library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)
<- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
url <- fread(url)
dat <- na.omit(dat) dat
The nest
/unnest
programming pattern can be a useful way to organize your data analyses in R
. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply
or purrr::map
.
One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.
I could not find an easy introductory tutorial for nesting list-columns in data.table
, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table
and tidyverse
.
To begin, we load libraries and download the Guerry data from Rdatasets:
data.table
In data.table
, we can split our dataset using the by
argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:
list(clergy_sd=sd(Clergy)), by=Region] dat[,
Region clergy_sd
<char> <num>
1: E 19.04754
2: N 24.09845
3: C 25.03527
4: S 25.76706
5: W 20.29181
6: NA
To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)
), we return a list with the “Subset of Data” special object (.SD
) inside:
= dat[, list(data=list(.SD)), by=Region] dat
The result is a data table with a new list-column called data
. Each element of the data
column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:
dat
Region data
<char> <list>
1: E <data.table[17x23]>
2: N <data.table[17x23]>
3: C <data.table[17x23]>
4: S <data.table[17x23]>
5: W <data.table[17x23]>
6: <data.table[1x23]>
This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply
or purrr::map
. For instance, this estimates a separate linear regression model in each subset of the data:
:= lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, model dat
Region data model
<char> <list> <list>
1: E <data.table[17x23]> <lm[12]>
2: N <data.table[17x23]> <lm[12]>
3: C <data.table[17x23]> <lm[12]>
4: S <data.table[17x23]> <lm[12]>
5: W <data.table[17x23]> <lm[12]>
6: <data.table[1x23]> <lm[12]>
Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm
object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary
:
modelsummary(dat$model)
Warning: Model has zero degrees of freedom!
Warning: Model has zero degrees of freedom!
(1) | (2) | (3) | (4) | (5) | (6) | |
---|---|---|---|---|---|---|
(Intercept) | 3459.206 | 12616.594 | 1906.380 | 4792.878 | 4646.248 | 37015.000 |
(1698.357) | (2708.688) | (1719.969) | (869.711) | (4255.843) | ||
Clergy | 43.731 | -110.731 | 107.241 | -38.621 | 91.106 | |
(43.830) | (53.525) | (29.720) | (21.940) | (70.121) | ||
Num.Obs. | 17 | 17 | 17 | 17 | 17 | 1 |
R2 | 0.062 | 0.222 | 0.465 | 0.171 | 0.101 | 0.000 |
R2 Adj. | 0.000 | 0.170 | 0.429 | 0.116 | 0.041 | 0.000 |
AIC | 328.0 | 342.8 | 324.1 | 314.7 | 346.1 | |
BIC | 330.5 | 345.3 | 326.6 | 317.2 | 348.6 | |
Log.Lik. | -160.988 | -168.384 | -159.031 | -154.361 | -170.052 | |
RMSE | 3136.83 | 4846.46 | 2795.64 | 2124.14 | 5346.24 | 0.00 |
We could also use the broom::tidy
function to extract model estimates in a tidy format:
:= lapply(model, tidy)]
dat[, result dat
Region data model result
<char> <list> <list> <list>
1: E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2: N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3: C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4: S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5: W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
6: <data.table[1x23]> <lm[12]> <tbl_df[2x5]>
In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble
). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.
Since each element of the result
column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]
:
1]], by=Region] dat[, result[[
Region term estimate std.error statistic p.value
<char> <char> <num> <num> <num> <num>
1: E (Intercept) 3459.20615 1698.35697 2.0367957 5.972055e-02
2: E Clergy 43.73143 43.82991 0.9977531 3.342234e-01
3: N (Intercept) 12616.59395 2708.68810 4.6578246 3.094474e-04
4: N Clergy -110.73145 53.52470 -2.0687918 5.625602e-02
5: C (Intercept) 1906.38003 1719.96865 1.1083807 2.851535e-01
6: C Clergy 107.24137 29.71992 3.6083997 2.580823e-03
7: S (Intercept) 4792.87834 869.71107 5.5108857 5.982135e-05
8: S Clergy -38.62128 21.94008 -1.7603077 9.872506e-02
9: W (Intercept) 4646.24822 4255.84350 1.0917338 2.921758e-01
10: W Clergy 91.10633 70.12076 1.2992775 2.134648e-01
11: (Intercept) 37015.00000 NaN NaN NaN
12: Clergy NA NA NA NA
Putting everything together, we have:
# Nest
= dat[, list(data=list(.SD)), by=Region]
dat
# Operations
:= lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, model := lapply(model, tidy)]
dat[, result
# Unnest
1]], by=Region] dat[, result[[
tidyverse
The same result can be achieved in a tidyverse
style:
<- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
dat %>%
read_csv %>%
drop_na nest_by(Region) %>%
mutate(model = list(lm(Donations ~ Clergy, data)),
result = list(tidy(model))) %>%
unnest(result)
dat
# A tibble: 10 × 8
# Groups: Region [5]
Region data model term estimate std.error statistic p.value
<chr> <list<tibble[,23]>> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 C [17 × 23] <lm> (Inte… 1906. 1720. 1.11 2.85e-1
2 C [17 × 23] <lm> Clergy 107. 29.7 3.61 2.58e-3
3 E [17 × 23] <lm> (Inte… 3459. 1698. 2.04 5.97e-2
4 E [17 × 23] <lm> Clergy 43.7 43.8 0.998 3.34e-1
5 N [17 × 23] <lm> (Inte… 12617. 2709. 4.66 3.09e-4
6 N [17 × 23] <lm> Clergy -111. 53.5 -2.07 5.63e-2
7 S [17 × 23] <lm> (Inte… 4793. 870. 5.51 5.98e-5
8 S [17 × 23] <lm> Clergy -38.6 21.9 -1.76 9.87e-2
9 W [17 × 23] <lm> (Inte… 4646. 4256. 1.09 2.92e-1
10 W [17 × 23] <lm> Clergy 91.1 70.1 1.30 2.13e-1
Benchmarking
Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table
is about 8x faster than tidyverse
in my naive tests:
= function(k) lm(y ~ x, k)
fit = data.table(group=sample(1:5000, 1e4, replace=TRUE),
dat_dt y=rnorm(1e4),
x=rnorm(1e4))
= dat_dt[, .(data=.(.SD)), by=group][
dat_dt := lapply(data, fit)][
, model := lapply(model, tidy)]
, result = tibble(dat_dt)
dat_tb
= function() {
unnest_dt 1L]], by=group][dat_dt, on='group']
dat_dt[, result[[
}= function() {
unnest_tb %>% unnest(result)
dat_dt
}
microbenchmark(unnest_dt(), unnest_tb(), times=5)
Unit: milliseconds
expr min lq mean median uq max neval
unnest_dt() 3.827038 3.848272 4.771732 4.173371 5.779775 6.230204 5
unnest_tb() 12.151467 12.320894 13.244025 13.003458 13.756246 14.988060 5
cld
a
b
Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest
returns all the original data).