The nest
/unnest
programming pattern can be a useful way to organize your data analyses in R
. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply
or purrr::map
.
One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.
I could not find an easy introductory tutorial for nesting list-columns in data.table
, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table
and tidyverse
.
To begin, we load libraries and download the Guerry data from Rdatasets:
library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)
url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)
data.table
In data.table
, we can split our dataset using the by
argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:
Region clergy_sd
1: E 19.04754
2: N 24.09845
3: C 25.03527
4: S 25.76706
5: W 20.29181
To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)
), we return a list with the “Subset of Data” special object (.SD
) inside:
The result is a data table with a new list-column called data
. Each element of the data
column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:
dat
Region data
1: E <data.table[17x23]>
2: N <data.table[17x23]>
3: C <data.table[17x23]>
4: S <data.table[17x23]>
5: W <data.table[17x23]>
This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply
or purrr::map
. For instance, this estimates a separate linear regression model in each subset of the data:
Region data model
1: E <data.table[17x23]> <lm[12]>
2: N <data.table[17x23]> <lm[12]>
3: C <data.table[17x23]> <lm[12]>
4: S <data.table[17x23]> <lm[12]>
5: W <data.table[17x23]> <lm[12]>
Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm
object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary
:
modelsummary(dat$model)
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
---|---|---|---|---|---|
(Intercept) | 3459.206 | 12616.594 | 1906.380 | 4792.878 | 4646.248 |
(1698.357) | (2708.688) | (1719.969) | (869.711) | (4255.843) | |
Clergy | 43.731 | -110.731 | 107.241 | -38.621 | 91.106 |
(43.830) | (53.525) | (29.720) | (21.940) | (70.121) | |
Num.Obs. | 17 | 17 | 17 | 17 | 17 |
R2 | 0.062 | 0.222 | 0.465 | 0.171 | 0.101 |
R2 Adj. | 0.000 | 0.170 | 0.429 | 0.116 | 0.041 |
AIC | 328.0 | 342.8 | 324.1 | 314.7 | 346.1 |
BIC | 330.5 | 345.3 | 326.6 | 317.2 | 348.6 |
Log.Lik. | -160.988 | -168.384 | -159.031 | -154.361 | -170.052 |
F | 0.996 | 4.280 | 13.021 | 3.099 | 1.688 |
We could also use the broom::tidy
function to extract model estimates in a tidy format:
dat[, result := lapply(model, tidy)]
dat
Region data model result
1: E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2: N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3: C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4: S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5: W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble
). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.
Since each element of the result
column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]
:
dat[, result[[1]], by=Region]
Region term estimate std.error statistic p.value
1: E (Intercept) 3459.20615 1698.35697 2.0367957 5.972055e-02
2: E Clergy 43.73143 43.82991 0.9977531 3.342234e-01
3: N (Intercept) 12616.59395 2708.68810 4.6578246 3.094474e-04
4: N Clergy -110.73145 53.52470 -2.0687918 5.625602e-02
5: C (Intercept) 1906.38003 1719.96865 1.1083807 2.851535e-01
6: C Clergy 107.24137 29.71992 3.6083997 2.580823e-03
7: S (Intercept) 4792.87834 869.71107 5.5108857 5.982135e-05
8: S Clergy -38.62128 21.94008 -1.7603077 9.872506e-02
9: W (Intercept) 4646.24822 4255.84350 1.0917338 2.921758e-01
10: W Clergy 91.10633 70.12076 1.2992775 2.134648e-01
Putting everything together, we have:
tidyverse
The same result can be achieved in a tidyverse
style:
dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
read_csv %>%
drop_na %>%
nest_by(Region) %>%
mutate(model = list(lm(Donations ~ Clergy, data)),
result = list(tidy(model))) %>%
unnest(result)
dat
# A tibble: 10 x 8
# Groups: Region [5]
Region data model term estimate std.error statistic p.value
<chr> <list<tbl_> <lis> <chr> <dbl> <dbl> <dbl> <dbl>
1 C [17 × 23] <lm> (Int… 1906. 1720. 1.11 2.85e-1
2 C [17 × 23] <lm> Cler… 107. 29.7 3.61 2.58e-3
3 E [17 × 23] <lm> (Int… 3459. 1698. 2.04 5.97e-2
4 E [17 × 23] <lm> Cler… 43.7 43.8 0.998 3.34e-1
5 N [17 × 23] <lm> (Int… 12617. 2709. 4.66 3.09e-4
6 N [17 × 23] <lm> Cler… -111. 53.5 -2.07 5.63e-2
7 S [17 × 23] <lm> (Int… 4793. 870. 5.51 5.98e-5
8 S [17 × 23] <lm> Cler… -38.6 21.9 -1.76 9.87e-2
9 W [17 × 23] <lm> (Int… 4646. 4256. 1.09 2.92e-1
10 W [17 × 23] <lm> Cler… 91.1 70.1 1.30 2.13e-1
Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table
is about 8x faster than tidyverse
in my naive tests:
fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
y=rnorm(1e4),
x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
, model := lapply(data, fit)][
, result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)
unnest_dt = function() {
dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
dat_dt %>% unnest(result)
}
microbenchmark(unnest_dt(), unnest_tb(), times=5)
Unit: milliseconds
expr min lq mean median uq max
unnest_dt() 12.13646 12.81705 13.40778 13.18411 13.3367 15.56458
unnest_tb() 95.01422 98.74043 107.48908 106.86359 118.2567 118.57048
neval cld
5 a
5 b
Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest
returns all the original data).