Nesting with data.table and tidyverse

The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.

One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.

I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.

To begin, we load libraries and download the Guerry data from Rdatasets:

library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)

url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)

`data.table`

In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:

dat[, list(clergy_sd=sd(Clergy)), by=Region]

   Region clergy_sd
   <char>     <num>
1:      E  19.04754
2:      N  24.09845
3:      C  25.03527
4:      S  25.76706
5:      W  20.29181
6:               NA

To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:

dat = dat[, list(data=list(.SD)), by=Region]

The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:

dat

   Region                data
   <char>              <list>
1:      E <data.table[17x23]>
2:      N <data.table[17x23]>
3:      C <data.table[17x23]>
4:      S <data.table[17x23]>
5:      W <data.table[17x23]>
6:         <data.table[1x23]>

This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:

dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat

   Region                data    model
   <char>              <list>   <list>
1:      E <data.table[17x23]> <lm[12]>
2:      N <data.table[17x23]> <lm[12]>
3:      C <data.table[17x23]> <lm[12]>
4:      S <data.table[17x23]> <lm[12]>
5:      W <data.table[17x23]> <lm[12]>
6:         <data.table[1x23]> <lm[12]>

Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:

modelsummary(dat$model)

Warning: Model has zero degrees of freedom!
Warning: Model has zero degrees of freedom!

	(1)	(2)	(3)	(4)	(5)	(6)
(Intercept)	3459.206	12616.594	1906.380	4792.878	4646.248	37015.000
	(1698.357)	(2708.688)	(1719.969)	(869.711)	(4255.843)
Clergy	43.731	-110.731	107.241	-38.621	91.106
	(43.830)	(53.525)	(29.720)	(21.940)	(70.121)
Num.Obs.	17	17	17	17	17	1
R2	0.062	0.222	0.465	0.171	0.101	0.000
R2 Adj.	-0.000	0.170	0.429	0.116	0.041	0.000
AIC	328.0	342.8	324.1	314.7	346.1
BIC	330.5	345.3	326.6	317.2	348.6
Log.Lik.	-160.988	-168.384	-159.031	-154.361	-170.052
RMSE	3136.83	4846.46	2795.64	2124.14	5346.24	0.00

We could also use the broom::tidy function to extract model estimates in a tidy format:

dat[, result := lapply(model, tidy)]
dat

Warning: `...` must be empty in `format.tbl()`
`...` must be empty in `format.tbl()`
`...` must be empty in `format.tbl()`
`...` must be empty in `format.tbl()`
`...` must be empty in `format.tbl()`
`...` must be empty in `format.tbl()`
Caused by error in `format_tbl()`:
! `...` must be empty.
✖ Problematic arguments:
• na.encode = FALSE
• timezone = timezone
• justify = justify

   Region                data    model        result
   <char>              <list>   <list>        <list>
1:      E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2:      N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3:      C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4:      S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5:      W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
6:         <data.table[1x23]> <lm[12]> <tbl_df[2x5]>

In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.

Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:

dat[, result[[1]], by=Region]

    Region        term    estimate  std.error  statistic      p.value
    <char>      <char>       <num>      <num>      <num>        <num>
 1:      E (Intercept)  3459.20615 1698.35697  2.0367957 5.972055e-02
 2:      E      Clergy    43.73143   43.82991  0.9977531 3.342234e-01
 3:      N (Intercept) 12616.59395 2708.68810  4.6578246 3.094474e-04
 4:      N      Clergy  -110.73145   53.52470 -2.0687918 5.625602e-02
 5:      C (Intercept)  1906.38003 1719.96865  1.1083807 2.851535e-01
 6:      C      Clergy   107.24137   29.71992  3.6083997 2.580823e-03
 7:      S (Intercept)  4792.87834  869.71107  5.5108857 5.982135e-05
 8:      S      Clergy   -38.62128   21.94008 -1.7603077 9.872506e-02
 9:      W (Intercept)  4646.24822 4255.84350  1.0917338 2.921758e-01
10:      W      Clergy    91.10633   70.12076  1.2992775 2.134648e-01
11:        (Intercept) 37015.00000        NaN        NaN          NaN
12:             Clergy          NA         NA         NA           NA

Putting everything together, we have:

# Nest
dat = dat[, list(data=list(.SD)), by=Region]

# Operations
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, result := lapply(model, tidy)]

# Unnest
dat[, result[[1]], by=Region]

`tidyverse`

The same result can be achieved in a tidyverse style:

dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
       read_csv %>%
       drop_na %>%
       nest_by(Region) %>%
       mutate(model = list(lm(Donations ~ Clergy, data)),
              result = list(tidy(model))) %>%
       unnest(result)
dat

# A tibble: 10 × 8
# Groups:   Region [5]
   Region                data model  term   estimate std.error statistic p.value
   <chr>  <list<tibble[,23]>> <list> <chr>     <dbl>     <dbl>     <dbl>   <dbl>
 1 C                [17 × 23] <lm>   (Inte…   1906.     1720.      1.11  2.85e-1
 2 C                [17 × 23] <lm>   Clergy    107.       29.7     3.61  2.58e-3
 3 E                [17 × 23] <lm>   (Inte…   3459.     1698.      2.04  5.97e-2
 4 E                [17 × 23] <lm>   Clergy     43.7      43.8     0.998 3.34e-1
 5 N                [17 × 23] <lm>   (Inte…  12617.     2709.      4.66  3.09e-4
 6 N                [17 × 23] <lm>   Clergy   -111.       53.5    -2.07  5.63e-2
 7 S                [17 × 23] <lm>   (Inte…   4793.      870.      5.51  5.98e-5
 8 S                [17 × 23] <lm>   Clergy    -38.6      21.9    -1.76  9.87e-2
 9 W                [17 × 23] <lm>   (Inte…   4646.     4256.      1.09  2.92e-1
10 W                [17 × 23] <lm>   Clergy     91.1      70.1     1.30  2.13e-1

Benchmarking

Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:

fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
                    y=rnorm(1e4), 
                    x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
                , model := lapply(data, fit)][
                , result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)

unnest_dt = function() {
  dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
  dat_dt %>% unnest(result)
}

microbenchmark(unnest_dt(), unnest_tb(), times=5)

Warning in microbenchmark(unnest_dt(), unnest_tb(), times = 5): less accurate
nanosecond times to avoid potential integer overflows

Unit: milliseconds
        expr       min        lq      mean    median        uq       max neval
 unnest_dt()  3.723702  3.894344  4.288879  4.095039  4.766127  4.965182     5
 unnest_tb() 11.767246 11.977781 12.644794 12.553216 13.037016 13.888709     5
 cld
  a 
   b

Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).