Nesting with data.table and tidyverse

Vincent Arel-Bundock
2020-08-17

The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.

One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.

I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.

To begin, we load libraries and download the Guerry data from Rdatasets:

library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)

url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)

data.table

In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:

dat[, list(clergy_sd=sd(Clergy)), by=Region]
   Region clergy_sd
1:      E  19.04754
2:      N  24.09845
3:      C  25.03527
4:      S  25.76706
5:      W  20.29181

To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:

dat = dat[, list(data=list(.SD)), by=Region]

The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:

dat
   Region                data
1:      E <data.table[17x23]>
2:      N <data.table[17x23]>
3:      C <data.table[17x23]>
4:      S <data.table[17x23]>
5:      W <data.table[17x23]>

This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:

dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat
   Region                data    model
1:      E <data.table[17x23]> <lm[12]>
2:      N <data.table[17x23]> <lm[12]>
3:      C <data.table[17x23]> <lm[12]>
4:      S <data.table[17x23]> <lm[12]>
5:      W <data.table[17x23]> <lm[12]>

Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:

modelsummary(dat$model)
Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) 3459.206 12616.594 1906.380 4792.878 4646.248
(1698.357) (2708.688) (1719.969) (869.711) (4255.843)
Clergy 43.731 -110.731 107.241 -38.621 91.106
(43.830) (53.525) (29.720) (21.940) (70.121)
Num.Obs. 17 17 17 17 17
R2 0.062 0.222 0.465 0.171 0.101
R2 Adj. 0.000 0.170 0.429 0.116 0.041
AIC 328.0 342.8 324.1 314.7 346.1
BIC 330.5 345.3 326.6 317.2 348.6
Log.Lik. -160.988 -168.384 -159.031 -154.361 -170.052
F 0.996 4.280 13.021 3.099 1.688

We could also use the broom::tidy function to extract model estimates in a tidy format:

dat[, result := lapply(model, tidy)]
dat
   Region                data    model        result
1:      E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2:      N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3:      C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4:      S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5:      W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>

In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.

Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:

dat[, result[[1]], by=Region]
    Region        term    estimate  std.error  statistic      p.value
 1:      E (Intercept)  3459.20615 1698.35697  2.0367957 5.972055e-02
 2:      E      Clergy    43.73143   43.82991  0.9977531 3.342234e-01
 3:      N (Intercept) 12616.59395 2708.68810  4.6578246 3.094474e-04
 4:      N      Clergy  -110.73145   53.52470 -2.0687918 5.625602e-02
 5:      C (Intercept)  1906.38003 1719.96865  1.1083807 2.851535e-01
 6:      C      Clergy   107.24137   29.71992  3.6083997 2.580823e-03
 7:      S (Intercept)  4792.87834  869.71107  5.5108857 5.982135e-05
 8:      S      Clergy   -38.62128   21.94008 -1.7603077 9.872506e-02
 9:      W (Intercept)  4646.24822 4255.84350  1.0917338 2.921758e-01
10:      W      Clergy    91.10633   70.12076  1.2992775 2.134648e-01

Putting everything together, we have:

# Nest
dat = dat[, list(data=list(.SD)), by=Region]

# Operations
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, result := lapply(model, tidy)]

# Unnest
dat[, result[[1]], by=Region]

tidyverse

The same result can be achieved in a tidyverse style:

dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
       read_csv %>%
       drop_na %>%
       nest_by(Region) %>%
       mutate(model = list(lm(Donations ~ Clergy, data)),
              result = list(tidy(model))) %>%
       unnest(result)
dat
# A tibble: 10 x 8
# Groups:   Region [5]
   Region        data model term  estimate std.error statistic p.value
   <chr>  <list<tbl_> <lis> <chr>    <dbl>     <dbl>     <dbl>   <dbl>
 1 C        [17 × 23] <lm>  (Int…   1906.     1720.      1.11  2.85e-1
 2 C        [17 × 23] <lm>  Cler…    107.       29.7     3.61  2.58e-3
 3 E        [17 × 23] <lm>  (Int…   3459.     1698.      2.04  5.97e-2
 4 E        [17 × 23] <lm>  Cler…     43.7      43.8     0.998 3.34e-1
 5 N        [17 × 23] <lm>  (Int…  12617.     2709.      4.66  3.09e-4
 6 N        [17 × 23] <lm>  Cler…   -111.       53.5    -2.07  5.63e-2
 7 S        [17 × 23] <lm>  (Int…   4793.      870.      5.51  5.98e-5
 8 S        [17 × 23] <lm>  Cler…    -38.6      21.9    -1.76  9.87e-2
 9 W        [17 × 23] <lm>  (Int…   4646.     4256.      1.09  2.92e-1
10 W        [17 × 23] <lm>  Cler…     91.1      70.1     1.30  2.13e-1

Benchmarking

Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:

fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
                    y=rnorm(1e4), 
                    x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
                , model := lapply(data, fit)][
                , result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)

unnest_dt = function() {
  dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
  dat_dt %>% unnest(result)
}

microbenchmark(unnest_dt(), unnest_tb(), times=5)
Unit: milliseconds
        expr      min       lq      mean    median       uq       max
 unnest_dt() 12.13646 12.81705  13.40778  13.18411  13.3367  15.56458
 unnest_tb() 95.01422 98.74043 107.48908 106.86359 118.2567 118.57048
 neval cld
     5  a 
     5   b

Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).