Nesting with data.table and tidyverse

Author

Vincent Arel-Bundock

Published

August 17, 2020

The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.

One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.

I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.

To begin, we load libraries and download the Guerry data from Rdatasets:

library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)

url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)

data.table

In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:

dat[, list(clergy_sd=sd(Clergy)), by=Region]
   Region clergy_sd
   <char>     <num>
1:      E  19.04754
2:      N  24.09845
3:      C  25.03527
4:      S  25.76706
5:      W  20.29181
6:               NA

To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:

dat = dat[, list(data=list(.SD)), by=Region]

The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:

dat
   Region                data
   <char>              <list>
1:      E <data.table[17x23]>
2:      N <data.table[17x23]>
3:      C <data.table[17x23]>
4:      S <data.table[17x23]>
5:      W <data.table[17x23]>
6:         <data.table[1x23]>

This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:

dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat
   Region                data    model
   <char>              <list>   <list>
1:      E <data.table[17x23]> <lm[12]>
2:      N <data.table[17x23]> <lm[12]>
3:      C <data.table[17x23]> <lm[12]>
4:      S <data.table[17x23]> <lm[12]>
5:      W <data.table[17x23]> <lm[12]>
6:         <data.table[1x23]> <lm[12]>

Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:

modelsummary(dat$model)
Warning: Model has zero degrees of freedom!
Warning: Model has zero degrees of freedom!
(1) (2) (3) (4) (5) (6)
(Intercept) 3459.206 12616.594 1906.380 4792.878 4646.248 37015.000
(1698.357) (2708.688) (1719.969) (869.711) (4255.843)
Clergy 43.731 -110.731 107.241 -38.621 91.106
(43.830) (53.525) (29.720) (21.940) (70.121)
Num.Obs. 17 17 17 17 17 1
R2 0.062 0.222 0.465 0.171 0.101 0.000
R2 Adj. 0.000 0.170 0.429 0.116 0.041 0.000
AIC 328.0 342.8 324.1 314.7 346.1
BIC 330.5 345.3 326.6 317.2 348.6
Log.Lik. -160.988 -168.384 -159.031 -154.361 -170.052
RMSE 3136.83 4846.46 2795.64 2124.14 5346.24 0.00

We could also use the broom::tidy function to extract model estimates in a tidy format:

dat[, result := lapply(model, tidy)]
dat
   Region                data    model        result
   <char>              <list>   <list>        <list>
1:      E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2:      N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3:      C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4:      S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5:      W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
6:         <data.table[1x23]> <lm[12]> <tbl_df[2x5]>

In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.

Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:

dat[, result[[1]], by=Region]
    Region        term    estimate  std.error  statistic      p.value
    <char>      <char>       <num>      <num>      <num>        <num>
 1:      E (Intercept)  3459.20615 1698.35697  2.0367957 5.972055e-02
 2:      E      Clergy    43.73143   43.82991  0.9977531 3.342234e-01
 3:      N (Intercept) 12616.59395 2708.68810  4.6578246 3.094474e-04
 4:      N      Clergy  -110.73145   53.52470 -2.0687918 5.625602e-02
 5:      C (Intercept)  1906.38003 1719.96865  1.1083807 2.851535e-01
 6:      C      Clergy   107.24137   29.71992  3.6083997 2.580823e-03
 7:      S (Intercept)  4792.87834  869.71107  5.5108857 5.982135e-05
 8:      S      Clergy   -38.62128   21.94008 -1.7603077 9.872506e-02
 9:      W (Intercept)  4646.24822 4255.84350  1.0917338 2.921758e-01
10:      W      Clergy    91.10633   70.12076  1.2992775 2.134648e-01
11:        (Intercept) 37015.00000        NaN        NaN          NaN
12:             Clergy          NA         NA         NA           NA

Putting everything together, we have:

# Nest
dat = dat[, list(data=list(.SD)), by=Region]

# Operations
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, result := lapply(model, tidy)]

# Unnest
dat[, result[[1]], by=Region]

tidyverse

The same result can be achieved in a tidyverse style:

dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
       read_csv %>%
       drop_na %>%
       nest_by(Region) %>%
       mutate(model = list(lm(Donations ~ Clergy, data)),
              result = list(tidy(model))) %>%
       unnest(result)
dat
# A tibble: 10 × 8
# Groups:   Region [5]
   Region                data model  term   estimate std.error statistic p.value
   <chr>  <list<tibble[,23]>> <list> <chr>     <dbl>     <dbl>     <dbl>   <dbl>
 1 C                [17 × 23] <lm>   (Inte…   1906.     1720.      1.11  2.85e-1
 2 C                [17 × 23] <lm>   Clergy    107.       29.7     3.61  2.58e-3
 3 E                [17 × 23] <lm>   (Inte…   3459.     1698.      2.04  5.97e-2
 4 E                [17 × 23] <lm>   Clergy     43.7      43.8     0.998 3.34e-1
 5 N                [17 × 23] <lm>   (Inte…  12617.     2709.      4.66  3.09e-4
 6 N                [17 × 23] <lm>   Clergy   -111.       53.5    -2.07  5.63e-2
 7 S                [17 × 23] <lm>   (Inte…   4793.      870.      5.51  5.98e-5
 8 S                [17 × 23] <lm>   Clergy    -38.6      21.9    -1.76  9.87e-2
 9 W                [17 × 23] <lm>   (Inte…   4646.     4256.      1.09  2.92e-1
10 W                [17 × 23] <lm>   Clergy     91.1      70.1     1.30  2.13e-1

Benchmarking

Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:

fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
                    y=rnorm(1e4), 
                    x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
                , model := lapply(data, fit)][
                , result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)

unnest_dt = function() {
  dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
  dat_dt %>% unnest(result)
}

microbenchmark(unnest_dt(), unnest_tb(), times=5)
Warning in microbenchmark(unnest_dt(), unnest_tb(), times = 5): less accurate
nanosecond times to avoid potential integer overflows
Unit: milliseconds
        expr      min        lq      mean    median        uq       max neval
 unnest_dt()  3.18898  3.489633  3.677347  3.610952  4.029316  4.067856     5
 unnest_tb() 10.50682 10.717318 11.239732 11.466306 11.679096 11.829115     5
 cld
  a 
   b

Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).