The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.

One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.

I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.

To begin, we load libraries and download the Guerry data from Rdatasets:

library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)

url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)

data.table

In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:

dat[, list(clergy_sd=sd(Clergy)), by=Region]
##    Region clergy_sd
## 1:      E  19.04754
## 2:      N  24.09845
## 3:      C  25.03527
## 4:      S  25.76706
## 5:      W  20.29181

To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:

dat = dat[, list(data=list(.SD)), by=Region]

The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:

dat
##    Region                data
## 1:      E <data.table[17x23]>
## 2:      N <data.table[17x23]>
## 3:      C <data.table[17x23]>
## 4:      S <data.table[17x23]>
## 5:      W <data.table[17x23]>

This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:

dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat
##    Region                data    model
## 1:      E <data.table[17x23]> <lm[12]>
## 2:      N <data.table[17x23]> <lm[12]>
## 3:      C <data.table[17x23]> <lm[12]>
## 4:      S <data.table[17x23]> <lm[12]>
## 5:      W <data.table[17x23]> <lm[12]>

Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:

modelsummary(dat$model)
Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) 3459.206 12616.594 1906.380 4792.878 4646.248
(1698.357) (2708.688) (1719.969) (869.711) (4255.843)
Clergy 43.731 -110.731 107.241 -38.621 91.106
(43.830) (53.525) (29.720) (21.940) (70.121)
Num.Obs. 17 17 17 17 17
R2 0.062 0.222 0.465 0.171 0.101
R2 Adj. -0.000 0.170 0.429 0.116 0.041
AIC 328.0 342.8 324.1 314.7 346.1
BIC 330.5 345.3 326.6 317.2 348.6
Log.Lik. -160.988 -168.384 -159.031 -154.361 -170.052

We could also use the broom::tidy function to extract model estimates in a tidy format:

dat[, result := lapply(model, tidy)]
dat
##    Region                data    model        result
## 1:      E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
## 2:      N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
## 3:      C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
## 4:      S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
## 5:      W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>

In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.

Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:

dat[, result[[1]], by=Region]
##     Region        term    estimate  std.error  statistic      p.value
##  1:      E (Intercept)  3459.20615 1698.35697  2.0367957 5.972055e-02
##  2:      E      Clergy    43.73143   43.82991  0.9977531 3.342234e-01
##  3:      N (Intercept) 12616.59395 2708.68810  4.6578246 3.094474e-04
##  4:      N      Clergy  -110.73145   53.52470 -2.0687918 5.625602e-02
##  5:      C (Intercept)  1906.38003 1719.96865  1.1083807 2.851535e-01
##  6:      C      Clergy   107.24137   29.71992  3.6083997 2.580823e-03
##  7:      S (Intercept)  4792.87834  869.71107  5.5108857 5.982135e-05
##  8:      S      Clergy   -38.62128   21.94008 -1.7603077 9.872506e-02
##  9:      W (Intercept)  4646.24822 4255.84350  1.0917338 2.921758e-01
## 10:      W      Clergy    91.10633   70.12076  1.2992775 2.134648e-01

Putting everything together, we have:

# Nest
dat = dat[, list(data=list(.SD)), by=Region]

# Operations
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, result := lapply(model, tidy)]

# Unnest
dat[, result[[1]], by=Region]

tidyverse

The same result can be achieved in a tidyverse style:

dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
       read_csv %>%
       drop_na %>%
       nest_by(Region) %>%
       mutate(model = list(lm(Donations ~ Clergy, data)),
              result = list(tidy(model))) %>%
       unnest(result)
dat
## # A tibble: 10 x 8
## # Groups:   Region [5]
##    Region             data model  term     estimate std.error statistic  p.value
##    <chr>  <list<tbl_df[,2> <list> <chr>       <dbl>     <dbl>     <dbl>    <dbl>
##  1 C             [17 × 23] <lm>   (Interc…   1906.     1720.      1.11   2.85e-1
##  2 C             [17 × 23] <lm>   Clergy      107.       29.7     3.61   2.58e-3
##  3 E             [17 × 23] <lm>   (Interc…   3459.     1698.      2.04   5.97e-2
##  4 E             [17 × 23] <lm>   Clergy       43.7      43.8     0.998  3.34e-1
##  5 N             [17 × 23] <lm>   (Interc…  12617.     2709.      4.66   3.09e-4
##  6 N             [17 × 23] <lm>   Clergy     -111.       53.5    -2.07   5.63e-2
##  7 S             [17 × 23] <lm>   (Interc…   4793.      870.      5.51   5.98e-5
##  8 S             [17 × 23] <lm>   Clergy      -38.6      21.9    -1.76   9.87e-2
##  9 W             [17 × 23] <lm>   (Interc…   4646.     4256.      1.09   2.92e-1
## 10 W             [17 × 23] <lm>   Clergy       91.1      70.1     1.30   2.13e-1

Benchmarking

Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:

fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
                    y=rnorm(1e4), 
                    x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
                , model := lapply(data, fit)][
                , result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)

unnest_dt = function() {
  dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
  dat_dt %>% unnest(result)
}

microbenchmark(unnest_dt(), unnest_tb(), times=5)
## Unit: milliseconds
##         expr      min       lq     mean   median        uq     max neval
##  unnest_dt() 11.55791 12.47092 14.90513 15.66596  16.96684  17.864     5
##  unnest_tb() 89.25033 97.04142 99.06009 98.43822 105.01252 105.558     5

Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).