Skip to main content

The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.

One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.

I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.

To begin, we load libraries and download the Guerry data from Rdatasets:

library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)

url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)
package ‘data.table’ was built under R version 4.5.2
package ‘ggplot2’ was built under R version 4.5.2
package ‘tibble’ was built under R version 4.5.2
package ‘tidyr’ was built under R version 4.5.2
package ‘readr’ was built under R version 4.5.2
package ‘purrr’ was built under R version 4.5.2
package ‘dplyr’ was built under R version 4.5.2
package ‘lubridate’ was built under R version 4.5.2
package ‘broom’ was built under R version 4.5.2
package ‘modelsummary’ was built under R version 4.5.2

data.table

In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:

dat[, list(clergy_sd=sd(Clergy)), by=Region]
>    Region clergy_sd
>         
> 1:      E  19.04754
> 2:      N  24.09845
> 3:      C  25.03527
> 4:      S  25.76706
> 5:      W  20.29181
> 6:               NA

To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:

dat = dat[, list(data=list(.SD)), by=Region]

The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:

dat
>    Region                data
>                  
> 1:      E 
> 2:      N 
> 3:      C 
> 4:      S 
> 5:      W 
> 6:         

This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:

dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
>    Region                data    model
>                     
> 1:      E  
> 2:      N  
> 3:      C  
> 4:      S  
> 5:      W  
> 6:          
dat
>    Region                data    model
>                     
> 1:      E  
> 2:      N  
> 3:      C  
> 4:      S  
> 5:      W  
> 6:          

Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:

modelsummary(dat$model)
(1) (2) (3) (4) (5) (6)
(Intercept) 3459.206 12616.594 1906.380 4792.878 4646.248 37015.000
(1698.357) (2708.688) (1719.969) (869.711) (4255.843)
Clergy 43.731 -110.731 107.241 -38.621 91.106
(43.830) (53.525) (29.720) (21.940) (70.121)
Num.Obs. 17 17 17 17 17 1
R2 0.062 0.222 0.465 0.171 0.101 0.000
R2 Adj. -0.000 0.170 0.429 0.116 0.041 0.000
AIC 328.0 342.8 324.1 314.7 346.1
BIC 330.5 345.3 326.6 317.2 348.6
Log.Lik. -160.988 -168.384 -159.031 -154.361 -170.052
RMSE 3136.83 4846.46 2795.64 2124.14 5346.24 0.00
Model has zero degrees of freedom!
Model has zero degrees of freedom!
Model matrix is rank deficient. Parameters `Clergy` were not estimable.

We could also use the broom::tidy function to extract model estimates in a tidy format:

dat[, result := lapply(model, tidy)]
>    Region                data    model        result
>                             
> 1:      E   
> 2:      N   
> 3:      C   
> 4:      S   
> 5:      W   
> 6:           
dat
>    Region                data    model        result
>                             
> 1:      E   
> 2:      N   
> 3:      C   
> 4:      S   
> 5:      W   
> 6:           

In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.

Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:

dat[, result[[1]], by=Region]
>     Region        term    estimate  std.error  statistic      p.value
>                                      
>  1:      E (Intercept)  3459.20615 1698.35697  2.0367957 5.972055e-02
>  2:      E      Clergy    43.73143   43.82991  0.9977531 3.342234e-01
>  3:      N (Intercept) 12616.59395 2708.68810  4.6578246 3.094474e-04
>  4:      N      Clergy  -110.73145   53.52470 -2.0687918 5.625602e-02
>  5:      C (Intercept)  1906.38003 1719.96865  1.1083807 2.851535e-01
>  6:      C      Clergy   107.24137   29.71992  3.6083997 2.580823e-03
>  7:      S (Intercept)  4792.87834  869.71107  5.5108857 5.982135e-05
>  8:      S      Clergy   -38.62128   21.94008 -1.7603077 9.872506e-02
>  9:      W (Intercept)  4646.24822 4255.84350  1.0917338 2.921758e-01
> 10:      W      Clergy    91.10633   70.12076  1.2992775 2.134648e-01
> 11:        (Intercept) 37015.00000        NaN        NaN          NaN
> 12:             Clergy          NA         NA         NA           NA

Putting everything together, we have:

# Nest
dat = dat[, list(data=list(.SD)), by=Region]

# Operations
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, result := lapply(model, tidy)]

# Unnest
dat[, result[[1]], by=Region]

tidyverse

The same result can be achieved in a tidyverse style:

dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
       read_csv %>%
       drop_na %>%
       nest_by(Region) %>%
       mutate(model = list(lm(Donations ~ Clergy, data)),
              result = list(tidy(model))) %>%
       unnest(result)
dat
> # A tibble: 10 × 8
> # Groups:   Region [5]
>    Region                data model  term   estimate std.error statistic p.value
>      >                    
>  1 C                [17 × 23]    (Inte…   1906.     1720.      1.11  2.85e-1
>  2 C                [17 × 23]    Clergy    107.       29.7     3.61  2.58e-3
>  3 E                [17 × 23]    (Inte…   3459.     1698.      2.04  5.97e-2
>  4 E                [17 × 23]    Clergy     43.7      43.8     0.998 3.34e-1
>  5 N                [17 × 23]    (Inte…  12617.     2709.      4.66  3.09e-4
>  6 N                [17 × 23]    Clergy   -111.       53.5    -2.07  5.63e-2
>  7 S                [17 × 23]    (Inte…   4793.      870.      5.51  5.98e-5
>  8 S                [17 × 23]    Clergy    -38.6      21.9    -1.76  9.87e-2
>  9 W                [17 × 23]    (Inte…   4646.     4256.      1.09  2.92e-1
> 10 W                [17 × 23]    Clergy     91.1      70.1     1.30  2.13e-1

Benchmarking

Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:

fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
                    y=rnorm(1e4), 
                    x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
                , model := lapply(data, fit)][
                , result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)

unnest_dt = function() {
  dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
  dat_dt %>% unnest(result)
}

microbenchmark(unnest_dt(), unnest_tb(), times=5)
> Unit: milliseconds
>         expr       min        lq     mean   median       uq      max neval cld
>  unnest_dt()  8.088398  8.752147 10.01684 10.42368 11.25532 11.56466     5  a 
>  unnest_tb() 26.872958 27.715098 30.99372 29.95653 31.78115 38.64287     5   b
less accurate nanosecond times to avoid potential integer overflows

Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).

Loading source...