noalign <- function(x) {
    x <- tinytable::theme_tinytable(x)
    fn <- function(table) {
        if (table@output != "typst") {
            return(table)
        }
        tab <- unlist(strsplit(table@table_string, "\\n"))
        idx <- grepl("^\\s*#align\\(center, \\[\\s*$|^\\s*\\]\\) // end align\\s*$", tab)
        table@table_string <- paste(tab[!idx], collapse = "\n")
        return(table)
    }
    x <- tinytable::style_tt(x, finalize = fn)
    return(x)
}
options(tinytable_tt_theme = noalign)

Nesting with data.table and tidyverse

The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.

One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.

I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.

To begin, we load libraries and download the Guerry data from Rdatasets:

library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)

url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)
package ‘data.table’ was built under R version 4.5.2
package ‘ggplot2’ was built under R version 4.5.2
package ‘tibble’ was built under R version 4.5.2
package ‘tidyr’ was built under R version 4.5.2
package ‘readr’ was built under R version 4.5.2
package ‘purrr’ was built under R version 4.5.2
package ‘dplyr’ was built under R version 4.5.2
package ‘lubridate’ was built under R version 4.5.2
package ‘broom’ was built under R version 4.5.2

data.table

In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:

dat[, list(clergy_sd=sd(Clergy)), by=Region]
   Region clergy_sd
   <char>     <num>
1:      E  19.04754
2:      N  24.09845
3:      C  25.03527
4:      S  25.76706
5:      W  20.29181
6:               NA

To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:

dat = dat[, list(data=list(.SD)), by=Region]

The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:

dat
   Region                data
   <char>              <list>
1:      E <data.table[17x23]>
2:      N <data.table[17x23]>
3:      C <data.table[17x23]>
4:      S <data.table[17x23]>
5:      W <data.table[17x23]>
6:         <data.table[1x23]>

This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:

dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat
   Region                data    model
   <char>              <list>   <list>
1:      E <data.table[17x23]> <lm[12]>
2:      N <data.table[17x23]> <lm[12]>
3:      C <data.table[17x23]> <lm[12]>
4:      S <data.table[17x23]> <lm[12]>
5:      W <data.table[17x23]> <lm[12]>
6:         <data.table[1x23]> <lm[12]>
   Region                data    model
   <char>              <list>   <list>
1:      E <data.table[17x23]> <lm[12]>
2:      N <data.table[17x23]> <lm[12]>
3:      C <data.table[17x23]> <lm[12]>
4:      S <data.table[17x23]> <lm[12]>
5:      W <data.table[17x23]> <lm[12]>
6:         <data.table[1x23]> <lm[12]>

Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:

modelsummary(dat$model)
(1)(2)(3)(4)(5)(6)
(Intercept)3459.20612616.5941906.3804792.8784646.24837015.000
(1698.357)(2708.688)(1719.969)(869.711)(4255.843)
Clergy43.731−110.731107.241−38.62191.106
(43.830)(53.525)(29.720)(21.940)(70.121)
Num.Obs.17171717171
R20.0620.2220.4650.1710.1010.000
R2 Adj.−0.0000.1700.4290.1160.0410.000
AIC328.0342.8324.1314.7346.1
BIC330.5345.3326.6317.2348.6
Log.Lik.−160.988−168.384−159.031−154.361−170.052
RMSE3136.834846.462795.642124.145346.240.00
Model has zero degrees of freedom!
Model has zero degrees of freedom!
Model matrix is rank deficient. Parameters `Clergy` were not estimable.

We could also use the broom::tidy function to extract model estimates in a tidy format:

dat[, result := lapply(model, tidy)]
dat
   Region                data    model        result
   <char>              <list>   <list>        <list>
1:      E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2:      N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3:      C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4:      S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5:      W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
6:         <data.table[1x23]> <lm[12]> <tbl_df[2x5]>
   Region                data    model        result
   <char>              <list>   <list>        <list>
1:      E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2:      N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3:      C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4:      S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5:      W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
6:         <data.table[1x23]> <lm[12]> <tbl_df[2x5]>

In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.

Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:

dat[, result[[1]], by=Region]
    Region        term    estimate  std.error  statistic      p.value
    <char>      <char>       <num>      <num>      <num>        <num>
 1:      E (Intercept)  3459.20615 1698.35697  2.0367957 5.972055e-02
 2:      E      Clergy    43.73143   43.82991  0.9977531 3.342234e-01
 3:      N (Intercept) 12616.59395 2708.68810  4.6578246 3.094474e-04
 4:      N      Clergy  -110.73145   53.52470 -2.0687918 5.625602e-02
 5:      C (Intercept)  1906.38003 1719.96865  1.1083807 2.851535e-01
 6:      C      Clergy   107.24137   29.71992  3.6083997 2.580823e-03
 7:      S (Intercept)  4792.87834  869.71107  5.5108857 5.982135e-05
 8:      S      Clergy   -38.62128   21.94008 -1.7603077 9.872506e-02
 9:      W (Intercept)  4646.24822 4255.84350  1.0917338 2.921758e-01
10:      W      Clergy    91.10633   70.12076  1.2992775 2.134648e-01
11:        (Intercept) 37015.00000        NaN        NaN          NaN
12:             Clergy          NA         NA         NA           NA

Putting everything together, we have:

# Nest
dat = dat[, list(data=list(.SD)), by=Region]

# Operations
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, result := lapply(model, tidy)]

# Unnest
dat[, result[[1]], by=Region]

tidyverse

The same result can be achieved in a tidyverse style:

dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
       read_csv %>%
       drop_na %>%
       nest_by(Region) %>%
       mutate(model = list(lm(Donations ~ Clergy, data)),
              result = list(tidy(model))) %>%
       unnest(result)
dat
# A tibble: 10 × 8
# Groups:   Region [5]
   Region                data model  term   estimate std.error statistic p.value
   <chr>  <list<tibble[,23]>> <list> <chr>     <dbl>     <dbl>     <dbl>   <dbl>
 1 C                [17 × 23] <lm>   (Inte…   1906.     1720.      1.11  2.85e-1
 2 C                [17 × 23] <lm>   Clergy    107.       29.7     3.61  2.58e-3
 3 E                [17 × 23] <lm>   (Inte…   3459.     1698.      2.04  5.97e-2
 4 E                [17 × 23] <lm>   Clergy     43.7      43.8     0.998 3.34e-1
 5 N                [17 × 23] <lm>   (Inte…  12617.     2709.      4.66  3.09e-4
 6 N                [17 × 23] <lm>   Clergy   -111.       53.5    -2.07  5.63e-2
 7 S                [17 × 23] <lm>   (Inte…   4793.      870.      5.51  5.98e-5
 8 S                [17 × 23] <lm>   Clergy    -38.6      21.9    -1.76  9.87e-2
 9 W                [17 × 23] <lm>   (Inte…   4646.     4256.      1.09  2.92e-1
10 W                [17 × 23] <lm>   Clergy     91.1      70.1     1.30  2.13e-1

Benchmarking

Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:

fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
                    y=rnorm(1e4), 
                    x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
                , model := lapply(data, fit)][
                , result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)

unnest_dt = function() {
  dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
  dat_dt %>% unnest(result)
}

microbenchmark(unnest_dt(), unnest_tb(), times=5)
Unit: milliseconds
        expr      min        lq     mean    median       uq       max neval cld
 unnest_dt()  4.82283  5.498305 29.60581  5.928354 11.97643 119.80315     5   a
 unnest_tb() 15.14622 15.384922 16.76398 16.464165 16.61488  20.20972     5   a
less accurate nanosecond times to avoid potential integer overflows

Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).