library(data.table)
library(tidyverse)
library(broom)
library(modelsummary)
library(microbenchmark)
url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <- fread(url)
dat <- na.omit(dat)The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.
One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.
I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.
To begin, we load libraries and download the Guerry data from Rdatasets:
data.table
In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:
dat[, list(clergy_sd=sd(Clergy)), by=Region] Region clergy_sd
<char> <num>
1: E 19.04754
2: N 24.09845
3: C 25.03527
4: S 25.76706
5: W 20.29181
6: NA
To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:
dat = dat[, list(data=list(.SD)), by=Region]The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:
dat Region data
<char> <list>
1: E <data.table[17x23]>
2: N <data.table[17x23]>
3: C <data.table[17x23]>
4: S <data.table[17x23]>
5: W <data.table[17x23]>
6: <data.table[1x23]>
This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat Region data model
<char> <list> <list>
1: E <data.table[17x23]> <lm[12]>
2: N <data.table[17x23]> <lm[12]>
3: C <data.table[17x23]> <lm[12]>
4: S <data.table[17x23]> <lm[12]>
5: W <data.table[17x23]> <lm[12]>
6: <data.table[1x23]> <lm[12]>
Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:
modelsummary(dat$model)Warning: Model has zero degrees of freedom!
Warning: Model has zero degrees of freedom!
Model matrix is rank deficient. Parameters `Clergy` were not estimable.
| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 3459.206 | 12616.594 | 1906.380 | 4792.878 | 4646.248 | 37015.000 |
| (1698.357) | (2708.688) | (1719.969) | (869.711) | (4255.843) | ||
| Clergy | 43.731 | -110.731 | 107.241 | -38.621 | 91.106 | |
| (43.830) | (53.525) | (29.720) | (21.940) | (70.121) | ||
| Num.Obs. | 17 | 17 | 17 | 17 | 17 | 1 |
| R2 | 0.062 | 0.222 | 0.465 | 0.171 | 0.101 | 0.000 |
| R2 Adj. | -0.000 | 0.170 | 0.429 | 0.116 | 0.041 | 0.000 |
| AIC | 328.0 | 342.8 | 324.1 | 314.7 | 346.1 | |
| BIC | 330.5 | 345.3 | 326.6 | 317.2 | 348.6 | |
| Log.Lik. | -160.988 | -168.384 | -159.031 | -154.361 | -170.052 | |
| RMSE | 3136.83 | 4846.46 | 2795.64 | 2124.14 | 5346.24 | 0.00 |
We could also use the broom::tidy function to extract model estimates in a tidy format:
dat[, result := lapply(model, tidy)]
dat Region data model result
<char> <list> <list> <list>
1: E <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
2: N <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
3: C <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
4: S <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
5: W <data.table[17x23]> <lm[12]> <tbl_df[2x5]>
6: <data.table[1x23]> <lm[12]> <tbl_df[2x5]>
In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.
Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:
dat[, result[[1]], by=Region] Region term estimate std.error statistic p.value
<char> <char> <num> <num> <num> <num>
1: E (Intercept) 3459.20615 1698.35697 2.0367957 5.972055e-02
2: E Clergy 43.73143 43.82991 0.9977531 3.342234e-01
3: N (Intercept) 12616.59395 2708.68810 4.6578246 3.094474e-04
4: N Clergy -110.73145 53.52470 -2.0687918 5.625602e-02
5: C (Intercept) 1906.38003 1719.96865 1.1083807 2.851535e-01
6: C Clergy 107.24137 29.71992 3.6083997 2.580823e-03
7: S (Intercept) 4792.87834 869.71107 5.5108857 5.982135e-05
8: S Clergy -38.62128 21.94008 -1.7603077 9.872506e-02
9: W (Intercept) 4646.24822 4255.84350 1.0917338 2.921758e-01
10: W Clergy 91.10633 70.12076 1.2992775 2.134648e-01
11: (Intercept) 37015.00000 NaN NaN NaN
12: Clergy NA NA NA NA
Putting everything together, we have:
# Nest
dat = dat[, list(data=list(.SD)), by=Region]
# Operations
dat[, model := lapply(data, function(x) lm(Donations ~ Clergy, x))]
dat[, result := lapply(model, tidy)]
# Unnest
dat[, result[[1]], by=Region]tidyverse
The same result can be achieved in a tidyverse style:
dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
read_csv %>%
drop_na %>%
nest_by(Region) %>%
mutate(model = list(lm(Donations ~ Clergy, data)),
result = list(tidy(model))) %>%
unnest(result)
dat# A tibble: 10 × 8
# Groups: Region [5]
Region data model term estimate std.error statistic p.value
<chr> <list<tibble[,23]>> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 C [17 × 23] <lm> (Inte… 1906. 1720. 1.11 2.85e-1
2 C [17 × 23] <lm> Clergy 107. 29.7 3.61 2.58e-3
3 E [17 × 23] <lm> (Inte… 3459. 1698. 2.04 5.97e-2
4 E [17 × 23] <lm> Clergy 43.7 43.8 0.998 3.34e-1
5 N [17 × 23] <lm> (Inte… 12617. 2709. 4.66 3.09e-4
6 N [17 × 23] <lm> Clergy -111. 53.5 -2.07 5.63e-2
7 S [17 × 23] <lm> (Inte… 4793. 870. 5.51 5.98e-5
8 S [17 × 23] <lm> Clergy -38.6 21.9 -1.76 9.87e-2
9 W [17 × 23] <lm> (Inte… 4646. 4256. 1.09 2.92e-1
10 W [17 × 23] <lm> Clergy 91.1 70.1 1.30 2.13e-1
Benchmarking
Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:
fit = function(k) lm(y ~ x, k)
dat_dt = data.table(group=sample(1:5000, 1e4, replace=TRUE),
y=rnorm(1e4),
x=rnorm(1e4))
dat_dt = dat_dt[, .(data=.(.SD)), by=group][
, model := lapply(data, fit)][
, result := lapply(model, tidy)]
dat_tb = tibble(dat_dt)
unnest_dt = function() {
dat_dt[, result[[1L]], by=group][dat_dt, on='group']
}
unnest_tb = function() {
dat_dt %>% unnest(result)
}
microbenchmark(unnest_dt(), unnest_tb(), times=5)Unit: milliseconds
expr min lq mean median uq max neval
unnest_dt() 4.744702 4.941496 6.315508 5.042935 7.144489 9.703919 5
unnest_tb() 15.745042 16.043001 16.784949 16.473661 16.591763 19.071277 5
cld
a
b
Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).