Nesting with data.table and tidyverse
August 17, 2020
The nest/unnest programming pattern can be a useful way to organize your data analyses in R. The idea is to store a series of objects (e.g., datasets, models, summaries) in a “list-column.” A list-column is simply a column in your data frame where each entry is a list. Each of the lists in your list-column can itself contain any object you want, including other datasets or statistical model objects. The benefit of a list columns is that analysts can operate on them using standard tools such as lapply or purrr::map.
One common use-case is to split a dataset into subgroups, and to “nest” those subgroups of the data into a list-column. Then, we can estimate models and summarize the data easily by operating on that list column.
I could not find an easy introductory tutorial for nesting list-columns in data.table, so I decided to write this notebook. In it, I compare how list-columns can be created in both data.table and tidyverse.
To begin, we load libraries and download the Guerry data from Rdatasets:
url <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv"
dat <-
dat <-
package ‘data.table’ was built under R version 4.5.2
package ‘ggplot2’ was built under R version 4.5.2
package ‘tibble’ was built under R version 4.5.2
package ‘tidyr’ was built under R version 4.5.2
package ‘readr’ was built under R version 4.5.2
package ‘purrr’ was built under R version 4.5.2
package ‘dplyr’ was built under R version 4.5.2
package ‘lubridate’ was built under R version 4.5.2
package ‘broom’ was built under R version 4.5.2
package ‘modelsummary’ was built under R version 4.5.2
data.table
In data.table, we can split our dataset using the by argument, and return any summary of the data we want inside a list object. For instance, we calculate the standard deviation of the “Clergy” variable for each region by typing:
dat
> Region clergy_sd
>
> 1: E 19.04754
> 2: N 24.09845
> 3: C 25.03527
> 4: S 25.76706
> 5: W 20.29181
> 6: NA
To create a list-column, we use the same syntax, but instead of returning a single numeric value (sd(Clergy)), we return a list with the “Subset of Data” special object (.SD) inside:
dat = dat
The result is a data table with a new list-column called data. Each element of the data column is now a list, with a data table inside, and each of those data tables has 17 rows and 23 columns:
dat
> Region data
>
> 1: E
> 2: N
> 3: C
> 4: S
> 5: W
> 6:
This is a convenient data structure, because it allows us to operate on the data subsets with standard commands like lapply or purrr::map. For instance, this estimates a separate linear regression model in each subset of the data:
dat
> Region data model
>
> 1: E
> 2: N
> 3: C
> 4: S
> 5: W
> 6:
dat
> Region data model
>
> 1: E
> 2: N
> 3: C
> 4: S
> 5: W
> 6:
Above, we see that the dataset now includes a new list-column called “model”. Each element of that column is a list, with a lm object inside. We can do cool things with this list-column. For example, we can pass it directly to modelsummary:
| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 3459.206 | 12616.594 | 1906.380 | 4792.878 | 4646.248 | 37015.000 |
| (1698.357) | (2708.688) | (1719.969) | (869.711) | (4255.843) | ||
| Clergy | 43.731 | -110.731 | 107.241 | -38.621 | 91.106 | |
| (43.830) | (53.525) | (29.720) | (21.940) | (70.121) | ||
| Num.Obs. | 17 | 17 | 17 | 17 | 17 | 1 |
| R2 | 0.062 | 0.222 | 0.465 | 0.171 | 0.101 | 0.000 |
| R2 Adj. | -0.000 | 0.170 | 0.429 | 0.116 | 0.041 | 0.000 |
| AIC | 328.0 | 342.8 | 324.1 | 314.7 | 346.1 | |
| BIC | 330.5 | 345.3 | 326.6 | 317.2 | 348.6 | |
| Log.Lik. | -160.988 | -168.384 | -159.031 | -154.361 | -170.052 | |
| RMSE | 3136.83 | 4846.46 | 2795.64 | 2124.14 | 5346.24 | 0.00 |
Model has zero degrees of freedom!
Model has zero degrees of freedom!
We could also use the broom::tidy function to extract model estimates in a tidy format:
dat
> Region data model result
>
> 1: E
> 2: N
> 3: C
> 4: S
> 5: W
> 6:
dat
> Region data model result
>
> 1: E
> 2: N
> 3: C
> 4: S
> 5: W
> 6:
In this new “result” list-column, each entry is a list with a data frame-like object inside (a tibble). This data frame includes information about coefficient estimates, standard errors, etc. To extract that information from the list-column we move to the final step of today’s demonstration: unnesting.
Since each element of the result column is a list with a single data frame inside, we simply extract that single element with brackets [[1]]:
dat
> Region term estimate std.error statistic p.value
>
> 1: E (Intercept) 3459.20615 1698.35697 2.0367957 5.972055e-02
> 2: E Clergy 43.73143 43.82991 0.9977531 3.342234e-01
> 3: N (Intercept) 12616.59395 2708.68810 4.6578246 3.094474e-04
> 4: N Clergy -110.73145 53.52470 -2.0687918 5.625602e-02
> 5: C (Intercept) 1906.38003 1719.96865 1.1083807 2.851535e-01
> 6: C Clergy 107.24137 29.71992 3.6083997 2.580823e-03
> 7: S (Intercept) 4792.87834 869.71107 5.5108857 5.982135e-05
> 8: S Clergy -38.62128 21.94008 -1.7603077 9.872506e-02
> 9: W (Intercept) 4646.24822 4255.84350 1.0917338 2.921758e-01
> 10: W Clergy 91.10633 70.12076 1.2992775 2.134648e-01
> 11: (Intercept) 37015.00000 NaN NaN NaN
> 12: Clergy NA NA NA NA
Putting everything together, we have:
# Nest
dat = dat
# Operations
dat
dat
# Unnest
dat
tidyverse
The same result can be achieved in a tidyverse style:
dat <- "https://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv" %>%
read_csv %>%
drop_na %>%
%>%
%>%
dat
> # A tibble: 10 × 8
> # Groups: Region [5]
> Region data model term estimate std.error statistic p.value
> >
> 1 C [17 × 23] (Inte… 1906. 1720. 1.11 2.85e-1
> 2 C [17 × 23] Clergy 107. 29.7 3.61 2.58e-3
> 3 E [17 × 23] (Inte… 3459. 1698. 2.04 5.97e-2
> 4 E [17 × 23] Clergy 43.7 43.8 0.998 3.34e-1
> 5 N [17 × 23] (Inte… 12617. 2709. 4.66 3.09e-4
> 6 N [17 × 23] Clergy -111. 53.5 -2.07 5.63e-2
> 7 S [17 × 23] (Inte… 4793. 870. 5.51 5.98e-5
> 8 S [17 × 23] Clergy -38.6 21.9 -1.76 9.87e-2
> 9 W [17 × 23] (Inte… 4646. 4256. 1.09 2.92e-1
> 10 W [17 × 23] Clergy 91.1 70.1 1.30 2.13e-1
Benchmarking
Admittedly, benchmarks are pretty useless in this context, because unnesting is unlikely to be your major bottleneck unless you are dealing with a ton of subgroups. Still, it is interesting to note that unnesting data.table is about 8x faster than tidyverse in my naive tests:
dat_dt =
dat_dt = dat_dt
dat_tb =
{
dat_dt
}
{
dat_dt %>%
}
> Unit: milliseconds
> expr min lq mean median uq max neval cld
> unnest_dt() 8.088398 8.752147 10.01684 10.42368 11.25532 11.56466 5 a
> unnest_tb() 26.872958 27.715098 30.99372 29.95653 31.78115 38.64287 5 b
less accurate nanosecond times to avoid potential integer overflows
Note that in the code above, I merged the original dataset back into data.table for a fairer comparison (tidyr::unnest returns all the original data).
Loading source...