Rowwise operations with data.table
August 23, 2020
This notebook illustrates how one can apply a function to every row of a dataset using the data.table package. I begin with a trivial example, and continue with a slightly less trivial one.
Trivial example
Say you want to apply this function:
{
if (x > y) {
return(x)
} else {
return(y)
}
}
to every row of this dataset:
dat <-
dat
> x y
>
> 1: 9 1
> 2: 2 4
> 3: 3 9
> 4: 1 6
package ‘data.table’ was built under R version 4.5.2
Since f() is not vectorized, you can’t simply pass the x and y vectors to f(). This would produce a warning and the wrong answer:
#dat[, z := f(x, y)][]
Instead, we will explore 2 alternatives for rowwise operations in data.table, using the by argument or Map function. In the end, our code will look like this:
dat
> x y z1
>
> 1: 9 1 9
> 2: 2 4 4
> 3: 3 9 9
> 4: 1 6 6
dat
> x y z1 z2
>
> 1: 9 1 9 9
> 2: 2 4 4 4
> 3: 3 9 9 9
> 4: 1 6 6 6
dat
> x y z1 z2
>
> 1: 9 1 9 9
> 2: 2 4 4 4
> 3: 3 9 9 9
> 4: 1 6 6 6
In the code above, by=1:nrow(dat) tells data.table that it should treat every row as a distinct group, and that it should apply the f function to each of those groups separately. The Map function (from base R) applies an element-wise operation to the “x” and “y” vectors, and returns a list.
To understand how these two strategies work, we will now consider a slightly more complicated (but still rather trivial) example.
Task description
Consider this data generating process:
$$y = 1 \cdot x + \varepsilon$$ $$z = \beta_{xz}\cdot x + \beta_{yz}\cdot y + \nu,$$
where $\varepsilon, \nu, x \sim \mbox{Normal}(0, 1)$. To assess the effect of $x$ on $y$, we can estimate a linear regression model with $y$ as dependent variable and $x$ as regressor. In R, we could type lm(y~x). However, if we estimate a more complex model with z as control variable, lm(y ~ x + z), our estimates are likely to be biased, because z is “post-treatment” with respect to x. In DAG-world, they call this “collider bias”.
Our task will be to estimate the magnitude of this collider bias for different values of $\beta_{yz}$ and $\beta_{xz}$ using rowwise operations in data.table.
A partial solution
First, we define a function to simulate datasets which conform to the data generating process described above:
{
N <- 100 # 100 observations
x <- # x has no ancestors
y <- 1 * x + # x causes y
z <- b_zx * x + b_zy * y + # x and y cause z
out <- # unbiased model
out
}
Note that this function returns two “x” coefficients in a list. This is important because data.table treats elements of lists as distinct columns. This ensures that our final output will a dataset with a column called “Biased” and a column called “Unbiased”.
We want to assess the performance of both models for every combination of b_xz and b_yz in {-1, 0, 1}, and we want to replicate our experiment 100 times. To achieve this, we use the CJ function to create a dataset with every combination of b_xz, b_yz, and a replication index:
params <-
> [1] 900 3
> Key:
> replication b_xz b_yz
>
> 1: 100 0 -1
> 2: 100 0 0
> 3: 100 0 1
> 4: 100 1 -1
> 5: 100 1 0
> 6: 100 1 1
The simulation function is not vectorized, which means that it only accepts 2 numerical values as arguments; we can’t simply give it the full vectors of parameters in params. We need to apply the simulation function rowwise, that is, to every row of the dataset. To achieve this, we use data.table’s by argument. Typically, by is used to apply an operation to a “group” or “subset” of the data. Here, we assign each row to its own group:
results <- params
> nrow Biased Unbiased
>
> 1: 1 -0.01723128 0.9884276
> 2: 2 0.92631128 1.0146390
> 3: 3 1.03765059 1.0041715
> 4: 4 0.28818854 0.9350003
> 5: 5 0.91518233 0.9129573
> 6: 6 0.61884030 1.2013585
Great! Now we know that the unbiased estimate for row 1 is equal to r results$Biased[1], and that the unbiased estimate for row 1 is equal to r results$Unbiased[1]. The problem, of course, is that “row 1” is not very informative. We lost relevant information about the values of b_yz and b_xz on every row.
A better solution
A better way to leverage the by argument would be:
results <- params
> Key:
> replication b_xz b_yz Biased Unbiased
>
> 1: 1 -1 -1 0.01247715 1.0371168
> 2: 1 -1 0 1.03431370 0.9687347
> 3: 1 -1 1 1.07899244 0.9738551
> 4: 1 0 -1 0.72669444 1.1135464
> 5: 1 0 0 0.97200903 0.9751390
> 6: 1 0 1 0.70268068 1.1879717
Now we can find the mean biased and unbiased estimates using normal data.table operations:
results
> b_xz b_yz Biased Unbiased
>
> 1: -1 -1 0.005753066 1.0060821
> 2: -1 0 1.032445223 1.0195756
> 3: -1 1 0.998845382 0.9945345
> 4: 0 -1 0.510010117 0.9965415
> 5: 0 0 0.990174328 0.9907054
> 6: 0 1 0.497241640 0.9976653
> 7: 1 -1 1.006002068 1.0077494
> 8: 1 0 1.003550065 0.9986808
> 9: 1 1 0.004560044 0.9867779
Recall that the true value of the coefficient is 1. There appears to be no collider bias (a) x does not cause z, (b) y does not cause z, or (b) the effects of x and y on z exactly offset each other (-1 and 1).
(These are weird edge cases. Don’t condition on colliders!)
A useful alternative
An alternative approach is to use the Map function from base R. The Map(simulate, x, y) call produces a list where element 1 is simulate(x[1], y[1]), element 2 is simulate(x[2], y[2]), etc. This list can be used as a list-column in a “nesting” / “unnesting” programming pattern:
params
> Key:
> replication b_xz b_yz coefs
>
> 1: 1 -1 -1
> 2: 1 -1 0
> 3: 1 -1 1
> 4: 1 0 -1
> 5: 1 0 0
> ---
> 896: 100 0 0
> 897: 100 0 1
> 898: 100 1 -1
> 899: 100 1 0
> 900: 100 1 1
> Key:
> replication b_xz b_yz coefs
>
> 1: 1 -1 -1
> 2: 1 -1 0
> 3: 1 -1 1
> 4: 1 0 -1
> 5: 1 0 0
> 6: 1 0 1
Notice that each entry in the coefs column is itself a list of coefficients. We can “unnest” that list:
params
> Key:
> replication b_xz b_yz Biased Unbiased
>
> 1: 1 -1 -1 -0.1548834 0.9964310
> 2: 1 -1 0 1.1573618 1.0512472
> 3: 1 -1 1 1.0874509 1.1651031
> 4: 1 0 -1 0.4060620 0.9381167
> 5: 1 0 0 1.0408373 1.0438923
> ---
> 896: 100 0 0 0.9937826 0.9890470
> 897: 100 0 1 0.5487397 1.1051524
> 898: 100 1 -1 0.9404295 1.0078137
> 899: 100 1 0 1.0258530 1.0156454
> 900: 100 1 1 -0.2287936 0.9910195
Benchmarking
To optimize efficiency, the first thing to consider is whether you actually need to apply a function to every row. In general, this is a very expensive operation, and you are better off using a vectorized approach.
In most rowwise applications, the computational difference between the two approaches outlined above is likely to dwarfed by the cost of computation inside the function you apply rowwise. My advice is to just use the approach you think is easier to read and understand.
params <-
{
out <- params
out <- out
out
}
{
out <- params
out
}
> Unit: milliseconds
> expr min lq mean median uq max neval cld
> f_map() 729.9997 797.9679 869.5278 865.936 939.2919 1012.648 3 a
> f_by() 1008.2642 1019.2288 1055.9283 1030.193 1079.7604 1129.327 3 a
less accurate nanosecond times to avoid potential integer overflows
Loading source...