Simulation-based power calculations and model assessment with R and DeclareDesign
February 4, 2021
package ‘future.apply’ was built under R version 4.5.2
package ‘future’ was built under R version 4.5.2
The DeclareDesign package for R is a flexible tool to describe the elements of a research design, and to assess the properties of that design using simulations (power, coverage, etc.).
The authors of this package clearly put in a lot of effort to document their work. Make sure you visit their website and the examples library for an in-depth treatment. Also, check out @BlaCooCopHum2019 for a more conceptual treatment of the DeclareDesign approach.
The present notebook has four goals:
- Explain how to use the
DeclareDesignpackage forRto conduct simulation-based power analyses and model assessments. - Show how to easily compare the performance of several estimators and sampling strategies.
- Work through 3 detailed examples: Two-Arm Trial, Blocking vs. Adjusting, and Cluster Sampling.
- Illustrate how users can use “custom” estimators and estimands.
A modular approach
DeclareDesign offers a modular “grammar” to encode the core features of a research design: data generating process, treatment assignment, sampling scheme, estimands, and estimators. Each of these features is specified using a distinct function from the declare_* family. This modular design is powerful, because it allows us to swap out parts of a research design to, for example, compare the power of different sampling schemes while keeping everything else constant.
In this notebook, we will use several declare_* functions. Their names are self-explanatory:
declare_populationdeclare_potential_outcomesdeclare_inquirydeclare_assignmentdeclare_samplingdeclare_estimator
Two-arm design
This section illustrates how DeclareDesign works in a simple two-arm experimental design with outcome $Y$, treatment $Z$, covariate $W$, and noise $e$.
To begin, we use the declare_population function to define the basic structure of the dataset, and to generate exogenous variables. The declare_population function can create hierarchical data (e.g., cities in counties, or time in unit), but here we build a flat dataset with 100 observations:
population =
package ‘ggplot2’ was built under R version 4.5.2
package ‘tibble’ was built under R version 4.5.2
package ‘tidyr’ was built under R version 4.5.2
package ‘readr’ was built under R version 4.5.2
package ‘purrr’ was built under R version 4.5.2
package ‘dplyr’ was built under R version 4.5.2
package ‘lubridate’ was built under R version 4.5.2
The code above called the declare_population function and assigned the result to a new object called population. This new object is itself a function that can be used to simulate data. Every time we call population(), we get a new simulated dataset of 100 observations with different numerical values for $e$ and $W$:
%>%
> ID e W
> 1 001 0.3780865 0.79871157
> 2 002 0.8795205 -0.64124924
> 3 003 1.0258454 0.03580886
> 4 004 -0.9988735 0.46703447
> 5 005 0.5194171 1.22085659
> 6 006 0.6018485 -0.65019074
The next step is to assign units to different treatment conditions using declare_assignment. This function can randomize treatment in a variety of ways, with blocks, clusters, factorial, or multi-arm designs. Here, we use simple random assignment with a 45% probability of receiving treatment $Z=1$:
assignment =
We have now defined a new function called assignment. Every time we call it, we obtain a new simulated dataset with randomly assigned $Z$ treatment:
%>%
%>%
> ID e W Z
> 1 001 -0.90802590 1.3979111 1
> 2 002 1.87353318 0.3681868 0
> 3 003 0.25383880 1.0398030 0
> 4 004 1.18620026 -0.6756406 1
> 5 005 0.04633063 0.9697663 0
> 6 006 0.02695792 -0.4390266 0
To define the potential outcomes, we use declare_potential_outcomes. This function takes two main arguments: (1) a formula to link treatment, exogenous variables, and potential outcomes; and (2) the list of conditions that we want to consider. In principle, we could consider several conditions (and their interactions) in some kind of factorial design. Here we only consider treatment $Z\in\{0,1\}$ in a linear data generating process:
$$Y = 0.31416 \cdot X + W + e,$$
potential_outcomes =
%>%
%>%
%>%
unused argument (Y ~ Z)
Our data generation process is now fully specified. Now, we need to declare the estimand, that is, the quantity that we are trying to estimate. To do this, we use the declare_inquiry function.
inquiry =
%>%
%>%
%>%
%>%
:9:0: unexpected end of input
7:
8: inquiry() %>%
^
The last thing we need to define is the estimation strategy using declare_estimator. This function is flexible enough to accomodate pretty much any estimator (see the last section of this notebook). Here, we will use the lm model from Base R:
estimator =
It’s now time for the big payoff! We can chain all our functions to run a simulation, power analysis, and model assessment automatically:
design = population +
assignment +
potential_outcomes +
inquiry +
estimator
object 'inquiry' not found
Our simulation suggests power and coverage of about 0.20 and 0.95 (for $\alpha=0.05$).
Here’s a compact version of what we have done so far:
design =
+
+
+
+
Comparing estimators
One benefit of the modular DeclareDesign approach is that we can easily compare different estimators. For instance, to see what happens when I adjust for $W$ (a cause of outcome $Y$), I define a new estimator and simply add it to the chain:
estimator =
estimator_w =
design = population +
assignment +
potential_outcomes +
estimand +
estimator +
estimator_w
object 'estimand' not found
Controlling for $W$ increases power and decreases RMSE.
Blocking vs. adjusting
In experimental settings, researchers often try to improve the precision of their estimates by using “block” randomization or by adjusting for pre-treatment covariates in a regression framework. We can use DeclareDesign to compare those two strategies in a two-arm design with outcome $Y$, treatment $Z\in\{0,1\}$, blocking covariate $W\in\{0,1\}$ and noise $e$.
First, we declare the data generation mechanism and estimand:
population =
potential_outcomes =
estimand =
Second, we declare two alternative assignment mechanisms: blocking vs. no blocking.
assignment =
assignment_block =
You appear to have used legacy declare_assignment() syntax. Consider:
declare_assignment(Z = conduct_ra(N = N, prob = 0.5))
Alternatively, you can set legacy = TRUE to restore the previous functionality.
If you received this message in error, please ensure that you do not name variables 'blocks', 'clusters', 'm', 'm_unit', 'm_each', 'prob', 'prob_unit', 'prob_each', 'block_m', 'block_m_each', 'block_prob', 'block_prob_each', 'num_arms', 'conditions', or 'simple'.
Third, we declare two alternative estimators: linear regression models with or without control for $W$.
estimator =
estimator_control =
Finally, we combine all the steps to create two distinct designs: design and design_block. Since the two designs are identical except for one “step,” we use the replace_step function to avoid code duplication:
design = population +
assignment +
potential_outcomes +
estimand +
estimator +
estimator_control
design_block =
object 'assignment_block' not found
At this stage, we could use the diagnose_design function to compare power, coverage and other diagnostics. A useful alternative is to run the simulations with simulate_design, and to plot the results with ggplot2:
sims =
+
+
+
+
object 'design_block' not found
The histograms show that simulations are more spread-out (i.e., less precise) in the left-most panel, which is the design with linear regression, no blocking, and no adjustement. In this context, it seems useful to block or adjust.
Cluster sampling design
This example is copied almost line-for-line from the DeclareDesign library. I think it is useful to describe it here as a contrast to the other examples.
We want to estimate the mean of an outcome $Y$ in some large population (e.g., the population of Canada). For practical reasons, we cannot take a simple random sample. Instead, we draw a sample of 600 subjects in two steps:
- Sample 30 clusters from the larger populations (e.g., 30 districts).
- Sample 20 subjects within each cluster.
One problem with this strategy is that observations on $Y$ might be correlated within clusters. To simulate this, we use draw_normal_icc to create a latent outcome variable with intra-cluster correlation, and draw_ordered to convert the latent into an ordinal variable. We also use the add_level function to create a nested data structure.
population_total =
%>%
head
> block cluster subject latent Y
> 1 1 0001 00001 0.5208240 7
> 2 1 0001 00002 -0.6083444 3
> 3 1 0001 00003 1.0096762 8
> 4 1 0001 00004 -0.6930921 3
> 5 1 0001 00005 -0.4665348 3
> 6 1 0001 00006 -1.7920082 1
%>%
nrow
> [1] 50000
Our total population has 50000 observations: 1000 clusters of 50 individuals.
In this application, we want to treat the total population as fixed, and to simulate what happens when different clusters are sampled. To do this, we create a single data.frame to hold population-level information. Then, we tell the declare_population function to draw from that fixed population:
population_fixed =
population =
We declare the two stages of our sampling strategy using the declare_sampling function. When we apply those two stages to a simulated data, we obtain a dataset with 600 observations (30 clusters and 20 subjects per cluster):
sampling_stage1 <-
sampling_stage2 <-
%>%
sampling_stage1 %>%
sampling_stage2 %>%
nrow
You appear to have used legacy declare_sampling() syntax. Consider:
declare_sampling(Population = draw_rs(N = N, strata = block, clusters = cluster, n = 30), filter = Population == 1)
Alternatively, you can set legacy = TRUE to restore the previous functionality.
If you received this message in error, please ensure that you do not name variables 'strata', 'clusters', 'n', 'n_unit', 'prob', 'prob_unit', 'strata_n', 'strata_prob', or 'simple'.
We wish to estimate the mean of outcome $Y$:
estimand =
Finally, we check the coverage of confidence intervals computed using classical or clustered standard errors.
estimator_cluster =
estimator_classical =
design =
population +
sampling_stage1 +
sampling_stage2 +
estimand +
estimator_classical +
estimator_cluster
sim =
sim %>%
%>%
%>%
%>%
+
+
+
+
+
+
+
object 'sampling_stage1' not found
Custom estimands and estimators
Above, we saw how to use built-in estimation and estimand functions from the DeclareDesign package. Now, we’ll see how to easily define custom estimands and estimators.
This function has a nice shortcut syntax, but to properly understand what it does, I think it is useful to use the more flexible (and verbose) handler argument.
The handler argument accepts a function that takes our simulated dataset as a data.frame, and returns a data.frame with two entries. Here, the “handler” takes data and calcualtes the average of the difference between potential outcomes. This is the average treatment effect that we ultimately want to estimate:
estimand =
The “handler” argument of the declare_estimator function accepts a simulated dataset, and returns a data.frame that conforms to the tidy broom package format. This is nice, because it means that DeclareDesign can easily accomodate pretty much any estimation strategy.
estimator =
These functions can be plugged in to any of the workflows described above and should work immediately. I hope these examples make clear that DeclareDesign can accept pretty much any estimator or quantity of interest.
Loading source...