The `DeclareDesign`

package for `R`

is a flexible tool to *describe* the elements of a research design, and to *assess* the properties of that design using simulations (power, coverage, etc.).

The authors of this package clearly put in a lot of effort to document their work. Make sure you visit their website and the examples library for an in-depth treatment. Also, check out Blair et al. (2019) for a more conceptual treatment of the `DeclareDesign`

approach.

The present notebook has four goals:

- Explain how to use the
`DeclareDesign`

package for`R`

to conduct simulation-based power analyses and model assessments. - Show how to easily compare the performance of several estimators and sampling strategies.
- Work through 3 detailed examples: Two-Arm Trial, Blocking vs. Adjusting, and Cluster Sampling.
- Illustrate how users can use “custom” estimators and estimands.

`DeclareDesign`

offers a modular “grammar” to encode the core features of a research design: data generating process, treatment assignment, sampling scheme, estimands, and estimators. Each of these features is specified using a distinct function from the `declare_*`

family. This modular design is powerful, because it allows us to swap out parts of a research design to, for example, compare the power of different sampling schemes while keeping everything else constant.

In this notebook, we will use several `declare_*`

functions. Their names are self-explanatory:

`declare_population`

`declare_potential_outcomes`

`declare_estimand`

`declare_assignment`

`declare_sampling`

`declare_estimator`

This section illustrates how `DeclareDesign`

works in a simple two-arm experimental design with outcome \(Y\), treatment \(Z\), covariate \(W\), and noise \(e\).

To begin, we use the `declare_population`

function to define the basic structure of the dataset, and to generate exogenous variables. The `declare_population`

function can create hierarchical data (e.g., cities in counties, or time in unit), but here we build a flat dataset with 100 observations:

```
library(DeclareDesign)
library(tidyverse)
set.seed(924033)
population = declare_population(
N = 100,
e = rnorm(N),
W = rnorm(N))
```

The code above called the `declare_population`

function and assigned the result to a new object called `population`

. This new object is itself a function that can be used to simulate data. Every time we call `population()`

, we get a new simulated dataset of 100 observations with different numerical values for \(e\) and \(W\):

```
population() %>%
head()
```

```
ID e W
1 001 0.3780865 0.79871157
2 002 0.8795205 -0.64124924
3 003 1.0258454 0.03580886
4 004 -0.9988735 0.46703447
5 005 0.5194171 1.22085659
6 006 0.6018485 -0.65019074
```

The next step is to assign units to different treatment conditions using `declare_assignment`

. This function can randomize treatment in a variety of ways, with blocks, clusters, factorial, or multi-arm designs. Here, we use simple random assignment with a 45% probability of receiving treatment \(Z=1\):

```
assignment = declare_assignment(
assignment_variable = "Z",
prob = 0.45)
```

We have now defined a new function called `assignment`

. Every time we call it, we obtain a new simulated dataset with randomly assigned \(Z\) treatment:

```
population() %>%
assignment() %>%
head()
```

```
ID e W Z Z_cond_prob
1 001 -0.90802590 1.3979111 1 0.45
2 002 1.87353318 0.3681868 0 0.55
3 003 0.25383880 1.0398030 0 0.55
4 004 1.18620026 -0.6756406 1 0.45
5 005 0.04633063 0.9697663 0 0.55
6 006 0.02695792 -0.4390266 0 0.55
```

To define the potential outcomes, we use `declare_potential_outcomes`

. This function takes two main arguments: (1) a formula to link treatment, exogenous variables, and potential outcomes; and (2) the list of conditions that we want to consider. In principle, we could consider several conditions (and their interactions) in some kind of factorial design. Here we only consider treatment \(Z\in\{0,1\}\) in a linear data generating process:

\[Y = 0.31416 \cdot X + W + e,\]

```
potential_outcomes = declare_potential_outcomes(
Y ~ 0.31416 * Z + W + e,
conditions = list(Z = 0:1)
)
population() %>%
assignment() %>%
potential_outcomes() %>%
head()
```

```
ID e W Z Z_cond_prob Y_Z_0 Y_Z_1
1 001 -0.0301615 0.6814670 0 0.55 0.6513055 0.9654655
2 002 0.6005652 0.3532419 1 0.45 0.9538071 1.2679671
3 003 0.4977127 1.4589256 1 0.45 1.9566383 2.2707983
4 004 0.8799200 1.1666211 1 0.45 2.0465411 2.3607011
5 005 0.7868867 0.1957344 0 0.55 0.9826211 1.2967811
6 006 1.5831428 2.0717843 0 0.55 3.6549272 3.9690872
```

Our data generation process is now fully specified. Now, we need to declare the estimand, that is, the quantity that we are trying to estimate. To do this, we use the `declare_estimand`

function.

```
estimand = declare_estimand(
mean(Y_Z_1 - Y_Z_0))
population() %>%
assignment() %>%
potential_outcomes() %>%
estimand()
```

```
estimand_label estimand
1 estimand 0.31416
```

The last thing we need to define is the estimation strategy using `declare_estimator`

. This function is flexible enough to accomodate pretty much any estimator (see the last section of this notebook). Here, we will use the `lm`

model from Base `R`

:

```
estimator = declare_estimator(
formula = Y ~ Z,
model = lm)
```

It’s now time for the big payoff! We can chain all our functions to run a simulation, power analysis, and model assessment automatically:

```
design = population +
assignment +
potential_outcomes +
estimand +
estimator
diagnose_design(design)
```

```
Research design diagnosis based on 500 simulations. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).
Design Label Estimand Label Estimator Label Term N Sims Bias
design estimand estimator Z 500 0.00
(0.01)
RMSE Power Coverage Mean Estimate SD Estimate Mean Se Type S Rate
0.29 0.19 0.94 0.31 0.29 0.28 0.00
(0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00)
Mean Estimand
0.31
(0.00)
```

Our simulation suggests power and coverage of about 0.20 and 0.95 (for \(\alpha=0.05\)).

Here’s a compact version of what we have done so far:

```
design =
declare_population(N = 100, e = rnorm(N), W = rnorm(N)) +
declare_assignment(assignment_variable = "Z", prob = 0.45) +
declare_potential_outcomes(Y ~ 0.31416 * Z + W + e) +
declare_estimand(mean(Y_Z_1 - Y_Z_0)) +
declare_estimator(formula = Y ~ Z, model = lm)
diagnose_design(design)
```

One benefit of the modular `DeclareDesign`

approach is that we can easily compare different estimators. For instance, to see what happens when I adjust for \(W\) (a cause of outcome \(Y\)), I define a new estimator and simply add it to the chain:

```
estimator = declare_estimator(
formula = Y ~ Z,
model = lm,
label = "LM")
estimator_w = declare_estimator(
formula = Y ~ Z + W,
model = lm,
label = "LM w/ W")
design = population +
assignment +
potential_outcomes +
estimand +
estimator +
estimator_w
diagnose_design(design)
```

```
Research design diagnosis based on 500 simulations. Diagnosand estimates with bootstrapped standard errors in parentheses (100 replicates).
Design Label Estimand Label Estimator Label Term N Sims Bias
design estimand LM Z 500 0.02
(0.01)
design estimand LM w/ W Z 500 0.01
(0.01)
RMSE Power Coverage Mean Estimate SD Estimate Mean Se Type S Rate
0.28 0.21 0.97 0.33 0.27 0.28 0.00
(0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00)
0.20 0.36 0.96 0.32 0.20 0.20 0.00
(0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00)
Mean Estimand
0.31
(0.00)
0.31
(0.00)
```

Controlling for \(W\) increases power and decreases RMSE.

In experimental settings, researchers often try to improve the precision of their estimates by using “block” randomization or by adjusting for pre-treatment covariates in a regression framework. We can use `DeclareDesign`

to compare those two strategies in a two-arm design with outcome \(Y\), treatment \(Z\in\{0,1\}\), blocking covariate \(W\in\{0,1\}\) and noise \(e\).

First, we declare the data generation mechanism and estimand:

```
population = declare_population(
N = 1000,
W = rbinom(N, 1, .5),
e = rnorm(N))
potential_outcomes = declare_potential_outcomes(
Y ~ 0.31416 * Z + 2 * W + e)
estimand = declare_estimand(
mean(Y_Z_1 - Y_Z_0))
```

Second, we declare two alternative assignment mechanisms: blocking vs. no blocking.

```
assignment = declare_assignment(
prob = .5,
assignment_variable = "Z")
assignment_block = declare_assignment(
prob = .5,
blocks = W,
assignment_variable = "Z")
```

Third, we declare two alternative estimators: linear regression models with or without control for \(W\).

```
estimator = declare_estimator(
formula = Y ~ Z,
model = lm,
label = "LM")
estimator_control = declare_estimator(
formula = Y ~ Z + W,
model = lm,
term = "Z",
label = "LM + control")
```

Finally, we combine all the steps to create two distinct designs: `design`

and `design_block`

. Since the two designs are identical except for one “step,” we use the `replace_step`

function to avoid code duplication:

```
design = population +
assignment +
potential_outcomes +
estimand +
estimator +
estimator_control
design_block = replace_step(
design, assignment, assignment_block)
```

At this stage, we could use the `diagnose_design`

function to compare power, coverage and other diagnostics. A useful alternative is to run the simulations with `simulate_design`

, and to plot the results with `ggplot2`

:

```
sims = simulate_design(design, design_block)
ggplot(sims, aes(estimate)) +
geom_histogram() +
geom_vline(xintercept = 0.31416, color = "red") +
facet_grid(~design_label + estimator_label) +
theme_minimal()
```

The histograms show that simulations are more spread-out (i.e., less precise) in the left-most panel, which is the design with linear regression, no blocking, and no adjustement. In this context, it seems useful to block or adjust.

This example is copied almost line-for-line from the DeclareDesign library. I think it is useful to describe it here as a contrast to the other examples.

We want to estimate the mean of an outcome \(Y\) in some large population (e.g., the population of Canada). For practical reasons, we cannot take a simple random sample. Instead, we draw a sample of 600 subjects in two steps:

- Sample 30 clusters from the larger populations (e.g., 30 districts).
- Sample 20 subjects within each cluster.

One problem with this strategy is that observations on \(Y\) might be correlated within clusters. To simulate this, we use `draw_normal_icc`

to create a latent outcome variable with intra-cluster correlation, and `draw_ordered`

to convert the latent into an ordinal variable. We also use the `add_level`

function to create a nested data structure.

```
population_total = declare_population(
block = add_level(N = 1),
cluster = add_level(N = 1000),
subject = add_level(
N = 50,
latent = draw_normal_icc(mean = 0, N = N, clusters = cluster, ICC = 0.5),
Y = draw_ordered(x = latent, breaks = qnorm(seq(0, 1, length.out = 10)))))
population_total() %>%
head
```

```
block cluster subject latent Y
1 1 0001 00001 -1.8909168 1
2 1 0001 00002 -2.3684342 1
3 1 0001 00003 -0.4404240 3
4 1 0001 00004 -2.5420079 1
5 1 0001 00005 -1.6946567 1
6 1 0001 00006 -0.6603045 3
```

```
population_total() %>%
nrow
```

`[1] 50000`

Our total population has 50000 observations: 1000 clusters of 50 individuals.

In this application, we want to treat the total population as *fixed*, and to simulate what happens when different clusters are sampled. To do this, we create a single data.frame to hold population-level information. Then, we tell the `declare_population`

function to draw from that fixed population:

```
population_fixed = population_total()
population = declare_population(
data = population_fixed)
```

We declare the two stages of our sampling strategy using the `declare_sampling`

function. When we apply those two stages to a simulated data, we obtain a dataset with 600 observations (30 clusters and 20 subjects per cluster):

```
sampling_stage1 <- declare_sampling(
strata = block,
clusters = cluster,
n = 30,
sampling_variable = "Population")
sampling_stage2 <- declare_sampling(
strata = cluster,
n = 20,
sampling_variable = "Cluster")
population() %>%
sampling_stage1 %>%
sampling_stage2 %>%
nrow
```

`[1] 600`

We wish to estimate the mean of outcome \(Y\):

```
estimand = declare_estimand(
mean(Y), label = "Ybar")
```

Finally, we check the coverage of confidence intervals computed using classical or clustered standard errors.

```
estimator_cluster = declare_estimator(
formula = Y ~ 1,
model = lm_robust,
clusters = cluster,
label = "Clustered")
estimator_classical = declare_estimator(
formula = Y ~ 1,
model = lm,
label = "Classical")
design =
population +
sampling_stage1 +
sampling_stage2 +
estimand +
estimator_classical +
estimator_cluster
sim = simulate_design(design)
sim %>%
mutate(Covered = conf.low < 5 & conf.high > 5) %>%
arrange(estimator_label, conf.low) %>%
mutate(sim_ID = factor(sim_ID, unique(sim_ID))) %>%
ggplot(aes(x = sim_ID, ymin = conf.low, ymax = conf.high, color = Covered)) +
geom_linerange() +
geom_hline(yintercept = 5, color = "red") +
facet_wrap(~estimator_label) +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid = element_blank()) +
scale_colour_grey() +
xlab("500 Simulations")
```

Above, we saw how to use built-in estimation and estimand functions from the `DeclareDesign`

package. Now, we’ll see how to easily define custom estimands and estimators.

This function has a nice shortcut syntax, but to properly understand what it does, I think it is useful to use the more flexible (and verbose) `handler`

argument.

The `handler`

argument accepts a function that takes our simulated dataset as a `data.frame`

, and returns a `data.frame`

with two entries. Here, the “handler” takes `data`

and calcualtes the average of the difference between potential outcomes. This is the average treatment effect that we ultimately want to estimate:

```
estimand = declare_estimand(
handler = function(data) {
data.frame(
estimand_label = "ATE",
estimand = mean(data$Y_Z_1 - data$Y_Z_0)
)
},
label = "ATE")
```

The “handler” argument of the `declare_estimator`

function accepts a simulated dataset, and returns a `data.frame`

that conforms to the tidy `broom`

package format. This is nice, because it means that `DeclareDesign`

can easily accomodate pretty much any estimation strategy.

```
estimator = declare_estimator(
handler = label_estimator(function(data) {
mod = lm(Y ~ Z, data)
out = broom::tidy(mod, conf.int = TRUE)
out[out$term == "Z",]
}),
label = "LM")
```

These functions can be plugged in to any of the workflows described above and should work immediately. I hope these examples make clear that `DeclareDesign`

can accept pretty much any estimator or quantity of interest.

Blair, Graeme, Jasper Cooper, Alexander Coppock, and Macartan Humphreys. 2019. “Declaring and Diagnosing Research Designs.” *American Political Science Review* 113 (3): 838–59.