noalign <- function(x) {
x <- tinytable::theme_tinytable(x)
fn <- function(table) {
if (table@output != "typst") {
return(table)
}
tab <- unlist(strsplit(table@table_string, "\\n"))
idx <- grepl("^\\s*#align\\(center, \\[\\s*$|^\\s*\\]\\) // end align\\s*$", tab)
table@table_string <- paste(tab[!idx], collapse = "\n")
return(table)
}
x <- tinytable::style_tt(x, finalize = fn)
return(x)
}
options(tinytable_tt_theme = noalign)Frontdoor adjustment for causal inference: A primer with examples in R
This notebook has 3 objectives:
- Offer some intuition about the “frontdoor” criterion and adjustment formula for causal inference.
- Show how to derive the frontdoor both algebraically and algorithmically using the
dosearchpackage forR. - Show how to apply the frontdoor adjustment formula in simulated data.
I assume that you know what a DAG and a backdoor path are.1 The data I will consider follows this data generating process, with a cause , a mediator , an outcome , and an unobserved confounder :
library(ggdag)
#theme_set(theme_dag_blank())
co = read.table(header = TRUE, text = "
x y name
0 0 X
1 0 Z
2 0 Y
1 1 U")
d = dagify(
X ~ U,
Z ~ X,
Y ~ Z,
Y ~ U,
coords = co)
ggdag(d)@Pea2009 uses the operator to represent variables on which we intervene or that we manipulate.2 For instance, the expression represents the distribution of the when we manipulate the treatment variable to give it a specific value .
Our goal is to estimate . Unfortunately, this relationship between and is confounded by the unobserved variable , via this backdoor path: . Therefore, we cannot estimate the causal quantity of interest directly.
Instead, we will estimate the effect of on indirectly via frontdoor adjustment. The key intuition behind this approach is this:
In a causal chain with three nodes , we can estimate the effect of on indirectly by combining two distinct quantities: (1) an estimate of the effect of on , and (2) an estimate of the effect of on .
Roughly speaking, frontdoor adjustment thus proceeds in 3 steps:
- Estimate
- Estimate
- Combine the two
To illustrate these steps, I will use a simulated dataset that conforms to the DAG above, and where the true effect of on is equal to 0.25:
library(data.table)
set.seed(731460)
N = 1e5
U = rbinom(N, 1, prob = .2)
X = rbinom(N, 1, prob = .1 + U * .6)
Z = rbinom(N, 1, prob = .3 + X * .5)
Y = rbinom(N, 1, prob = .1 + U * .3 + Z * .5)
dat = data.table(X, Z, Y)package ‘data.table’ was built under R version 4.5.2In the simplest case, we can estimate the effect of on by multiplying two linear regression coefficients.
Frist we estimate the effect of on . Since there is no open backdoor, we do not need to control for other variables:
step1 = lm(Z ~ X, dat)Then, we estimate the effect of on , controlling for to close the backdoor:
step2 = lm(Y ~ Z + X, dat)Finally, we combine the two estimates by multiplication:
coef(step1)["X"] * coef(step2)["Z"] X
0.2496002Why did this produce the correct result (0.25)? To answer this question, we can use do-calculus to derive a frontdoor adjustment formula, or we can use an R package called dosearch to derive the formula automatically. The algorithmic approach is very useful in more complicated cases, but the algebraic approach helps us understand the assumptions that underlie the method.
Algebraic frontdoor
I borrow notation from a nice Cross Validated answer, and make the following assumptions:
- Full mediation: there is no directed path from to , except through .
- Unconfoundedness 1: There is no open backdoor from to .
- Unconfoundedness 2: All backdoors from Z to Y are blocked by X.
The estimation proceeds in three steps.
Step 1: Under assumption 2, the relationship between and is not confounded (see DAG at the top). As a result, we have:
Step 2: In contrast, the relationship between and is confounded. Thankfully, adjusting for suffices to estimate the effect of on , because it blocks the backdoor path. As a result, we can use the backdoor adjustment formula3 to get:
Step 3: Back out the effect of on by combining what we obtained above:
Equation (1) conditions on Z and sums over its values. Equation (2) is allowed because the effect of is entirely mediated by , and because blocks the backdoor path from to . Intuitively, if we manipulate , it no longer matters what happened to . Equation (3) is allowed because the relationship between and is unconfounded. Equation (4) applies the backdoor adjustment formula to estimate the effect of on by conditioning on .
Equation (5) is the frontdoor adjustment formula. The left part is the effect of on . The right part is the effect of on .
Algorithmic frontdoor: R & dosearch
The dosearch package for R includes an algorithm that can automatically apply the rules of do-calculus to convert DAGs to adjustment formulas. Instead of manipulating equations ourselves like we did above, we can simply call the dosearch function to obtain the frontdoor adjustment formula automatically:
library('dosearch')
data1 = "P(X, Y, Z)"
query1 = "P(Y | do(X))"
graph1 = "
U -> X
U -> Y
X -> Z
Z -> Y
"
# compute
frontdoor = dosearch(data1, query1, graph1)
# convert to Rmarkdown equation
cat(paste("$$", frontdoor$formula, "$$"))sum{Z}left(p(Z|X)sum{X}left(p(X)p(Y|X,Z)right)right)
Which is equivalent to the formula we obtained above.
Example: Frontdoor simulation
With the frontdoor adjustment formula in hand, we can finally estimate the causal effect of in our simulated data. We will work with this version fo the formula:
dat[, `P(X)` := fifelse(X == 1, mean(X), 1 - mean(X)) ][
, `P(Z|X)` := mean(Z), by = X ][
, `P(Y|Z,X)` := mean(Y), by = .(Z, X) ][
, `P(Z|X)` := mean(Z), by = X ][
, Y := NULL ]
dat = unique(dat)
dat[, `P(Y|do(Z))` := sum(`P(Y|Z,X)` * `P(X)`), by = Z]
`P(Y|do(X=0))` = with(dat[X == 0],
`P(Z|X)` [Z == 1] *
`P(Y|do(Z))` [Z == 1] +
(1 - `P(Z|X)`) [Z == 0] *
`P(Y|do(Z))` [Z == 0]
)
`P(Y|do(X=1))` = with(dat[X == 1], {
`P(Z|X)` [Z == 1] *
`P(Y|do(Z))` [Z == 1] +
(1 - `P(Z|X)`) [Z == 0] *
`P(Y|do(Z))` [Z == 0]
})
`P(Y|do(X=1))` - `P(Y|do(X=0))` X Z P(X) P(Z|X) P(Y|Z,X)
<int> <int> <num> <num> <num>
1: 0 0 0.78018 0.2993796 0.1245495
2: 0 0 0.78018 0.2993796 0.1245495
3: 0 0 0.78018 0.2993796 0.1245495
4: 1 1 0.21982 0.7977436 0.7966469
5: 0 0 0.78018 0.2993796 0.1245495
---
99996: 0 1 0.78018 0.2993796 0.6239671
99997: 1 1 0.21982 0.7977436 0.7966469
99998: 0 0 0.78018 0.2993796 0.1245495
99999: 0 1 0.78018 0.2993796 0.6239671
100000: 1 1 0.21982 0.7977436 0.7966469 X Z P(X) P(Z|X) P(Y|Z,X) P(Y|do(Z))
<int> <int> <num> <num> <num> <num>
1: 0 0 0.78018 0.2993796 0.1245495 0.1607537
2: 1 1 0.21982 0.7977436 0.7966469 0.6619256
3: 0 1 0.78018 0.2993796 0.6239671 0.6619256
4: 1 0 0.21982 0.7977436 0.2892488 0.1607537[1] 0.249766As shown above, we can get essentially the same result using regression and multiplication:
coef(lm(Y ~ Z + X))["Z"] * coef(lm(Z ~ X))["X"] Z
0.2496002Or by estimating an impossible model (remember that is unobservable):
coef(lm(Y ~ X + U))["X"] X
0.2549541- 1Les lecteurs francophones peuvent se référer au chapitre 6 de mon livre “Analyse Causale et Méthodes Quantitatives”, disponible gratuitement en PDF: https://www.pum.umontreal.ca/catalogue/analyse_causale_et_methodes_quantitatives
- 2The manipulation could be hypothetical or counterfactual.
- 3See @Pea2009 for a detailed treatment of backdoor adjustment, and @Pea2016 for an accessible primer.