hmx_simonsohn

This notebook explores:

A Response to Recent Critiques of Hainmueller, Mummolo and Xu (2019) on Estimating Conditional Relationships https://arxiv.org/pdf/2502.05717

In Section A.2 of the paper, the authors describe a data generating process.

Given this data generating process, we know that the true CEM is: \(X-X^2\).

The authors’ main criticism of the GAM approach is that it does not target the “right” estimand, that is, the conditional mariginal effect:

“The key issue is that, as a method for estimating the response surface, i.e., the conditional expecation of the outcome, GAM does not explicitly model the CME. In contrast, kernel estimators based on local linear regression approximate derivatives more directly”

Below, I show that it is easy to target the CME using GAM. The dashed red line represents the known true CME. The grey line represents the GAM estimate. They are very close to one another.

library(mgcv)
library(ggplot2)
library(data.table)
library(marginaleffects)
theme_set(theme_bw())

# Simulate data
set.seed(48103)
N <- 10000
X <- runif(N, min = -2, max = 2)
eD <- rnorm(N, sd = .1)
eY <- rnorm(N, sd = .1)
D <-  0.5 * X + eD
Y <- 1 + 1.5 * X + D^2 - D * X^2 + eY 
dat <- data.table(Y, D, X)

# True CME 
x_true <- seq(-2, 2, length.out = 100)
y_true <- x_true - x_true^2
truth <- data.frame(x = x_true, y = y_true)

# Fit GAM model
mod <- gam(Y ~ s(X) + s(D) + te(X, D), data = dat)

# Use a spline to smooth/marginalize estimates along the continuous axis
s <- comparisons(mod,
  variables = "D", 
  comparison = "dydx",
  vcov = FALSE,
  transform = \(x) predict(gam(x ~ s(X))))

# Compute and plot CME vs. Truth
ggplot(s) +
  geom_line(data = truth, aes(x, y), linetype = 2, col = "red", size = 1) +
  geom_line(data = s, aes(X, estimate), size = 1, alpha = .5) +
  labs(y = "CME", x = "X")

A second strategy is to create bins to “flatten” the X dimension:

dat[, X_bin := cut(X, breaks = 50)]
dat[, X_bin_avg := mean(X), by = X_bin]

# Fit GAM model
mod <- gam(Y ~ s(X) + s(D) + te(X, D), data = dat)

# Compute and plot CME vs. Truth
s <- avg_slopes(mod, variables = "D", by = "X_bin_avg", vcov = FALSE)

ggplot(s) +
  geom_line(data = truth, aes(x, y), linetype = 2, col = "red", size = 1) +
  geom_line(data = s, aes(X_bin_avg, estimate), size = 1, alpha = .5) +
  labs(y = "CME", x = "X")