Regular assignment mechanism

Causal inference for observational studies

  • So far, we have considered a situation where the treatment assignment is controlled by an experimenter and the mechanism was known.
  • In observational studies, the mechanism is not known.
  • Nevertheless, we often proceed the analysis assuming that it is a regular assignment mechanism:
  1. Individualistic,
  2. Probabilistic,
  3. Unconfounded.

Unconfoundedness

  • Among three assumptions, unconfoundedness is the most controversial.
  • In the super-population setting, unconfoundedness implies: \[ \mathbb{P}[W_i = 1| Y_i(0), Y_i(1), X_i] = \mathbb{P}[W_i = 1| X_i] = e(X_i), \] or: \[ W_i \perp \!\!\! \perp Y_i(1), Y_i(0) | X_i. \]
  • This assumption is not testable, because it compares the distribution of missing outcomes \(Y_i(1 - W_i)\) with the distribution of observed outcomes \(Y_i(W_i)\).
  • Therefore, we need to rely on the background knowledge to justify the assumption: for example, caseworkers assign job training programs according to a known index of observed characteristics.

Balancing covariate distributions

  • In observational studies, the observed covariates can be substantially unbalanced between treated and control units.
  • In the regular assignment mechanism, we can remove all biases in comparisons between treated and control units as long as we adjust for differences in observed covariates.
  • Therefore, the main focus of the analysis using the regular assignment mechanism is how to adjust the covariance distributions.

Balancing score

  • It is easier to balance the covariate distributions if we could reduce the dimensionality of the covariates while maintaining the unconfoundedness.
  • Definition: Balancing score
    • A balancing score \(b(x)\) is a function of the covariates such that: \[ W_i \perp \!\!\! \perp X_i | b(X_i). \]
  • There can be multiple balancing scores including \(X_i\) itself.

Propensity score is a balancing score

  • We can show that the propensity score is a balancing score.
  • First, we have: \[ \mathbb{P}[W_i = 1| X_i, e(X_i)] = \mathbb{P}[W_i = 1| X_i] = e(X_i). \]
  • Second, we have: \[ \begin{split} \mathbb{P}[W_i = 1| e(X_i)] &= \mathbb{E}[W_i|e(X_i)]\\ &= \mathbb{E}[\mathbb{E}[W_i | X_i, e(X_i)] | e(X_i)]\\ &= \mathbb{E}[e(X_i)| e(X_i)]\\ &= e(X_i). \end{split} \]
  • Therefore: \[ \mathbb{P}[W_i = 1| X_i, e(X_i)] = \mathbb{P}[W_i = 1| e(X_i)]. \]

Balancing score maintains unconfoundedness

  • Lemma: Unconfoundedness given a balancing score
    • In a regular assignment mechanism, the assignment is unconfounded given any balancing score: \[ W_i \perp \!\!\! \perp Y_i(0), Y_i(1) | b(X_i). \] \[ \begin{split} &\mathbb{P}[W_i = 1 | Y_i(0), Y_i(1), b(X_i)]\\ &= \mathbb{E}[W_i | Y_i(0), Y_i(1), b(X_i)]\\ &= \mathbb{E}[\mathbb{E}[W_i | Y_i(0), Y_i(1), X_i, b(X_i)] | Y_i(0), Y_i(1), b(X_i)]\\ &= \mathbb{E}[\mathbb{E}[W_i | X_i, b(X_i)] | Y_i(0), Y_i(1), b(X_i)] \text{ by unconfoundedness}\\ &= \mathbb{E}[\mathbb{E}[W_i | b(X_i)] | Y_i(0), Y_i(1), b(X_i)] \text{ by balancing score}\\ &= \mathbb{E}[W_i | b(X_i)] \\ &= \mathbb{P}[W_i = 1 | b(X_i)]. \end{split} \]

Adjusting balancing score is sufficient

  • The previous lemma implies that it is sufficient for removing biases associated with differences in the covariates to adjust the difference in the balancing scores.
  • Because the propensity score is a balancing score, it is sufficient for adjusting the differences in the propensity scores.

Propensity score is the coarsest balancing score

  • Lemma: Coarseness of balancing score
    • The propensity score is the coarsest balancing score. That is, the propensity score is a function of every balansing score.
  • Suppose not. Then, there is a balancing score \(b(x)\) and values \(x, x'\) such that \(b(x) = b(x')\) but \(e(x) \neq e(x')\).
  • Then, we have: \[ \mathbb{P}(W_i = 1| X_i = x) = e(x) \neq e(x') = \mathbb{P}(W_i = 1| X_i = x'). \]
  • Then, \(W_i\) and \(X_i\) are not independent on \(b(x) = b(x')\).

Estimation and inference

Estimands

  • The super-population average tretment effect: \[ \tau_{sp} \equiv \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[\tau_{sp}(X_i)], \] where \(\tau_{sp}(X_i)\) is: \[ \tau_{sp}(X_i) \equiv \mu_t(x) - \mu_c(x) = \mathbb{E}[Y_i(1) - Y_i(0) | X_i = x]. \]

Estimands

  • The finite-sample average treatment effect: \[ \tau_{fs} \equiv \frac{1}{N} \sum_{i = 1}^N[Y_i(1) - Y_i(0)]. \]
  • It is sometimes useful to estimate the conditional average treatment effect on the value of the pretreatment variables in the finite sample: \[ \tau_{cond} \equiv \frac{1}{N} \sum \tau_{sp}(X_i). \]

Efficiency bound

  • The asymptotic sampling variance of any estimator to \(\tau_{sp}\) cannot be smaller than: \[ \mathbb{V}_{sp}^{eff} \equiv \mathbb{E}\Big[ \frac{\sigma_c^2(X_i)}{1 - e(X_i)} + \frac{\sigma_t^2(X_i)}{e(X_i)} + [\tau_{sp}(X_i) - \tau_{sp}]^2 \Big]. \]
  • The first two terms get bigger as the propensity scores are unbalanced.
  • The third term gets bigger as the treatment effects are heterogeneous across the values of the covariates.

Efficiency bound

  • The asymptotic sampling variance of any estimator to \(\tau_{cond}\) cannot be smaller than: \[ \mathbb{V}_{cond}^{eff} \equiv \mathbb{E}\Big[ \frac{\sigma_c^2(X_i)}{1 - e(X_i)} + \frac{\sigma_t^2(X_i)}{e(X_i)}\Big]. \]
  • The first two terms get bigger as the propensity scores are unbalanced.
  • The term \([\tau_{sp}(X_i) - \tau_{sp}]^2\) does not show up here.

Estimation strategies

  • Model-based imputation: Imputing the missing potential outcomes by a model such as a linear regression function.
  • Weighting estimator: Estimating the propensity score and then estimating the treatment effect as if a randomized experiment.
  • Blocking estimator: Estimating the propensity score and then estimating the treatment effect as if a stratified randomized experiment.
  • Matching estimator: Find a pair of treated and control units with a similar set of covariates.

Model-based imputation

  • It is common in Economics to assume: \[ Y_i^{obs} = \tau \cdot W_i + W_i \cdot X_i \beta_t + (1 - W_i) \cdot X_i \beta_c + \epsilon_i, \] and estiamte the super-population or finite-sample averaget treatment effect by: \[ \hat{\tau}_{ols} = \frac{1}{N} \sum_{i = 1}^N[W_i \cdot (Y_i^{obs} - X_i \hat{\beta}_{c, ols}) + (1 - W_i) \cdot (X_i \hat{\beta}_{t, ols} - Y_i^{obs})]. \]
  • The estimator is inaccurate when the covariates are unbalanced, because: \[ Y_i^{obs} - X_i \hat{\beta}_{c, ols} = \overline{Y}_t^{obs} - \overline{Y}_c^{obs} - (\overline{X}_t - \overline{X}_c) \hat{\beta}_{c, ols}, \] \[ X_i \hat{\beta}_{t, ols} - Y_i^{obs} = \overline{Y}_t^{obs} - \overline{Y}_c^{obs} - (\overline{X}_t - \overline{X}_c) \hat{\beta}_{t, ols}. \]

Weighting estimator

  • Under the unconfoundedness, we have: \[ \begin{split} \mathbb{E}\Bigg[ \frac{Y_i^{obs} \cdot W_i}{e(X_i)} \Bigg] &= \mathbb{E}\Bigg[ \mathbb{E}\Bigg[ \frac{Y_i^{obs} \cdot W_i}{e(X_i)} \Bigg | X_i \Bigg] \Bigg]\\ &= \mathbb{E}\Bigg[ \mathbb{E}\Bigg[ \frac{Y_i(1) \cdot W_i}{e(X_i)} \Bigg | X_i \Bigg] \Bigg]\\ &= \mathbb{E}\Bigg[ \frac{\mathbb{E}[Y_i(1) | X_i] \cdot \mathbb{E}[W_i | X_i]}{e(X_i)} \Bigg]\\ &= \mathbb{E}[ \mathbb{E}[Y_i(1) | X_i] ]\\ &= \mathbb{E}[Y_i(1)]. \end{split} \]

Weighting estimator

  • Similarly, we have: \[ \mathbb{E}\Bigg[ \frac{Y_i^{obs} \cdot (1 - W_i)}{1 - e(X_i)} \Bigg] = \mathbb{E}[Y_i(0)]. \]
  • Then, we can estiamte \(\tau_{sp}\) by: \[ \begin{split} \hat{\tau}_{weight} &\equiv \frac{1}{N} \sum_{i = 1}^N \frac{W_i \cdot Y_i^{obs}}{\hat{e}(X_i)} - \frac{1}{N} \sum_{i = 1}^N \frac{(1 - W_i) \cdot Y_i^{obs}}{1 - \hat{e}(X_i)}. \end{split} \]
  • This is againg imprecise when the covariance distribution is unbalanced, i.e. either \(\hat{e}(X_i)\) or \(1 - \hat{e}(X_i)\) is close to zero.

BLocking estimator

  • To avoid extreme valeus in the propensity score, one may stratify samples according to the value of the propensity score: For \(j = 0, 1, \cdots, J\), \(b_0 = 0 \le \cdots \le b_j \le \cdots \le b_J = 1\), let \(B_i(j)\) be a binary indicator of \(b_{j - 1} < \hat{e}(X_i) \le b_j\).
  • Then, define the strata-level treatment effect: \[ \hat{\tau}^{diff}(j) \equiv \frac{\sum_{i: B_i(j) = 1} Y_i \cdot W_i}{\sum_{i: B_i(j) = 1} W_i} - \frac{\sum_{i: B_i(j) = 0} Y_i \cdot (1 - W_i)}{\sum_{i: B_i(j) = 0} (1 - W_i)}, \] to construct: \[ \hat{\tau}^{diff} \equiv \sum_{j = 1} \frac{N(j)}{N} \hat{\tau}^{diff}(j). \]

Matching estimator

  • Define some metric on the space of covariates and let \(M(i) \subset \{1, \cdots, N\}\) be the \(M\)-nearest units of \(i\) in the opposite group of \(i\).
  • Then, define: \[ \hat{Y}_i(0) = (1 - W_i) \cdot Y_i^{obs} + W_i \cdot \frac{1}{M} \sum_{j \in M(i)} Y_j, \] \[ \hat{Y}_i(1) = W_i \cdot Y_i^{obs} + (1 - W_i) \cdot \frac{1}{M} \sum_{j \in M(i)} Y_j, \] and construct: \[ \hat{\tau}_{matching} \equiv \frac{1}{N} \sum_{i = 1}^N [\hat{Y}_i(1) - \hat{Y}_i(0)]. \]

Assessing unconfoundedness

Seeking supporting evidence for unconfoundedness

  • The unconfoundedness assumption is not testable.
  • Giving up testing it, we try to estimate pseudo-causal estimands with a priori known values, which are typically zero, under assumptions more restrictive than unconfoundedness.
  • If these analyses suggest the null hypothesis does not hold, then the unconfoundedness assumption will be viewed less plausible.
  • If the unconfoundedness assumption is less plausible, we should conclude that it is not possible to estimate credibly and precisly the causal effects of interest.

Estimating effects on pseudo-outcomes

  • One approachs is to test the pseudo-outcome unconfoundedness: \[ W_i \perp \!\!\! \perp X_i^p | X_i^r, \] for a partition of the covariates \(X_i = (X_i^p, X_i^r)\).
  • Testing this property makes sense if a stronger version of unconfoundedness, subset unconfoundedness , is supposed to hold \[ W_i \perp \!\!\! \perp (Y_i(1), Y_i(0)) | X_i^r. \] and \(X_i^p\) is considered to be the proxiy to the potential outcome, such as the lagged outcomes.
  • This is a design-stage method because the observation of the outcomes is not required.

Estimating effects of pseudo-treatments

  • Suppose that the assignment was such that there was 1 treatment group and 2 control groups. \[ W_i = \begin{cases} 1 &\text{ if } G_i = t\\ 0 &\text{ if } G_i = c_1, c_2. \end{cases} \]
  • Moreover, suppose that a stronger version of unconfoundedness, group unconfoundedness, is supposed to hold: \[ G_i \perp \!\!\! \perp Y_i(0), Y_i(1) | X_i, \] which has a testable implication: \[ G_i \perp \!\!\! \perp Y_i(0) | X_i, G_i \in \{c_1, c_2\} \Leftrightarrow G_i \perp \!\!\! \perp Y_i^{obs} | X_i, G_i \in \{c_1, c_2\}. \]

Robustness to the pretreatment variables

  • Again assume subset unconfoundedness: \[ W_i \perp \!\!\! \perp (Y_i(1), Y_i(0)) | X_i^r, \] for a partition of the covariates \(X_i = (X_i^p, X_i^r)\).
  • Then, the estimated treatment effect should be robust to the inclusion of \(X_i^p\).

Simulation

Set simulation parameters

set.seed(1)
N <- 1000
J <- 4
M <- 1
beta_t <- c(1, 0.1)
beta_c <- c(0, 0.05)

Generate potential outcomes

outcome <-
  tibble::tibble(
    x = rnorm(N),
    y_0 = beta_c[1] + beta_c[2] * x + rnorm(N),
    y_1 = beta_t[1] + beta_t[2] * x + rnorm(N)
  )
head(outcome)
## # A tibble: 6 x 3
##        x     y_0     y_1
##    <dbl>   <dbl>   <dbl>
## 1 -0.626  1.10    0.0512
## 2  0.184  1.12   -0.904 
## 3 -0.836 -0.913   2.54  
## 4  1.60   0.290   1.68  
## 5  0.330  0.0859  0.977 
## 6 -0.820 -1.70    1.61

Check the estimand

tau_fs <- 
  outcome %>%
  dplyr::summarise(tau = mean(y_1 - y_0)) %>%
  dplyr::pull(tau)
tau_fs
## [1] 1.030989

Assign treatment

outcome <-
  outcome %>%
  dplyr::mutate(
    e = exp(x) / (1 + exp(x)),
    w = (runif(N) < e)
    )
head(outcome)
## # A tibble: 6 x 5
##        x     y_0     y_1     e w    
##    <dbl>   <dbl>   <dbl> <dbl> <lgl>
## 1 -0.626  1.10    0.0512 0.348 FALSE
## 2  0.184  1.12   -0.904  0.546 FALSE
## 3 -0.836 -0.913   2.54   0.302 FALSE
## 4  1.60   0.290   1.68   0.831 TRUE 
## 5  0.330  0.0859  0.977  0.582 FALSE
## 6 -0.820 -1.70    1.61   0.306 TRUE

Check the balance in the covariate

Generate observed outcome and assignment

outcome_observed <-
  outcome %>%
  dplyr::mutate(y = y_0 * (1 - w) + y_1 * w) %>%
  dplyr::select(y, w, x)
head(outcome_observed)
## # A tibble: 6 x 3
##         y w          x
##     <dbl> <lgl>  <dbl>
## 1  1.10   FALSE -0.626
## 2  1.12   FALSE  0.184
## 3 -0.913  FALSE -0.836
## 4  1.68   TRUE   1.60 
## 5  0.0859 FALSE  0.330
## 6  1.61   TRUE  -0.820

Impute by a linear regression

result <-
  lm(
    data = outcome_observed,
    formula = y ~ w + w:x 
  )
result %>%
  modelsummary(fmt = 6)
Model 1
(Intercept) -0.017528
(0.049717)
wTRUE 1.103605
(0.073258)
wFALSE × x 0.088752
(0.047710)
wTRUE × x 0.068441
(0.052452)
Num.Obs. 1000
R2 0.238
R2 Adj. 0.236
AIC 2949.7
BIC 2974.3
Log.Lik. -1469.856
F 103.888

Impute by a linear regression

  • If misspecified:
result <-
  c(
    y ~ w + w:exp(x),
    y ~ w + w:x^2,
    y ~ w + w:(x + x^2)
  ) %>%
  purrr::map(
    .,
    ~ lm(
      data = outcome_observed,
      formula = .
      )
  )

Impute by a linear regression

result %>%
  modelsummary(fmt = 6)
Model 1 Model 2 Model 3
(Intercept) -0.125049 -0.017528 -0.017528
(0.057698) (0.049717) (0.049717)
wTRUE 1.179050 1.103605 1.103605
(0.084518) (0.073258) (0.073258)
wFALSE × exp(x) 0.066445
(0.032783)
wTRUE × exp(x) 0.026059
(0.016001)
wFALSE × x 0.088752 0.088752
(0.047710) (0.047710)
wTRUE × x 0.068441 0.068441
(0.052452) (0.052452)
Num.Obs. 1000 1000 1000
R2 0.240 0.238 0.238
R2 Adj. 0.237 0.236 0.236
AIC 2948.1 2949.7 2949.7
BIC 2972.7 2974.3 2974.3
Log.Lik. -1469.059 -1469.856 -1469.856
F 104.583 103.888 103.888

Weighing estimator

  • Estimate the propensity score by a logistic function (correctly specified):
propensity_score <-
  glm(
    data = outcome_observed,
    formula = w ~ x,
    family = binomial("logit")
  )
propensity_score %>%
  modelsummary(fmt = 6)
Model 1
(Intercept) -0.139605
(0.069562)
x 0.951090
(0.080246)
Num.Obs. 1000
AIC 1204.7
BIC 1214.5
Log.Lik. -600.337

Weighting estimator

outcome_observed <-
  outcome_observed %>%
  dplyr::mutate(e = predict(propensity_score, type = "response"))
outcome_observed %>%
  ggplot(
    aes(
      x = x,
      y = e
    )
  ) +
  geom_line() +
  theme_classic()

Weighting estimator

Weighting estimator

outcome_observed %>%
  dplyr::mutate(weight = (w / e + (1 - w) / (1 - e)) / length(w)) %>%
  dplyr::summarise(y = sum(w * y * weight) - sum((1 - w) * y * weight)) 
## # A tibble: 1 x 1
##       y
##   <dbl>
## 1  1.08

Blocking estimator

blocking_estimator <-
  outcome_observed %>%
  dplyr::mutate(j = dplyr::ntile(outcome_observed$e, J)) %>%
  dplyr::group_by(j) %>%
  dplyr::summarise(
    N_j = length(y),
    y = sum(y * w) / sum(w) - sum(y * (1 - w)) / sum(1 - w)
    ) %>%
  dplyr::ungroup()
blocking_estimator
## # A tibble: 4 x 3
##       j   N_j     y
##   <int> <int> <dbl>
## 1     1   250  1.09
## 2     2   250  1.21
## 3     3   250  1.00
## 4     4   250  1.17

Blocking estimator

blocking_estimator %>%
  dplyr::summarise(y = sum(y * N_j) / sum(N_j))
## # A tibble: 1 x 1
##       y
##   <dbl>
## 1  1.12

Matching estimator

matched <-
  MatchIt::matchit(
    data = outcome_observed,
    formula = w ~ x,
    method = "nearest",
    distance = "glm",
    ratio = M
  )
 matched 
## A matchit object
##  - method: 1:1 nearest neighbor matching without replacement
##  - distance: Propensity score
##              - estimated with logistic regression
##  - number of obs.: 1000 (original), 938 (matched)
##  - target estimand: ATT
##  - covariates: x

Matching estimator

matched_observed <-
  MatchIt::match.data(matched) 
matched_observed
## # A tibble: 938 x 7
##          y w          x     e distance weights subclass
##      <dbl> <lgl>  <dbl> <dbl>    <dbl>   <dbl> <fct>   
##  1  1.10   FALSE -0.626 0.324    0.324       1 424     
##  2  1.12   FALSE  0.184 0.509    0.509       1 59      
##  3 -0.913  FALSE -0.836 0.282    0.282       1 363     
##  4  1.68   TRUE   1.60  0.799    0.799       1 158     
##  5  0.0859 FALSE  0.330 0.543    0.543       1 39      
##  6  1.61   TRUE  -0.820 0.285    0.285       1 265     
##  7  1.10   TRUE   0.487 0.580    0.580       1 314     
##  8 -0.236  TRUE   0.738 0.637    0.637       1 366     
##  9 -1.07   TRUE   0.576 0.601    0.601       1 412     
## 10  0.983  FALSE -0.305 0.394    0.394       1 113     
## # ... with 928 more rows

Matching estimator

matched_observed %>%
  dplyr::group_by(subclass) %>%
  dplyr::summarise(y = sum(y * w) - sum(y * (1 - w))) %>%
  dplyr::ungroup() %>%
  dplyr::summarise(y = mean(y))
## # A tibble: 1 x 1
##       y
##   <dbl>
## 1  1.15

Reference

  • Chapter 12-13, Guido W. Imbens and Donald B. Rubin, 2015, Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press.