Regular assignment mechanism

Causal inference for observational studies

So far, we have considered a situation where the treatment assignment is controlled by an experimenter and the mechanism was known.
In observational studies, the mechanism is not known.
Nevertheless, we often proceed the analysis assuming that it is a regular assignment mechanism:

Individualistic,
Probabilistic,
Unconfounded.

Unconfoundedness

Among three assumptions, unconfoundedness is the most controversial.
In the super-population setting, unconfoundedness implies: \[ \mathbb{P}[W_i = 1| Y_i(0), Y_i(1), X_i] = \mathbb{P}[W_i = 1| X_i] = e(X_i), \] or: \[ W_i \perp \!\!\! \perp Y_i(1), Y_i(0) | X_i. \]
This assumption is not testable, because it compares the distribution of missing outcomes \(Y_i(1 - W_i)\) with the distribution of observed outcomes \(Y_i(W_i)\).
Therefore, we need to rely on the background knowledge to justify the assumption: for example, caseworkers assign job training programs according to a known index of observed characteristics.

Balancing covariate distributions

In observational studies, the observed covariates can be substantially unbalanced between treated and control units.
In the regular assignment mechanism, we can remove all biases in comparisons between treated and control units as long as we adjust for differences in observed covariates.
Therefore, the main focus of the analysis using the regular assignment mechanism is how to adjust the covariance distributions.

Balancing score

It is easier to balance the covariate distributions if we could reduce the dimensionality of the covariates while maintaining the unconfoundedness.
Definition: Balancing score
- A balancing score \(b(x)\) is a function of the covariates such that: \[ W_i \perp \!\!\! \perp X_i | b(X_i). \]
There can be multiple balancing scores including \(X_i\) itself.

Propensity score is a balancing score

We can show that the propensity score is a balancing score.
First, we have: \[ \mathbb{P}[W_i = 1| X_i, e(X_i)] = \mathbb{P}[W_i = 1| X_i] = e(X_i). \]
Second, we have: \[ \begin{split} \mathbb{P}[W_i = 1| e(X_i)] &= \mathbb{E}[W_i|e(X_i)]\\ &= \mathbb{E}[\mathbb{E}[W_i | X_i, e(X_i)] | e(X_i)]\\ &= \mathbb{E}[e(X_i)| e(X_i)]\\ &= e(X_i). \end{split} \]
Therefore: \[ \mathbb{P}[W_i = 1| X_i, e(X_i)] = \mathbb{P}[W_i = 1| e(X_i)]. \]

Balancing score maintains unconfoundedness

Lemma: Unconfoundedness given a balancing score
- In a regular assignment mechanism, the assignment is unconfounded given any balancing score: \[ W_i \perp \!\!\! \perp Y_i(0), Y_i(1) | b(X_i). \] \[ \begin{split} &\mathbb{P}[W_i = 1 | Y_i(0), Y_i(1), b(X_i)]\\ &= \mathbb{E}[W_i | Y_i(0), Y_i(1), b(X_i)]\\ &= \mathbb{E}[\mathbb{E}[W_i | Y_i(0), Y_i(1), X_i, b(X_i)] | Y_i(0), Y_i(1), b(X_i)]\\ &= \mathbb{E}[\mathbb{E}[W_i | X_i, b(X_i)] | Y_i(0), Y_i(1), b(X_i)] \text{ by unconfoundedness}\\ &= \mathbb{E}[\mathbb{E}[W_i | b(X_i)] | Y_i(0), Y_i(1), b(X_i)] \text{ by balancing score}\\ &= \mathbb{E}[W_i | b(X_i)] \\ &= \mathbb{P}[W_i = 1 | b(X_i)]. \end{split} \]

Adjusting balancing score is sufficient

The previous lemma implies that it is sufficient for removing biases associated with differences in the covariates to adjust the difference in the balancing scores.
Because the propensity score is a balancing score, it is sufficient for adjusting the differences in the propensity scores.

Propensity score is the coarsest balancing score

Lemma: Coarseness of balancing score
- The propensity score is the coarsest balancing score. That is, the propensity score is a function of every balansing score.
Suppose not. Then, there is a balancing score \(b(x)\) and values \(x, x'\) such that \(b(x) = b(x')\) but \(e(x) \neq e(x')\).
Then, we have: \[ \mathbb{P}(W_i = 1| X_i = x) = e(x) \neq e(x') = \mathbb{P}(W_i = 1| X_i = x'). \]
Then, \(W_i\) and \(X_i\) are not independent on \(b(x) = b(x')\).

Estimation and inference

Estimands

The super-population average tretment effect: \[ \tau_{sp} \equiv \mathbb{E}[Y_i(1) - Y_i(0)] = \mathbb{E}[\tau_{sp}(X_i)], \] where \(\tau_{sp}(X_i)\) is: \[ \tau_{sp}(X_i) \equiv \mu_t(x) - \mu_c(x) = \mathbb{E}[Y_i(1) - Y_i(0) | X_i = x]. \]

Estimands

The finite-sample average treatment effect: \[ \tau_{fs} \equiv \frac{1}{N} \sum_{i = 1}^N[Y_i(1) - Y_i(0)]. \]
It is sometimes useful to estimate the conditional average treatment effect on the value of the pretreatment variables in the finite sample: \[ \tau_{cond} \equiv \frac{1}{N} \sum \tau_{sp}(X_i). \]

Efficiency bound

The asymptotic sampling variance of any estimator to \(\tau_{sp}\) cannot be smaller than: \[ \mathbb{V}_{sp}^{eff} \equiv \mathbb{E}\Big[ \frac{\sigma_c^2(X_i)}{1 - e(X_i)} + \frac{\sigma_t^2(X_i)}{e(X_i)} + [\tau_{sp}(X_i) - \tau_{sp}]^2 \Big]. \]
The first two terms get bigger as the propensity scores are unbalanced.
The third term gets bigger as the treatment effects are heterogeneous across the values of the covariates.

Efficiency bound

The asymptotic sampling variance of any estimator to \(\tau_{cond}\) cannot be smaller than: \[ \mathbb{V}_{cond}^{eff} \equiv \mathbb{E}\Big[ \frac{\sigma_c^2(X_i)}{1 - e(X_i)} + \frac{\sigma_t^2(X_i)}{e(X_i)}\Big]. \]
The first two terms get bigger as the propensity scores are unbalanced.
The term \([\tau_{sp}(X_i) - \tau_{sp}]^2\) does not show up here.

Estimation strategies

Model-based imputation: Imputing the missing potential outcomes by a model such as a linear regression function.
Weighting estimator: Estimating the propensity score and then estimating the treatment effect as if a randomized experiment.
Blocking estimator: Estimating the propensity score and then estimating the treatment effect as if a stratified randomized experiment.
Matching estimator: Find a pair of treated and control units with a similar set of covariates.

Model-based imputation

It is common in Economics to assume: \[ Y_i^{obs} = \tau \cdot W_i + W_i \cdot X_i \beta_t + (1 - W_i) \cdot X_i \beta_c + \epsilon_i, \] and estiamte the super-population or finite-sample averaget treatment effect by: \[ \hat{\tau}_{ols} = \frac{1}{N} \sum_{i = 1}^N[W_i \cdot (Y_i^{obs} - X_i \hat{\beta}_{c, ols}) + (1 - W_i) \cdot (X_i \hat{\beta}_{t, ols} - Y_i^{obs})]. \]
The estimator is inaccurate when the covariates are unbalanced, because: \[ Y_i^{obs} - X_i \hat{\beta}_{c, ols} = \overline{Y}_t^{obs} - \overline{Y}_c^{obs} - (\overline{X}_t - \overline{X}_c) \hat{\beta}_{c, ols}, \] \[ X_i \hat{\beta}_{t, ols} - Y_i^{obs} = \overline{Y}_t^{obs} - \overline{Y}_c^{obs} - (\overline{X}_t - \overline{X}_c) \hat{\beta}_{t, ols}. \]

Weighting estimator

Under the unconfoundedness, we have: \[ \begin{split} \mathbb{E}\Bigg[ \frac{Y_i^{obs} \cdot W_i}{e(X_i)} \Bigg] &= \mathbb{E}\Bigg[ \mathbb{E}\Bigg[ \frac{Y_i^{obs} \cdot W_i}{e(X_i)} \Bigg | X_i \Bigg] \Bigg]\\ &= \mathbb{E}\Bigg[ \mathbb{E}\Bigg[ \frac{Y_i(1) \cdot W_i}{e(X_i)} \Bigg | X_i \Bigg] \Bigg]\\ &= \mathbb{E}\Bigg[ \frac{\mathbb{E}[Y_i(1) | X_i] \cdot \mathbb{E}[W_i | X_i]}{e(X_i)} \Bigg]\\ &= \mathbb{E}[ \mathbb{E}[Y_i(1) | X_i] ]\\ &= \mathbb{E}[Y_i(1)]. \end{split} \]

Weighting estimator

Similarly, we have: \[ \mathbb{E}\Bigg[ \frac{Y_i^{obs} \cdot (1 - W_i)}{1 - e(X_i)} \Bigg] = \mathbb{E}[Y_i(0)]. \]
Then, we can estiamte \(\tau_{sp}\) by: \[ \begin{split} \hat{\tau}_{weight} &\equiv \frac{1}{N} \sum_{i = 1}^N \frac{W_i \cdot Y_i^{obs}}{\hat{e}(X_i)} - \frac{1}{N} \sum_{i = 1}^N \frac{(1 - W_i) \cdot Y_i^{obs}}{1 - \hat{e}(X_i)}. \end{split} \]
This is againg imprecise when the covariance distribution is unbalanced, i.e. either \(\hat{e}(X_i)\) or \(1 - \hat{e}(X_i)\) is close to zero.

BLocking estimator

To avoid extreme valeus in the propensity score, one may stratify samples according to the value of the propensity score: For \(j = 0, 1, \cdots, J\), \(b_0 = 0 \le \cdots \le b_j \le \cdots \le b_J = 1\), let \(B_i(j)\) be a binary indicator of \(b_{j - 1} < \hat{e}(X_i) \le b_j\).
Then, define the strata-level treatment effect: \[ \hat{\tau}^{diff}(j) \equiv \frac{\sum_{i: B_i(j) = 1} Y_i \cdot W_i}{\sum_{i: B_i(j) = 1} W_i} - \frac{\sum_{i: B_i(j) = 0} Y_i \cdot (1 - W_i)}{\sum_{i: B_i(j) = 0} (1 - W_i)}, \] to construct: \[ \hat{\tau}^{diff} \equiv \sum_{j = 1} \frac{N(j)}{N} \hat{\tau}^{diff}(j). \]

Matching estimator

Define some metric on the space of covariates and let \(M(i) \subset \{1, \cdots, N\}\) be the \(M\)-nearest units of \(i\) in the opposite group of \(i\).
Then, define: \[ \hat{Y}_i(0) = (1 - W_i) \cdot Y_i^{obs} + W_i \cdot \frac{1}{M} \sum_{j \in M(i)} Y_j, \] \[ \hat{Y}_i(1) = W_i \cdot Y_i^{obs} + (1 - W_i) \cdot \frac{1}{M} \sum_{j \in M(i)} Y_j, \] and construct: \[ \hat{\tau}_{matching} \equiv \frac{1}{N} \sum_{i = 1}^N [\hat{Y}_i(1) - \hat{Y}_i(0)]. \]

Assessing unconfoundedness

Seeking supporting evidence for unconfoundedness

The unconfoundedness assumption is not testable.
Giving up testing it, we try to estimate pseudo-causal estimands with a priori known values, which are typically zero, under assumptions more restrictive than unconfoundedness.
If these analyses suggest the null hypothesis does not hold, then the unconfoundedness assumption will be viewed less plausible.
If the unconfoundedness assumption is less plausible, we should conclude that it is not possible to estimate credibly and precisly the causal effects of interest.

Estimating effects on pseudo-outcomes

One approachs is to test the pseudo-outcome unconfoundedness: \[ W_i \perp \!\!\! \perp X_i^p | X_i^r, \] for a partition of the covariates \(X_i = (X_i^p, X_i^r)\).
Testing this property makes sense if a stronger version of unconfoundedness, subset unconfoundedness , is supposed to hold \[ W_i \perp \!\!\! \perp (Y_i(1), Y_i(0)) | X_i^r. \] and \(X_i^p\) is considered to be the proxiy to the potential outcome, such as the lagged outcomes.
This is a design-stage method because the observation of the outcomes is not required.

Estimating effects of pseudo-treatments

Suppose that the assignment was such that there was 1 treatment group and 2 control groups. \[ W_i = \begin{cases} 1 &\text{ if } G_i = t\\ 0 &\text{ if } G_i = c_1, c_2. \end{cases} \]
Moreover, suppose that a stronger version of unconfoundedness, group unconfoundedness, is supposed to hold: \[ G_i \perp \!\!\! \perp Y_i(0), Y_i(1) | X_i, \] which has a testable implication: \[ G_i \perp \!\!\! \perp Y_i(0) | X_i, G_i \in \{c_1, c_2\} \Leftrightarrow G_i \perp \!\!\! \perp Y_i^{obs} | X_i, G_i \in \{c_1, c_2\}. \]

Robustness to the pretreatment variables

Again assume subset unconfoundedness: \[ W_i \perp \!\!\! \perp (Y_i(1), Y_i(0)) | X_i^r, \] for a partition of the covariates \(X_i = (X_i^p, X_i^r)\).
Then, the estimated treatment effect should be robust to the inclusion of \(X_i^p\).

Simulation

Set simulation parameters

set.seed(1)
N <- 1000
J <- 4
M <- 1
beta_t <- c(1, 0.1)
beta_c <- c(0, 0.05)

Generate potential outcomes

outcome <-
  tibble::tibble(
    x = rnorm(N),
    y_0 = beta_c[1] + beta_c[2] * x + rnorm(N),
    y_1 = beta_t[1] + beta_t[2] * x + rnorm(N)
  )
head(outcome)

## # A tibble: 6 x 3
##        x     y_0     y_1
##    <dbl>   <dbl>   <dbl>
## 1 -0.626  1.10    0.0512
## 2  0.184  1.12   -0.904 
## 3 -0.836 -0.913   2.54  
## 4  1.60   0.290   1.68  
## 5  0.330  0.0859  0.977 
## 6 -0.820 -1.70    1.61

Check the estimand

tau_fs <- 
  outcome %>%
  dplyr::summarise(tau = mean(y_1 - y_0)) %>%
  dplyr::pull(tau)
tau_fs

## [1] 1.030989

Assign treatment

outcome <-
  outcome %>%
  dplyr::mutate(
    e = exp(x) / (1 + exp(x)),
    w = (runif(N) < e)
    )
head(outcome)

## # A tibble: 6 x 5
##        x     y_0     y_1     e w    
##    <dbl>   <dbl>   <dbl> <dbl> <lgl>
## 1 -0.626  1.10    0.0512 0.348 FALSE
## 2  0.184  1.12   -0.904  0.546 FALSE
## 3 -0.836 -0.913   2.54   0.302 FALSE
## 4  1.60   0.290   1.68   0.831 TRUE 
## 5  0.330  0.0859  0.977  0.582 FALSE
## 6 -0.820 -1.70    1.61   0.306 TRUE

Check the balance in the covariate

Generate observed outcome and assignment

outcome_observed <-
  outcome %>%
  dplyr::mutate(y = y_0 * (1 - w) + y_1 * w) %>%
  dplyr::select(y, w, x)
head(outcome_observed)

## # A tibble: 6 x 3
##         y w          x
##     <dbl> <lgl>  <dbl>
## 1  1.10   FALSE -0.626
## 2  1.12   FALSE  0.184
## 3 -0.913  FALSE -0.836
## 4  1.68   TRUE   1.60 
## 5  0.0859 FALSE  0.330
## 6  1.61   TRUE  -0.820

Impute by a linear regression

result <-
  lm(
    data = outcome_observed,
    formula = y ~ w + w:x 
  )
result %>%
  modelsummary(fmt = 6)

	Model 1
(Intercept)	-0.017528
	(0.049717)
wTRUE	1.103605
	(0.073258)
wFALSE × x	0.088752
	(0.047710)
wTRUE × x	0.068441
	(0.052452)
Num.Obs.	1000
R2	0.238
R2 Adj.	0.236
AIC	2949.7
BIC	2974.3
Log.Lik.	-1469.856
F	103.888

Impute by a linear regression

If misspecified:

result <-
  c(
    y ~ w + w:exp(x),
    y ~ w + w:x^2,
    y ~ w + w:(x + x^2)
  ) %>%
  purrr::map(
    .,
    ~ lm(
      data = outcome_observed,
      formula = .
      )
  )

Impute by a linear regression

result %>%
  modelsummary(fmt = 6)

	Model 1	Model 2	Model 3
(Intercept)	-0.125049	-0.017528	-0.017528
	(0.057698)	(0.049717)	(0.049717)
wTRUE	1.179050	1.103605	1.103605
	(0.084518)	(0.073258)	(0.073258)
wFALSE × exp(x)	0.066445
	(0.032783)
wTRUE × exp(x)	0.026059
	(0.016001)
wFALSE × x		0.088752	0.088752
		(0.047710)	(0.047710)
wTRUE × x		0.068441	0.068441
		(0.052452)	(0.052452)
Num.Obs.	1000	1000	1000
R2	0.240	0.238	0.238
R2 Adj.	0.237	0.236	0.236
AIC	2948.1	2949.7	2949.7
BIC	2972.7	2974.3	2974.3
Log.Lik.	-1469.059	-1469.856	-1469.856
F	104.583	103.888	103.888

Weighing estimator

Estimate the propensity score by a logistic function (correctly specified):

propensity_score <-
  glm(
    data = outcome_observed,
    formula = w ~ x,
    family = binomial("logit")
  )
propensity_score %>%
  modelsummary(fmt = 6)

	Model 1
(Intercept)	-0.139605
	(0.069562)
x	0.951090
	(0.080246)
Num.Obs.	1000
AIC	1204.7
BIC	1214.5
Log.Lik.	-600.337

Weighting estimator

outcome_observed <-
  outcome_observed %>%
  dplyr::mutate(e = predict(propensity_score, type = "response"))
outcome_observed %>%
  ggplot(
    aes(
      x = x,
      y = e
    )
  ) +
  geom_line() +
  theme_classic()

Weighting estimator

outcome_observed %>%
  dplyr::mutate(weight = (w / e + (1 - w) / (1 - e)) / length(w)) %>%
  dplyr::summarise(y = sum(w * y * weight) - sum((1 - w) * y * weight))

## # A tibble: 1 x 1
##       y
##   <dbl>
## 1  1.08

Blocking estimator

blocking_estimator <-
  outcome_observed %>%
  dplyr::mutate(j = dplyr::ntile(outcome_observed$e, J)) %>%
  dplyr::group_by(j) %>%
  dplyr::summarise(
    N_j = length(y),
    y = sum(y * w) / sum(w) - sum(y * (1 - w)) / sum(1 - w)
    ) %>%
  dplyr::ungroup()
blocking_estimator

## # A tibble: 4 x 3
##       j   N_j     y
##   <int> <int> <dbl>
## 1     1   250  1.09
## 2     2   250  1.21
## 3     3   250  1.00
## 4     4   250  1.17

Blocking estimator

blocking_estimator %>%
  dplyr::summarise(y = sum(y * N_j) / sum(N_j))

## # A tibble: 1 x 1
##       y
##   <dbl>
## 1  1.12

Matching estimator

matched <-
  MatchIt::matchit(
    data = outcome_observed,
    formula = w ~ x,
    method = "nearest",
    distance = "glm",
    ratio = M
  )
 matched

## A matchit object
##  - method: 1:1 nearest neighbor matching without replacement
##  - distance: Propensity score
##              - estimated with logistic regression
##  - number of obs.: 1000 (original), 938 (matched)
##  - target estimand: ATT
##  - covariates: x

Matching estimator

matched_observed <-
  MatchIt::match.data(matched) 
matched_observed

## # A tibble: 938 x 7
##          y w          x     e distance weights subclass
##      <dbl> <lgl>  <dbl> <dbl>    <dbl>   <dbl> <fct>   
##  1  1.10   FALSE -0.626 0.324    0.324       1 424     
##  2  1.12   FALSE  0.184 0.509    0.509       1 59      
##  3 -0.913  FALSE -0.836 0.282    0.282       1 363     
##  4  1.68   TRUE   1.60  0.799    0.799       1 158     
##  5  0.0859 FALSE  0.330 0.543    0.543       1 39      
##  6  1.61   TRUE  -0.820 0.285    0.285       1 265     
##  7  1.10   TRUE   0.487 0.580    0.580       1 314     
##  8 -0.236  TRUE   0.738 0.637    0.637       1 366     
##  9 -1.07   TRUE   0.576 0.601    0.601       1 412     
## 10  0.983  FALSE -0.305 0.394    0.394       1 113     
## # ... with 928 more rows

Matching estimator

matched_observed %>%
  dplyr::group_by(subclass) %>%
  dplyr::summarise(y = sum(y * w) - sum(y * (1 - w))) %>%
  dplyr::ungroup() %>%
  dplyr::summarise(y = mean(y))

## # A tibble: 1 x 1
##       y
##   <dbl>
## 1  1.15

Reference

Chapter 12-13, Guido W. Imbens and Donald B. Rubin, 2015, Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press.