Incorporating covariates

  • A researcher may wish to incorporate covariates into the analysis if covariates are recorded for each unit.
  • If one has covariates observed prior to the randomization, one should modify the design of the experiment and carry out a stratified randomized experiment.
  • Moreover, if one has a conducted a completely randomized experiment where the randomization did not take the covariates into account, one can just ignore the covariates and focus on the difference in means to estimate the average treatment effect.

The role of ex-post covariate adjustment

  • Consider a case where the completely randomized experiments did not take the covariates into account at the design stage.
  • There are still some roles in ex-post covariate adjustment.
  • If the covariates are sufficiently strongly correlated with the potential outcomes, the estimation results may be more precise.
  • The covariates allow the assessments of heterogeneity in treatment effects.
  • If the randomization is compromised because of practical problems, adjusting for covariate differences may remove the biases.

Regression methods and the randomization-based approach

  • We often use regression methods to estimate treatment effects, especially for controlling for covariates.
  • However, the inference of regression methods are based on sampling-based approach.
  • Moreover, the assumptions for the regression methods do not immediately follow from randomization.
  • Therefore, regression methods often go beyond analyses that are justified by randomization and end up with analyses that rely on a mix of randomization assumptions, modelling assumptions, and large sample approximations.

Mixing sampling-based and randomization-based approach

  • We assume that the sample of size \(N\) can be a simple random sample drawn from an infinite super-population.
  • This random sample induces a distribution on the pair of potential outcomes and covariates.
  • Then, the treatment status is randomly assigned and observed outcome is determined.
  • The estimand is the super-population average treatment effect: \[ \tau_{sp} \equiv \mathbb{E}[Y_i(1) - Y_i(0)], \] which is the expected value of the unit-level treatment effects under the distribution induced by sampling from the super-population.

Notations

  • For each covariate value \(X = x\); \[ \mu_c(x) \equiv \mathbb{E}[Y_i(0)|X_i = x], \mu_t \equiv \mathbb{E}[Y_i(1) | X_i = x], \] \[ \sigma_c^2(x) \equiv \mathbb{V}_c[Y_i(0)|X_i = x], \sigma_t^2 \equiv \mathbb{V}[Y_i(1)|X_i = x], \] \[ \tau(x) \equiv \mathbb{E}[Y_i(1) - Y_i(0)| X_i = x], \sigma_{ct}^2(x) \equiv \mathbb{V}[Y_i(1) - Y_i(0)| X_i = x]. \]
  • Marginally: \[ \mu_c \equiv \mathbb{E}[Y_i(0)] = \mathbb{E}[\mu_c(X_i)], \mu_t \equiv \mathbb{E}[Y_i(1)] = \mathbb{E}[\mu_1(X_i)], \] \[ \sigma_c^2 \equiv \mathbb{V}[Y_i(0)] = \mathbb{E}[\sigma_c^2(X_i)] + \mathbb{V}[\mu_c(X_i)], \] \[ \sigma_t^2 \equiv \mathbb{V}[Y_i(1)] = \mathbb{E}[\sigma_t^2(X_i)] + \mathbb{V}[\mu_t(X_i)]. \]

Notations

  • For the covariates: \[ \mu_X \equiv \mathbb{E}[X_i], \Omega_X \equiv \mathbb{V}(X_i) = \mathbb{E}[(X_i - \mu_X)' (X_i - \mu_X)]. \]

Linear regression with no covariate

Definie a linear regression function

  • Specify a linear regression function for the observed outcome \(Y_i^{obs}\): \[ Y_i^{obs} \equiv \alpha + \tau \cdot W_i + \epsilon_i. \]
  • Remark that this linear regression function is correctly specified without loss of generality, because we can set: \[ \alpha = \mu_{c}, \tau = \tau_{sp}, \epsilon_i = Y_i(0) - \mu_c + W_i \cdot [Y_i(1) - Y_i(0) - \tau_{sp}]. \]

Define ols estimator of \(\tau\)

  • The ordinary least squares (ols) estimator for \(\tau\) is defined as: \[ (\hat{\tau}_{ols}, \hat{\alpha}_{ols}) \equiv \text{argmin}_{\tau, \alpha} \sum_{i = 1}^N (Y_i^{obs} - \alpha - \tau \cdot W_i)^2, \] and is solved as: \[ \hat{\tau}_{ols} = \frac{\sum_{i = 1}^N(W_i - \overline{W}) (Y_i^{obs} - \overline{Y}^{obs})}{\sum_{i = 1}^N (W_i - \overline{W})^2}, \hat{\alpha}_{ols} = \overline{Y}^{obs} - \hat{\tau}_{ols} \cdot \overline{W}, \] where: \[ \overline{Y}^{obs} \equiv \frac{1}{N} \sum_{i = 1}^N Y_i^{obs}, \overline{W} = \frac{1}{N} \sum_{i = 1}^N W_i. \]

Ols estimator as Neyman’s estimator

  • A simple algebra shows that: \[ \hat{\tau}_{ols} = \overline{Y}_t^{obs} - \overline{Y}_c^{obs} \equiv \hat{\tau}_{fs} \]
  • Thus, the ols estimator is equivalent to Neyman’s estimator for the finite-sample average treatment effect.
  • In the previous lecture, we showed that \(\hat{\tau}_{fs}\) is unbiased for \(\tau_{sp}\) with random assignment and random sampling.
  • This shows that \(\hat{\tau}_{ols}\) is unbiased for \(\tau_{sp}\).

Is conditional mean zero assumption justified from the current setting?

  • We did not use the standard argument for the unbiasedness of the ols estimator that rely on the conditional mean zero assumption.
  • Can we justify this assumption from the random assignment and random sampling?
  • Specifying the linear regression model is equivalent to defining \(\epsilon_i\) as follows: \[ \epsilon_i \equiv Y_i(0) - \mu_c + W_i \cdot [Y_i(1) - Y_i(0) - \tau_{sp}]. \]
  • In other words, can we derive \(\mathbb{E}(\epsilon_i | W_i = w) = 0\) for \(w = 0, 1\) from random sampling and random assignment?

Showing conditional mean zero

  • The random sampling allows viewing the potential outcome as random variables.
  • Moreover, the randam assignment implies: \[ \mathbb{P}[W_i = 1|Y_i(0), Y_i(1)] = \mathbb{P}(W_i = 1). \]
  • Therefore: \[ \mathbb{E}[\epsilon_i | W_i = 0] = \mathbb{E}[Y_i(0) - \mu_c| W_i = 0] = \mathbb{E}[Y_i(0) - \mu_c] = 0, \] and: \[ \mathbb{E}[\epsilon_i | W_i = 1] = \mathbb{E}[Y_i(1) - \mu_c - \tau_{sp}| W_i = 1] = \mathbb{E}[Y_i(1) - \mu_c - \tau_{sp}] = 0. \]

Heteroskedasticity robust sampling variance estimator of \(\hat{\tau}_{ols}\)

  • The standard robust sampling variance estimator of \(\hat{\tau}_{ols}\) allowing for heteroskedasticity due to \(W_i\): \[ \widehat{\mathbb{V}}_{hetero} \equiv \frac{\sum_{i = 1}^N \hat{\epsilon}_i^2 \cdot (W_i - \overline{W})^2}{[\sum_{i = 1}^N (W_i - \overline{W})^2]^2}, \] where: \[ \hat{\epsilon}_i \equiv Y_i^{obs} - \hat{\alpha}_{ols} - \hat{\tau}_{ols} \cdot W_i. \]

..is the same with \(\widehat{\mathbb{V}}_{Neyman}\) for \(\hat{\tau}_{fs}\)

  • A simple algebra shows: \[ \widehat{\mathbb{V}}_{hetero} = \frac{s_c^2}{N_c} + \frac{s_t^2}{N_t} = \widehat{\mathbb{V}}_{Neyman}, \] where: \[ s_c^2 \equiv \frac{1}{N_c - 1} \sum_{i: W_i = 0} (Y_i^{obs} - \overline{Y}^{obs})^2, s_t^2 \equiv \frac{1}{N_t - 1} \sum_{i: W_i = 1} (Y_i^{obs} - \overline{Y}^{obs})^2. \]
  • Whereas \(\widehat{\mathbb{V}}_{Neyman}\) for \(\hat{\tau}_{fs}\) is justified as a conservative estimator for the variance of \(\hat{\tau}_{fs}\) due to random assignment, \(\widehat{\mathbb{V}}_{hetero}\) is justified as a consistent estimator to the variance of \(\hat{\tau}_{ols}\) under random sampling.

Linear regression with additional covariates

Define a linear regression function

  • We have \(X_i\) as additional covariates.
  • Define a linear regression function with \(X_i\) as: \[ Y_i^{obs} \equiv \alpha + \tau \cdot W_i + X_i \beta + \epsilon_i. \]
  • We do not assume that the conditional mean of \(Y_i^{obs}\) is actually linear in \(W_i\) and \(W_i\).
  • We are interested in \(\tau\) but not in the nuisance parameters \(\alpha\) and \(\beta\).

Define ols estimator for \(\tau\)

  • The ols estimator is defined using least squares: \[ (\hat{\tau}_{ols}, \hat{\alpha}_{ols}, \hat{\beta}_{ols}) \equiv \text{argmin}_{\tau, \alpha, \beta} \sum_{i = 1}^N(Y_i^{obs} - \alpha - \tau \cdot W_i - X_i \beta)^2. \]
  • Define the super-population values as: \[ (\hat{\tau}^*, \hat{\alpha}^*, \hat{\beta}^*) \equiv \text{argmin}_{\tau, \alpha, \beta} \mathbb{E}(Y_i^{obs} - \alpha - \tau \cdot W_i - X_i \beta)^2. \]

Consistency of \(\hat{\tau}_{ols}\)

  • If \(X_i \beta\) is included, \(\hat{\tau}_{ols}\) is no longer unbiased for \(\tau_{sp}\) in the finite sample.
  • In the large sample, \(\hat{\tau}_{ols}\) converges to \(\tau^*\) as shown in the basic econometrics.
  • Moreover, we can show \(\tau^* = \tau_{sp}\).
  • Therefore, in the large sample, \(\hat{\tau}_{ols}\) is unbiased for \(\tau_{sp}\).
  • Intuition:
    • In the large sample, \(\hat{\tau}_{ols}\) is unbiased for \(\tau_{sp}\), because in the large sample, the correlation between \(W_i\) and \(X_i\) is zero because of the random assignment.
    • In the finite sample, however, the correlation between \(W_i\) and \(X_i\) can be non-zero.

Asymptotic normality of \(\hat{\tau}_{ols}\)

  • Theorem:
    • Suppose we conduct a completely randomized experiment in a sample drawn at random from an infinite population. Then:
    1. \(\tau^* = \tau_{sp}\).
    2. \(\hat{\tau}_{ols}\) is asymptotically normal centered at \(\tau_{sp}\), i.e.: \[ \sqrt{N} \cdot (\hat{\tau}_{ols} - \tau_{sp}) \xrightarrow{d} N(0, V), \] where: \[ V \equiv \frac{\mathbb{E}[(W_i - p)^2 \cdot (Y_i^{obs} - \alpha^* - \tau_{sp} \cdot W_i - X_i \cdot \beta^*)^2]}{p^2 \cdot (1 - p)^2}. \]

Heteroskedasticity robust sampling variance estimator of \(\hat{\tau}_{ols}\)

  • The estimator of the sampling variance of \(\hat{\tau}_{ols}\) is: \[ \scriptsize \widehat{\mathbb{V}}_{hetero} \equiv \frac{1}{N\cdot(N - \text{dim}(X_i) - 2)} \frac{\sum_{i = 1}^N (W_i - \overline{W})^2 \cdot (Y_i^{obs} - \hat{\alpha}_{ols} - \hat{\tau}_{ols} \cdot W_i - X_i \hat{\beta}_{Ols})^2}{[\overline{W} \cdot (1 - \overline{W})]^2}. \]
  • Including \(X_i\) may reduce the sampling variance to the degree of correlated with the potential outcome.
  • The price paid for this is to give up the finite-sample unbiasedness.
  • The unbiasedness in the large sample is maintained.

Linear regression with covariates and interactions

Define a linear regression function

  • We can further define a linear regression function with covariates and interactions: \[ Y_i^{obs} \equiv \alpha + \tau \cdot W_i + X_i \beta + W_i \cdot (X_i - \overline{X}) \gamma + \epsilon_i. \]
  • Again, we do not assume that this is the correct specification for the conditional mean of \(Y_i^{obs}\).
  • We are interested in \(\tau\) but not in the nuisance parameters \(\alpha\), \(\beta\), and \(\gamma\).

Define ols estimator for \(\tau\)

  • Define ols estimator as: \[ \scriptsize (\hat{\tau}_{ols}, \hat{\alpha}_{ols}, \hat{\beta}_{ols}, \hat{\gamma}_{ols}) \equiv \text{argmin}_{\tau, \alpha, \beta, \gamma} \sum_{i = 1}^N [Y_i^{obs} - \alpha - \tau \cdot W_i - X_i \beta - W_i \cdot (X_i - \overline{X}) \gamma]^2. \]
  • Define \(\tau^*, \alpha^*, \beta^*\), and \(\gamma^*\) as their super-population counterparts.

Consistency of \(\hat{\tau}_{ols}\)

  • \(\hat{\tau}_{ols}\) is no longer unbiased for \(\tau_{sp}\) in the finite sample.
  • In the large sample, \(\hat{\tau}_{ols}\) converges to \(\tau^*\) as shown in the basic econometrics.
  • Moreover, we can show \(\tau^* = \tau_{sp}\).
  • Therefore, in the large sample, \(\hat{\tau}_{ols}\) is unbiased for \(\tau_{sp}\).
  • Intuition:
    • In the large sample, \(\hat{\tau}_{ols}\) is unbiased for \(\tau_{sp}\), because in the large sample, the correlation between \(W_i\) and \(X_i\) is zero because of the random assignment.
    • In the finite sample, however, the correlation between \(W_i\) and \(X_i\) can be non-zero.

Asymptotic normality of \(\hat{\tau}_{ols}\)

  • Theorem:
    • Suppose we conduct a completely randomized experiment in a sample drawn at random from an infinite population. Then:
    1. \(\tau^* = \tau_{sp}\).
    2. \(\hat{\tau}_{ols}\) is asymptotically normal centered at \(\tau_{sp}\), i.e.: \[ \sqrt{N} \cdot (\hat{\tau}_{ols} - \tau_{sp}) \xrightarrow{d} N(0, V), \] where: \[ \scriptsize V \equiv \frac{\mathbb{E}[(W_i - p)^2 \cdot (Y_i^{obs} - \alpha^* - \tau_{sp} \cdot W_i - X_i \cdot \beta^* - W_i \cdot (X_i - \mu_X) \gamma^*)^2]}{p^2 \cdot (1 - p)^2}. \]

Heteroskedasticity robust sampling variance estimator of \(\hat{\tau}_{ols}\)

  • The estimator of the sampling variance of \(\hat{\tau}_{ols}\) is: \[ \tiny \widehat{\mathbb{V}}_{hetero} \equiv \frac{1}{N\cdot(N - 2 \cdot \text{dim}(X_i)) - 2} \frac{\sum_{i = 1}^N (W_i - \overline{W})^2 \cdot [ Y_i^{obs} - \hat{\alpha}_{ols} - \hat{\tau}_{ols} \cdot W_i - X_i \hat{\beta}_{ols} - W_i \cdot (X_i - \overline{X})\hat{\gamma}_{ols} ]^2}{[\overline{W} \cdot (1 - \overline{W})]^2}. \]
  • Including \(X_i\) may reduce the sampling variance to the degree of correlated with the potential outcome.
  • The price paid for this is to give up the finite-sample unbiasedness.
  • The unbiasedness in the large sample is maintained.

Can we test the presence of treatment effect heterogeneity?

  • Based on the ols estimator \(\hat{\tau}_{ols}\) and its sampling variance estimator, we can test the null of \(\tau_{sp} = 0\).
  • Can we further test the following null hypothesis?: \(Y_i(1) - Y_i(0) = \tau\) for some \(\tau\).

Chi-squared test statistics

  • Consider a test statistics such that: \[ Q \equiv \hat{\gamma}_{ols}' \widehat{V}_\gamma^{-1} \hat{\gamma}_{ols}, \] where \(\widehat{V}_\gamma\) is the sampling variance estimator for \(\hat{\gamma}_{ols}\).
  • Theorem:
    • Suppose we conduct a completely randomized experiment in a random sample from an infinite population. If \(Y_i(1) - Y_i(0) = \tau\) for some value \(\tau\) and all units, then:
    1. \(\gamma^* = 0\).
    2. \(Q \xrightarrow{d} \chi(\text{dim}(X_i))\).

Simulation

Generate covariates

set.seed(1)
N <- 1000
R <- 1000
N_t <- 500
tau_sp <- 1
covariates <- rnorm(N, 0, 1)
head(covariates)
## [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684

Generate potential outcomes

outcomes <-
  data.frame(
    y0 = rnorm(N, mean = 0, sd = 1),
    y1 = rnorm(N, mean = tau_sp, sd = 1),
    x = covariates
  )
outcomes <-
  outcomes %>%
  dplyr::mutate(
    y0 = y0 + (exp(x) - exp(1/2)) * 0.01,
    y1 = y1 + (exp(x) - exp(1/2)) * 0.02,
    tau = y1 - y0
  ) 
head(outcomes)
##            y0          y1          x       tau
## 1  1.12382271  0.09156567 -0.6264538 -1.032257
## 2  1.10746050 -0.93119758  0.1836433 -2.038658
## 3 -0.88292883  2.59539836 -0.8356286  3.478327
## 4  0.24354150  1.58488974  1.5952808  1.341348
## 5  0.06681127  0.93898132  0.3295078  0.872170
## 6 -1.67473381  1.67224769 -0.8204684  3.346982

Distribution of the individual treatment effects

  • The sample mean of the individual treatment effects is near the super-population average treatment effect.

Assign treatment and get observations

assignment_realized <- 1:N %in% sample(N, N_t)
head(assignment_realized)
## [1] FALSE FALSE  TRUE  TRUE FALSE FALSE
data_realized <- 
  data.frame(
    x = covariates,
    w = assignment_realized,
    y = outcomes$y0 * (1 - assignment_realized) + outcomes$y1 * assignment_realized
  )
head(data_realized)
##            x     w           y
## 1 -0.6264538 FALSE  1.12382271
## 2  0.1836433 FALSE  1.10746050
## 3 -0.8356286  TRUE  2.59539836
## 4  1.5952808  TRUE  1.58488974
## 5  0.3295078 FALSE  0.06681127
## 6 -0.8204684 FALSE -1.67473381

Calculate ols estimator

  • Estimate with no covariates, covariates, and interactions:
result_ols <-
  c(
    "y ~ w",
    "y ~ w + x",
    "y ~ w + x + w:x"
  ) %>%
  purrr::map(as.formula) %>%
  purrr::map(
    .,
    ~ lm(
      formula = .,
      data = data_realized
      )
  )

Estimation results

result_ols %>%
  modelsummary(vcov = "HC2") 
Model 1 Model 2 Model 3
(Intercept) 0.003 0.004 0.003
(0.047) (0.047) (0.047)
wTRUE 1.042 1.042 1.043
(0.065) (0.065) (0.065)
x 0.020 -0.003
(0.033) (0.048)
wTRUE × x 0.046
(0.066)
Num.Obs. 1000 1000 1000
R2 0.203 0.204 0.204
R2 Adj. 0.203 0.202 0.202
AIC 2905.8 2907.3 2908.8
BIC 2920.5 2927.0 2933.3
Log.Lik. -1449.876 -1449.666 -1449.401
F 254.748 127.508 85.142
Std.Errors HC2 HC2 HC2

Manually calculate the standard errors

se_hetero <-
  result_ols %>%
  purrr::map(
    .,
    ~ data.frame(
      data_realized,
      residual = .$residuals,
      df = .$df.residual
      ) %>%
      dplyr::mutate(w_bar = mean(w)) %>%
      dplyr::summarise(
        v = sum((w - w_bar)^2 * residual^2 / (w_bar * (1 - w_bar))^2) 
        / (length(w) * mean(df))
      ) %>%
      dplyr::ungroup() %>%
      sqrt()
  ) %>%
  dplyr::bind_rows()

Showing the manually calculated standard error

data.frame(
  model = c("No covariate", "Covariates", "Interactions"),
  "Heterskedasticity robust standard error" = se_hetero$v
  ) %>%
  kbl() %>%
  kable_styling()
model Heterskedasticity.robust.standard.error
No covariate 0.0652981
Covariates 0.0653172
Interactions 0.0653326

Reference

  • Chapter 7, Guido W. Imbens and Donald B. Rubin, 2015, Causal Inference for Statistics, Social, and Biomedical Sciences, Cambridge University Press.
  • Section 5, Athey, Susan, and Guido Imbens. 2016. “The Econometrics of Randomized Experiments.” arXiv [stat.ME]. arXiv. http://arxiv.org/abs/1607.00698.