Difference in Differences

2 x 2 Design

Kohei Kawaguchi, Hong Kong University of Science and Technology

2 x 2 design Difference in Differences (DID)

Repeated cross-section

  • Data are repeated cross sections.
  • \((Y_i, T_i, G_i)\), \(i = 1, \cdots, N\).
  • Period: \(T_i \in \{0, 1\}\).
  • Group: \(G_i \in \{0, 1\}\).
  • Treatment: \(D_i = T_i \cdot G_i\).
  • Potential outcomes: \(Y_i(0), Y_i(1)\).
  • Observed outcome: \(Y_i = D_i \cdot Y_i(1) + (1 - D_i) \cdot Y_i(0)\).

Repeated cros-section 2 x 2 DID parameter

\[ \begin{split} \tau_{DID} &= [\mathbb{E}(Y_i| G_i = 1, T_i = 1) - \mathbb{E}(Y_i| G_i = 1, T_i = 0)] \\ &- [\mathbb{E}(Y_i| G_i = 0, T_i = 1) - \mathbb{E}(Y_i| G_i = 0, T_i = 0)]. \end{split} \]

  • We can estimate this parameter as an OLS estimator \(\hat{\tau}_{DID}\) of: \[ Y_i = \alpha + \beta G_i + \gamma T_i + \tau \cdot G_i \cdot T_i + \epsilon_i. \]

In terms of potential outcomes:

\[ \begin{split} \tau_{DID} &= [\mathbb{E}(Y_i| G_i = 1, T_i = 1) - \mathbb{E}(Y_i| G_i = 1, T_i = 0)]\\ &- [\mathbb{E}(Y_i| G_i = 0, T_i = 1) - \mathbb{E}(Y_i| G_i = 0, T_i = 0)]\\ &= [\mathbb{E}(Y_i(1)| G_i = 1, T_i = 1) - \mathbb{E}(Y_i(0)| G_i = 1, T_i = 0)]\\ &- [\mathbb{E}(Y_i(0)| G_i = 0, T_i = 1) - \mathbb{E}(Y_i(0)| G_i = 0, T_i = 0)]\\ &+ [\mathbb{E}(Y_i(0)|G_i = 1, T_i = 1) - \mathbb{E}(Y_i(0)|G_i = 1, T_i = 1)]\\ &= \mathbb{E}(Y_i(1)| G_i = 1, T_i = 1) - \mathbb{E}(Y_i(0)|G_i = 1, T_i = 1)\\ &+ [\mathbb{E}(Y_i(0)|G_i = 1, T_i = 1) - \mathbb{E}(Y_i(0)| G_i = 1, T_i = 0)]\\ &- [\mathbb{E}(Y_i(0)| G_i = 0, T_i = 1) - \mathbb{E}(Y_i(0)| G_i = 0, T_i = 0)], \end{split} \] - \(\tau_{DID}\) is the average treatment effect on \(G_i = 1, T_i = 1\) plus the difference in the trends between two groups.

Parallel trend assumption

  • Assume parallel trends between two groups: \[ \begin{split} &\mathbb{E}(Y_i(0)|G_i = 1, T_i = 1) - \mathbb{E}(Y_i(0)| G_i = 1, T_i = 0)]\\ &= [\mathbb{E}(Y_i(0)| G_i = 0, T_i = 1) - \mathbb{E}(Y_i(0)| G_i = 0, T_i = 0)]. \end{split} \]
  • Then, the DID parameter is the average treatment effect on treated: \[ \tau_{DID} = \mathbb{E}(Y_i(1)| G_i = 1, T_i = 1) - \mathbb{E}(Y_i(0)|G_i = 1, T_i = 1). \]

Panel data

  • Data are panel data.
  • \((Y_{it}, G_{it})\), \(i = 1, \cdots, N\).
  • Period: \(t \in \{0, 1\}\).
  • Group: \(G_i \in \{0, 1\}\).
  • Treatment: \(D_{it} = t \cdot G_i\).
  • Potential outcomes: \(Y_{it}(0), Y_{it}(1)\).
  • Observed outcome: \(Y_{it} = D_{it} \cdot Y_{it}(1) + (1 - D_{it}) \cdot Y_{it}(0)\).

Panel 2x2 DID parameter

\[ \begin{split} \tau_{DID} &= \mathbb{E}(Y_{i1} - Y_{i0}| G_i = 1) - \mathbb{E}(Y_{i1} - Y_{i0}| G_i = 0). \end{split} \] - We can estimate this parameter as a two-way fixed-effect estimator \(\hat{\tau}_{DID}\) of: \[ Y_{it} = \mu_i + \lambda_t + \tau \cdot G_i \cdot t + \epsilon_{it}. \]

In terms of potential outcomes:

\[ \begin{split} \tau_{DID} &= \mathbb{E}(Y_{i1} - Y_{i0}| G_i = 1) - \mathbb{E}(Y_{i1} - Y_{i0}| G_i = 0)\\ &= \mathbb{E}(Y_{i1}(1) - Y_{i0}(0)| G_i = 1) - \mathbb{E}(Y_{i1}(0) - Y_{i0}(0)| G_i = 0)\\ &+ \mathbb{E}(Y_{i1}(0) - Y_{i1}(0)|G_i = 1)\\ &= \mathbb{E}(Y_{i1}(1) - Y_{i1}(0)| G_i = 1)\\ &+ \mathbb{E}(Y_{i1}(0) - Y_{i0}(0)|G_i = 1) - \mathbb{E}(Y_{i1}(0) - Y_{i0}(0)| G_i = 0). \end{split} \]

Parallel trend assumption

  • The parallel trend assumption in the panel case is: \[ \mathbb{E}(Y_{i1}(0) - Y_{i0}(0)|G_i = 1) = \mathbb{E}(Y_{i1}(0) - Y_{i0}(0)| G_i = 0). \]

  • This holds when: \[ Y_{it}(1) - Y_{it}(0) \perp\!\!\!\!\perp G_i. \]

  • Under this assumption, the DID parameter is the average treatment effect on treated: \[ \tau_{DID} = \mathbb{E}(Y_{i1}(1) - Y_{i0}(0)| G_i = 1). \]

Semiparametric DID

Parallel trend assumption conditional on covariates

  • Suppose that the parallel trend assumption holds only conditional on covariate \(X_{i}\): \[ \mathbb{E}(Y_{i1}(0) - Y_{i0}(0)|G_i = 1, X_i) = \mathbb{E}(Y_{i1}(0) - Y_{i0}(0)| G_i = 0, X_i). \]
  • How do we exploit this assumption?
  • One way is to assume the following parametric model: \[ Y_{it} = \mu_i + X_i'\lambda_t + \tau \cdot G_i \cdot t + \epsilon_{it}. \]
  • Can we exploit this assumption without assuming any parametric model?

Outcome regression (OR) approach

  • Heckman et al. (1997).
  • Estimate the outcome for untreated: \[ \mu_{0t}(X_i) \equiv \mathbb{E}(Y_{it}| G_i = 0, X_i) \] by a kernel estimator or local polynomical estimator to get \(\hat{\mu}_{0t}(X_i)\).
  • Then, construct DID estimator: \[ \hat{\tau}^{OR} = \frac{1}{N_1}\sum_{i: G_i = 1} [ Y_{i1} - Y_{i0} - \hat{\mu}_{01}(X_i) + \hat{\mu}_{00}(X_i)]. \]
  • This approach relies on the correct estimation of regression function \(\mu_{0t}(\cdot)\).

Inverse probability weight (IPW) approach

  • Abadie (2005).
  • Let \(p(X_i) = \mathbb{P}(G_i = 1| X_i)\) is the propensity score.
  • We have: \[ \begin{split} &\mathbb{E}[Y_{i1} - Y_{i0}| G_i = 1, X_i] - \mathbb{E}[Y_{i1} - Y_{i0}| G_i = 0, X_i]\\ &= \frac{1}{p(X_i)} \mathbb{E}[G_i \cdot (Y_{i1} - Y_{i0})| X_i] - \frac{1}{1 - p(X_i)} \mathbb{E}[(1 - G_i) \cdot (Y_{i1} - Y_{i0})| X_i]\\ &= \mathbb{E}\Bigg[\frac{G_i - p(X_i)}{p(X_i) \cdot [1 - p(X_i)]} \cdot (Y_{i1} - Y_{i0}) \Bigg | X_i \Bigg]. \end{split} \]

Inverse probability weight (IPW) approach

  • The average treatment effect on treated under the conditional paralel trend assumption is: \[ \begin{split} &= \mathbb{E}[Y_{i1}(1) - Y_{i1}(0)|G_i = 1] \\ &= \int \mathbb{E}[Y_{i1}(1) - Y_{i1}(0)|G_i = 1, X_i] d\mathbb{P}(X_i|G_i = 1)\\ &= \int \{\mathbb{E}[Y_{i1} - Y_{i0}| G_i = 1, X_i] - \mathbb{E}[Y_{i1} - Y_{i0}| G_i = 0, X_i]\} d\mathbb{P}(X_i|G_i = 1)\\ &= \int \mathbb{E}\Bigg[\frac{G_i - p(X_i)}{p(X_i) \cdot [1 - p(X_i)]} \cdot (Y_{i1} - Y_{i0}) \Bigg | X_i \Bigg] d\mathbb{P}(X_i|G_i = 1)\\ &= \int \mathbb{E}\Bigg[\frac{G_i - p(X_i)}{p(X_i) \cdot [1 - p(X_i)]} \cdot (Y_{i1} - Y_{i0}) \cdot \frac{p(X_i)}{\mathbb{P}(G_i = 1)} \Bigg | X_i \Bigg] d\mathbb{P}(X_i) \end{split} \]

Inverse probability weight (IPW) approach

\[ \begin{split} &= \mathbb{E}\Bigg[\frac{G_i - p(X_i)}{p(X_i) \cdot [1 - p(X_i)]} \cdot (Y_{i1} - Y_{i0}) \cdot \frac{p(X_i)}{\mathbb{P}(G_i = 1)} \Bigg]\\ &= \frac{1}{\mathbb{P}(G_i = 1)} \mathbb{E}\Bigg[\frac{G_i - p(X_i)}{1 - p(X_i)} \cdot (Y_{i1} - Y_{i0}) \Bigg]. \end{split} \]

Inverse probability weight (IPW) approach

  • We first estimate the propensity score \(p(\cdot)\) by a kernel or local polynomial estimator to obtain \(\hat{p}(\cdot)\).
  • Then, we can estimate the average treatment effect on treated as: \[ \hat{\tau}^{IPW} = \frac{1}{\overline{G}} \frac{1}{N} \sum_{i = 1}^N \Bigg[\frac{G_i - \hat{p}(X_i)}{1 - \hat{p}(X_i)} \cdot (Y_{i1} - Y_{i0}) \Bigg] \]
  • This relies on the correct estimation of propensity score \(p(\cdot)\).

Doubly-robust (DR) approach

  • Anna and Zhao (2020) combines both approaches.
  • It considers the sample analogue of: \[ \begin{split} \mathbb{E}[(w_1(G_i) - w_0(G_i, X_i)) \cdot (\Delta Y_i - \Delta \mu_0(X_i))], \end{split} \] where: \[ w_1(G_i) = \frac{G_i}{\mathbb{E}(G_i)}, w_0(G_i, X_i) = \frac{p(X_i) \cdot (1 - G_i) }{1 - p(X_i)} \cdot \mathbb{E}\Bigg[\frac{p(X_i) \cdot (1 - G_i) }{1 - p(X_i)} \Bigg]^{-1}, \] \[ \Delta Y_i = Y_{i1} - Y_{i0}, \Delta \mu_0(X_i) = \mathbb{E}[\Delta Y_i| G_i = 0, X_i]. \]
  • This is efficient if either the regression function or propensity score is correctly estimated.

Simulation

Setting

  • Consider outcome functions for \(X_i \sim N(0, 1)\): \[ Y_{it}(0) = f(X_i)t + \epsilon_{it}(0), \epsilon_{it}(0) \sim N(0, 1), \] \[ Y_{it}(1) = \tau + 5 \cdot X_i + f(X_i)t + \epsilon_{it}(1), \epsilon_{it}(1) \sim N(0, 1), \] and the propensity score: \[ p(X_i) = \frac{exp(g(X_i))}{1 + \exp(g(X_i))} \] with: \[ f(X_i) = X_i, \] \[ g(X_i) = X_i. \]

Set parameters

set.seed(1)
N <- 1000
T <- 2
tau <- 10
f <- function(x) {
  y <- x
  return(y)
}
g <- function(x) {
  y <- exp(x)
  return(y)
}

Make data

df_i <-
  tibble::tibble(
    i = 1:N,
    x = rnorm(length(i)),
    v = rnorm(length(i)),
    d = (runif(length(i)) < exp(g(x)) / (1 + exp(g(x))) ) %>% as.integer()
  )
df <- 
  tidyr::expand_grid(i = 1:N, t = 0:(T - 1)) %>%
  dplyr::left_join(df_i, by = "i") %>%
  dplyr::mutate(
    e_0 = rnorm(length(i)),
    e_1 = rnorm(length(i)),
    y_0 = f(x) * t + e_0,
    y_1 = tau + x + f(x) * t + e_1,
    y = (1 - d) * y_0 + d * (1 - t) * y_0 + d * t * y_1
  )

Suammarize simulated data

df %>%
  modelsummary::datasummary_skim()
Unique (#) Missing (%) Mean SD Min Median Max
i 1000 0 500.5 288.7 1.0 500.5 1000.0
t 2 0 0.5 0.5 0.0 0.5 1.0
x 1000 0 -0.0 1.0 -3.0 -0.0 3.8
v 1000 0 -0.0 1.0 -3.3 -0.0 3.6
d 2 0 0.7 0.4 0.0 1.0 1.0
e_0 2000 0 0.0 1.0 -3.7 0.0 3.1
e_1 2000 0 -0.0 1.0 -3.2 -0.0 3.6
y_0 2000 0 0.0 1.3 -4.6 0.0 4.4
y_1 2000 0 10.0 1.9 2.6 10.0 18.9
y 2000 0 3.8 5.3 -4.0 0.8 18.9

Heterogeneity

df %>% dplyr::mutate(t = as.factor(t)) %>%
  ggplot(aes(x = x, y = d, color = t, group = t)) + geom_smooth(method = "gam", formula = y ~ s(x, bs = "cs"), se = FALSE) + 
  scale_color_viridis_d() + theme_classic()

Heterogeneity

df %>% dplyr::mutate(t = as.factor(t)) %>%
  ggplot(aes(x = x, y = y_0, color = t, group = t)) + geom_point() + 
  scale_color_viridis_d() + theme_classic()

Heterogeneity

df %>% dplyr::mutate(t = as.factor(t)) %>%
  ggplot(aes(x = x, y = y_1, color = t, group = t)) + geom_point() + 
  scale_color_viridis_d() + theme_classic()

Heterogeneity

df %>% dplyr::mutate(t = as.factor(t)) %>%
  dplyr:::mutate(tau_i = y_1 - y_0) %>%
  ggplot(aes(x = x, y = tau_i, color = t, group = t)) + geom_point() + 
  scale_color_viridis_d() + theme_classic()

Average treatment effect on treated

df %>% dplyr::filter(t == 1, d == 1) %>% dplyr::summarise(mean(y_1 - y_0))
## # A tibble: 1 x 1
##   `mean(y_1 - y_0)`
##               <dbl>
## 1              10.2

Two-way fixed effects estimator

model_twfe <-
  lfe::felm(
    data = df,
    formula = y ~ d:t | i + t
  )
modelsummary::modelsummary(model_twfe)
Model 1
d × t 10.709
(0.171)
Num.Obs. 2000
R2 0.951
R2 Adj. 0.901

Outcome regression

model_ordid <-
  DRDID::ordid(data = df, yname = "y", tname = "t",
    idname = "i",dname = "d", xformla = ~ x 
  )
summary(model_ordid)
##  Call:
## DRDID::ordid(yname = "y", tname = "t", idname = "i", dname = "d", 
##     xformla = ~x, data = df)
## ------------------------------------------------------------------
##  Outcome-Regression DID estimator for the ATT:
##  
##    ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
##   9.9773     0.1336    74.6829       0        9.7154    10.2391  
## ------------------------------------------------------------------
##  Estimator based on panel data.
##  Outcome regression est. method: OLS.
##  Analytical standard error.
## ------------------------------------------------------------------
##  See Sant'Anna and Zhao (2020) for details.

IPW

model_ipwdid <-
  DRDID::ipwdid(data = df, yname = "y", tname = "t",
    idname = "i",dname = "d", xformla = ~ x
  )
summary(model_ipwdid)
##  Call:
## DRDID::ipwdid(yname = "y", tname = "t", idname = "i", dname = "d", 
##     xformla = ~x, data = df)
## ------------------------------------------------------------------
##  IPW DID estimator for the ATT:
##  
##    ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
##   10.273     0.1308    78.5239       0       10.0165    10.5294  
## ------------------------------------------------------------------
##  Estimator based on panel data.
##  Hajek-type IPW estimator (weights sum up to 1).
##  Propensity score est. method: maximum likelihood.
##  Analytical standard error.
## ------------------------------------------------------------------
##  See Sant'Anna and Zhao (2020) for details.

Doubly-robust DID

model_drdid <-
  DRDID::drdid(data = df, yname = "y", tname = "t",
    idname = "i",dname = "d", xformla = ~ x 
  )
summary(model_drdid)
##  Call:
## DRDID::drdid(yname = "y", tname = "t", idname = "i", dname = "d", 
##     xformla = ~x, data = df)
## ------------------------------------------------------------------
##  Further improved locally efficient DR DID estimator for the ATT:
##  
##    ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
##   9.9616     0.1366    72.9093       0        9.6938    10.2294  
## ------------------------------------------------------------------
##  Estimator based on panel data.
##  Outcome regression est. method: weighted least squares.
##  Propensity score est. method: inverse prob. tilting.
##  Analytical standard error.
## ------------------------------------------------------------------
##  See Sant'Anna and Zhao (2020) for details.

Reference

  • Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” The Review of Economic Studies 72 (1): 1–19.
  • Sant’Anna, Pedro H. C., and Jun Zhao. 2020. “Doubly Robust Difference-in-Differences Estimators.” Journal of Econometrics 219 (1): 101–22.
  • Heckman, James J., Hidehiko Ichimura, and Petra E. Todd. 1997. “Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme.” The Review of Economic Studies 64 (4): 605–54.
  • LaLonde, Robert J. 1986. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” The American Economic Review 76 (4): 604–20.