Potential Outcomes and Causality
$$
\[ \color{gray} \begin{aligned} \hat\Delta_{\text{raw}} =\textcolor[RGB]{0,191,196}{\frac{1}{N_1} \sum_{i:W_i=1} Y_i} - \textcolor[RGB]{248,118,109}{\frac{1}{N_0} \sum_{i:W_i=0} Y_i} \approx \textcolor[RGB]{0,191,196}{8.64} - \textcolor[RGB]{248,118,109}{7.02} \approx 1.62 \end{aligned} \]
\(j\) | \(x_j\) ` | \(y_j(1)\) | \(y_j(0)\) | \(\tau_j\) |
---|---|---|---|---|
1 | 55 | 6 | 2 | 4 |
2 | 55 | 0 | 0 | 0 |
3 | 55 | 4 | 1 | 3 |
4 | 75 | 7 | 7 | 0 |
5 | 75 | 8 | 4 | 4 |
6 | 75 | 2 | 0 | 2 |
\[ \color{gray} \begin{aligned} \text{target} = \textcolor[RGB]{248,118,109}{\frac 1m \sum_{j=1}^m y_j(1)} - \textcolor[RGB]{0,191,196}{\frac 1m \sum_{j=1}^m y_j(0)} = \frac1m \sum_{j=1}^m \tau_j \quad \text{for} \quad \tau_j = y_j(1) - y_j(0) \end{aligned} \]
\(j\) | \(x_j\) ` | \(y_j(1)\) | \(y_j(0)\) | \(\tau_j\) |
---|---|---|---|---|
1 | 55 | 6 | 2 | 4 |
2 | 55 | 0 | 0 | 0 |
3 | 55 | 4 | 1 | 3 |
4 | 75 | 7 | 7 | 0 |
5 | 75 | 8 | 4 | 4 |
6 | 75 | 2 | 0 | 2 |
\(\text{probability}\) | \(W\) | \(y_3(W)\) |
---|---|---|
1/2 | 0 | 1 |
1/2 | 1 | 4 |
\(j\) | \(x_j\) | \(y_j(1)\) | \(y_j(0)\) | \(\tau_j\) | \(w_j\) | \(y_j(w_j)\) |
---|---|---|---|---|---|---|
1 | 55 | 6 | 2 | 4 | 1 | 6 |
2 | 55 | 0 | 0 | 0 | 0 | 0 |
3 | 55 | 4 | 1 | 3 | 1 | 4 |
4 | 75 | 7 | 7 | 0 | 0 | 7 |
5 | 75 | 8 | 4 | 4 | 1 | 8 |
6 | 75 | 2 | 0 | 2 | 0 | 0 |
\(j\) | \(x_j\) | \(y_j(1)\) | \(y_j(0)\) | \(\tau_j\) | \(w_j\) | \(y_j(w_j)\) |
---|---|---|---|---|---|---|
1 | 55 | 6 | 2 | 4 | 1 | 6 |
2 | 55 | 0 | 0 | 0 | 0 | 0 |
3 | 55 | 4 | 1 | 3 | 1 | 4 |
4 | 75 | 7 | 7 | 0 | 0 | 7 |
5 | 75 | 8 | 4 | 4 | 1 | 8 |
6 | 75 | 2 | 0 | 2 | 0 | 0 |
\[ W_j = \begin{cases} 0 & \text{with probability } 1/2 \\ 1 & \text{with probability } 1/2 \end{cases} \qqtext{ and } \sum_{j=1}^m 1_{=w}(W_j) = m/2 \qfor w \in \{0,1\} \]
\[ \hat\tau = \frac{1}{m/2} \sum_{j:W_j=1} y_j(W_j) - \frac{1}{m/2} \sum_{j:W_j=0} y_j(W_j) \]
\[ \begin{aligned} \hat\mu(w) &\overset{\texttip{\small{\unicode{x2753}}}{This is our groups mean.}}{=} \frac{1}{m/2}\sum_{j:W_j=w}y_j(W_j) \\ &\overset{\texttip{\small{\unicode{x2753}}}{We rewrite our subpopulation sums a a sum over the entire population using group indicators.}}{=} \frac{1}{m/2}\sum_{j=1}^m y_j(W_j) \ 1_{=w}(W_j) \\ &\overset{\texttip{\small{\unicode{x2753}}}{We use the indicator trick. $f(W)1_{=w}(W) = f(w)1_{=w}(W)$ for any function $f$.}}{=} \frac{1}{m/2}\sum_{j=1}^m y_j(w) 1_{=w}(W_j). \\ \mathop{\mathrm{E}}[\hat\mu(w)] &\overset{\texttip{\small{\unicode{x2753}}}{distributing the expectation using linearity}}{=} \frac{1}{m/2}\sum_{j=1}^m y_j(w) \ \mathop{\mathrm{E}}[1_{=w}(W_j)] \\ &\overset{\texttip{\small{\unicode{x2753}}}{expectations of indicators are probabilities and we're doing equal-probability randomization: $\mathop{\mathrm{E}}[1_{=w}(W)] = \P(W=w) = 1/2$}}{=} \frac{1}{m/2}\sum_{j=1}^m y_j(w) \ \frac12 \\ &\overset{\texttip{\small{\unicode{x2753}}}{multiplying: $1/(m/2) \times 1/2 = 1/m$}}{=} \frac{1}{m} \sum_{j=1}^m y_j(w) \\ \mathop{\mathrm{E}}[\hat\tau] &\overset{\texttip{\small{\unicode{x2753}}}{distributing the expectation using linearity}}{=} \frac1m \sum_{j=1}^m y_j(1) - \frac1m \sum_{j=1}^m y_j(0) \\ &\overset{\texttip{\small{\unicode{x2753}}}{grouping corresponding terms}}{=} \frac1m\sum_{j=1}^m \underset{\textcolor[RGB]{128,128,128}{\tau_j}}{\{ y_j(1) - y_j(0) \}} = \bar\tau. \end{aligned} \]
How different are these means?
Not very. In fact, the means in the red and green subsamples are unbiased estimators of the means of the red and green potential outcomes. \[ \mathop{\mathrm{E}}\qty[\frac{1}{N_w}\sum_{i:W_i=w} Y_i] = \frac{1}{m}\sum_{j=1}^m y_j(w) \qfor N_w = \sum_{i=1}^n 1_{=w}(W_i) \]
\[ \begin{array}{cccc|cccc} j & W_j & y_j(W_j) & x_j & y_j(1) & y_j(0) & \tau_j \\ \hline 1 & & & 55 & 6 & 5 & 1 \\ 2 & & & 55 & 0 & 0 & 0 \\ 3 & & & 55 & 4 & 4 & 0 \\ 4 & & & 75 & 7 & 0 & 7 \\ 5 & & & 75 & 4 & 0 & 4 \\ 6 & & & 75 & 2 & 0 & 2 \\ \end{array} \]
Let’s review the process that gives us our observed treatment+covariate+outcome triples \((W_1, X_1, Y_1) \ldots (W_n,X_n,Y_n)\).
We draw covariate+potential-outcomes triples \(\{X_i, Y_i(0), Y_i(1)\}\) uniformly-at-random from the population of all such triples \(\{x_1, y_1(0), y_1(1)\}, \ldots, \{x_m, y_m(0), y_m(1)\}\). \[ \{ X_i,Y_i(0), Y_i(1)\} = \{x_J, y_J(0), y_J(1)\} \qfor J=1 \ldots m \qqtext{ each with probability } 1/m \]
We choose treatments \(W_1 \ldots W_n\) by some random mechanism, independent of everything else. These determines the potential outcomes we observe.
\[ Y_i = Y_i(W_i) \qqtext{ for } W_1 \ldots W_n \qqtext{independent of} \{X_1,Y_1(0), Y_1(1)\} \ldots \{X_n, Y_n(0), Y_n(1)\} \]
\[ \frac{1}{m}\sum_{j=1}^m y_j(w) = \mathop{\mathrm{E}}[Y_i \mid W_i=w] \]
\[ \begin{aligned} \frac{1}{m}\sum_{j=1}^m y_j(w) &\overset{\texttip{\small{\unicode{x2753}}}{sampling uniformly-at-random}}{=} \mathop{\mathrm{E}}[Y_i(w)] \\ &\overset{\texttip{\small{\unicode{x2753}}}{irrelevance of independent conditioning variables}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid W_i=w] \\ &\overset{\texttip{\small{\unicode{x2753}}}{if we've flipped $W_i=w$, $Y_i(W_i)=Y_i(w)$. It's a bit like the indicator trick.}}{=} \mathop{\mathrm{E}}[Y_i(W_i) \mid W_i=w] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Definition. $Y_i=Y_i(W_i)$.}}{=} \mathop{\mathrm{E}}[Y_i \mid W_i=w] = \mu(w) \end{aligned} \]
\[ \mathop{\mathrm{E}}[\hat\mu(w)] \overset{\texttip{\small{\unicode{x2753}}}{unbiasedness of column means}}{=} \mu(w) \overset{\texttip{\small{\unicode{x2753}}}{identification}}{=} \mathop{\mathrm{E}}[Y_i(w)] \]
\[ \begin{aligned} \mathop{\mathrm{E}}\qty[\frac{1}{N_w}\sum_{i:W_i=w} Y_i ] &\overset{\texttip{\small{\unicode{x2753}}}{via the law of iterated expectations}}{=} \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}\qty[ \frac{1}{N_w}\sum_{i=1}^n Y_i \mid W_1 \ldots W_n] ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectation}}{=} \mathop{\mathrm{E}}\qty[ \frac{1}{N_w}\sum_{i=1}^n \mathop{\mathrm{E}}[Y_i \mid W_1 \ldots W_n] \ 1_{=w}(W_i) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{indicator trick}}{=} \mathop{\mathrm{E}}\qty[ \frac{1}{N_w}\sum_{i=1}^n \mathop{\mathrm{E}}[Y_i(w) \mid W_1 \ldots W_n] \ 1_{=w}(W_i) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{irrelevance of independent conditioning variables. The assignments $W_1 \ldots W_n$ are independent of $Y_i(w)$. }}{=} \mathop{\mathrm{E}}\qty[ \frac{1}{N_w}\sum_{i=1}^n \mathop{\mathrm{E}}[Y_i(w)] \ 1_{=w}(W_i) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{via linearity, i.e. pulling out the constant $E[Y_i(w)]$}}{=} \mathop{\mathrm{E}}[Y_i(w)] \mathop{\mathrm{E}}\qty[ \frac{1}{N_w}\sum_{i=1}^n 1_{=w}(W_i) ] = \mathop{\mathrm{E}}[Y_i(w)]\frac{N_w}{N_w} = \mathop{\mathrm{E}}[Y_i(w)] \\ &\overset{\texttip{\small{\unicode{x2753}}}{sampling uniformly-at-random}}{=} \frac{1}{m}\sum_{j=1}^m y_j(w) \end{aligned} \]
How different are these means?
Not very different. Why?
How different are these means?
The counterfactual and post-randomization means are very different. Why?
\[ \color{gray} \begin{aligned} \hat\Delta_{\text{raw}} &= \textcolor[RGB]{0,191,196}{\frac{1}{N_1} \sum_{i:W_i=1} Y_i} - \textcolor[RGB]{248,118,109}{\frac{1}{N_0} \sum_{i:W_i=0} Y_i} \\ &= \sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x | 0}} \ {\textcolor[RGB]{248,118,109}{\hat\mu(0,x)}} \\ &= \underset{\text{adjusted difference} \ \hat\Delta_1}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \{\textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}\}} + \qty{\underset{\text{covariate shift term}}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x|0}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}}} \end{aligned} \]
\[ \color{gray} \begin{aligned} \hat\Delta_{\text{raw}} &= \textcolor[RGB]{0,191,196}{\frac{1}{N_1} \sum_{i:W_i=1} Y_i} - \textcolor[RGB]{248,118,109}{\frac{1}{N_0} \sum_{i:W_i=0} Y_i} \\ &= \sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x | 0}} \ {\textcolor[RGB]{248,118,109}{\hat\mu(0,x)}} \\ &= \underset{\text{adjusted difference} \ \hat\Delta_1}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \{\textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}\}} + \qty{\underset{\text{covariate shift term}}{\sum_x \textcolor[RGB]{0,191,196}{P_{x|1}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} - \sum_x \textcolor[RGB]{248,118,109}{P_{x|0}} \ \textcolor[RGB]{248,118,109}{\hat\mu(0,x)}}} \end{aligned} \]
$$
If treatment assignments are conditionally independent of the potential outcomes given covariates
\[
\text{ i.e. if } \ W_i \qqtext{ is independent of } \{Y_i(0), Y_i(1)\} \qqtext{ conditional on } X_i
\]
then we can identify potential outcome means within groups with the same covariate value.
It’s a conditional version of the same formula.
\[ \begin{aligned} \mu(w,x) &= \mathop{\mathrm{E}}[Y_i \mid W_i=w, X_i=x] = \mathop{\mathrm{E}}[Y_i(w) \mid X_i=x] = \frac{1}{m_x} \sum_{j:x_j=x} y_j(w) \\ \qfor &m_x = \frac{1}{n} \sum_{i=1}^n 1_{=x}(X_i) \end{aligned} \]
\[ \mathop{\mathrm{E}}[Y \mid W, X] = \mathop{\mathrm{E}}[Y \mid X] \qqtext{ if } W \qqtext{ is independent of } \{Y(0), Y(1)\} \qqtext{ conditional on } X \]
\[ \begin{aligned} \mathop{\mathrm{E}}[Y_i \mid W_i=w, X_i=x] &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i=Y_i(W_i)$}}{=} \mathop{\mathrm{E}}[Y_i(W_i) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(W_i)=Y_i(w)$ when $W_i=w$}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{irrelevance of conditionally independent conditioning variables}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(w)$ is sampled uniformly at random from the potential outcomes $y_j(w)$ of the $m_x$ units with $X_i=x$}}{=} \frac{1}{m_x} \sum_{j:x_j=x} y_j(w) \qfor m_x = \frac{1}{n} \sum_{i=1}^n 1_{=x}(X_i) \end{aligned} \]
\[ \begin{aligned} \mu(1,x) - \mu(0,x) &= \mathop{\mathrm{E}}[Y_i(1) \mid X_i=x] - \mathop{\mathrm{E}}[Y_i(0) \mid X_i=x] \\ &= \frac{1}{m_x} \sum_{j:x_j=x} y_j(1) - \sum_{j:x_j=x} y_j(0) \\ &= \frac{1}{m_x} \sum_{j:x_j=x} \qty{y_j(1) - y_j(0)} \\ &= \frac{1}{m_x} \sum_{i:x_i=x} \tau_j \end{aligned} \]
\[ \begin{aligned} \mathop{\mathrm{E}}[\hat\Delta_\text{all}] &=\mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \qty{\hat \mu(1,x) - \hat\mu(0,x) }] \\ &\overset{\texttip{\small{\unicode{x2753}}}{law of iterated expectations}}{=} \mathop{\mathrm{E}}\qty[\mathop{\mathrm{E}}\qty[ \sum_x P_{x} \ \qty{\hat \mu(1,x) - \hat\mu(0,x)} \mid (W_1, X_1) \ldots (W_n, X_n)]] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectation}}{=} \mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \qty{\mathop{\mathrm{E}}\qty[\hat\mu(1,x) \mid (W_1,X_1) \ldots (W_n,X_n)] - \mathop{\mathrm{E}}\qty[\hat\mu(0,x) \mid (W_1,X_1) \ldots (W_n,X_n)]}] \\ &\overset{\texttip{\small{\unicode{x2753}}}{unbiasedness of the sample mean}}{=} \mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \qty{\mu(1,x) - \mu(0,x)}] \\ &\overset{\texttip{\small{\unicode{x2753}}}{identification: $\tau(x)=\mu(1,x)-\mu(0,x)$}}{=} \mathop{\mathrm{E}}\qty[\sum_x P_{x} \ \tau(x)] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of expectation}}{=} \sum_x \mathop{\mathrm{E}}[P_{x}] \ \tau(x) \\ &\overset{\texttip{\small{\unicode{x2753}}}{unbiasedness of sample proportions and def of $\tau(x)$}}{=} \sum_x p_{x} \ \frac{1}{m_x}\sum_{j:x_j=x} \tau_j \qfor p_{x} = \mathop{\mathrm{E}}[P_{x}]=\frac{m_x}{m} \\ &\overset{\texttip{\small{\unicode{x2753}}}{rewriting our sum of column sums as a single sum}}{=} \sum_x \frac{m_x}{m} \ \frac{1}{m_x}\sum_{j:x_j=x} \tau_j = \frac{1}{m}\sum_{j=1}^m \tau_j \end{aligned} \]
\[ \begin{aligned} \mathop{\mathrm{E}}[\hat\mu(w)] &\overset{\texttip{\small{\unicode{x2753}}}{This is our subsample mean.}}{=} \mathop{\mathrm{E}}\qty[\frac{1}{N_w}\sum_{j=1}^m y_j(W_j) \ 1_{=w}(W_j)] \\ &\overset{\texttip{\small{\unicode{x2753}}}{via the law of iterated expectations}}{=} \mathop{\mathrm{E}}\qty[\mathop{\mathrm{E}}\qty[\frac{1}{N_w}\sum_{j=1}^m y_j(W_j) \ 1_{=w}(W_j) \mid N_w]] \\ &\overset{\texttip{\small{\unicode{x2753}}}{the linearity of conditional expectation}}{=} \mathop{\mathrm{E}}\qty[\frac{1}{N_w}\sum_{j=1}^m y_j(W_j) \ \mathop{\mathrm{E}}[1_{=w}(W_j) \mid N_w]] \\ &\overset{\texttip{\small{\unicode{x2753}}}{substituting the conditional probability}}{=} \mathop{\mathrm{E}}\qty[\frac{1}{N_w}\sum_{j=1}^m y_j(w) \ \frac{N_w}{m}] \\ &\overset{\texttip{\small{\unicode{x2753}}}{multiplying}}{=} \mathop{\mathrm{E}}\qty[\frac{1}{m} \sum_{j=1}^m y_j(w)] \\ &\overset{\texttip{\small{\unicode{x2753}}}{the expectation of a constant is that constant}}{=} \frac{1}{m} \sum_{j=1}^m y_j(w) \end{aligned} \]
\[ \begin{aligned} N_w &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectation}}{=} \mathop{\mathrm{E}}[N_w \mid N_w] \\ &\overset{\texttip{\small{\unicode{x2753}}}{plugging in the definition of $N_w$}}{=} \mathop{\mathrm{E}}\qty[\sum_{j=1}^m 1_{=w}(W_j) \mid N_w] \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectation}}{=} \sum_{j=1}^m \mathop{\mathrm{E}}[1_{=w}(W_j) \mid N_w] \\ &\overset{\texttip{\small{\unicode{x2753}}}{identical distribution}}{=} m \times \mathop{\mathrm{E}}[1_{=w}(W_j) \mid N_w]. \end{aligned} \]
This is inconsistent with everyday use of the term treatment because you can, e.g., take both ibuprofen and acetaminophen. In potential outcomes language, we’d say that ibuprofen alone, acetaminophen alone, and ibuprofen plus acetaminophen are three different treatments.
This is sort of like a sampling-without-replacement version of flipping a coin for each individual. The marginal probability of treatment is the same (\(1/2\)), but treatments for different individuals aren’t independent like they would be if we flipped a coin. We could flip coins instead, but we’d wind up treating everybody or nobody sometimes, especially when our population is small.
Some people prefer to say ‘randomization distribution’ in this context because we’re not sampling in individuals from our population—we get a dot for each one.
Why? If we draw a sample of the same size as the population without replacement, we get the whole population.
Conditional independence is a new term. We’ll define it shortly.
There are other ways, e.g. making a deck of cards for each age group, shuffling them, and treating the top \(N_{w,x}\) cards on each deck.