library(tidyverse)
Week 13 Homework
$$
Introduction
Summary
In this homework, we’re going to focus on estimating average treatment effects. While we haven’t done that in any previous one, it’s not really a new thing either. It’s really just a matter of using our adjusted difference in means estimators \(\hat\Delta_1\), \(\hat\Delta_{\text{all}}\), etc. with the right data. After all, we’ve shown that \(\hat\Delta_{\text{all}}\) is an unbiased estimator of the average treatment effect \(\text{ATE} = \frac{1}{m}\sum_{j=1}^m y_j(1) - y_j(0)\) when treatment is conditionally randomized.
The new part is how we’re going to check that they work. We’re going to simulate a conditionally randomized experiment. Having done this, we can show our estimator’s sampling distribution and check coverage just like we did in the last homework. We’ll compare what happens with and without inverse probability weighting.
This is a real thing
We’ve been doing simulation all semester. That’s how we’ve been visualizing the sampling distribution of our estimators and calculating coverage. This is just bringing that into a causal inference context. This isn’t just something people do in class. It’s how people check that their estimators work and compare them to each other. Many papers in Machine Learning conferences and statistics journals use one or both of these simulation techniques. Maybe even the majority of them. Andrew Gelman has a blog post with the title Yes, I really really really like fake-data simulation, and I can’t stop talking about it..
What to Expect
You’ll see a few variations on things we’ve done before.
- Test estimators for causal effects using simulated data. Here that simulation involves a randomization step.
- Adjusting for covariate shift by averaging predictions. Here that shift is induced by conditional randomization.
- How to use inverse probability weights to make these predictions in a way that gives us unbiased—or nearly unbiased—estimators of \(\Delta_\text{all}\), \(\Delta_1\), etc. Here we’re interpreting these causally.
That’s all useful stuff. But this is also a good opportunity to review skills and ideas from earlier in the semester.
- Approximating the coverage of biased estimators using normal approximation and the \(\text{bias}/\text{standard deviation}\) ratio.1
- Calculating the coverage of complicated point estimators using simulation.2
As usual, we’ll use the tidyverse
.
Simulating the Data
We’re going to work with the example we talked about in Lecture 10.3 We’ve got a list of people that might donate to a political campaign—that’s our population—and we’re considering two ways of contacting them to ask for money: an email or a phone call. We’re running a pilot study to decide which one to use. In particular we want to estimate the average treatment effect of a call (vs. an email) on the amount they donate.
\[ \begin{aligned} \text{ATE} &= \frac{1}{m}\sum_{j=1}^m \qty{\textcolor[RGB]{0,191,196}{y_j(1)} - \textcolor[RGB]{248,118,109}{y_j(0)}} \qqtext{ where } \\ & \textcolor[RGB]{0,191,196}{y_j(1)} \text{ is amount person $j$ would donate if contacted by phone} \\ & \textcolor[RGB]{248,118,109}{y_j(0)} \text{ is amount person $j$ would donate if contacted by email} \end{aligned} \]
We call \(y_j(w)\) the potential outcome for person \(j\) under treatment \(w\). These are, of course, just things we’re imagining—we either we call them or we don’t—but that’s decision-making. When you think through your options, you’re imagining the consequences of some choices that you aren’t ultimately going to make.
To look into what happens when we try to estimate the ATE, we’ll sample people from our population and then randomize them to a treatment. In particular, we’ll do it like this.
- Sample without replacement from our population to choose who we’re going to contact.
- Assign each of them to treatment by flipping a weighted coin that may depend on that person’s age.
To simulate an experiment like this, we have to start one step back. ‘Step 0’ is making up the population.
Step 0
When we make up our population, we haven’t yet assigned these people to treatment, so we describe each person by their age \(x_j\) and their potential outcomes \(y_j(\cdot)\)4 This means our whole population is a list \(x_1, y_1(\cdot) \ldots x_m, y_m(\cdot)\). Here’s some code that makes up a population like this.
We’re going to use a random number generator to help us do this, but this isn’t a part of the random process we’re analyzing statistically. It’s just a way of making up \(m\) different things without having to do it all by hand. That means that we only do this once, even when we’re using simulating our study over and over to calculate the sampling distribution of our estimator. The population isn’t changing.
= 1000
m set.seed(1)
= sample(c(20,30,40,50), m, replace=TRUE)
x = function(w) { (2 + 10*((x-20)/30 + w*sqrt((x-20)/30))) * runif(m, min=0, max=2) }
y
= data.frame(x=x, y0=y(0), y1=y(1)) po.pop
Step 1
We choose a random sample \(X_1, Y_1(\cdot) \ldots X_n, Y_n(\cdot)\) by sampling without replacement from the population. These random variables \(X_i, Y_i(\cdot)\) are sampled uniformly-at-random from our population, so \(\mathop{\mathrm{E}}[Y_i(w)]=\frac{1}{m}\sum_{j=1}^m y_j(w)\).
= 200
n = sample(1:m, n, replace=FALSE)
J = po.pop[J,] po.sam
Step 2
For each person (\(i\)) we’ve chosen, we flip a coin with probability \(\pi(X_i)\) of landing heads. Then, taking the corresponding potential outcome for each person, we get the sample we actually observe: \(W_1, Y_1(W_1) \ldots W_n, Y_n(W_n)\). We call \(Y_i(W_i)\), which is sometime just written as \(Y_i\), the realized outcome.
= function(x) { 1/(1+exp(-(x-30)/8 )) }
pi = rbinom(n, 1, pi(po.sam$x))
W = data.frame(w=W, x=po.sam$x, y=ifelse(W==1, po.sam$y1, po.sam$y0)) sam
Visualizing our Simulation Process
By flipping through the tabs below, you can see the process of simulating our experiment.
- In the first tab, you can see the population of potential outcomes we’re sampling from. Each pair \(\textcolor[RGB]{248,118,109}{y_j(0)}, \textcolor[RGB]{0,191,196}{y_j(1)}\) is drawn as a pair of points connected by a thin line. I’ve made them hollow because they’re potential outcomes—they don’t necessarily happen.
- In the second, you see the 200 of these that we’ve sampled.
- In the third, the circle corresponding to the realized outcome gets filled in.
Identifying the ATE
Let’s start by reviewing our identification results from Lecture 11. We know, because it’s right in the code above this, that these two sampling and randomization conditions are satisfied.
- \(X_1,Y_1(\cdot) \ldots X_n,Y_n(\cdot)\) are sampled uniformly-at-random from our population of covariate/potential-outcome pairs \(x_1,y_1(\cdot) \ldots x_n, y_n(\cdot)\).
- The treatment \(W_i\) is a binary random variable with a probability \(\pi(X_i)\) of being one that depends only the covariate \(X_i\).
When these conditions are satisfied, we can identify the mean potential outcome among people with each level of the covariate. That means we can work out a formula for it in terms of the distribution of the data we actually observe: the treatment, covariate, and realized outcome triples \(W_1,X_1,Y_1(W_1) \ldots W_n,X_n,Y_n(W_n)\). The mean of the potential outcome \(y_j(w)\) for treatment \(w\) among people with covariate value \(x\) is the conditional mean \(\mu(w,x)\) of the realized outcome \(Y_i(W_i)\) given that treatment \(W_i=w\) and covariate value \(X_i=x\).
\[ \begin{aligned} \mu(w,x) &\overset{\texttip{\small{\unicode{x2753}}}{This is just notation. $\mu(w,x)$ is defined by this equation.}}{=} \mathop{\mathrm{E}}[Y_i(W_i) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(W_i)=Y_i(w)$ when $W_i=w$}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid W_i=w, X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{irrelevance of conditionally independent conditioning variables}}{=} \mathop{\mathrm{E}}[Y_i(w) \mid X_i=x] \\ &\overset{\texttip{\small{\unicode{x2753}}}{$Y_i(w)$ is sampled uniformly at random from the potential outcomes $y_j(w)$ of the $m_x$ units with $X_i=x$}}{=} \frac{1}{m_x} \sum_{j:x_j=x} y_j(w) \qfor m_x = \frac{1}{n} \sum_{i=1}^n 1_{=x}(X_i) \end{aligned} \tag{1}\]
From here, showing that the average treatment effect (ATE) is equal to the familiar quantity \(\Delta_{\text{all}}\) is a matter of averaging.
\[ \begin{aligned} \Delta_{\text{all}} &\overset{\texttip{\small{\unicode{x2753}}}{This equation defines $\Delta_{\text{all}}$}}{=} \mathop{\mathrm{E}}[ \mu(1,X_i) - \mu(0,X_i) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Our within-group identification result @eq-within-group-identification}}{=} \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}[ Y_i(1) \mid X_i] - \mathop{\mathrm{E}}[Y_i(0) \mid X_i] ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Linearity of conditional expectations}}{=} \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}[ Y_i(1) - Y_i(0) \mid X_i] ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{The law of iterated expectations}}{=} \mathop{\mathrm{E}}\qty[ Y_i(1) - Y_i(0) ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{sampling uniformly-at-random}}{=} \frac{1}{m}\sum_{j=1}^m \left\{ y_j(1) - y_j(0) \right\} = \text{ATE}. \end{aligned} \tag{2}\]
Estimating the ATE
Now let’s estimate the thing. We’ve shown that the ATE is equal to \(\Delta_\text{all}\), so we can estimate it exactly the same way we estimated \(\Delta_\text{all}\) in our last assignment. We’ll estimate \(\hat\mu\) by (potentially weighted) least squares, then average the difference in predictions \(\textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)}\) over the sample. \[ \begin{aligned} \hat\Delta_\text{all} &= \frac{1}{n} \sum_{i=1}^n \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} \\ &\qfor \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i, X_i) \ \left\{ Y_i - m(W_i,X_i) \right\}^2 \end{aligned} \tag{4}\]
Below, I’ve packaged all that up into a single function that returns this estimate of the ATE when you give it three arguments.
- A sample \(W_1,X_1,Y_1 \ldots W_n,X_n,Y_n\).
- A formula describing a regression model \(\mathcal{M}\).
- A function \(\gamma\) that computes weights \(\gamma(W_i,X_i)\) for each observation.
= function(sam) { rep(1, nrow(sam)) }
constant.weights = function(sam, formula, gamma=constant.weights) {
estimate.ate = lm(formula, weights=gamma, data=sam |> mutate(gamma=gamma(sam)))
model .0X = predict(model, newdata=sam |> mutate(w=0))
muhat.1X = predict(model, newdata=sam |> mutate(w=1))
muhatmean(muhat.1X - muhat.0X)
}
By default, the it uses the weights \(\gamma(w,x)=1\), i.e., it uses unweighted least squares. If we want to use inverse probability weighted least squares, we can pass in this weighting function. We’ll talk about why these are the right weights below in Section 4.1.
= function(sam) { ifelse(sam$w==1, 1/pi(sam$x), 1/(1-pi(sam$x))) } ate.weights
Writing the estimator as a function of the sample makes it easy to calculate its sampling distribution and its bootstrap sampling distribution. We’ll get there. First, let’s just calculate point estimates. We’ll try using both the unweighted and inverse probability weighted least squares predictors in the parallel lines model.
= sam |> estimate.ate(y~w+x)
ate.hat.lsq = sam |> estimate.ate(y~w+x, ate.weights) ate.hat.ipw
And here’s the estimation target.
= mean(po.pop$y1 - po.pop$y0) ate
target | without ipw | with ipw |
---|---|---|
5.52 | 5.95 | 6.28 |
It looks like we’re doing better without the weights. To get a sense of whether this is just least squares getting lucky, let’s look at the sampling distributions of these two estimators.
To do that, we’ll use two functions defined below.
sample.and.randomize
simulates the process of sampling and randomizing. It takes as input a population of potential outcomes, a sample size \(n\), a function \(\pi(x)\) that tells us the probability of assigning treatment \(w=1\) to a person with covariate value \(x\). And it returns a list of data frames, with each data frame describing a sample \(W_1,X_1,Y_1(W_1) \ldots W_n,X_n,Y_n(W_n)\). By default, it gives us 1000 samples like this.sampling.distribution
is not particularly interesting, but it is useful. It takes a list of samples and a list of estimators, calculates each estimator in each sample, and returns the results a data frame describing the sampling distribution of each estimator. Here an estimator is a function that takes one argument, a sample like the ones returned bysample.and.randomize
, and returns an estimate.
= function(po.pop, n, pi, replications=1000) {
sample.and.randomize 1:replications |> map(function(.) {
= sample(1:nrow(po.pop), n, replace=FALSE)
J = po.pop[J,]
po.sam = rbinom(n, 1, pi(po.sam$x))
W = data.frame(w=W, x=po.sam$x, y=ifelse(W==1, po.sam$y1, po.sam$y0))
sam
sam
})
}= function(samples, estimators) {
sampling.distribution |> map(function(sam) {
samples |> map(function(estimator) {
estimators data.frame(value=estimator(sam))
|> bind_rows(.id='estimator')
}) |> bind_rows(.id='sample')
}) }
We can use these functions to calculate the sampling distribution of our two estimators.
= po.pop |> sample.and.randomize(n, pi)
samples = list(
estimators lsq=\(sam)estimate.ate(sam, y~w+x),
ipw=\(sam)estimate.ate(sam, y~w+x, ate.weights))
= samples |> sampling.distribution(estimators) sampling.dist
Here are the results. I’ve used a function defined in the expandable box below to draw my histograms. The one using
inverse probability weighted least squares is drawn in purple and the one using unweighted least squares is drawn in orange. Their means are marked by vertical lines of the same color and the target—the ATE—is marked by a thick green line.
sampling.dist
|> plot.sampling.distribution(target=ate) sampling.dist
What can we conclude? Both estimators work pretty well. You can see that the estimator that uses unweighted least squares is slightly biased, but not really enough to worry about.
We know this bias shouldn’t have much impact on coverage, but let’s try to quantify it. We can calculate the bias and standard deviation of that estimator like this.
= sampling.dist |> filter(estimator=='lsq') |> pull(value)
estimates = mean(estimates)-ate
bias = sd(estimates) se
bias | standard deviation |
---|---|
-0.132 | 0.993 |
When we’re working with real data, we don’t know our estimator’s standard deviation \(\sigma_{\theta}\). We have to estimate it. Let’s do that using the bootstrap. The way we’ve organized our code makes this easy. If we have a list of bootstrap samples, we can just call to sampling.distribution
passing in this list and the same estimators as before.
Here’s a version of the table above with this estimated standard deviation added.
bias | standard deviation | estimated standard deviation |
---|---|---|
-0.132 | 0.993 | 1.07 |
It’s not exactly the same, but it’s close.
Let’s see how our bias and our imperfect estimate of our standard deviation affect the coverage of an interval estimate we can really use, \(\hat\theta \pm 1.96 \times \hat\sigma_{\theta}\) where \(\hat\sigma_{\theta}\) is the bootstrap standard deviation.
Weighting
When we talked about inverse probability weighting in Lecture 14, we talked about using the weights \(\gamma(w,x)=m_{x}/m_{wx}\) to estimate \(\Delta_\text{all}\). That was an exercise—it’s not in the slides—so we’ll review it here. Let’s start with the intuition. By weighting this way, we’re effectively using a sample drawn from an imaginary population in which each column (i.e. at each level of \(x\)) has one red dot and one green dot for each dot of either color in our actual population. As a result, the distributions \(\textcolor[RGB]{248,118,109}{p_{x \mid 0}}\) and \(\textcolor[RGB]{0,191,196}{p_{x \mid 1}}\) of \(x\) in our imaginary population are the same as the the distribution \(\textcolor[RGB]{160,32,240}{p_{x}}\) of \(x\) in our actual population.
In the plot below, you can see this in action. On the left, I’ve plotted a population. We see the population, the mean in each group, and the distribution of the covariate \(x\) in each group and in the population as a whole. On the right, you can see the imaginary population we’ve created by using the weights \(\gamma(w,x)=m_{x}/m_{wx}\). I’ve scaled the size of each dot in proportion to its weight. I’ve calculated weighted means within each group and the ‘weighted covariate distribution’ for each group and the population as a whole, i.e., the proportion of weight at each level of \(x\) rather than the proportion of people. You can see that in the imaginary population, the overall (purple) covariate distribution is the same as it is in the unweighted sample, but the covariate distributions of the red and green groups now match the overall distribution. That’s what using the weights \(\gamma(w,x)=m_x/m_{wx}\) does.
There’s something about this that doesn’t quite make sense in randomized experiments. What does it mean to talk about the covariate distributions \(\textcolor[RGB]{0,191,196}{p_{x \mid 1}}\) and \(\textcolor[RGB]{248,118,109}{p_{x \mid 0}}\) among the treated and untreated people in our population when only the ones we sample are actually randomized to treatment? Or the number of people \(\textcolor[RGB]{0,191,196}{m_{1x}}\) and \(\textcolor[RGB]{248,118,109}{m_{0x}}\) with covariate value \(x\) in those groups?
Let’s think about the plot above. The population that’s shown looks a lot like a bigger version of the samples we’ve been working with because it is. To get it, randomly assigned a treatment (i.e. a color) to each person in the population we’ve been sampling from exactly the same way I’ve been assigning treatment to our samples. Like this.
= rbinom(nrow(po.pop), 1, pi(po.pop$x))
W = data.frame(x=po.pop$x, w=W, y=ifelse(W==1, po.pop$y1, po.pop$y0)) pop
If we like, we can think of our sample as being drawn from this pre-randomized population. Like this.
= sample(1:nrow(pop), n, replace=FALSE)
J = pop[J,] sam
Why? Let’s think about how we get our sample in physical terms. Suppose we have a population of people and each of them has a coin. To sample with replacement, we write each person’s phone number on a card, shuffle the deck, and call the first \(n\) numbers. Does it matter if the people we call flip their coin before we call them or after? It doesn’t. Earlier, in Section 3, they flipped after we called; in the code right above, they flipped before.
If we think like this, the numbers \(\textcolor[RGB]{248,118,109}{m_{0x}}\) and \(\textcolor[RGB]{0,191,196}{m_{1x}}\) of untreated and treated people in our population with covariate value \(x\) do make sense. But they’re random. And they depend on a bunch of coin flips that have absolutely nothing to do with the people in our sample. So instead of using this random denominator \(m_{wx}\) in our weights \(\gamma(w,x)=m_x/m_{wx}\), let’s average over all these coin flips. Let’s use \(\gamma(w,x)=m_x / \mathop{\mathrm{E}}[m_{wx}]\) instead.
It turns out that this is very easy. Recall that we’ve randomized treatment with probability \(\pi(x)\) that depends only on the covariate \(x\), so \(m_{wx}\) is the sum of \(m_x\) coin flips that come up heads (\(W_j=1\)) with probability \(\pi(x)\). \[ \begin{aligned} \mathop{\mathrm{E}}\qty[m_{wx}] &= \mathop{\mathrm{E}}\qty[\sum_{j:X_j=x} 1_{=w}(W_j)] \\ &= \sum_{j:X_j=x} \mathop{\mathrm{E}}\qty[1_{=w}(W_j)] \\ &= m_x \mathop{\mathrm{E}}\qty[1_{=w}(W_j)] \\ &= m_x \begin{cases} \pi(x) & \qif w=1 \\ 1-\pi(x) & \qif w=0 \end{cases} \end{aligned} \tag{5}\] As a result, we can write the weights \(\gamma(w,x)=m_x / \mathop{\mathrm{E}}[m_{wx}]\) in terms of \(\pi(x)\). The factor of \(m_x\) cancels out. \[ \begin{aligned} \gamma(w,x) &= \color[RGB]{0,191,196}\frac{m_x}{m_x \pi(x)} &&= \color[RGB]{0,191,196}\frac{1}{\pi(x)} & \color[RGB]{0,191,196}\qif w=1 \\ &= \color[RGB]{248,118,109}\frac{m_x}{m_x \{1-\pi(x)\}} &&= \color[RGB]{248,118,109}\frac{1}{1-\pi(x)} & \color[RGB]{248,118,109}\qif w=0 \end{aligned} \tag{6}\] Those are the weights we’re using in the code above. We don’t need to know the covariate distribution in the population to use them; we just need to know how treatment was randomized. Cool, right? This is where the name ‘inverse probability weighting’ comes from. The weight we give each observation is the (multiplicative) inverse of the probability it was randomized to the treatment it received.
The randomization probability \(\pi(x)\) plays a pretty pivotal role, so let’s take a look at it. And, while we’re at it, let’s look at \(\gamma(w, x)\).
Let’s take a look at what these weights do in our sample. On the left, I’ve plotted our sample, the means in each group, and the covariate distribution in each group and in the sample as a whole. On the right, you can see all the same stuff for imaginary sample we’ve created using the weights \(\textcolor[RGB]{0,191,196}{\gamma(1,x)=1/\pi(x)}\) and \(\textcolor[RGB]{248,118,109}{\gamma(0,x)=1/\{1-\pi(x)\}}\). To compare to what happens using weights \(\gamma(w,x)=m_x/m_{wx}\) based on our pre-randomized population, flip to the second tab.
Having done all this, let’s think about bias. That is, after all, why we’re inverse probability weighting. This is going to look a lot like our analysis in Lecture 14.
Let’s think \(\tilde\mu\), the minimizer of expected weighted squared error. \[ \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \mathop{\mathrm{E}}\qty[ \gamma(W_i,X_i) \left\{ Y_i - m(W_i,X_i) \right\}^2 ] \] What do we know about this? Following the same logic as in Lecture 14, we get a ‘weighted orthogonality condition’. \[ \mathop{\mathrm{E}}\qty[ \gamma(W_i,X_i) \left\{ \mu(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} m(W_i,X_i) ] = 0 \qqtext{ for all } m \in \mathcal{M}. \]
Now let’s take a moment to acknowledge a quirk. Exercise 3 does not necessarily imply that if we do the same thing in our sample, we’ll get an unbiased estimate of the ATE.5
Let’s think about two estimators. \[ \begin{aligned} \hat\theta_0 &= \frac{1}{n}\sum_{i=1}^n 1_{=1}(W_i) \gamma(W_i,X_i) Y_i - \frac{1}{n}\sum_{i=1}^n 1_{=0}(W_i) \gamma(W_i,X_i) Y_i \\ \hat\theta_1 &= \frac{1}{n}\sum_{i=1}^n \hat\mu(1,X_i) - \hat\mu(0,X_i) \qfor \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \gamma(W_i, X_i) \ \left\{ Y_i - m(W_i,X_i) \right\}^2 \end{aligned} \] When you use the parallel lines model, these are pretty similar, but they are not the same.
Estimating the ATT
Now let’s think about estimating the ATT: the average treatment effect on the treated.
\[ \text{ATT} = \mathop{\mathrm{E}}[Y_i(1) - Y_i(0) \mid W_i=1] \]
Using an argument a lot like the one we used to show that the ATE was equal to \(\Delta_\text{all}\) in Section 3 above, we can show that this is equal to \(\Delta_1\). This argument, like that one, uses the law of iterated expectations to boil things down to the ‘within-group’ identification Equation 1 implied by conditional randomization. But there’s a trick to using the law of iterated expectations here because we’re already conditioning on \(W_i=1\). The trick is to start by getting rid of the conditioning. Here’s a formula that’ll help us do that. \[ \mathop{\mathrm{E}}[ Z \mid W=1] = \mathop{\mathrm{E}}[ Z \ 1_{=1}(W) ] \ / \ P(W=1) \qqtext{ for any random variables $W$ and $Z$ with $W \in \{0,1\}$ }. \tag{7}\]
We’ll use this formula to get rid of that conditioning on \(W_i=1\) when we get started and put it back at the end. Here’s the full step-by-step derivation. \[ \begin{aligned} \text{ATT} &\overset{\texttip{\small{\unicode{x2753}}}{Using the formula above $Z=Y_i(1)-Y_i(0)$}}{=} \mathop{\mathrm{E}}\qty[\{Y_i(1) - Y_i(0)\} 1_{=1}(W_i) ] \ / \ P(W_i=1) \\ &\overset{\texttip{\small{\unicode{x2753}}}{iterated expectations conditioning on $W_i,X_i$}}{=} \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}\qty[\{Y_i(1) - Y_i(0)\} 1_{=1}(W_i) \mid W_i, X_i]] \ / \ P(W_i=1) \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectations}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \mathop{\mathrm{E}}\qty[Y_i(1)\mid W_i, X_i] - \mathop{\mathrm{E}}\qty[Y_i(0) \mid W_i, X_i] } 1_{=1}(W_i) ] \ / \ P(W_i=1) \\ &\overset{\texttip{\small{\unicode{x2753}}}{Using the within-group-identication result}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \mu(1,X_i) - \mu(0,X_i) } 1_{=1}(W_i) ] \ / \ P(W_i=1) \\ &\overset{\texttip{\small{\unicode{x2753}}}{iterated expectations conditioning on $X_i$}}{=} \mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}[ \qty{ \mu(1,X_i) - \mu(0,X_i) } 1_{=1}(W_i) \mid X_i]] \ / \ P(W_i=1) \\ &\overset{\texttip{\small{\unicode{x2753}}}{linearity of conditional expectations}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \mu(1,X_i) - \mu(0,X_i) } \mathop{\mathrm{E}}[1_{=1}(W_i) \mid X_i]] \ / \ P(W_i=1) \\ &\overset{\texttip{\small{\unicode{x2753}}}{Using the formula above, in reverse, with $Z=\mu(1,X_i)-\mu(0,X_i)$}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \mu(1,X_i) - \mu(0,X_i) } \mid W_i=1] \end{aligned} \tag{8}\]
Typically, we estimate this with an analogous sample average of \(\hat\mu\). \[ \widehat{ATT} = \textcolor[RGB]{0,191,196}{\frac{1}{N_1}\sum_{i:W_i}}\qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat \mu(0,X_i)} } \] This should look familiar. We’ve been using the same formula to estimate \(\Delta_1\).
Now let’s run through a few of the same steps we did for the ATE to estimate the ATT. I’ll give you some code to help out. All you’ve got to do, in addition to running the code above here in the document, is write functions estimate.att
and att.weights
analogous to estimate.ate
and ate.weights
from Section 4. Here’s a starting point for that. What I’ve done is copy/paste those from Section 4 and changed the names. What I haven’t done is change the code so it actually estimates the ATT. You don’t have to do that much to it, but you will have to change a few lines. I’ve marked them for you.
= function(sam) { rep(1, nrow(sam)) }
constant.weights = function(sam, formula, gamma=constant.weights) {
estimate.att = lm(formula, weights=gamma, data=sam |> mutate(gamma=gamma(sam)))
model .0X = predict(model, newdata=sam |> mutate(w=0))
muhat.1X = predict(model, newdata=sam |> mutate(w=1))
muhatmean(muhat.1X - muhat.0X)
}
= function(sam) {
att.weights ifelse(sam$w==1, 1/pi(sam$x), 1/(1-pi(sam$x)))
}
- 1
- You’ll need to change this if you want your to estimate the ATT.
- 2
- You’ll need to change this if you want your inverse probability weights to work.
= mean((po.pop$y1 - po.pop$y0)*pi(po.pop$x)/mean(pi(po.pop$x)))
att = list(
estimators lsq=\(sam)estimate.att(sam, y~w+x),
ipw=\(sam)estimate.att(sam, y~w+x, att.weights))
= samples |> sampling.distribution(estimators)
sampling.dist |> plot.sampling.distribution(target=att) sampling.dist
- 1
- See Note 1 for a discussion of this formula.
Now let’s do some inverse probability weighting. When we talked about the inverse probability weights for the ATE, we started with the idea of using the weights \(\gamma(w,x)=m_x/m_{wx}\) and then replaced \(m_{wx}\) with its expectation. Let’s do something similar for the weights \(\gamma(w,x)=m_{1x}/m_{wx}\). These are like the weights we focused on in Lecture 14 when we were estimating \(\Delta_0\), but with the colors flipped—\(\Delta_1\) is an average over the green dots and \(\Delta_0\) is an average over the red ones.
If you’ve gotten these right, you should be able to get a nearly-unbiased estimator of the ATT using them.
The Impact of Model Choice
Now let’s think briefly about what happens when we use different models. Below, I’ve tried out four. For each, I’ve drawn two plots below.
- The least squares and inverse probability weighted least squares predictors on top of the sample they’re fit to. Squares show the unweighted least squares predictions; diamonds show the inverse probability weighted least squares predictions.
- The sampling distributions of the ATT estimators using least squares and inverse probability weighted least squares. The unweighted version is in orange; the inverse probability weighted one is in purple.
Let’s use these to do an exercise on visualizing and communicating about bias and coverage.
= y~w
formula |> lsq.plot(formula, gammas) sam
= y~w+x
formula |> lsq.plot(formula, gammas) sam
= y~w*x
formula |> lsq.plot(formula, gammas) sam
= y~w+factor(x)
formula |> lsq.plot(formula, gammas) sam
Footnotes
This’ll be similar to these exercises from Lecture 13.↩︎
This’ll be similar to this exercise from the Week 9 Homework.↩︎
The process we use to generate the data here is a different (simpler) but the overall theme is the same.↩︎
With this notation, we’re writing our potential outcomes as one thing (a function of treatment \(w\)) instead of two things (that function’s value at treatments \(w=0\) and \(w=1\)).↩︎
This isn’t just a randomization thing. In Lecture 14, we did a calculation similar to Exercise 3 in a section of the slides called ‘Proving Unbiasedness’, but we never actually proved that our estimator was unbiased. It isn’t. Here, you’ll get a chance to see why. And why it’s pretty close to unbiased anyway.↩︎
The derivation of this is just like the derivation of the weighted orthogonality condition in https://qtm285-1.github.io/assets/lectures/Lecture14.html#the-orthogonality-condition↩︎
Go ahead and try it if you like. It’s a good exercise, but not one I’m going to assign.↩︎