Misspecification and Averaging
\[ \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \qty{ Y_i - m(W_i,X_i) }^2 \qfor \mathcal{M}= \qty{ m(w,x) = a(w) + b(w)x } \]
\[ \color{gray} \begin{aligned} \hat\Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \left\{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{N_0 = \sum_{i: W_i=0} 1} \\ \Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \left\{ \textcolor[RGB]{0,191,196}{\mu(1,x_{j})} - \textcolor[RGB]{248,118,109}{\mu(0,x_{j})} \right\} &&\qfor \textcolor[RGB]{248,118,109}{m_0 = \sum_{j: w_j=0} 1} \end{aligned} \]
Look at the difference in mean incomes at one education level.
Compare it to our target \(\theta\), the actual difference in means in our population. \[ \hat\theta = \textcolor[RGB]{0,191,196}{\hat\mu(1,x)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,x)} \qqtext{ estimates } \theta = \textcolor[RGB]{0,191,196}{\mu(1,x)} - \textcolor[RGB]{248,118,109}{\mu(0,x)} \]
The sampling distribution of our estimator \(\hat\theta\) is shown above.
For \(x=14\) (a 2-year degree), which is what’s shown for now, its center is far from the target.
Change \(x\) in the code above to see what happens at other education levels.
Here’s the sampling distribution of our adjusted difference in means. \[ \hat\theta = \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}} \qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1,X_i)} - \textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} } \qqtext{ estimates } \theta = \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0}} \qty{ \textcolor[RGB]{0,191,196}{\mu(1,x_{j})} - \textcolor[RGB]{248,118,109}{\mu(0,x_{j})} } \]
We’re using biased estimates of the difference in means at each education level \(x\).
But when we average over the covariate distribution3 of male CA residents …
Our Gym Subsidy Example
\[ \begin{aligned} \hat \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \qty{ Y_i - m(X_i) }^2 \qfor \mathcal{M}= \qty{ m(x) = a + bx } && \text{ estimates } \\ \mu(x) &= \frac{1}{m_x} \sum_{j: x_j=x} y_j \qfor m_x = \sum_{j: x_j=x} 1 \end{aligned} \]
\[ \tilde\theta = \tilde\mu(100) \qfor \tilde \mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(x_j) }^2 \]
plot.sampling.distribution
simulates our experiment over and over and calculates \(\hat\theta\) each time.\[ \hat\theta = \frac{1}{n} \sum_{i=1}^n \hat\mu(X_i) \]
\[ \mathop{\mathrm{E}}[\hat\theta] = \tilde\theta = \frac{1}{m} \sum_{j=1}^m \tilde\mu(x_j) \]
\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(X_i) } \ m(X_i) \qqtext{ for all } m \in \mathcal{M} \]
Why? Plug in \(m(x) = 1\). That’s a line. It’s a horizontal line. Then do a little algebra. \[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(X_i) } \times 1 \implies \textcolor[RGB]{192,192,192}{\frac{1}{n}} \sum_{i=1}^n Y_i = \textcolor[RGB]{192,192,192}{\frac{1}{n}} \sum_{i=1}^n \hat\mu(X_i) \]
It tells us that our estimate \(\hat\theta\) is really just the sample mean.
This is true for any regression model that includes the constant function \(m(x)=1\).
\[ \begin{aligned} \mathcal{M}&= \{ m(x) = a : a \in \mathbb{R} \} && \text{horizontal lines} \\ \mathcal{M}&= \{ \text{all functions} \ m(x) \} && \text{all functions} \\ \mathcal{M}&= \{ m(x) = a + bx : a,b \in \mathbb{R} \} && \text{lines} \\ \mathcal{M}&= \{ m(x) = \sum_{k=0}^p a_k x^k : a_k \in \mathbb{R} \} && \text{polynomials of order } p \\ \end{aligned} \]
formula = y ~ 1+x
to formula = y ~ 0+x
, you’ll see it. Bias!\[ \begin{aligned} \mathcal{M}&= \{ m(x) = bx : b \in \mathbb{R} \} && \text{lines through the origin} \end{aligned} \]
\[ \frac{1}{m} \sum_{j=1}^m \qty{ y_j - \tilde\mu(x_j) } \times m(x_j) = 0 \qqtext{ for all } m \in \mathcal{M}. \]
\[ 0 = \sum_{j=1}^m \qty{ y_j - \tilde\mu(x_j) } \times 1 \implies \textcolor[RGB]{192,192,192}{\frac{1}{m}} \sum_{j=1}^m y_j = \underset{\tilde\theta}{\textcolor[RGB]{192,192,192}{\frac{1}{m}} \sum_{j=1}^m \tilde\mu(x_j)} \]
Let’s look at our adjusted difference in means when we use the ‘not necessarily parallel lines’ model. \[ \mathcal{M}= \{ m(w,x) = a(w) + b(w)x \ \text{ for functions} \ a,b \} \]
We’ll break it down into two parts.
Our target, of course, has the same parts.
\[ \color{gray} \begin{aligned} \Delta_0 &= \textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}} \left\{ \textcolor[RGB]{0,191,196}{\mu(1,x_j)} - \textcolor[RGB]{248,118,109}{\mu(0,x_j)} \right\} &&\qfor \textcolor[RGB]{248,118,109}{m_0 = \sum_{j: w_j=0} 1} \\ &= \underset{\text{mismatched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}}\textcolor[RGB]{0,191,196}{\mu(1,x_j)}} - \underset{\text{matched part}}{\textcolor[RGB]{248,118,109}{\frac{1}{m_0}\sum_{j: w_j=0}}\textcolor[RGB]{248,118,109}{\mu(0,x_j)}} \end{aligned} \]
\[ \mathop{\mathrm{E}}\qty[ \textcolor[RGB]{248,118,109}{\frac{1}{N_0}\sum_{i: W_i=0}}\textcolor[RGB]{248,118,109}{\hat\mu(0,X_i)} ] = \textcolor[RGB]{248,118,109}{\frac{1}{m_0} \sum_{j: w_j=0} \mu(0,x_j)} \]
\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \ m(W_i,X_i) \qqtext{ for all } m \in \mathcal{M} \]
\[ m(w,x) = 1_{=0}(w) = \begin{cases} 1 & \text{if } w=0 \\ 0 & \text{if } w=1 \end{cases} \]
\[ \begin{aligned} 0 &\overset{\texttip{\small{\unicode{x2753}}}{plugging in}}{=} \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \times 1_{=0}(W_i) \\ &\overset{\texttip{\small{\unicode{x2753}}}{using the indicator trick}}{=} \sum_{i=1}^n Y_i 1_{=0}(W_i) - \sum_{i=1}^n \hat\mu(0,X_i) 1_{=0}(W_i). \end{aligned} \]
\[ 0 = \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) } \ m(W_i,X_i) \qqtext{ for all } m \in \mathcal{M} \]
You could, of course, use the population residuals’ orthogonality property. \[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\qty[ \qty{ Y_i - \tilde\mu(W_i, X_i) } \ m(W_i,X_i) ] && \qqtext{ for all } m \in \mathcal{M}\\ &\overset{\texttip{\small{\unicode{x2753}}}{By the law of iterated expectations conditioning on $(W_i,X_i)$}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \mu(W_i,X_i) - \tilde\mu(W_i,X_i) } \ m(W_i,X_i) ] \qqtext{ for all } m \in \mathcal{M}\\ &\overset{\texttip{\small{\unicode{x2753}}}{Writing out our expectation as an average over the population}}{=} \frac{1}{m}\sum_{j=1}^m \qty{ \mu(w_j,x_j) - \tilde\mu(w_j,x_j) } \ m(w_j,x_j) \end{aligned} \]
But that has the same problem. Matched stuff only.
\[ \begin{aligned} 0 &= \frac{1}{m}\sum_{j=1}^m \qty{ \mu(w_j,x_j) - \tilde\mu(w_j,x_j) } \ 1_{=1}(w_j) \end{aligned} \tag{1}\]
\[ \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1} \mu(1,x_j)} = \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1} \tilde\mu(1,x_j)} \]
\[ \sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1} \ \tilde\mu(1,x)} = \sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1} \ \mu(1,x)} \qfor p_{x \mid w} = \frac{\sum_{j:w_j=w,x_j=x} 1}{\sum_{j:w_j=w} 1} \]
What happens if our two histograms are almost the same? We can’t be that far off. \[ \qqtext{ therefore } \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \ \textcolor[RGB]{0,191,196}{\mu(1,x)} \approx \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \ \textcolor[RGB]{0,191,196}{\tilde\mu(1,x)} \qqtext{ if } \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \approx \textcolor[RGB]{0,191,196}{p_{x \mid 0}} \]
If our histograms were exactly the same, it wouldn’t really matter what model we used here.
y ~ 1+w
to fit the ‘horizontal lines model’\[\color{gray} \begin{aligned} \textcolor[RGB]{0,191,196}{p_{x \mid 1}} &= \frac{1}{5} \qqtext{ and } \textcolor[RGB]{248,118,109}{p_{x \mid 0}} &= \frac{1}{5} - \frac{1}{20} \frac{x-14}{2} \end{aligned} \]
\[ \color{gray} \begin{aligned} \textcolor[RGB]{248,118,109}{p_{x \mid 0}} &= \textcolor[RGB]{0,191,196}{p_{x \mid 1}} \times (ax + b) \qfor a=\frac{11}{4} \qand b=-1/8. \end{aligned} \]
When we look at the sampling distribution of \(\hat\Delta_0\), we see no bias at all.
And it’s not because our covariate distributions are close to the same in the two groups. They’re not.
Exercise. Prove that the population version of our estimator, \(\tilde\Delta_0\), is equal to the target, i.e., that \[ \color{gray} \begin{aligned} \tilde\Delta_0 &= \sum_x \textcolor[RGB]{248,118,109}{p_{x \mid 0}} \qty{\textcolor[RGB]{0,191,196}{\tilde\mu(1,x)} - \textcolor[RGB]{248,118,109}{\tilde\mu(0,x)}} \qfor \\ \tilde \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \frac{1}{m}\sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \qqtext{ with } \mathcal{M}= \{ a(w) + b(w) x \}. \end{aligned} \]
If you’re not convinced that they’re exactly equal, run this code to check.
\[ \begin{aligned} \frac{1}{m}\sum_{j=1}^m y_j &= \frac{1}{m} \ \sum_x \sum_{j:x_j=x} y_j \\ &= \frac{1}{m} \ \sum_x \mu(x) \times m_x \qfor \mu(x) = \frac{1}{m_x} \sum_{j:x_j=x} y_j \qand m_x = \sum_{j:x_j=x} 1 \\ &= \frac{1}{m} \ \sum_x \mu(x) \sum_{j:x_j=x} 1 \\ &= \frac{1}{m} \ \sum_x \sum_{j:x_j=x} \mu(x) \\ &= \frac{1}{m} \ \sum_{j=1}^m m_x \mu(x) \end{aligned} \]
In the plot, ◇ indicates \(\hat\mu(w,x)\) and ⍿ indicates \(\mu(w,x)\).
Or, to be precise, a histogram that approximates it. But we used 10,000 simulated surveys, so our approximation was pretty good.
In this case, our covariate is education level. So we’re averaging over the distribution of education levels in our sample.
To see that more clearly, uncomment the + zoom.in
in the code above.
Replace sam
with pop
in the code above to see the population least squares predictor \(\tilde\mu\).5 Still bad.
We proved that in class earlier this week. It’s here in the slides.
Try it! Change target
in the code above to function(muhat, sam) { muhat(0) }
and see.
Remember that a sum of terms multiplied by an indicator is the sum of the terms where the indicator is 1.
We’ll write \(p_{x \mid 0}\) and \(p_{x \mid 1}\) for the height of the red and green bars at covariate level \(x\) in our population histogram. That is, \(p_{x \mid 0}\) is the proportion of male residents in our population with education level \(x\) and \(p_{x \mid 1}\) the same for female residents.
Try it! Change the formula to y ~ 1+w