The Behavior of Least Squares Predictors
$$
We’re going to talk about statistical inference when we use least squares in linear models. For example … \[ \color{gray} \begin{aligned} \theta &= \frac{1}{m}\sum_{j=1}^m \qty{ \mu(1,x_j) - \mu(0,x_j) } \qfor \mu(w,x)=\frac{1}{m_{w,x}}\sum_{j:w_j=w,x_j=x} y_j \\ \hat\theta &= \frac{1}{n}\sum_{i=1}^n \qty{ \hat\mu(1,X_i) - \hat\mu(0,X_i) } \qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \end{aligned} \]
We can think of our estimator’s error as a sum of two things.
\[ \hat\theta - \theta = \underset{\text{sampling error}}{\qty{\hat\theta - \textcolor[RGB]{239,71,111}{\tilde\theta}}} + \underset{\text{modeling error}}{\qty{\textcolor[RGB]{239,71,111}{\tilde\theta} - \theta}} \]
\[ \hat\theta - \theta = \underset{\text{sampling error}}{\qty{\hat\theta - \textcolor[RGB]{239,71,111}{\tilde\theta}}} + \underset{\text{modeling error}}{\qty{\textcolor[RGB]{239,71,111}{\tilde\theta} - \theta}} \]
\[ \theta = \frac{1}{m}\sum_{j=1}^m \qty{ \mu(1,x_j) - \mu(0,x_j) } \qfor \mu(w,x)=\frac{1}{m_{w,x}}\sum_{j:w_j=w,x_j=x} y_j \]
\[ \hat\mu(w,x) - \mu(w,x) = \qty{\hat\mu(w,x) - \textcolor[RGB]{239,71,111}{\tilde\mu(w,x)}} + \qty{\textcolor[RGB]{239,71,111}{\tilde\mu(w,x)} - \mu(w,x)} \]
\[ \hat\mu(w,x) - \mu(w,x) = \qty{\hat\mu(w,x) - \textcolor[RGB]{239,71,111}{\tilde\mu(w,x)}} + \qty{\textcolor[RGB]{239,71,111}{\tilde\mu(w,x)} - \mu(w,x)} \]
Q. If the distribution of \(\hat\mu\) is centered on the population least squares predictor \(\tilde\mu\),
why is \(\hat\mu(w,x)\) unbiased when we use the ‘all functions model’?
\[ \textcolor{orange}{\hat\varepsilon_i} = Y_i - \hat\mu(W_i,X_i) \qquad \text{ are the \textbf{residuals}} \]
The residuals are orthogonal to the predictions of every function in our model. \[ \begin{aligned} 0 &= \sum_{i=1}^n \hat\varepsilon_i m(W_i, X_i) && \qqtext{ for all } m \in \mathcal{M} &= \underset{\text{vector of residuals}}{\vec{\hat\varepsilon}} \cdot \underset{\text{vector of predictions}}{\vec{m(W,X)}} && \qfor \vec{m(W,X)} = [ m(W_1,X_1), \ldots, m(W_n,X_n) ] \end{aligned} \]
For the constant model, that’s the same as summing to zero.
For the constant-within-groups model, that’s the same as summing to zero within groups.
\[ \begin{aligned} 0 &= \sum_{i=1}^n \hat\varepsilon_i m && \qqtext{ for all constants } m \text{ if and only if } \\ 0 &= \sum_{i=1}^n \hat\varepsilon_i. \end{aligned} \]
\[ \begin{aligned} 0 &= \sum_{i=1}^n \hat\varepsilon_i m(W_i) && \qqtext{ for all functions } m(w) \text{ if and only if } \\ 0 &= \sum_{i=1}^n \hat\varepsilon_i m(W_i) \\ &= \sum_{w}\sum_{i:W_i=w} \hat\varepsilon_i m(w) \\ &= \sum_{w} m(w) \textcolor[RGB]{17,138,178}{\sum_{i:W_i=w} \hat\varepsilon_i} && \text{ is a linear combination of \textcolor[RGB]{17,138,178}{zeros} }. \end{aligned} \]
To prove this is true in general, we’ll think about the optimality of \(\hat\mu\) one direction at a time. \[ \begin{aligned} \hat\mu &\qqtext{minimizes} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 &&\qqtext{ over } m \in \mathcal{M}\implies \\ \hat\mu &\qqtext{minimizes} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 &&\qqtext{ over } m \in \mathcal{M}_1 \subseteq \mathcal{M}\\ & \text{and in particular} && \qqtext{ over } m \in \mathcal{M}_1 = \qty{ \hat\mu + t m \ : \ t \in \mathbb{R} } \qqtext{ for any } m \in \mathcal{M} \end{aligned} \]
Writing mean squared error as a function of \(t\) in this model …
\[ \begin{aligned} 0 &= \frac{d}{dt}\mid_{t=0} \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) - t m(W_i,X_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i,X_i) } \frac{d}{dt}\mid_{t=0} \qty{ Y_i - \hat\mu(W_i,X_i) - t m(W_i,X_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i,X_i) } \times -m(W_i,X_i) \\ &= -2 \hat\varepsilon_i \dot m(W, X) \end{aligned} \]
Claim. \[ \begin{aligned} \tilde\mu(w,x) &= \mu_E(w,x) && \qfor \mu_E(w,x) = \mathop{\mathrm{E}}[\hat\mu(W_i,X_i) \mid W_i=w, X_i=x] \\ &&& \qand \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \end{aligned} \]
Sketch.
Claim. \[ \begin{aligned} \tilde\mu(w,x) &= \mu_E(w,x) && \qfor \mu_E(w,x) = \mathop{\mathrm{E}}[\hat\mu(W_i,X_i) \mid W_i=w, X_i=x] \\ &&& \qand \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \end{aligned} \]
Step 1.
\[ \begin{aligned} 0 &= \frac{1}{n}\sum_{i=1}^n \color{gray}\left\{ \color{black} Y_i -\hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \qqtext{ and therefore } \\ 0 &= \mathop{\mathrm{E}}\left[ \frac{1}{n}\sum_{i=1}^n \color{gray}\left\{ \color{black} Y_i -\hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \\ &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} Y_i - \hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right]. \end{aligned} \]
\[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} Y_i - \tilde \mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \\ \end{aligned} \]
\[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color[RGB]{17,138,178}\left\{ \color{black} \color{gray}\left\{ \color{black} Y_i- \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} - \color{gray}\left\{ \color{black} Y_i - \hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \color[RGB]{17,138,178} \right\} \color{black} \ m(W_i,X_i) \right] \\ &= \mathop{\mathrm{E}}\left[ \color[RGB]{17,138,178}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color[RGB]{17,138,178} \right\} \color{black} \ m(W_i,X_i) \right] \end{aligned} \]
Claim. \[ \begin{aligned} \tilde\mu(w,x) &= \mu_E(w,x) && \qfor \mu_E(w,x) = \mathop{\mathrm{E}}[\hat\mu(W_i,X_i) \mid W_i=w, X_i=x] \\ &&& \qand \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \end{aligned} \]
Step 2. \[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \qqtext{ for all } m \in \mathcal{M}\implies \\ 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ \color{gray}\left\{ \color{black} \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \right] \end{aligned} \]
Claim. \[ \begin{aligned} \tilde\mu(w,x) &= \mu_E(w,x) && \qfor \mu_E(w,x) = \mathop{\mathrm{E}}[\hat\mu(W_i,X_i) \mid W_i=w, X_i=x] \\ &&& \qand \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \end{aligned} \]
Step 3. \[ \begin{aligned} 0 &\overset{\texttip{\small{\unicode{x2753}}}{Steps 1-2}}{=} \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \color{gray}\left\{ \color{black} \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Law of Iterated Expectations}}{=} \mathop{\mathrm{E}}\left[ \color{blue}\mathop{\mathrm{E}}\left[ \color{black} \left\{ \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \color{blue} \mid W_i, X_i\right] \color{black} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Linearity of Conditional Expectations}}{=} \mathop{\mathrm{E}}\left[ \left\{ \color{blue}\mathop{\mathrm{E}}\left[ \color{black} \hat\mu(W_i,X_i) \color{blue} \mid W_i, X_i\right] \color{black} - \tilde\mu(W_i,X_i) \right\} \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Recognizing $\mu_E$'s definition}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \color{blue} \mu_E(W_i,X_i) \color{black} - \tilde\mu(W_i,X_i) }\qty{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) } ] \ &\overset{\texttip{\small{\unicode{x2753}}}{Recognizing a square}}{=} \mathop{\mathrm{E}}\left[ \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\}^2 \right] \end{aligned} \]
Let’s consider a more complex estimation target, e.g., the adjusted difference in means \(\hat\Delta_{\text{all}}\).
This is how we’ve been estimating it. \[ \begin{aligned} \hat\theta &= \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } &&\qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \\ &= \sum_{w,x} \alpha(w,x) \hat\mu(w,x) &&\qfor \alpha(w,x) = \begin{cases} p_{x} & \text{ for } w=1 \\ -p_{x} & \text{ for } w=0. \end{cases} \end{aligned} \]
We can decompose our estimator’s error by comparing to \(\tilde\mu\) term-by-term. \[ \begin{aligned} \hat\theta-\theta &= \sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \mu(w,x) } \\ &= \underset{\textcolor[RGB]{128,128,128}{\text{sampling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \tilde\mu(w,x) }} + \underset{\textcolor[RGB]{128,128,128}{\text{modeling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \tilde\mu(w,x) - \mu(w,x) }} \\ \end{aligned} \]
This breaks our error down into two pieces.
\[ \begin{aligned} \hat\theta-\theta &= \underset{\textcolor[RGB]{128,128,128}{\text{sampling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \tilde\mu(w,x) }} + \underset{\textcolor[RGB]{128,128,128}{\text{modeling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \tilde\mu(w,x) - \mu(w,x) }} \\ \end{aligned} \]
\[ \hat\theta = \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } \]
Here we see the sampling distributions of three estimators of \(\theta\) at three different sample sizes.
Which one couldn’t be the sampling distribution of an estimator \(\hat\theta\) like this?
\[ \hat\theta = \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } \qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \]
Blame us teachers. In a lot of classes, you do the math assuming you’re using a correctly specified model. Then you learn some techniques for choosing a model that isn’t obviously misspecified. The message people get is to put a little effort into choosing a model, then act as if it were correctly specified. That doesn’t work.