The Behavior of Least Squares Predictors
This is a revised version of Lecture 12 as covered in class. It’s a work in progress. Parts may be useful as supplemental reading but it isn’t yet entirely coherent or complete enough to be a replacement for the version from class.
$$
We’re going to talk about statistical inference when we use least squares in linear models. For example … \[ \color{gray} \begin{aligned} \theta &= \frac{1}{m}\sum_{j=1}^m \qty{ \mu(1,x_j) - \mu(0,x_j) } \qfor \mu(w,x)=\frac{1}{m_{w,x}}\sum_{j:w_j=w,x_j=x} y_j \\ \hat\theta &= \frac{1}{n}\sum_{i=1}^n \qty{ \hat\mu(1,X_i) - \hat\mu(0,X_i) } \qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \end{aligned} \]
We can think of our estimator’s error as a sum of two things.
\[ \hat\theta - \theta = \underset{\text{sampling error}}{\qty{\hat\theta - \textcolor[RGB]{239,71,111}{\tilde\theta}}} + \underset{\text{modeling error}}{\qty{\textcolor[RGB]{239,71,111}{\tilde\theta} - \theta}} \]
\[ \hat\theta - \theta = \underset{\text{sampling error}}{\qty{\hat\theta - \textcolor[RGB]{239,71,111}{\tilde\theta}}} + \underset{\text{modeling error}}{\qty{\textcolor[RGB]{239,71,111}{\tilde\theta} - \theta}} \]
The tool for understanding least squares.
\[ \hat \mu = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} SSR(m) \qfor SSR(m) =\sum_{i=1}^n \qty{ Y_i - m }^2 \quad \text{ satisfies the \textbf{zero-derivative condition} } \]
\[ \begin{aligned} 0 &= \frac{d}{dm} \sum_{i=1}^n \qty{ Y_i - m }^2 \ \ \mid_{m=\hat\mu} &&= \sum_{i=1}^n \frac{d}{dm} \qty{ Y_i - m }^2 \ \ \mid_{m=\hat\mu} \\ &= \sum_{i=1}^n -2 \qty{ Y_i - m } \ \ \mid_{m=\hat\mu} &&= \sum_{i=1}^n -2 \qty{ Y_i - \hat\mu } \end{aligned} \]
\[ \begin{aligned} &0 = \sum_{i=1}^n -2\qty{ Y_i - \hat\mu } && \text{ when } \\ &2\sum_{i=1}^n Y_i = 2\sum_{i=1}^n \hat\mu = 2n\hat\mu && \text{ and therefore when } \\ &\frac{1}{n}\sum_{i=1}^n Y_i = \hat\mu \end{aligned} \]
\[ \hat \mu = \mathop{\mathrm{argmin}}_{\text{functions}\ m(w)} SSR(m) \qfor SSR(m) =\sum_{i=1}^n \qty{ Y_i - m(W_i) }^2 \]
\[ \hat \mu = \mathop{\mathrm{argmin}}_{\text{functions}\ m(w)} SSR(m) \qfor SSR(m) =\sum_{i=1}^n \qty{ Y_i - m(W_i) }^2 \]
\[ \begin{aligned} SSR(m_t) &= \sum_{i=1}^n \qty{ Y_i - m_t(W_i) }^2 \\ &= \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i) - t (m(W_i) - \hat\mu(W_i)) }^2 \end{aligned} \]
\[ \begin{aligned} 0 = \frac{d}{dt}\mid_{t=0} SSR(m_t) \end{aligned} \]
\[ m_t = \hat\mu + t m \qqtext{ is 'where we are' at time } t \color{lightgray} \qqtext{ instead of } m_t = \hat\mu + t (m - \hat\mu). \]
\[ \begin{aligned} 0 = \frac{d}{dt}SSR(m_t) &= \frac{d}{dt}\mid_{t=0} \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i) - tm(W_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i) } \times \frac{d}{dt}\mid_{t=0} \ \qty{ Y_i - \hat\mu(W_i) - tm(W_i) } \\ &= \sum_{i=1}^n -2 \times \qty{ Y_i - \hat\mu(W_i) } \times m(W_i) \\ \end{aligned} \]
\[ \underset{\text{left}}{\textcolor{red}{m(w)}} = 1_{=0}(w) = \begin{cases} 1 & \text{ if } w=0 \\ 0 & \text{ if } w=1 \end{cases} \qquad \underset{\text{middle}}{\textcolor{magenta}{m(w)}} = 1_{=1}(w) = \begin{cases} 0 & \text{ if } w=0 \\ 1 & \text{ if } w=1 \end{cases} \qquad \underset{\text{right}}{\textcolor{orange}{m(w)}} = 1 \]
\[ \begin{aligned} 0 &= \frac{d}{dt}\mid_{t=0} \ \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i) + tm(W_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i) } \times \frac{d}{dt}\mid_{t=0} \ \qty{ Y_i - \hat\mu(W_i) - tm(W_i) } \\ &= \sum_{i=1}^n -2 \times \qty{ Y_i - \hat\mu(W_i) } \times m(W_i) \\ \end{aligned} \]
\[ \begin{aligned} \hat\mu &\qqtext{minimizes} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 &&\qqtext{ over } m \in \mathcal{M}\implies \\ \hat\mu &\qqtext{minimizes} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 &&\qqtext{ over } m \in \{ m_t = \hat\mu + t m \ : \ t \in \mathbb{R} \} \qqtext{ for any } m \in \mathcal{M} \end{aligned} \]
\[ \begin{aligned} 0 &= \frac{d}{dt}\mid_{t=0} \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) - t m(W_i,X_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i,X_i) } \frac{d}{dt}\mid_{t=0} \qty{ Y_i - \hat\mu(W_i,X_i) - t m(W_i,X_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i,X_i) } \times -m(W_i,X_i) \\ \end{aligned} \]
\[ \begin{aligned} \tilde\mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \\ &\overset{\texttip{\small{\unicode{x2753}}}{When we sample uniformly-at-random from the population, this expectation is $1/m$ times the sum above.}}{=} \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \mathop{\mathrm{E}}\qty[ \qty{ Y_i - m(W_i,X_i) }^2 ] \end{aligned} \]
\[ \mathop{\mathrm{E}}[\hat\mu(w,x)] = \tilde\mu(w,x) \]
To prove it, we’ll think about the population least squares residuals \(\tilde\varepsilon_j = y_j - \tilde\mu(w_j,x_j)\).
These satisfy an orthogonality property analogous to the one we saw for sample residuals. \[ \begin{aligned} 0 &= \textcolor[RGB]{192,192,192}{\frac{1}{n}}\sum_{i=1}^n \overset{\textcolor{lightgray}{\text{sample residuals}}}{\color{gray}\left\{ \color{black} Y_i - \hat\mu(W_i,X_i) \color{gray} \right\} \color{black}} \ m(W_i,X_i) && \qqtext{ for all } m \in \mathcal{M}\\ 0 &= \textcolor[RGB]{192,192,192}{\frac{1}{m}}\sum_{j=1}^m \overset{\textcolor{lightgray}{\text{population residuals}}}{\color{gray}\left\{ \color{black} y_j - \tilde\mu(w_j,x_j) \color{gray} \right\} \color{black}} \ m(w_j,x_j) && \qqtext{ for all } m \in \mathcal{M}\\ &= \mathop{\mathrm{E}}\qty[ \qty{ Y_i - \tilde\mu(W_i,X_i) } \ m(W_i,X_i) ] \end{aligned} \]
To prove ‘unbiasedness’, we’ll combine these two orthogonality properties in a clever way.
Claim. \[ \begin{aligned} \tilde\mu(w,x) &= \mu_E(w,x) && \qfor \mu_E(w,x) = \mathop{\mathrm{E}}[\hat\mu(W_i,X_i) \mid W_i=w, X_i=x] \\ &&& \qand \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \end{aligned} \]
Proof Sketch. 1. We’ll show that orthogonality of residuals implies a kind of orthogonality of the difference \(\hat\mu - \tilde\mu\). \[ 0=\mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \qqtext{ for all } m \in \mathcal{M} \]
\[ \begin{aligned} 0 &= \frac{1}{n}\sum_{i=1}^n \color{gray}\left\{ \color{black} Y_i -\hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \qqtext{ and therefore } \\ 0 &= \mathop{\mathrm{E}}\left[ \frac{1}{n}\sum_{i=1}^n \color{gray}\left\{ \color{black} Y_i -\hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \\ &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} Y_i - \hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right]. \end{aligned} \]
\[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} Y_i - \tilde \mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \\ \end{aligned} \]
\[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color[RGB]{17,138,178}\left\{ \color{black} \color{gray}\left\{ \color{black} Y_i- \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} - \color{gray}\left\{ \color{black} Y_i - \hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \color[RGB]{17,138,178} \right\} \color{black} \ m(W_i,X_i) \right] \\ &= \mathop{\mathrm{E}}\left[ \color[RGB]{17,138,178}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color[RGB]{17,138,178} \right\} \color{black} \ m(W_i,X_i) \right] \end{aligned} \]
Because \(\mu_E\) and \(\tilde\mu\) are in the model6, so is \(m=\mu_E - \tilde\mu\). And therefore … \[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \qqtext{ for all } m \in \mathcal{M}\implies \\ 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ \color{gray}\left\{ \color{black} \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \right] \qfor \mu_E(w,x) = \mathop{\mathrm{E}}[\hat\mu(W_i,X_i) \mid W_i=w, X_i=x] \end{aligned} \]
We’ll use the law of iterated expectations to simplify this expectation.
\[ \begin{aligned} 0 &\overset{\texttip{\small{\unicode{x2753}}}{Step 1 + Substitution}}{=} \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \color{gray}\left\{ \color{black} \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Law of Iterated Expectations}}{=} \mathop{\mathrm{E}}\left[ \color{blue}\mathop{\mathrm{E}}\left[ \color{black} \left\{ \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \color{blue} \mid W_i, X_i\right] \color{black} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Linearity of Conditional Expectations}}{=} \mathop{\mathrm{E}}\left[ \left\{ \color{blue}\mathop{\mathrm{E}}\left[ \color{black} \hat\mu(W_i,X_i) \color{blue} \mid W_i, X_i\right] \color{black} - \tilde\mu(W_i,X_i) \right\} \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Recognizing $\mu_E$'s definition}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \color{blue} \mu_E(W_i,X_i) \color{black} - \tilde\mu(W_i,X_i) }\qty{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) } ] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Recognizing a square}}{=} \mathop{\mathrm{E}}\left[ \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\}^2 \right] \end{aligned} \]
\[ \theta = \frac{1}{m}\sum_{j=1}^m \qty{ \mu(1,x_j) - \mu(0,x_j) } \qfor \mu(w,x)=\frac{1}{m_{w,x}}\sum_{j:w_j=w,x_j=x} y_j \]
\[ \hat\mu(w,x) - \mu(w,x) = \qty{\hat\mu(w,x) - \textcolor[RGB]{239,71,111}{\tilde\mu(w,x)}} + \qty{\textcolor[RGB]{239,71,111}{\tilde\mu(w,x)} - \mu(w,x)} \]
\[ \hat\mu(w,x) - \mu(w,x) = \qty{\hat\mu(w,x) - \textcolor[RGB]{239,71,111}{\tilde\mu(w,x)}} + \qty{\textcolor[RGB]{239,71,111}{\tilde\mu(w,x)} - \mu(w,x)} \]
Q. If the distribution of \(\hat\mu\) is centered on the population least squares predictor \(\tilde\mu\),
why is \(\hat\mu(w,x)\) unbiased when we use the ‘all functions model’?
Let’s consider a more complex estimation target, e.g., the adjusted difference in means \(\hat\Delta_{\text{all}}\).
This is how we’ve been estimating it. \[ \begin{aligned} \hat\theta &= \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } &&\qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \\ &= \sum_{w,x} \alpha(w,x) \hat\mu(w,x) &&\qfor \alpha(w,x) = \begin{cases} p_{x} & \text{ for } w=1 \\ -p_{x} & \text{ for } w=0. \end{cases} \end{aligned} \]
We can decompose our estimator’s error by comparing to \(\tilde\mu\) term-by-term. \[ \begin{aligned} \hat\theta-\theta &= \sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \mu(w,x) } \\ &= \underset{\textcolor[RGB]{128,128,128}{\text{sampling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \tilde\mu(w,x) }} + \underset{\textcolor[RGB]{128,128,128}{\text{modeling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \tilde\mu(w,x) - \mu(w,x) }} \\ \end{aligned} \]
This breaks our error down into two pieces.
\[ \begin{aligned} \hat\theta-\theta &= \underset{\textcolor[RGB]{128,128,128}{\text{sampling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \tilde\mu(w,x) }} + \underset{\textcolor[RGB]{128,128,128}{\text{modeling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \tilde\mu(w,x) - \mu(w,x) }} \\ \end{aligned} \]
\[ \hat\theta = \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } \]
\[ \begin{aligned} \mu_E &= \sum_{\substack{(w_1,x_1,y_1) \ldots (w_n,x_n,y_n) \\ w_i=w, x_i=x }} \hat\mu \times P\qty{ (W_1,X_1,Y_1) \ldots (W_n,X_n,Y_n) = (w_1,x_1,y_1) \ldots (w_n,x_n,y_n) \mid W_i=w, X_i=x} \\ &\qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ y_i - m(w_i,x_i) }^2 \end{aligned} \]
Blame us teachers. In a lot of classes, you do the math assuming you’re using a correctly specified model. Then you learn some techniques for choosing a model that isn’t obviously misspecified. The message people get is to put a little effort into choosing a model, then act as if it were correctly specified. That doesn’t work.
This isn’t just my own metaphorical language. These are really called paths in statistical theory.
There’s no need to restrict \(t\) to \([0,1]\). For \(t>1\), we keep going past the medians. For \(t<0\), we’re walking in the opposite direction. The whole path is shown as an orange line on the contour plot.
We are talking about minimizing the sum and average of squared errors interchangeably. Why does that difference not matter? Hint. Why is the value of \(x\) that minimizes \(f(x)=x^2\) the same as the value of \(x\) that minimizes \(10x^2\) and \(x^2/10\)?
That implies \(\mu_E(W_i,X_i) - \tilde\mu(W_i, X_i)=0\). That’s how squares work. If something isn’t zero, its mean-square is positive.
See Slide 6.1