Lecture 12

The Behavior of Least Squares Predictors

Introduction

Today’s Topic

A sample drawn with replacement from a population.

That population.

We’re going to talk about statistical inference when we use least squares in linear models. For example … \[ \color{gray} \begin{aligned} \theta &= \frac{1}{m}\sum_{j=1}^m \qty{ \mu(1,x_j) - \mu(0,x_j) } \qfor \mu(w,x)=\frac{1}{m_{w,x}}\sum_{j:w_j=w,x_j=x} y_j \\ \hat\theta &= \frac{1}{n}\sum_{i=1}^n \qty{ \hat\mu(1,X_i) - \hat\mu(0,X_i) } \qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \end{aligned} \]

Does it look like this is a good estimator?
If not, do you think it’s too big or too small?

Summary

The sampling distribution of $\hat\theta$ and a perfectly calibrated interval estimate.

The population least squares predictor $\tilde\mu$ fit to the population.

What happened?
What happens generally
Terminology

In this case, $\hat\theta$ is too small. And that’s not just a chance occurrence.
- When we look at its sampling distribution, we see that it’s systematically too small.
- It’s biased. In roughly 95% of samples, $\hat\theta$ will be smaller than $\theta$.
It boils down to a bad choice of regression model.
- To make sense of this, think about what happens when everything else is perfect.
- i.e. if we fit this model to the whole population, then average over the population.
- The result is nonrandom. And it’s right in the middle of our estimator’s sampling distribution. \[ \textcolor[RGB]{239,71,111}{\tilde\theta} = \frac{1}{m}\sum_{j=1}^m \qty{ \tilde\mu(1,x_j) - \tilde\mu(0,x_j) } \qfor \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \frac{1}{m}\sum_{j=1}^m \qty{y_j - m(w_j,x_j)}^2 \]

We can think of our estimator’s error as a sum of two things.

\[ \hat\theta - \theta = \underset{\text{sampling error}}{\qty{\hat\theta - \textcolor[RGB]{239,71,111}{\tilde\theta}}} + \underset{\text{modeling error}}{\qty{\textcolor[RGB]{239,71,111}{\tilde\theta} - \theta}} \]

Sampling error is the ‘variance part’. It’s random and has mean zero.

It’s the error we’d have had if what we’d really wanted to estimate was $\tilde\theta$.
We can get a good estimate of this term’s distribution using the bootstrap.

Modeling error is the ‘bias part’. It’s non-random—it’s the bias of our estimator.

It’s the error we’d have had if we’d been given the whole population as our sample.
This is zero if your model contains the subpopulation mean function $\mu(w,x)$.
- e.g. when you use the ‘all functions’ model.
That we have a nice, clean breakdown of our error like this isn’t obvious.
- You can, of course, add and subtract whatever you want. We chose $\textcolor[RGB]{239,71,111}{\tilde\theta}$.
- But it’s not obvious that the sampling error term $\hat\theta - \textcolor[RGB]{239,71,111}{\tilde\theta}$ has mean zero.

We say a model is correctly specified when it contains $\mu$
We say it is misspecified when it doesn’t contain $\mu$.

Plan

Today, we’ll do two things.

We’ll prove that the sampling error term has mean zero.
- In fact, we’ll prove that it’s an average of mean-zero random variables.
- That’ll take a little calculus. It’s not a hard proof, but it’s a little clever.
Then, we’ll try to get a handle on what modeling error looks like for different models.

This’ll be a visualization exercise to help you develop some intuition.
This intuition comes in very handy because a lot of people don’t have it.¹
With even pretty basic intuition, you can catch a lot of mistakes and get them fixed.

Modeling and Sampling Error

Breaking Down the Problem

A sample drawn with replacement from a population.

That population.

The summaries we’ve been estimating are complex, but made up of simple pieces.
In particular, they’re all linear combinations of subpopulation means $\mu(w,x)$.

\[ \theta = \frac{1}{m}\sum_{j=1}^m \qty{ \mu(1,x_j) - \mu(0,x_j) } \qfor \mu(w,x)=\frac{1}{m_{w,x}}\sum_{j:w_j=w,x_j=x} y_j \]

Let’s start by looking at the error we get when we estimate a single one of these.
We’ll break it down into two pieces.
- One piece is the error of a population least squares predictor $\textcolor[RGB]{239,71,111}{\tilde\mu}$.
- The other is the difference between that and our actual predictor $\hat\mu$.

\[ \hat\mu(w,x) - \mu(w,x) = \qty{\hat\mu(w,x) - \textcolor[RGB]{239,71,111}{\tilde\mu(w,x)}} + \qty{\textcolor[RGB]{239,71,111}{\tilde\mu(w,x)} - \mu(w,x)} \]

Estimating Subpopulation Means

\[ \hat\mu(w,x) - \mu(w,x) = \qty{\hat\mu(w,x) - \textcolor[RGB]{239,71,111}{\tilde\mu(w,x)}} + \qty{\textcolor[RGB]{239,71,111}{\tilde\mu(w,x)} - \mu(w,x)} \]

What we see looks a lot like what we saw when we were estimating $\hat\mu$ using the ‘all functions’ model.
But there’s one significant difference: bias.
- The sampling distribution of $\hat\mu(w,x)$ isn’t centered on the population mean $\mu(w,x)$.
- It’s centered on the population least squares predictor $\textcolor[RGB]{239,71,111}{\tilde\mu(w,x)}$. \[ \mathop{\mathrm{E}}[\hat\mu(w,x)] = \tilde\mu(w,x) \qfor \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{y_j - m(w_j,x_j)}^2 \]
That’s what we’re going to prove now.

Q. If the distribution of $\hat\mu$ is centered on the population least squares predictor $\tilde\mu$,
why is $\hat\mu(w,x)$ unbiased when we use the ‘all functions model’?

Key Idea: Orthogonality of Residuals

\[ \textcolor{orange}{\hat\varepsilon_i} = Y_i - \hat\mu(W_i,X_i) \qquad \text{ are the \textbf{residuals}} \]

Properties in Familiar Cases
A Generalization
Why It’s the Same

When we use least squares to choose a …
- constant prediction, the residuals sum to zero.
- constant-within-groups prediction, the residuals sum to zero within groups. \[ \begin{aligned} \hat\mu &= \mathop{\mathrm{argmin}}_{\text{constants} \ m} \sum_{i=1}^n \qty{ Y_i - m }^2 \qqtext{ satisfies } \\ 0 &=\frac{d}{dm}\mid_{m=\hat\mu} \sum_{i=1}^n \qty{ Y_i - m }^2 = \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(X_i) } \times -1 = -2\sum_{i=1}^n \hat \varepsilon_i \end{aligned} \]

The residuals are orthogonal to the predictions of every function in our model. \[ \begin{aligned} 0 &= \sum_{i=1}^n \hat\varepsilon_i m(W_i, X_i) && \qqtext{ for all } m \in \mathcal{M} &= \underset{\text{vector of residuals}}{\vec{\hat\varepsilon}} \cdot \underset{\text{vector of predictions}}{\vec{m(W,X)}} && \qfor \vec{m(W,X)} = [ m(W_1,X_1), \ldots, m(W_n,X_n) ] \end{aligned} \]
For the constant model, that’s the same as summing to zero.
For the constant-within-groups model, that’s the same as summing to zero within groups.

\[ \begin{aligned} 0 &= \sum_{i=1}^n \hat\varepsilon_i m && \qqtext{ for all constants } m \text{ if and only if } \\ 0 &= \sum_{i=1}^n \hat\varepsilon_i. \end{aligned} \]

\[ \begin{aligned} 0 &= \sum_{i=1}^n \hat\varepsilon_i m(W_i) && \qqtext{ for all functions } m(w) \text{ if and only if } \\ 0 &= \sum_{i=1}^n \hat\varepsilon_i m(W_i) \\ &= \sum_{w}\sum_{i:W_i=w} \hat\varepsilon_i m(w) \\ &= \sum_{w} m(w) \textcolor[RGB]{17,138,178}{\sum_{i:W_i=w} \hat\varepsilon_i} && \text{ is a linear combination of \textcolor[RGB]{17,138,178}{zeros} }. \end{aligned} \]

Orthogonality of Residuals in General

To prove this is true in general, we’ll think about the optimality of $\hat\mu$ one direction at a time. \[ \begin{aligned} \hat\mu &\qqtext{minimizes} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 &&\qqtext{ over } m \in \mathcal{M}\implies \\ \hat\mu &\qqtext{minimizes} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 &&\qqtext{ over } m \in \mathcal{M}_1 \subseteq \mathcal{M}\\ & \text{and in particular} && \qqtext{ over } m \in \mathcal{M}_1 = \qty{ \hat\mu + t m \ : \ t \in \mathbb{R} } \qqtext{ for any } m \in \mathcal{M} \end{aligned} \]
Writing mean squared error as a function of $t$ in this model …
- We can observe that it’s minimized at $t=0$, since that’s when $\hat\mu + tm=\hat\mu$
- And we can set the derivative to zero there to get the orthogonality condition we want.

\[ \begin{aligned} 0 &= \frac{d}{dt}\mid_{t=0} \sum_{i=1}^n \qty{ Y_i - \hat\mu(W_i,X_i) - t m(W_i,X_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i,X_i) } \frac{d}{dt}\mid_{t=0} \qty{ Y_i - \hat\mu(W_i,X_i) - t m(W_i,X_i) }^2 \\ &= \sum_{i=1}^n 2 \times \qty{ Y_i - \hat\mu(W_i,X_i) } \times -m(W_i,X_i) \\ &= -2 \hat\varepsilon_i \dot m(W, X) \end{aligned} \]

Orthogonality of Population Residuals

We can define population residuals $\tilde\varepsilon_j = y_j - \tilde\mu(w_j,x_j)$ analogously.
- These are the residuals we get when we choose a predictor $\tilde\mu$ by least squares in the whole population.
- And by the same argument, they’re are orthogonal to every function in our model in a population sense. \[ \begin{aligned} 0 &= \textcolor[RGB]{192,192,192}{\frac{1}{n}}\sum_{i=1}^n \hat\varepsilon_i m(W_i,X_i) && \qqtext{ for all } m \in \mathcal{M}\qqtext{ where } \hat\varepsilon_i = Y_i - \hat\mu(W_i,X_i) \\ 0 &= \textcolor[RGB]{192,192,192}{\frac{1}{m}}\sum_{j=1}^m \tilde\varepsilon_j m(w_j,x_j) && \qqtext{ for all } m \in \mathcal{M}\qqtext{ where } \tilde\varepsilon_j = y_j - \tilde\mu(w_j,x_j) \\ &= \mathop{\mathrm{E}}\qty{ \frac{1}{n}\sum_{i=1}^n \qty{ Y_i - \tilde\mu(W_i,X_i) } m(W_i,X_i) } \end{aligned} \]
We’ve set out to prove that the population least squares predictor $\tilde\mu$ is the conditional expectation of the sample least squares predictor $\hat\mu$.
To do this, we’ll combine these two orthogonality properties in a clever way.

Relating $\hat\mu$ and $\tilde\mu$ (1/4)

Claim. \[ \begin{aligned} \tilde\mu(w,x) &= \mu_E(w,x) && \qfor \mu_E(w,x) = \mathop{\mathrm{E}}[\hat\mu(W_i,X_i) \mid W_i=w, X_i=x] \\ &&& \qand \tilde\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{j=1}^m \qty{ y_j - m(w_j,x_j) }^2 \end{aligned} \]

Sketch.

We’ll show that orthogonality of residuals implies a kind of orthogonality of the difference $\hat\mu - \tilde\mu$. \[ 0=\mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \qqtext{ for all } m \in \mathcal{M} \]
We’ll plug in a clever choice of $m \in \mathcal{M}$: $m=\mu_E - \tilde \mu$. \[ 0=\mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ \qty{ \mu_E(W_i,X_i) - \tilde\mu(W_i, X_i) } \right] \]
We’ll use the law of iterated expectations to show that this is a mean-squared-difference. \[ 0 = \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \mu_E(W_i,X_i) - \tilde\mu(W_i, X_i) \color{gray} \right\} \color{black}^2 \right] \]

That implies $\mu_E(W_i,X_i) - \tilde\mu(W_i, X_i)=0$.
That’s how squares work. If something isn’t zero, its mean-square is positive.

Relating $\hat\mu$ and $\tilde\mu$ (2/4)

Step 1.

From the orthogonality of the sample residuals, all $m \in\mathcal{M}$ satisfy …

\[ \begin{aligned} 0 &= \frac{1}{n}\sum_{i=1}^n \color{gray}\left\{ \color{black} Y_i -\hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \qqtext{ and therefore } \\ 0 &= \mathop{\mathrm{E}}\left[ \frac{1}{n}\sum_{i=1}^n \color{gray}\left\{ \color{black} Y_i -\hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \\ &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} Y_i - \hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right]. \end{aligned} \]

From the orthogonality of the population residuals, all $m \in\mathcal{M}$ satisfy …

\[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} Y_i - \tilde \mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \\ \end{aligned} \]

Subtracting the first from the second, we find that all $m \in \mathcal{M}$ satisfy …

\[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color[RGB]{17,138,178}\left\{ \color{black} \color{gray}\left\{ \color{black} Y_i- \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} - \color{gray}\left\{ \color{black} Y_i - \hat\mu(W_i,X_i) \color{gray} \right\} \color{black} \color[RGB]{17,138,178} \right\} \color{black} \ m(W_i,X_i) \right] \\ &= \mathop{\mathrm{E}}\left[ \color[RGB]{17,138,178}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color[RGB]{17,138,178} \right\} \color{black} \ m(W_i,X_i) \right] \end{aligned} \]

Relating $\hat\mu$ and $\tilde\mu$ (3/4)

Step 2. \[ \begin{aligned} 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ m(W_i,X_i) \right] \qqtext{ for all } m \in \mathcal{M}\implies \\ 0 &= \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \ \color{gray}\left\{ \color{black} \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \right] \end{aligned} \]

This seems trivial, but to make sure this implication is valid, we have to check something.
- We have to check that $m=\mu_E - \tilde\mu$ is in the model $\mathcal{M}$.
- This ultimately boils down to linearity of the model and linearity of expectation.
$\mu_E$ is a (probability-weighted) linear combination of functions in the model, so it’s in the model. \[ \begin{aligned} \mu_E &= \sum_{\substack{(w_1,x_1,y_1) \ldots (w_n,x_n,y_n) \\ w_i=w, x_i=x }} \hat\mu \times P\qty{ (W_1,X_1,Y_1) \ldots (W_n,X_n,Y_n) = (w_1,x_1,y_1) \ldots (w_n,x_n,y_n) \mid W_i=w, X_i=x} \\ &\qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ y_i - m(w_i,x_i) }^2 \end{aligned} \]

Relating $\hat\mu$ and $\tilde\mu$ (4/4)

Step 3. \[ \begin{aligned} 0 &\overset{\texttip{\small{\unicode{x2753}}}{Steps 1-2}}{=} \mathop{\mathrm{E}}\left[ \color{gray}\left\{ \color{black} \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \color{gray}\left\{ \color{black} \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \color{gray} \right\} \color{black} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Law of Iterated Expectations}}{=} \mathop{\mathrm{E}}\left[ \color{blue}\mathop{\mathrm{E}}\left[ \color{black} \left\{ \hat\mu(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \color{blue} \mid W_i, X_i\right] \color{black} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Linearity of Conditional Expectations}}{=} \mathop{\mathrm{E}}\left[ \left\{ \color{blue}\mathop{\mathrm{E}}\left[ \color{black} \hat\mu(W_i,X_i) \color{blue} \mid W_i, X_i\right] \color{black} - \tilde\mu(W_i,X_i) \right\} \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\} \right] \\ &\overset{\texttip{\small{\unicode{x2753}}}{Recognizing $\mu_E$'s definition}}{=} \mathop{\mathrm{E}}\qty[ \qty{ \color{blue} \mu_E(W_i,X_i) \color{black} - \tilde\mu(W_i,X_i) }\qty{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) } ] \ &\overset{\texttip{\small{\unicode{x2753}}}{Recognizing a square}}{=} \mathop{\mathrm{E}}\left[ \left\{ \mu_E(W_i,X_i) - \tilde\mu(W_i,X_i) \right\}^2 \right] \end{aligned} \]

That concludes the proof.
- If $\mu_E(W_i,X_i) - \tilde\mu(W_i,X_i)$ weren’t always zero …
- then the expected value of its square would be positive, not zero.

Implications for Complex Targets

A sample drawn with replacement from a population.

That population.

Let’s consider a more complex estimation target, e.g., the adjusted difference in means $\hat\Delta_{\text{all}}$.
- This, like most of our targets, is a linear combination of column means.
- To keep things simple, let’s suppose that the weights are non-random.
- e.g., if we were able to get the distribution of $w,x$ in the population from a census or voter file.
This is how we’ve been estimating it. \[ \begin{aligned} \hat\theta &= \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } &&\qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \\ &= \sum_{w,x} \alpha(w,x) \hat\mu(w,x) &&\qfor \alpha(w,x) = \begin{cases} p_{x} & \text{ for } w=1 \\ -p_{x} & \text{ for } w=0. \end{cases} \end{aligned} \]
We can decompose our estimator’s error by comparing to $\tilde\mu$ term-by-term. \[ \begin{aligned} \hat\theta-\theta &= \sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \mu(w,x) } \\ &= \underset{\textcolor[RGB]{128,128,128}{\text{sampling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \tilde\mu(w,x) }} + \underset{\textcolor[RGB]{128,128,128}{\text{modeling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \tilde\mu(w,x) - \mu(w,x) }} \\ \end{aligned} \]
This breaks our error down into two pieces.
- The sampling error has mean zero. Why?
- The modeling error is deterministic. It’s our estimator’s bias.

What Do We Do With This?

\[ \begin{aligned} \hat\theta-\theta &= \underset{\textcolor[RGB]{128,128,128}{\text{sampling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \hat\mu(w,x) - \tilde\mu(w,x) }} + \underset{\textcolor[RGB]{128,128,128}{\text{modeling error}}}{\sum_{w,x} \alpha(w,x) \qty{ \tilde\mu(w,x) - \mu(w,x) }} \\ \end{aligned} \]

We can estimate the distribution of the sampling error as usual.
- Meaning using the bootstrap.
- There are variance formulas, but they’re a little complicated.
If there were no modeling error, this’d mean we could calibrate interval estimates for 95% coverage. With modeling error, we can’t do that.
This leaves us with some bad options.
1. We act as if there’s no modeling error.
2. We act as if what we really wanted to estimate was $\tilde\theta$.
Because these choices are so popular, it’s important to understand the implications of doing them.
But we do have some good options.
1. Use the all-functions model, so there really isn’t any modeling error.
2. Use inverse probability weighting, a special way estimation-target-specific way of estimating $\mu$ that can eliminate or at least significant reduce modeling error. That’s what we’ll discuss next class.

Putting It All Together: An Example

Above, we see the sampling distribution of $\hat\mu(w,x)$ for $w=0,1$ and $x=20,30,40$.
Roughly what’s the bias of $\hat\theta$?
- Could you have guessed that from the plot of $\hat\mu$ on top of the sample?
- Or did you need all the population information displayed here?

\[ \hat\theta = \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } \]

A sample drawn with replacement from a population.

That population.

Quiz

Here we see the sampling distributions of three estimators of $\theta$ at three different sample sizes.

Which one couldn’t be the sampling distribution of an estimator $\hat\theta$ like this?

\[ \hat\theta = \frac{1}{m}\sum_{j=1}^m \qty{ \hat\mu(1,x_j) - \hat\mu(0,x_j) } \qfor \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \]

Modeling Error

Example

Footnotes

Blame us teachers. In a lot of classes, you do the math assuming you’re using a correctly specified model. Then you learn some techniques for choosing a model that isn’t obviously misspecified. The message people get is to put a little effort into choosing a model, then act as if it were correctly specified. That doesn’t work.

Lecture 12

Introduction

Today’s Topic

Summary

Plan

Modeling and Sampling Error

Breaking Down the Problem

Estimating Subpopulation Means

Key Idea: Orthogonality of Residuals

Orthogonality of Residuals in General

Orthogonality of Population Residuals

Relating \(\hat\mu\) and \(\tilde\mu\) (1/4)

Relating \(\hat\mu\) and \(\tilde\mu\) (2/4)

Relating \(\hat\mu\) and \(\tilde\mu\) (3/4)

Relating \(\hat\mu\) and \(\tilde\mu\) (4/4)

Implications for Complex Targets

What Do We Do With This?

Putting It All Together: An Example

Quiz

Modeling Error

Setup

Example

Footnotes