Practice Final

QTM 285-1

Premise

Whether you’re in academia or industry, you’ll run into people who are analyzing data badly. Often, what they’re doing is both the established common practice and obviously wrong. In this practice exam, I’m going to run you through some examples in which this is happening. I’m going to ask you how to do better; explain what’s going wrong; and, in some cases, deal with what happens when you’ve convinced someone to do better but they haven’t entirely understood you. Obviously, in real life not everything that happens can be explained using the concepts we’ve learned in this class, and often the problems are a bit better-hidden than they are here. But it’s surprising—at least to me—how often these things come up in a relatively obvious way.

Aside from giving you some practice recognizing and dealing with this stuff, what I hope to do here is give you some examples to help you explain what’s going on when these things happen. Often, if someone’s making a mistake in a subtler form or a more complex problem, you can point out that the situation in analogous to one where the problem is much more obvious, as it is in many of these examples.

About the Exam

This practice exam was long and, in places, hard. Think about this as a set of exercises I wrote to make sure y’all had some practice with the ideas and skills that’ll come up on the exam.

For the actual exam, expect questions that look like these, but they’ll be simpler with less repetition and fewer of parts.

A Theme

Several of these exercises — Problems 2, 4, and parts of 5 — look at how evidence from simulations can be misleading. I wanted to bring this up, show you a few flavors of it, and pick apart what’s going on a little in a simple case for two reasons. To explain them, I do have to give away a few exercise solutions.

  1. A lot of the evidence people use to choose methods is based on simulation. When you write a paper proposing a new method, you’re expected to show it works—and often that it works better than the alternatives—in simulated data. That evidence is often pretty questionable. Even if you put aside the incentive to make your method look good, if you spend months or years working on a method to solve one piece of a problem, you get tunnel vision. That’s the piece of the problem that shows up in your simulated data because that’s the piece of the problem you’ve been thinking about. Knowing that this happens and a little about what it can look like can make it a little less disorienting the first time you download someone’s code, try it out, and find that it doesn’t seem to work as well as you’ve been led to believe.

  2. Arguably, many methods work pretty often for the same reason simulations don’t show that they don’t work. A stopped clock is right twice a day. And in Problem 2, we see an estimator that’s right ‘once a day’. But when we get to more complex estimators, there start to be a lot of ways for things to go right. In Problem 5, we see two that come up when estimating treatment effects.

  • We could have misspecification but not covariate shift, and therefore no bias.
  • We could have covariate shift but no misspecification, and therefore no bias.

We can even have misspecification and covariate shift that somehow fit together perfectly so there’s still no bias. We saw one example of this a few weeks ago. It’s not likely that this is happening exactly— unless you use models that can’t be misspecified (all functions) or techniques like inverse probability weighting, you’ll get bias— but maybe not enough that it really matters. I don’t mean to suggest that you should be relying on this kind of luck instead of just using better models or inverse probability weighting if you have a choice. But it might be a reason not to discount conclusions based on problematic methods too much. Pretty often, when you reanalyze the data using a more trustworthy approach, what you get is roughly the same point estimate and a slightly wider confidence interval.

Problem 1

The population and the population least squares line

A sample of size 13 and the least squares line

Suppose we’re interested in estimating the subpopulation mean \(\mu(0) = \sum_{j:x_j=0} y_j / \sum_{j:x_j=0} 1\) of the population drawn on the left above. We’re working with a sample of \((X_1, Y_1), \ldots, (X_n, Y_n)\) drawn with replacement from this population, like the one drawn on the right above.

As our estimate of \(\mu(0)\), we use \(\hat\mu(0)\) where \(\hat\mu\) is the least squares line1. Below is a plot of its sampling distribution at sample size \(n=13\). As usual, I’ve plotted its mean, mean plus or minus one standard deviation, and mean plus or minus two standard deviations as solid, dashed, and dotted blue lines respectively. I’ve also plotted the subpopulation mean \(\mu(0)\) itself as a green line.

You may assume that the standard deviation of the estimator at sample size \(n\) is \(\text{se}=\sigma/\sqrt{n}\) for some number \(\sigma\) that does not vary with sample size, just like it would be if \(\hat\mu(0)\) were the mean of a sample of size \(n\).

Exercise 1.1  

At which of these three samples sizes — 25, 50, or 100 — would you expect that interval \(\hat\mu(0) \pm 1.96\ \text{se}\) to have a width of roughly \(1/2\)?

From the plot of the sampling distribution at our current sample size of \(13\), we see that our current interval’s width is roughly \(1\). To halve this width, we need to increase our sample size enough to halve \(\text{se}=\sigma/\sqrt{n}\). Quadrupling our sample size does that. \(\text{new se} = \sigma/\sqrt{4 \times 13} = \frac{1}{2} \times \sigma/\sqrt{13}\). And 50 is close enough to \(13\times 4 = 52\).

Exercise 1.2  

At which of these three samples sizes — 25, 50, or 100 — would you expect the coverage of the usual 95% confidence interval \(\hat\mu(0) \pm 1.96\ \text{se}\) to be closest to 50%?

If the answer isn’t obvious to you, it might be helpful to sketch the sampling distribution of the estimator at the different sample sizes you’re considering. How does their width vary as you increase sample size? What about their center?

There are two things you need to know to answer this question.

  1. The bias of \(\hat\mu(0)\) doesn’t depend on sample size.
  2. We get roughly 50% coverage when our estimator’s bias is half of our interval’s width, i.e., when our interval is \(\hat\theta \pm \text{width}/2\).

The first tells us is that no matter what sample size we use, bias is the same as it is in the plot above: roughly \(1/4\). The second tells us that we’ll get 50% coverage when our interval’s width is \(1/4 \times 2 = 1/2\). The previous exercise tells us this happens for \(n=50\).

Where are these two things coming from?

  1. The bias of \(\hat\mu(0)\) is the error \(\tilde\mu(0)-\mu(0)\) that we get when we use the population least squares predictor in the same model. No matter what our sample size, \(\mathop{\mathrm{E}}[\hat\mu(x)]=\tilde\mu(x)\). See Lecture 12 for this.
  2. In visual terms, if our bias is half the width of our intervals (their ‘arm length’), then …
  • when we get a point estimate above our sampling distribution’s mean, it doesn’t cover the target.
  • when get a point estimate below our sampling distribution’s mean, it does cover the target.

If our sampling distribution is (close to) symmetric around its mean, then we’ll be above half the time and below half the time, so we get roughly 50% coverage. See Lecture 13 for this and the generalization discussed in the note below.

Something to remember. When use misspecified models, bias doesn’t depend on sample size, but the impact of that bias on inference does depend on sample size. More data, worse coverage.

Why emphasize this? Because it’s easy to get in the habit of thinking that more data is just better, and that’s not true if you’re not careful about your choice of model. It’s true if all we care about is the accuracy of our point estimators, e.g. their mean squared error; misspecification limits how well we can do in those terms, but the more data we have the closer we get to that limit. But we usually do care about interval estimation, and if you don’t stop to visualize what’s going on with the sampling distribution—that it’s actually shrinking away from the target as we get more data—it can be counterintuitive that getting more data can create problems.

This is a summary of our discussion here in Lecture 13 in the context of this question. What we saw there is that our sampling distribution is approximately normal, what determines coverage is the ratio \(\text{bias}/\text{se}\) of our estimator’s bias to its standard deviation. And we can plot coverage as a function of that ratio. We do in Chapter 6.

Where is this coming from? If we’re looking at the sampling distribution of \(\hat\theta\) and thinking about the coverage of intervals \(\hat\theta \pm \ell\) of width \(2\ell\) (\(\ell\) for ‘arm-length’), the ones that cover are the ones where the arms reach from \(\hat\theta\) to \(\theta\), i.e. the ones where \(\abs{\hat\theta-\theta} \le \ell\). To count those, we integrate the sampling distribution of \(\hat\theta\) from \(\theta-\ell\) to \(\theta+\ell\). You can see this by drawing the integral you’d use to do the coverage calculation. I’ve done that below.

Look at the top 2 plots. On the left, I’ve drawn the \(n=13\) sampling distribution again, this time with this domain of integration shaded. On the right, I’ve drawn a version for \(n=50\). What you know about the sampling distribution for \(\hat\mu(0)\) when \(n=50\) is that it’s centered in the same place as it is for \(n=13\) and it’s about half as wide. I’ve drawn exactly that on the right; I’ve taken the \(n=13\) histogram on the left and ‘compressed it’ so everything is \(1/2\) as far from the mean. To show that this width-halving approach works, I’ve also drawn the actual sampling distribution for \(n=50\) as a faint gray outline. Since we’re using intervals that are ± two standard deviations, we integrate the sampling distribution from two standard deviations left of the estimation target to two standard deviations right of it—i.e., over the region I’ve shaded yellow. Because that yellow region extends right to the center of the (roughly symmetric) sampling distribution and left far enough to include essentially all of its probability mass on that side, this integral is roughly \(.5\)—half of the sampling distribution’s total mass.

Why is \(\text{bias}/\text{se}\) all that really matters? In Lecture 13, we showed it via calculus. We wrote out the integral of the normal distribution with mean \(\theta+\text{bias}\) and standard deviation \(\text{se}\) and did a change of variables. But I’ll try to give you a more visual explanation here.

Think about what’s happened when we (roughly) quadrupled our sample size to \(n=50\). We took the \(n=13\) histogram and made it half as wide. And our intervals got half as wide too: their width scales with the sampling distribution’s. What doesn’t change is where the target is. On the left, it’s 1 standard deviation from the mean. On the right, it’s 2. Not because the target or the sampling distribution’s center has changed, but because the standard deviation has. That changes the value of the integral. But if we halved the bias at the same time as we halved everything else, then we’d get exactly the same integral—the plot on the right would just be a half-as-wide version as of the plot on the left.

I’ve illustrated that on the bottom row by drawing exactly the same thing twice: the \(n=50\) plot with the target changed so the bias is halved. By comparing the two plots on the right, top and bottom, you can see that all that’s changed is that the target’s been changed so the bias is halved. And by comparing the two pots on the left, you can see that all that’s changed are the numbers on the axes—it’s exactly the same plot, but half as wide and twice as tall. And because of that, the proportion of the sampling distribution in the range we’re integrating over is the same as it is in the \(n=13\) case.

Problem 2

You have a sample \(Y_1,Y_2...,Y_n\) drawn with replacement from a population with mean \(\mu\) and standard deviation \(\sigma\). The usual estimate for \(\mu\) is the sample mean \(\bar{Y}_0 = \frac{1}{n}\sum_{i=1}^n Y_i\). But one of your coworkers has a clever idea to reduce variance—instead of dividing by \(n\), they divide by \(n+10\). This is their estimator: \(\bar{Y}_{10} = \frac{1}{n+10}\sum_{i=1}^n Y_i\). You try to convince them it’s a bad idea by being a bit hyperbolic—you argue that if that’s a good idea, then why not divide by \(n+100\)? They, unfortunately, love this idea. You now have to explain why this is a bad idea. There are some plots in Appendix Chapter 6 that may help.

Your coworker has calculated formulas for the bias and standard deviation of the ‘divide by \(n+k\) estimator’ \(\bar{Y}_k=\frac{1}{n+k}\sum_{i=1}^n Y_i\). \[ \begin{aligned} \mathop{\mathrm{E}}[\bar{Y}_k] - \mu &= -\frac{k}{n+k}\mu \\ \mathop{\mathrm{sd}}[\bar{Y}_k] &= \frac{\sigma}{\sqrt{n}} \times \frac{n}{n+k}. \end{aligned} \]

\[ \begin{aligned} \mathop{\mathrm{E}}\qty[\bar{Y}_k] - \mu &= \mathop{\mathrm{E}}\qty[\frac{n}{n+k}\bar Y_0] - \mu \\ &= \frac{n}{n+k}\mathop{\mathrm{E}}[\bar Y_0] - \mu \\ &=\frac{n}{n+k}\mu - \mu \\ &= \qty(\frac{n}{n+k} - 1)\mu \\ &= \qty(\frac{n}{n+k} - \frac{n+k}{n+k})\mu \\ &= -\frac{k}{n+k}\mu \\ & \\ \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\bar{Y}_k] &= \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\frac{n}{n+k} \bar {Y}_0] \\ &= \qty(\frac{n}{n+k})^2 V[\bar{Y}_0] \\ &= \frac{n^2}{(n+k)^2} \frac{\sigma^2}{n} \\ &= \frac{ n\sigma^2}{(n+k)^2}=\frac{\sigma^2}{n} \times \frac{n^2}{(n+k)^2} \end{aligned} \]

And they’ve drawn the sampling distribution of \(\bar{Y}_0\), \(\bar{Y}_{10}\), and \(\bar{Y}_{100}\) (top, middle, and bottom rows respectively) for sample sizes \(n=100\) and \(n=10,000\) (left and right columns respectively) for \(\mu=0\) and \(\sigma=1\). The solid line is the estimator’s expected value \(\mathop{\mathrm{E}}[\bar{Y}_k]\), the dashed lines are \(\mathop{\mathrm{E}}[\bar Y_k] \pm \mathop{\mathrm{sd}}[\bar Y_k]\), and the dotted lines are \(\mathop{\mathrm{E}}[\bar Y_k] \pm 2 \mathop{\mathrm{sd}}[\bar Y_k]\).

On the basis of those drawings, your coworker proposes using \(\bar{Y}_{100}\) to estimate \(\mu\).

Exercise 2.1  

Suppose your sample size is \(n=100\). Explain why your coworker’s original proposal to use \(\bar Y_{10}\) and your hyperbolic suggestion to use \(\bar Y_{100}\) are appears, from the drawings, to be a way to increase precision without increasing bias. To explain why this is misleading, describe what happens in the case that \(\mu=11/10\) and \(\sigma=1\) by doing the following.

  • Modifying the drawings to show the sampling distributions in this new case.
  • Stating an approximation to the coverage of each of the confidence intervals \(\bar{Y}_{k} \pm 1.96\ \text{se}\) for \(k=0\), \(k=10\), and \(k=100\).

Tip. You can ‘change the location’ of the distributions in the drawing above by changing the labels on the x-axis.

Before getting into the solution, let’s think about what this estimator is doing. It’s a weighted average of of the sample mean and zero with weight \(n/(n+k)\) on the sample mean. We talked about these in the homework before our first midterm; we can think of this as adding \(k\) fake observations \(Y_{n+1} \ldots Y_{n+k}\) to our sample, all of which are zero.

\[ \bar Y_{k} = \frac{n}{n+k} \bar Y_0 + \frac{k}{n+k} \times 0 \] We call these shrinkage estimators because they take the sample mean and shrink it toward zero. The bias of \(\bar Y_{k}\) is proportional to \(\mu\). When \(\mu=0\), as it is for the plots above, there no bias. But if it’s not zero, there is bias. What we pay for an estimator that’s more precise when we’re estimating zero is bias toward zero when we’re not.

Drawing. To draw what’s happening, we don’t have to change the shape of the distributions. That shape—the standard deviation—doesn’t depend on \(\mu\). What does change when we think about the case that \(\mu=11/10\) (1.10) are the means. \[ \mathop{\mathrm{E}}[\bar Y_k] = \frac{n}{n+k} \mathop{\mathrm{E}}[\bar Y_0] = \frac{n}{n+k} \mu = \begin{cases} \frac{100}{100} \times \frac{11}{10} = 1.10 & \qfor k=0 \\ \frac{100}{110} \times \frac{11}{10} = 1.00 & \qfor k=10 \\ \frac{100}{200} \times \frac{11}{10} = 0.55 & \qfor k=100 \end{cases} \]

To change the plots to reflect these new means, we just relabel the \(x\)-axis to be the numbers that are already there plus these means. That way, our distributions are now centered at \(\mathop{\mathrm{E}}[Y_k]\) for \(\mu=11/10\) instead of at \(0\), but the shape is unchanged.

Where we’d draw in the target, of course, has also changed. For \(\mu=0\), it’s right at the center of the distribution.

  • For \(k=0\), it still is.
  • For \(k=10\), it’s 1 grid line to the right of the distribution’s center.
    • That’s just a little more than 1 standard deviation away.
    • That means coverage should be about 85%.
  • For \(k=100\), it’s 5.5 grid lines to the right of the distribution’s center.
    • That’s just off the plot. Many, many standard deviations away.
    • That means coverage should be about 0%.

Coverage. Those approximations to coverage are fine for the purposes of answering this question. But for the next one, it’s useful to have a formula for what matters for inference: the bias/standard deviation ratio. \[ \begin{aligned} \frac{\mathop{\mathrm{E}}[\bar Y_k] - \mu}{\mathop{\mathrm{sd}}[\bar Y_k]} &= \frac{-\frac{k}{n+k}\mu}{\frac{\sigma}{\sqrt{n}} \times \frac{n}{n+k}} \\ &= -\frac{k\mu}{n} \times \frac{\sqrt{n}}{\sigma} \\ &= -\frac{\mu}{\sigma} \times \frac{k}{\sqrt{n}} \\ &= \begin{cases} -\frac{11}{10} \times \frac{0}{10} = 0 & \qfor k=0 \\ -\frac{11}{10} \times \frac{10}{10} = -1.1 & \qfor k=10 \\ -\frac{11}{10} \times \frac{100}{10} = -11 & \qfor k=100 \\ \end{cases} \end{aligned} \]

To get more precise coverage numbers, we can look these ratios up in Figure 6.1. A ratio of 1.1 is about 80% coverage. A ratio of 11 is way off the chart, but we’re clearly at essentially zero coverage—at a ratio of \(5\), at the chart’s edge, we’re basically at zero.

Exercise 2.2  

This coworker isn’t going anywhere. Next time, it might be better to bite your tongue and save time. If your sample size was \(n=10,000\), maybe your coworker’s idea to use \(\tilde Y_{10}\) really wasn’t creating much of a problem. What would you need to know about \(\mu\) and \(\sigma\) to guarantee at least 85% coverage?

For 85% coverage, we need the magnitude of \(\abs{\text{bias}}/\text{sd}\) to be less than 1. That ratio was calculated in the last exercise’s solution: n abstract terms, it’s \((\abs{\mu}/\sigma) \times (k/\sqrt{n})\). In this example, \(k/\sqrt{n}=10/\sqrt{10,000}=10/100=1/10\), so \[ 1 \ge \frac{\abs{\text{bias}}}{\text{sd}} = \frac{\abs{\mu}}{\sigma} \times \frac{1}{10} \qif \abs{\mu} \le 10\sigma. \]

Interpretation. We want to know if it matters that our coworker wants to include 10 ‘fake observations’ of zero with out 10,000 real ones. That’s a tiny fraction of fake observations in the 10,010 observations (real and fake) that they’d be averaging. It won’t cause much loss of coverage unless those fake observations are extremely far from the actual value of \(\mu\). But what ‘extremely far’ means is measured in ‘standard deviation units’—if all of our 10,000 real observations are \(.01\), zero is infinitely far away in this sense.

Exercise 2.3  

Suppose the context for all this is that you are running a poll of 100 registered voters to find out whether they intend to vote for the candidate from Party 0 or Party 1. Elections are always pretty close, so you can assume that the proportion registered voters who intend to vote for Party 1 is between .4 and .6. Approximately how much coverage does using the interval estimator \(\bar{Y}_{k} \pm 1.96\ \text{se}\) guarantee you for \(k=0\), \(k=10\), and \(k=100\)?

The key observation here is that in a population of binary outcomes with mean \(\mu\), the standard deviation is a function of \(\mu\): \(\sigma=\sqrt{\mu(1-\mu)}\). As a result, so is the the bias/standard deviation ratio. \[ \frac{\abs{\text{bias}}}{\text{sd}}=\frac{\mu}{\sigma} \times \frac{k}{\sqrt{n}} = \frac{\mu}{\sqrt{\mu (1-\mu)}} \times \frac{k}{10}. \] To see how big this can get for \(\mu \in [.4,.6]\), and therefore how bad our coverage can be, we can look at Figure 6.2, a plot of \(x/\sqrt{x(1-x)}\) in the appendix. This ratio is largest at \(\mu=.6\), which is intuitive; shrinkage toward zero creates the biggest problem when the proportion we’re estimating is furthest from zero. And at \(\mu=.6\), it’s roughly \(1.2\). That’s close enough to \(\mu/\sigma = (11/10) / 1 = 11/10\), which we considered in Exercise 2.1, that its implications for coverage should be roughly the same.

  • For \(k=0\), we’re just talking about the mean. No bias. 95% coverage.
  • For \(k=10\), our bias/sd ratio is 1.2. From Figure 6.1, we can see that’s about 80% coverage.
  • For \(k=100\), our bias/sd ratio is 12. We don’t need to look at anything. That’s virtually 0 coverage.

Things to Remember.

  1. Be careful with evidence from simulations. Consider the possibility that there’s something special about the way the data is being simulated that makes the estimator look good.

  2. When you do identify a potential problem with an estimator, try to keep it in perspective. Think about what information you have that you can use to understand whether it’s a big or small problem. Sometimes just thinking about the kind of data you’re working with, e.g. binary data as in this last exercise. tells you a lot.

Problem 3

Consider the following estimation targets. \[ \begin{aligned} \Delta_{\text{raw}} &= \frac{1}{m_1}\sum_{j:w_j=1} y_j - \frac{1}{m_0}\sum_{j:w_j=0} y_j \qfor m_w = \sum_{j:w_j=w} 1 \\ \Delta_{0} &= \frac{1}{m_0}\sum_{j:w_j=0} \color{gray}\left\{ \color{black} \mu(1,x_j) - \mu(0,x_j) \color{gray} \right\} \color{black} \\ \Delta_{1} &= \frac{1}{m_1}\sum_{j:w_j=1} \color{gray}\left\{ \color{black} \mu(1,x_j) - \mu(0,x_j) \color{gray} \right\} \color{black} \\ \Delta_{\text{all}} &= \frac{1}{m}\sum_{j=1}^m \color{gray}\left\{ \color{black} \mu(1,x_j) - \mu(0,x_j) \color{gray} \right\} \color{black} \end{aligned} \]

Exercise 3.1  

For each of the examples above, answer the following questions.

  1. Is \(\Delta_1 > \Delta_\text{raw}\)?
  2. Is \(\Delta_1 > \Delta_0\)?
  3. Is \(\Delta_1 > \Delta_\text{all}\)?

I’m not asking you to compare the magnitudes of these estimation targets. If \(\Delta_1\) is positive and \(\Delta_\text{raw}\) is negative, then no matter what the magnitudes are, \(\Delta_1 > \Delta_\text{raw}\).

Here’s the essential approach I’m using to answer these questions.

  1. The difference \(\Delta_1-\Delta_\text{raw}\) is the ‘covariate shift term’ \(\sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1}} \textcolor[RGB]{248,118,109}{\mu(0, x)} - \sum_x \textcolor[RGB]{0,191,196}{p_{x \mid 1}} \textcolor[RGB]{248,118,109}{\mu(0, x)}\).
  • if the shift from \(\textcolor[RGB]{248,118,109}{p_{x \mid 0}}\) to \(\textcolor[RGB]{0,191,196}{p_{x \mid 1}}\) is toward where \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) larger, this difference is positive: \(\Delta_{\text{raw}}> \Delta_1\)
  • if it’s toward where \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) smaller, the difference is negative: \(\Delta_{\text{raw}} < \Delta_1\).
  • if \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) is constant, there’s no difference.
  1. The difference \(\Delta_1-\Delta_0\) is analogous, but with \(\tau(x)=\textcolor[RGB]{0,191,196}{\mu(1, x)}-\textcolor[RGB]{248,118,109}{\mu(0, x)}\) in place of \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\).
  • if the shift from \(\textcolor[RGB]{248,118,109}{p_{x \mid 0}}\) to \(\textcolor[RGB]{0,191,196}{p_{x \mid 1}}\) is toward where \(\tau(x)\) larger, the difference is positive: \(\Delta_1 > \Delta_0\).
  • etc.
  1. \(\Delta_{\text{all}}\) is an average of \(\Delta_0\) and \(\Delta_1\), so it’s between the two.
  • i.e. either \(\Delta_1 > \Delta_{\text{all}} > \Delta_0\) or \(\Delta_{1} < \Delta_{\text{all}} < \Delta_0\). Or all 3 are equal.
  • That means the answer to 3 is always the same as the answer to 2.

Example A

  1. No. \(\Delta_1 < \Delta_\text{raw}\). In intuitive terms, what we see if a ‘treatment’ of being green that does nothing but higher outcomes for the green people on average because they’re over on the right, where outcomes in both groups tend to be higher. In math terms, \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) is increasing and we have covariate shift from left to right, so the ‘covariate shift term’ is positive.
  2. No. \(\Delta_1 = \Delta_0\). The within-group means \(\textcolor[RGB]{0,191,196}{\mu(1, x)}\) and \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) are the same, so their difference \(\tau(x)\) is constant (zero), and it doesn’t matter how you average it—you always get zero.
  3. No. \(\Delta_1=\Delta_\text{all}\) for the same reason.

Example B

  1. Yes. \(\Delta_1 > \Delta_\text{raw}\). \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) is increasing and we have covariate shift from right to left, so the ‘covariate shift term’ \(\Delta_{\text{raw}} - \Delta_1\) is negative. In intuitive terms, when we adjust by covariate shift by comparing green dots to red dots in the same column, we’re looking at red dots that are further to the left than what’s typical for red dots (which are mostly on the right), so \(\textcolor[RGB]{0,191,196}{\text{same}} - \textcolor[RGB]{248,118,109}{\text{smaller}} = \text{bigger}\).
  2. No. \(\Delta_1 < \Delta_0\). The within-group difference \(\textcolor[RGB]{0,191,196}{\mu(1, x)} - \textcolor[RGB]{248,118,109}{\mu(0, x)}\) is increasing, so we get a bigger average on the right (where the red dots are) than on the left (where the green dots are).
  3. No. Because \(\Delta_\text{all}\) is an average of and therefore between \(\Delta_1\) and \(\Delta_0\), it’s bigger than \(\Delta_1\) if \(\Delta_0\) is.

Example C

  1. No. \(\Delta_1 = \Delta_\text{raw}\). The within-group mean \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) doesn’t vary with \(x\), so the average over the red distribution is the same as the average over the green distribution.
  2. No. \(\Delta_1 < \Delta_0\). The within-group difference \(\textcolor[RGB]{0,191,196}{\mu(1, x)} - \textcolor[RGB]{248,118,109}{\mu(0, x)}\) is decreasing, so averaging it over the green distribution (which is shifted to the right relative to the red distribution), gives us a smaller number than averaging it over the red distribution.
  3. No. \(\Delta_1 < \Delta_\text{all}\). Because \(\Delta_\text{all}\) is an average of and therefore between \(\Delta_1\) and \(\Delta_0\), it’s bigger than \(\Delta_1\) if \(\Delta_0\) is.

Problem 4

Your friend tells you about a new estimator for the mean \(\theta=\frac1m\sum_{j=1}^m y_j\) of a population \(y_1 \ldots y_m\). They like to use it when they think the population mean \(\theta\) is close to zero.

\[ \hat \theta = \begin{cases} 0 & \text{if } \abs{\bar Y} < .1 \\ \bar Y & \text{otherwise} \end{cases} \]

estimator = function(x, kill.below=.1) {
  if(abs(mean(x)) < kill.below) { 0 } else { mean(x) }
}

To show you how it works, they’ve make up a population of \(y_1 \ldots y_{1000}\) random variables with mean \(\theta=.02\) and standard deviation 1. They show you the sampling distribution of this estimator when you draw samples \(Y_1 \ldots Y_n\) of size \(n=25\) with replacement from this population.

They’re pretty pleased with themselves. Their estimator is …

  • almost never further from \(\theta\) than the sample mean is. It’s further in 0% of samples.
  • often closer to \(\theta\) than the sample mean is. It’s closer in 30% of samples.

This is a weird estimator, so you ask if you can still use the bootstrap to calibrate confidence intervals around it. Your friend says …

I checked this morning. The coverage probability is \(.93\). That’s not bad when you’re getting a more accurate estimate 30% of the time.

They pull up the code below to show you how they checked, but when they run it, it doesn’t say \(.93\). It says \(.84\).

bootstrap.sampling.distribution = function(Y) { 
  1:1000 |> map_vec( function(.) {  
    J.star = sample(1:n, n, replace=TRUE)
    Y.star = Y[J.star]
    estimator(Y.star)
  })
}

se = sd(bootstrap.sampling.distribution(Y))
covers = 1:1000 |> map_vec( function(.) { 
  J = sample(1:m, n, replace=TRUE)  
  Y = y[J]                          
  estimator(Y) - 1.96*se <= theta & theta <= estimator(Y) + 1.96*se
})
mean(covers)
[1] 0.841

Exercise 4.1  

Tell your friend what’s wrong with their code and how to fix it. A sentence or two should do.

It doesn’t take me more than 5 seconds to fix and I don’t type very fast.

They’re not using the same sample to calculate and bootstrap their point estimator. To bootstrap properly, move the line where we calculate se so it’s right after the like Y=y[J] that draws our sample.

set.seed(0)
covers = 1:1000 |> map_vec( function(Y) { 
  J = sample(1:m, n, replace=TRUE)
  Y = y[J]
  se = sd(bootstrap.sampling.distribution(Y))
  estimator(Y) - 1.96*se <= theta & theta <= estimator(Y) + 1.96*se
})
mean(covers)
[1] 0.929

Something to remember. If you’re using simulated data to look at what an estimator does, make sure it’s the estimator you’d actually be using.

The bootstrap-once thing that’s going on in your friend’s code might be a simple mistake or it might be a way to speed up the coverage calculation, but the result is that what’s being calculated isn’t the estimator’s actual coverage.

Exercise 4.2  

Extra Credit Style. Suppose that somewhere in the code they used to draw those plots, they were actually drawing a sample Y from the population y. They were running this code.

J = sample(1:m, n, replace=TRUE)  
Y = y[J]    

They ran it before they calculated the coverage probability this morning and they ran it again before they calculated it just now. Why do you think the two coverage probabilities they calculated are so different?

Take a look at the plots below. Each one shows the bootstrap sampling distributions of this new estimator in teal and the sample mean in pink for a random sample \(Y_1 \ldots Y_n\) drawn from the population \(y_1 \ldots y_m\). Each estimator’s bootstrap mean ± 2 bootstrap standard deviations is shown by a pair of vertical in the same colors.

The output of sample is random, so if they reran that code between this morning and now, the sample Y that they were using to calculate their one standard deviation was different this morning from what it is now.

They must’ve had a sample in which the bootstrap sampling distribution of the estimator \(\hat\theta\) had a bigger standard deviation than it does in the one we’re using now, so they were calculating the coverage of wider intervals around the same (distribution of) point estimators.

That’s not just a thing about this estimator. You can see in the plots in the tip above that it happens with the sample mean too. It’s a bit easier to understand that mathematically because \(\bar{Y}\)’s bootstrap standard deviation is just \(1/\sqrt{n}\) times the standard deviation of the sample; when we have more spread in the sample Y, the width of all these intervals is bigger.

Problem 5

You’re working at a company that makes a fancy alarm clock that watches you sleep. You don’t tell it when to wake you up. You tell it when you need to be up by; it wakes you earlier if, based on your sleep patterns, it thinks that’s better for you. You can, of course, turn that feature off so it wakes you when you ask like a regular alarm clock. You’re on a team that’s trying to figure out how well this feature is working.

You have a population \(w_1, x_1, y_1 \ldots w_m, x_m, y_m\) of people who have been using the clock.

  • \(w_j\) indicates whether they have the feature turned on or off.
  • \(x_j\) is a sleep quality score they reported when they bought the clock (0 is worst, 4 is best).
  • \(y_j\) is the average of the sleep quality scores they’ve logged since they started using the clock.

Privacy is a big deal when you’re selling a device that watches people sleep, so your app doesn’t report sleep quality scores directly to your company. You actually have to run a survey. The result is a sample \((W_1, X_1, Y_1) \ldots (W_n,X_n,Y_n)\) that’s drawn with replacement from the population. You have it
in an R dataframe sam with n rows. I’ve plotted it above.

As usual, \(X\) is on the horizontal axis, \(Y\) variable is on the vertical axis, and \(W\) variable, which is binary, is indicated by color: green for \(W=1\), red for \(W=0\). The mean in each column is indicated by a black pointrange (⍿). The proportion of times that \(X\) takes on each value \(x\) when \(W=0\) and when \(W=1\), \(\textcolor[RGB]{248,118,109}{P_{x\mid 0}}\) and \(\textcolor[RGB]{0,191,196}{P_{x\mid 1}}\) respectively, are plotted as histograms in matching colors. They’re also shown in the table below.

\[ \color{gray} \begin{array}{lll} x & \textcolor[RGB]{248,118,109}{P_{x \mid 0}} & \textcolor[RGB]{0,191,196}{P_{x \mid 1}} \\ \hline 0 & \textcolor[RGB]{248,118,109}{0.41} & \textcolor[RGB]{0,191,196}{0.09} \\ 1 & \textcolor[RGB]{248,118,109}{0.35} & \textcolor[RGB]{0,191,196}{0.11} \\ 2 & \textcolor[RGB]{248,118,109}{0.05} & \textcolor[RGB]{0,191,196}{0.45} \\ 3 & \textcolor[RGB]{248,118,109}{0.15} & \textcolor[RGB]{0,191,196}{0.31} \\ 4 & \textcolor[RGB]{248,118,109}{0.03} & \textcolor[RGB]{0,191,196}{0.04} \\ \end{array} \]

Part 1

Suppose your coworker fits the additive model by least squares. That is, they choose \(\hat\mu(w,x)\) as their estimate of \(\mu(w,x)\) like this. \[ \begin{aligned} \hat \mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 \qquad \\ &\qqtext{ where } \mathcal{M}= \qty{m(w,x) = a(w) + b(x) \ \text{ for all functions $a(w)$ and $b(x)$ }} \end{aligned} \]

They claim that the difference in the green and red lines’ values at \(x=0\), \(\textcolor[RGB]{0,191,196}{\hat\mu(1, 0)} - \textcolor[RGB]{248,118,109}{\hat\mu(0, 0)} = -1.36\), is a good estimate of the adjusted difference in means \(\Delta_1\). 2 3

\[ \Delta_1 = \textcolor[RGB]{0,191,196}{\frac{1}{m_1}\sum_{j:w_j=1}} \color{gray}\left\{ \color{black} \textcolor[RGB]{0,191,196}{\mu(1, X_i)} - \textcolor[RGB]{248,118,109}{\mu(0, X_i)} \color{gray} \right\} \color{black} \qfor m_1 = \sum_{j:w_j=1} 1 \]

You’re not sure, so you plot their fitted model. Here’s what you see. As usual, \(\textcolor[RGB]{0,191,196}{\mu(1, x)}\) is the green line and \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) is the red one.

Exercise 5.1  

Do you think their estimate is biased up (too large) or down (too small)? Referring to the plot, explain why.

I think it’s biased up (too large). You can see that where most of the green dots are, the difference in predictions \(\textcolor[RGB]{0,191,196}{\hat\mu(1, x)}-\textcolor[RGB]{248,118,109}{\hat\mu(0, x)}\) that we get with the additive model is smaller in magnitude, and (being negative) therefore bigger in a signed sense, than the actual spacing between the two groups. Especially at \(x=2\) where almost half of the green dots are.

A Revision

I’ve drawn in the column means in a revised version posted Monday night. If you were working on this one over the weekend, it was probably a little difficult because in this data, it little hard to see where the mean is each column is. You could get those means from the plot below, but it’s easier having them all in one place.

Part 2

When they calculate their estimator, they’re probably not explicitly making a comparison at \(x=0\), e.g. by writing code that says something like muhat(1,0)-muhat(0,0). They’re probably just looking at the coefficient on \(w\) in the output of the lm function.

additive.predictor = lm(y~w+factor(x), data=sam) 
additive.predictor    

Call:
lm(formula = y ~ w + factor(x), data = sam)

Coefficients:
(Intercept)            w   factor(x)1   factor(x)2   factor(x)3   factor(x)4  
      1.347       -1.362        0.636        0.695        1.727        3.091  

That’s the number in front of \(w\) when you write out your least squares predictor out in terms of a certain basis: it’s \(\hat a_1\) for \(\hat\mu(w,x) = \hat a_0 + \hat a_1 w + \hat a_2 1_{=2}(x) + \ldots + \hat a_5 1_{=5}(x)\). When they hear what you said in the Exercise 5.1, they change their model to one they’ve seen you use before. This is their new least squares predictor. \[ \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \qty{ Y_i - m(W_i,X_i)}^2 \qquad \text{ where } \mathcal{M}= \qty{\text{all functions} \ m(w,x) } \]

And they do exactly what they did before. They report the coefficient on \(w\) that lm gives them: -0.0860.

all.functions.predictor = lm(y~w*factor(x), data=sam) 
all.functions.predictor

Call:
lm(formula = y ~ w * factor(x), data = sam)

Coefficients:
 (Intercept)             w    factor(x)1    factor(x)2    factor(x)3  
       1.084        -0.086         0.954         1.731         2.427  
  factor(x)4  w:factor(x)1  w:factor(x)2  w:factor(x)3  w:factor(x)4  
       2.665        -1.472        -2.125        -1.890        -0.161  

That’s \(\hat a_1\) when we write out the least squares predictor like this. \[ \hat\mu(w,x) = \hat a_0 + \hat a_1 w + \hat a_2 1_{=1}(x) + \ldots + \hat a_5 1_{=4}(x) + \hat a_6 w 1_{=1}(x) + \ldots + \hat a_{9} w 1_{=4}(x) \tag{5.1}\]

That number doesn’t sound right to them, so they plot their new predictor using the code you just wrote. Here’s what they see.

They’re confused. The plot tells basically the same story as the one for the additive model. There’s a green curve below a red one. It seems like they should be getting a negative number, not zero.

Exercise 5.2  

  1. Using the output of lm above and other information you have above, estimate \(\Delta_1\). Or, if you don’t feel like getting out your calculator, explain how to do it very concretely, so your confused coworker handle it from there. Don’t expect too much from them. Say something like ‘take this number from this place, multiply it by this number from this other place, etc.’

  2. Explain why what they were doing before worked for the additive model but not for the all functions model.

Interpreting the output of lm isn’t something we’ve spent much time on in this class. And it’s not something you’ll have to do on the actual final exam.

If you want to try, go ahead. Reading the next tip might help. But if you don’t, you’re welcome to use this table describing the values of \(\hat\mu(w,x)\). And if you’re writing instructions for your coworker instead of doing the calculation yourself, you’re welcome to refer to this table in your instructions.

w x muhat
0 1 2.04
1 1 0.48
0 2 2.82
1 2 0.60
0 3 3.51
1 3 1.54
0 0 1.08
1 0 1.00
0 4 3.75
1 4 3.50

To interpret the coefficients in the output of lm, you need to understand how it describes the basis it’s thinking in. This is one of those skills that’s not all that necessary when you’re working on your own—you can just use predict to calculate \(\hat\mu(w,x)\) and not worry about how R is actually doing it under the hood. But it’s a good skill to have around other people, because a lot of them are in the habit of thinking about coefficients instead of predictions, and when that results in mistakes like the one ‘your coworker’ made here, you’ll need to do a little translation between their way of thinking and yours so you can say something more helpful than ‘Forget everything you know. Just do it my way.’

In the output of lm(y~w*factor(x), data=data), they’re thinking of \(\hat\mu\) like I’ve written it in 1.

  • What I’ve called \(\hat a_0\), R writes under ‘(Intercept)’. It’s the value of \(\hat\mu(0,0)\).
  • What I’ve called \(\hat a_1\), R writes under ‘w’. It’s what’s multiplying \(w\) in
  • What I’ve called \(\hat a_2 \ldots a_5\), R writes under ‘factor(x)1’ … ‘factor(x)4’. What it means by ‘factor(x)k’ is the indicator function \(1_{=k}(x)\).
  • What I’ve called \(\hat a_6 \ldots \hat a_9\), R writes under ‘w:factor(x)1’ … ‘w:factor(x)4’. What it means by ‘w:factor(x)k’ is the product of \(w\) and the indicator function \(1_{=k}(x)\).

Why it says factor(x). R says factor(x)1, etc., because I’ve converted \(x\) to a ‘factor’— something where we want an indicator for each possible value in our model instead of a single slope — directly in the formula y~w*factor(x). If \(x\) were already a factor, you’d see ‘x1’ … ‘x4’ instead of ‘factor(x)1’ … ‘factor(x)4’ and ‘w:x1’ … ‘w:x4’ instead of ‘w:factor(x)1’ … ‘w:factor(x)4’ in the output as you do below.

another.all.functions.predictor = lm(y~w*x, data=sam |> mutate(x=factor(x)))
another.all.functions.predictor

Call:
lm(formula = y ~ w * x, data = mutate(sam, x = factor(x)))

Coefficients:
(Intercept)            w           x1           x2           x3           x4  
      1.084       -0.086        0.954        1.731        2.427        2.665  
       w:x1         w:x2         w:x3         w:x4  
     -1.472       -2.125       -1.890       -0.161  

The diffence between converting it in the formula and before is what predict expects. If you say factor(x) in the formula, you can pass in data in which \(x\) is not a factor. It’ll convert for you.

predict(all.functions.predictor, data.frame(w=1, x=1:4))
   1    2    3    4 
0.48 0.60 1.54 3.50 

If you convert it ahead of time when you call lm, you have to convert it when you call predict, too.

predict(another.all.functions.predictor, data.frame(w=1, x=factor(1:4)))
   1    2    3    4 
0.48 0.60 1.54 3.50 

If you don’t, it gives you an error.

Error: variable 'x' was fitted with type "factor" but type "numeric" was supplied
  1. To estimate the \(\Delta_1\), we want to average the difference \(\hat\mu(1,x)-\hat\mu(0,x)\) between the green and red curves over the distribution of the green points, i.e., using the weights \(P_{x\mid 1}\) \[ \hat \Delta_1 = \sum_x \textcolor[RGB]{0,191,196}{P_{x \mid 1}} \qty{ \textcolor[RGB]{0,191,196}{\hat\mu(1, x)}-\textcolor[RGB]{248,118,109}{\hat\mu(0, x)} } \]

We’ve tables of both of these things above, so we can plug in numbers from those. Or, if we like, we can read \(\hat\mu(1,x)-\hat\mu(0,x)\) from the R output as the coefficient on w (for \(x=0\)) and on w:factor(x)1w:factor(x)4 for \(x=1 \ldots 4\) instead of looking at the \(\hat\mu\) table.

\[ \begin{aligned} \hat\Delta_1 &= 0.09 \times -0.09 + 0.11 \times -1.56 + 0.45 \times -2.21 + 0.31 \times -1.98 + 0.04 \times -0.25 \\ & = -1.8 \end{aligned} \]

  1. In the additive model, the difference \(\hat\mu(1,x) - \hat\mu(0,x)\) didn’t depend on \(x\), so making that comparison at \(x=0\) gave the same thing as making it everywhere and averaging. Once we have a model that predicts what the data tells us at \(x=0\) — that the green and red dots in that column look the same — using the difference there to tell us about the difference elsewhere doesn’t work at all because that’s not at all what happens elsewhere.

Part 3

Now you’ve managed to get your coworker estimating \(\Delta_1\) by using predict instead of just looking at coefficients. Congratulations! That means no matter what model they’re using, if \(\hat\mu(w,x)\) is a reasonable estimate of \(\mu(w,x)\), they’ll get a reasonable estimate of \(\Delta_1\). And you’ve got them packaging everything up into a function estimate.Delta.1 that takes in a sample, does all the work, and gives you back an estimate of the \(\Delta_1\).

Delta1.hat = estimate.Delta.1(sam)

Their point estimate of \(\Delta_1\) is -1.8.

Now they want a confidence interval, so they bootstrap their estimator. To do it, they first write a function that calculates the sampling distribution of any estimator that’s written as a function of a sample when you sample with replacement from a population. Something like this.

sampling.distribution = function(pop, estimator, draws=1000) { 
  map_vec(1:draws, function(b) {
    J = sample(1:nrow(pop), nrow(pop), replace=TRUE)
    sam = pop[J,]
    estimator(sam)
  }) 
}

Then, to get the bootstrap sampling distribution of their estimator, they use it as if their sample were the population.

Delta1.hat.star = sampling.distribution(sam, estimate.Delta.1) 

They calculate a 95% interval like this.

se = sd(Delta1.hat.star)
interval = c(Delta1.hat - 1.96*se, Delta1.hat + 1.96*se)

They get this interval \([-1.94 \ , \ -1.67 ]\). And they’ve done everything right. They’re using an interval estimator with 95% coverage. But your boss, who has a habit of correcting employees without really spending the time to understand what they’re doing, glances at the code and says this:

You forgot to divide by \(\sqrt{n}\). Rookie mistake.

This isn’t great for your coworker’s confidence. To help explain the \(\sqrt{n}\) thing, you draw a few things. You draw the bootstrap sampling distribution of \(\Delta_1\), the bootstrap sampling distribution of \(\bar{Y}\), and the distribution of \(Y\) in the sample.

The bootstrap sampling distribution of Delta.hat.1.

The distribution of Y in the sample.

The bootstrap sampling distribution of the sample mean.

Here’s what you say.

Suppose we were just using the sample mean \(\bar Y\) to estimate the mean outcome in the population, \(\frac{1}{m}\sum_{j=1}^m y_j\). One way of calculating the standard deviation of \(\bar Y\) is to take the standard deviation of the outcomes \(Y_1 \ldots Y_n\) and divide by \(\sqrt{n}\). But when we bootstrap \(\bar Y\), we get a distribution that looks like the distribution of our estimator, not our outcomes. It’s already narrower by a factor of \(1/\sqrt{n}\). You even calculate the standard deviation of \(Y_1 \ldots Y_n\)—it’s 1.11—divide it by \(\sqrt{n}\approx 32\), and compare it to the bootstrap standard deviation of \(\bar Y\), which is 0.033. But you really want to end your explanation with a flourish. You say this. If you want to think of the standard deviation of \(Y_1 \ldots Y_n\)—the one you’d divide by \(\sqrt{n}\)—in bootstrap terms, it’s the bootstrap standard deviation of an estimator you’d never actually use. It’s the standard deviation of …

Exercise 5.3  

Finish the sentence. What simple estimator has bootstrap standard deviation \(\sigma=\sqrt{\frac{1}{n}\sum\{Y_i - \bar Y\}^2}\), the same standard deviation as the sample \(Y_1 \ldots Y_n\)?

Tip. There’s more than one. Any one you can think of is fine.

The estimator \(\hat\theta = Y_1\). Or \(\hat\theta=Y_2\), \(\hat\theta=Y_3\), etc.

When we bootstrap the estimator \(\hat\theta=Y_1\), the bootstrap estimators \(\hat\theta^\star=Y_1^\star\) are just observations drawn uniformly at random from the sample \(Y_1 \ldots Y_1\), so their variance is the variance of the sample \(Y_1 \ldots Y_n\).

Part 4

When you and your coworker bring a summary of your findings to your boss, they’re annoyed. They say this.

Why are you doing all this complicated nonsense?

They pull up the data, subtract the mean of the green points from the mean of the red points. They estimate the standard deviation of their estimator \(\hat\Delta_{\text{raw}}\) to be 0.07, which gives them a 95% confidence interval of \(\hat\Delta_{\text{raw}} \pm 0.13\).

Much better! Simple. And more precise. Your interval was \(\pm 0.14\). Mine is \(\pm 0.13\).

But width isn’t everything. Their interval isn’t just narrower. It’s nowhere near yours.

Exercise 5.4  

Is their point estimate \(\hat\Delta_{\text{raw}}\) bigger or smaller than your point estimate \(\hat\Delta_1\)? Explain why.

It’s bigger. This is like Example A in Exercise 3.1: \(\textcolor[RGB]{248,118,109}{\hat\mu(0, x)}\) is increasing and there’s covariate shift from left (where it’s small) to right (where it’s big).

Part 5

Your coworker decides to do a little simulation to test out all the new stuff they’ve learned.

They make up a population of 10,000 people. The population they’ve made up is plotted above. They draw samples of size 1000 repeatedly to compare the sampling distributions of four different estimators of \(\Delta_1\).

  1. The estimator they were using at the beginning, in Section 5.1, where they used \(\hat\mu\) is the least squares predictor in the additive model.
  2. The estimator you calculated in Exercise 5.2. You used the least squares predictor in the all functions model.
  3. An estimator like the one they used in Section 5.1, but based on the inverse probability weighted least squares predictor in the additive model.
  4. Your boss’s estimator \(\Delta_{\text{raw}}\) from Section 5.4.

From what they see, none of these estimators is biased. They sampling distributions of all four, plotted below, are centered at the estimation target \(\Delta_1\).

Exercise 5.5  

What is it about this population that makes this happen?

No need for a long explanation. A sentence should be enough.

What is it about the population? There’s no covariate shift: the red and green dots have the same covariate distribution.
That was all I was asking for, but I’ll explain.

Why this matters. When there’s no covariate shift, it doesn’t matter how we adjust for covariate shift.

How do we know this? This is proven here in our Lab on Misspecification and Averaging. How do we remember it? Here are two ideas.

  1. When there’s no covariate shift, unweighted least squares is inverse probability weighted least squares: the inverse probability weights \(\gamma(w,x)=p_{x \mid 1} / p_{x \mid w}\) are one.

  2. The ‘matched term’ \(\sum_{x} \textcolor[RGB]{0,191,196}{p_{x \mid 1}} \textcolor[RGB]{0,191,196}{\hat\mu(1, x)}\) in \(\Delta_1\) is unbiased no matter what model we use4. When there’s no covariate shift (\(\textcolor[RGB]{0,191,196}{p_{x \mid 1}}=\textcolor[RGB]{248,118,109}{p_{x \mid 0}}\)), both terms in \(\Delta_1\) are effectively matched terms, as \(\sum_{x} \textcolor[RGB]{0,191,196}{p_{x \mid 1}} \textcolor[RGB]{248,118,109}{\hat\mu(0, x)} = \sum_{x} \textcolor[RGB]{248,118,109}{p_{x \mid 0}}\textcolor[RGB]{248,118,109}{\hat\mu(0, x)}\).

Some intuition that might help remember the ‘matched term’ thing is that one of the ‘least squares things to do’ is to get the mean right. If we average the least squares predictor in any model (that includes a constant) over the sample it’s fit to, we just get that sample mean. When we use, e.g. the ‘not necessarily parallel lines model’, we’re really just fitting \(\textcolor[RGB]{248,118,109}{\hat\mu(0, \cdot)}\) to the red dots and \(\textcolor[RGB]{0,191,196}{\hat\mu(1, \cdot)}\) to the green ones, so the matched terms are really just averages of \(\textcolor[RGB]{248,118,109}{\hat\mu(0, \cdot)}\) and \(\textcolor[RGB]{0,191,196}{\hat\mu(1, \cdot)}\) over the samples they’re fit to.

Something to remember. If you’re using simulation to understand methods for adjusting for covariate shift, don’t simulate data with no covariate shift. People do it a lot more often than you’d expect.

Part 6

Your coworker has lost confidence in their ability to make up fake populations that they can use to test estimators. To try to test their four estimators again, they get a fake population from an old paper published in Statistical Science.

Now they’re angry. The problem they had in their last simuation is gone—your boss’s estimator, at least, is biased. But they’re still not seeing any problem with their estimator from Section 5.1. All the changes they’ve considered have increased their estimator’s variance without meaningfully reducing bias.

Exercise 5.6  

What’s going on? What is it about this population that makes this happen?

No need for a long explanation. A sentence should be enough.

There isn’t bias with the additive model because it’s not misspecified. The difference between the red and green means at each \(x\) is constant: \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) and \(\textcolor[RGB]{0,191,196}{\mu(1, x)}\) are parallel lines.

Something to remember. Inverse probability weighting protects you from bias due to misspecification. If you simulate data in which the model you’re using isn’t misspecified, it looks like you don’t need it.

A dangerous interpretation. You could read this as saying that if you use a model that’s not misspecified, you don’t need inverse probability weighting. But for the kinds of models we’ve talked about, with the exception of the ‘all functions model’, misspecification is pretty much a certainty when you’re analyzing real data. \(\mu(w,x)\) is the mean of an actual group of people in your population. If an additive model (or lines, or whatever) actually did go through the means for the first \(m-1\) people in your population perfectly, adding the last one will break that. That doesn’t mean models like this are badly misspecified, but in larger sample sizes, even small small amounts of misspecification can be a disaster for coverage.

Exercise 5.7  

When they use the same axes for all the plots, the sampling distribution of your boss’s estimator \(\hat\Delta_{\text{raw}}\) just isn’t there. That estimator is so biased that its sampling distribution is off the plot to one side or the other. Which side is it on? Is the bias of \(\hat\Delta_{\text{raw}}\) positive or negative?

It’s to the right. \(\textcolor[RGB]{248,118,109}{\mu(0, x)}\) is decreasing and we have covariate shift from right to left, which makes \(\Delta_{\text{raw}}\) bigger \(\Delta_1\).

Appendix

Figure 6.1: The coverage of a 95% confidence interval \(\hat\theta \pm 1.96\ \mathop{\mathrm{sd}}[\hat\theta]\) when \(\hat\theta\) has a normal sampling distribution as a function of the bias/standard deviation ratio.

Figure 6.2: This is a plot of \(f(x) = x/\sqrt{x(1-x)}\).

Footnotes

  1. If we want to be verbose, it’s the least squares predictor in the ‘lines model’ \(\mathcal{M}=\{m(w)=a+b w\}\)↩︎

  2. They don’t say ‘the difference \(\textcolor[RGB]{0,191,196}{\hat\mu(1, 0)} - \textcolor[RGB]{248,118,109}{\hat\mu(0, 0)}\)’. We’ll get to that in the next section.↩︎

  3. They describe what they’re estimating as the average treatment effect on the treated (ATT), but the treatment \(W_i\) isn’t randomized; the people in our population chose it themselves. People, myself included, do this pretty often. We say average treatment effect on the treated (ATT), average treatment effect on the controls (ATC), and average treatment effect (ATE) to refer to \(\Delta_1\), \(\Delta_0\), and \(\Delta_\text{all}\) respectively even outside the context of randomized experiments. Sometimes that’s because they really want to estimate those treatment effects and they’re doing their best despite lack of randomization. That’s what’s going on in Section 2.1 of Kim et al. (2023). But sometimes it’s just something we do because we don’t have a good alternative: there isn’t standard terminology that differentiates between \(\Delta_1\), \(\Delta_0\), and \(\Delta_\text{all}\) that isn’t about causality. Sometimes it’s something in between, like when people talk about estimating the ‘effect of race’ or ‘effect of gender’; they want to be making a causal claim but what they’re calling a treatment is a social construct and what exactly our potential outcomes mean is unclear. If you’re interested in questions like that, take a look at Issa Kohler-Hausmann’s paper Eddie Murphy and the Danger of Counterfactual Causal Thinking About Detecting Racial Discrimination.↩︎

  4. Within reason. We need group indicators in our model.↩︎