Week 9 Homework

Summary

This one is about interval estimation for estimation targets that are complex summaries of a population. We’ll focus on the summaries we discussed in our lecture on Multivariate Analysis and Covariate Shift. The lesson here is that calibration works exactly the same way it did when we were just estimating simpler summaries like means. Because of that, parts of this assignment will be very similar to things you did in the Week 2 and 3 Homeworks.¹

I think working through some of those same problems in this more general context can help make these steps feel a bit more meaningful. For example, in Section 4 below, we’ll be checking that our interval estimates are properly calibrated when we use them on fake data. When you do this for a really simple summary, the process is almost too transparent. It doesn’t feel like you’re really making up a fake population, checking that it looks right, and seeing what happens when you’re sampling from it. You can see through the whole process from start to finish. In this more complex setting, you have to go step by step a bit more.

Because coverage calculations involve a fair amount of code and just calculating these estimators involves some too, this is a pretty code-heavy assignment. And when you’re new to that kind of programming, it can feel really overwhelming trying to get two pieces like this to fit together. Organizing your code well really helps. Unfortunately, some of the organizational principles you may have learned in other programming class — e.g. the kind of Object Oriented Programming where all the examples are about bank accounts — tend to be counterproductive for this kind of work. In my experience, the best way to organize your code when you’re doing something like this is to make it look as much like the math as you can. I’ll demonstrate that for you by doing versions of a lot of the same calculations I’m asking you to do.

Please don’t hesitate to reach out if you’re having trouble with any part of this. Learning new programming styles and new mathematical ideas at the same time can be a lot. And Google and ChatGPT may not be as helpful as you’d like. There’s a lot of content on the web about this programming style and these mathematical ideas, but there’s not much that talks about both at the same time.

We’ll use the tidyverse package.

library(tidyverse)

Review

Let’s review the summaries we’ll be working with. We’ll stick with the same California income data we’ve used in lecture. Here are plots to think about while we review.

Consider the population $(w_1,x_1,y_1), \ldots, (w_m,x_m,y_m)$ you see above on the right. To summarize it, we might talk about mean $\mu(w,x)$ in each column, the number of dots $m_{wx}$ in each column, and the number of red (w=0) and green (w=1) dots $m_w$ outright. That is, the mean income among people with a given education level and sex, the number of people with a given education level and sex, and the number of male and female people in our sample respectively. In mathematical notation, these. \[ \begin{aligned} \mu(w,x)&=\frac{1}{m_{wx}}\sum_{j:w_j=w,x_j=x} y_j && \text{ column means } \\ m_{wx} &= \sum_{j:w_j=w,x_j=x} 1 && \text{number of dots in column} \\ m_w &= \sum_{j:w_j=w} 1 && \text{total number of red (w=0) or green (w=1) dots} \end{aligned} \] In the plot, we see these means $\mu(w,x)$ represented as pointranges (⍿) and the proportion of people of a given sex with a given level of education, $m_{wx}/m_w$, represented by the heights of red and green bars.²

In terms of these, we can summarize the income disparity between the male and female people in our population a few pretty reasonable ways. We call the first one below the raw difference in means and the remaining three, which are all weighted averages of the within-education-level means $\mu(1,x)-\mu(0,x)$, differences that are adjusted for education level. \[ \begin{aligned} \Delta_{\text{raw}} &= \frac{1}{m_1}\sum_{j:w_j=1} y_j - \frac{1}{m_0}\sum_{j:w_j=0} y_j \\ \Delta_0 &=\frac{1}{m_0}\sum_{j:w_j=0} \qty{\mu(1,x_j) - \mu(0, x_j)} \\ \Delta_1 &= \frac{1}{m_1}\sum_{j:w_j=1} \qty{\mu(1,x_j) - \mu(0, x_j)} \\ \Delta_{\text{all}} &= \frac{1}{m}\sum_{j=1}^m \qty{\mu(1,x_j) - \mu(0, x_j)} \end{aligned} \]

To estimate them, we can do exactly the same thing with the sample. Letting $\hat\mu$, $N_{wx}$, and $N_w$ be sample versions of $\mu$, $m_{wx}$, and $m_w$, we can calculate our estimators like this.
\[ \begin{aligned} \hat \Delta_{\text{raw}} &= \frac{1}{N_1}\sum_{i: W_i=1} Y_i - \frac{1}{N_0}\sum_{i: W_i=0} Y_i \\ \hat \Delta_0 &=\frac{1}{N_0}\sum_{i: W_i=0} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_1 &= \frac{1}{N_1}\sum_{i: W_i=1} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_{\text{all}} &= \frac{1}{n} \sum_{i=1}^n \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \end{aligned} \]

As discussed in our lecture on Least Squares in Linear Models, the column means in the sample are really just one estimate of the mean of the corresponding column in the population. We can estimate our adjusted differences by plugging any estimate $\hat\mu$ of the function $\mu$ into the formulas above. The column means are the least squares estimate in the ‘all functions’ regression model, i.e., \[ \begin{aligned} \hat\mu(w,x) = \frac{1}{N_{wx}} \sum_{i:W_i=w,X_i=x} Y_i \qqtext{ minimizes } & \frac{1}{n}\sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 \\ &\qqtext{ over the set of all functions $m(w,x)$}. \end{aligned} \] We get other estimates $\hat\mu$ by minimizing the sum of squared errors over other (smaller) sets of functions of $w$ and $x$. For example, the ones discussed here in that lecture. In this assignment, we’ll hold off on that until our last exercise.

Estimation

We’re going to work with the same California income data we used in our last homework. But this time, instead of doing everything by hand using summaries and tables, we’ll write code that works with the raw data. You can get it like this.

ca = read.csv('https://qtm285-1.github.io/assets/data/ca-income.csv')
sam = data.frame(
  w = ca$sex == 'female',
  x = ca$education,
  y = ca$income)

We’ll use some tidyverse stuff in the calculations below.

library(tidyverse)

Calculating Point Estimates

As a warm-up, we’ll repeat Exercise 5 from our last homework. We’ll calculate the following four summaries of the income discrepency between male and female residents of California. \[ \begin{aligned} \hat \Delta_{\text{raw}} &= \frac{1}{N_1}\sum_{i: W_i=1} Y_i - \frac{1}{N_0}\sum_{i: W_i=0} Y_i \\ \hat \Delta_0 &=\frac{1}{N_0}\sum_{i: W_i=0} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_1 &= \frac{1}{N_1}\sum_{i: W_i=1} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_{\text{all}} &= \frac{1}{n} \sum_{i=1}^n \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \end{aligned} \] But, instead of doing it by hand using a table, we’ll do it with code. We’ll calculate each three ways.

Using code that looks like the formulas above, i.e., code that sums over people.
With code that looks like the equivalent ‘histogram form’ formulas. You needed these for last week’s Exercise 5.
With code that looks like the ‘general form’, $\sum_{w,x} \hat\alpha(w,x) \hat\mu(w,x)$, that we talked about in our lecture on inference for complex summaries.

Ultimately, we don’t really need three implementations of the same thing. We could get by with just the ‘general form’ code. We can use it to calculate point estimates and it’s what we’ll ultimately need to do variance calculations. If you like, you can think of the other two as steps along the way.

I’ll get you started by doing $\hat\Delta_{\text{raw}}$ and $\hat\Delta_1$. I like to write code that looks almost exactly like what I’ve written in mathematical notation. From what I’ve seen, when people do that, they tend to introduce fewer bugs when they go from math to code. And the code tends to be easier to understand, too.

I’ll start by defining three functions: $\hat\mu(w,x)$, $\hat\sigma(w,x)$, and $N_{w,x}$. Everything else I need, e.g. the histogram height $P_{x \mid 1}(x)$ that appears in the equivalent ‘histogram form’ expression for $\hat\Delta_1$, can be computed from these.

summaries = sam |> 
    group_by(w,x) |> 
    summarize(muhat = mean(y), sigmahat = sd(y), Nwx = n(), .groups='drop')
  
summary.lookup = function(column.name, summaries, default=NA) {
  function(w,x) { 
    out = left_join(data.frame(w=w,x=x),
                    summaries[, c('w','x',column.name)],
                    by=c('w','x')) |>
      pull(column.name)
    out[is.na(out)] = default
    out
  }
}

muhat    = summary.lookup('muhat',    summaries)
sigmahat = summary.lookup('sigmahat', summaries)
Nwx      = summary.lookup('Nwx',      summaries)

W = sam$w
X = sam$x
Y = sam$y
x = unique(X)
n = length(Y)

1: If you’re unfamiliar with code that looks like this, it helps to think about the types of summary.lookup‘s input and output. The input is the easy part. Its arguments as a column name and a data frame—one that should have columns named ’w’, ‘x’, and column.name. Its output is a function. It’s a function that takes as input two vectors of the same length—a vector of values for ‘w’ and a vector of values for ‘x’—and returns the corresponding values from the column.name column of the data frame. This basic pattern—writing functions that return other functions—is a big part of what makes functional programming effective in the right hands.
2: What we’re doing here is taking each row in the table data.frame(w=w,x=x) and adding to it a column column.name with values taken from the data frame summaries where the ‘w’ and ‘x’ match. If there is no row in summaries corresponding to some pair ‘w’ and ‘x’ we’ve asked for—i.e. some row in data.frame(w=w,x=x)—this puts NA in that column.
3: If default is passed, then we return default instead of NA for any row in data.frame(w=w,x=x) that doesn’t have a corresponding row in summaries.

This is everything I need to do calculations that look like the formulas above.

Deltahat.raw = mean(Y[W==1]) - mean(Y[W==0])
Deltahat.1 = mean(muhat(1,X[W==1]) - muhat(0,X[W==1]))

To do calculations that look like the ‘histogram form’ formulas, I need $P_{x \mid 0}$ (Px.0 in the code below), $P_{x \mid 1}$ (Px.1 in the code below), and $P_x$ (Px in the code below). I can define these in terms of the function $N_{w,x}$ implemented above.

Nw   = function(w) { sum(Nwx(w, x)) }
Px.1 = function(x) { Nwx(1,x)/Nw(1) } 
Px.0 = function(x) { Nwx(0,x)/Nw(0) } 
Px   = function(x) { (Nwx(0,x)+Nwx(1,x)) / n }

Here’s how I calculate $\hat\Delta_{\text{raw}}$ and $\hat\Delta_1$ using the ‘histogram form’ formulas.

Deltahat.raw.histform = sum(Px.1(x)*muhat(1,x)) - sum(Px.0(x)*muhat(0,x))
Deltahat.1.histform   = sum(Px.1(x)*(muhat(1,x) - muhat(0,x)))

To do calculations that look like the ‘general form’, $\sum_{w,x} \hat\alpha(w,x) \hat\mu(w,x)$, I need two new things. First, I need a function $\hat\alpha(w,x)$ for each estimation target. I’ll start by writing them out in mathematical notation. To do this, I take a look at the ‘histogram form’ formulas and pull out the coefficients on each instance of $\hat\mu(w,x)$.

$$ \[\begin{aligned} \hat\Delta_\text{raw} &= \sum_{x} P_{x \mid 1}(x) \ \hat\mu(1,x) - \sum_{x} P_{x \mid 0} \ \hat\mu(0,x) \\ &= \sum_{w,x} \hat\alpha_\text{raw}(w,x) \ \hat\mu(w,x) \qfor \begin{cases} \hat\alpha_\text{raw}(w,x) = P_{x \mid 1}(x) & \text{if } w=1 \\ \hat\alpha_\text{raw}(w,x) = -P_{x \mid 0}(x) & \text{if } w=0 \end{cases} \\ \hat\Delta_1 &= \sum_{x} P_{x \mid 1}(x) \qty{ \hat\mu(1,x) -\hat\mu(0,x)} \\ &= \sum_{w,x} \hat\alpha_1(w,x) \ \hat\mu(w,x) \qfor \begin{cases} \hat\alpha_1(w,x) = P_{x \mid 1}(x) & \text{if } w=1 \\ \hat\alpha_1(w,x) = -P_{x \mid 1}(x) & \text{if } w=0 \end{cases} \end{aligned}\]

Now, having done this, I can implement functions that look like that in R.

alphahat.raw = function(w,x) { ifelse(w==1, Px.1(x), -Px.0(x)) } 
alphahat.1   = function(w,x) { ifelse(w==1, Px.1(x), -Px.1(x)) }

Now that I have these, I can calculate $\hat\Delta_\text{raw}$ and $\hat\Delta_1$ by plugging them into the function general.form.estimate defined below, which sums the product $\hat\alpha(w,x) \hat\mu(w,x)$ over all pairs $(w,x)$.

general.form.estimate = function(alphahat, muhat, w, x) {
  grid = expand_grid(w=w, x=x) 
  ww = grid$w
  xx = grid$x
  sum(alphahat(ww,xx)*muhat(ww,xx))
}

Deltahat.raw.generalform = general.form.estimate(alphahat.raw, muhat, w=c(0,1), x=x)
Deltahat.1.generalform = general.form.estimate(alphahat.1,     muhat, w=c(0,1), x=x)

1: This is reusable. Just make sure you pass all levels of $w$ and $x$ to the function.

And to check for mistakes, we can look to see if all three methods give the same answer.³

c(Deltahat.raw, Deltahat.raw.histform, Deltahat.raw.generalform)

[1] -11824.38 -11824.38 -11824.38

c(Deltahat.1, Deltahat.1.histform, Deltahat.1.generalform)

[1] -15092.87 -15092.87 -15092.87

Now it’s your turn.

Exercise 1

Modify the code above to calculate $\hat\Delta_0$ and $\hat\Delta_\text{all}$ three ways: by summing over people, using the ‘histogram form’ formulas, and using the ‘general form’ formulas. Check that the answers agree.

All you should turn in for this one is the two functions alpha.0 and alpha.all you define to do the ‘general form’ calculations.

Solution

Here’s the code to do the ‘general form’ calculations.

alphahat.0 = function(w,x) { ifelse(w==1,   Px.0(x), -Px.0(x)) }
alphahat.all = function(w,x) { ifelse(w==1, Px(x), -Px(x)) }

And here are the results.

Deltahat.0 = general.form.estimate(alphahat.0, muhat, w=c(0,1), x=x)
Deltahat.all = general.form.estimate(alphahat.all, muhat, w=c(0,1), x=x)

data.frame(Deltahat.0=Deltahat.0, Deltahat.all=Deltahat.all)

Deltahat.0	Deltahat.all
-14868.4	-14978.51

Let’s check that they agree with ‘sum over people’ versions.

data.frame(
  Deltahat.0 = mean(muhat(1,X[W==0]) - muhat(0,X[W==0])),
  Deltahat.all = mean(muhat(1,X) - muhat(0,X)))

Deltahat.0	Deltahat.all
-14868.4	-14978.51

They do.

Calibrating Interval Estimates

Now that we’ve got all this machinery in place, we’re ready to calibrate some interval estimates around them. We’ll start by bootstrapping the point estimates we just calculated. As in the previous exercise, I’ll do it for $\hat\Delta_\text{raw}$ and $\hat\Delta_1$ and you’ll do it for $\hat\Delta_0$ and $\hat\Delta_\text{all}$. You can re-use a lot of my code, so all you’ll need to do is add a few lines.

bootstrap.estimates = 1:10000 |> map(function(.) {
  I.star = sample(1:n, n, replace=TRUE)
  Wstar = W[I.star]
  Xstar = X[I.star]
  Ystar = Y[I.star]

  summaries.star = sam[I.star,] |>
                group_by(w,x) |>
                summarize(muhat = mean(y), .groups='drop')
  muhat.star = summary.lookup('muhat', summaries.star, default=mean(Ystar))
  
  data.frame(
    Deltahat.raw = mean(Ystar[Wstar==1]) - mean(Ystar[Wstar==0]),
    Deltahat.1   = mean(muhat.star(1,Xstar[Wstar==1]) -
                        muhat.star(0,Xstar[Wstar==1]))
  )
}) |> bind_rows()

1: Do what follows 10,000 times.
2: Sample with replacement from the sample to get a bootstrap sample.
3: Summarize the bootstrap sample to get a bootstrap-sample version of the function $\hat\mu(w,x)$. See Note 1 below for what passing default is doing in the call to summary.lookup.
4: Calculate the estimation target using the bootstrap-sample stuff. All you’ll need to do is add a few lines here.
5: This is how we get a data frame with 10,000 rows and two columns, ‘Deltahat.raw’ and ‘Deltahat.1’. What map gives us is a list of 10,000 data frames with one row each (and the right two columns). bind_rows will combine them into one data frame.

Choosing the Number of Bootstrap Samples

This takes about a minute to run on my laptop (a M3 MacBook Air), but I’ve heard from students that it takes longer on their computers. If you want to speed it up, you can change the number 10,000 to something smaller. You’ll get rough-looking histograms if you use 400 instead of 10,000, but it shouldn’t have much of an impact on width calculations.

Caching in Rmarkdown and Quarto

If you’re writing your solutions in Rmarkdown or Quarto, it can be a big waste of time to re-run blocks of code like the one above each time you want to view your document. To speed things up, you can specify that you want the results cached.

To do this, add a special comment to your code chunk. Like this.

```{r}
#| cache: true

slow.function() 
```

Note 1: A Quirk

Sometimes, as a matter of chance, some columns in our bootstrap sample will be empty. That is, there’ll be no observation $(W_i^\star, X_i^\star, Y_i^\star)$ where $(W_i^\star, X_i^\star)$ matches a particular pair $(w,x)$ that occurs in the population. The result is that there’s no obvious way to define the column mean $\hat\mu^\star(w,x)$ for that pair. By passing default=mean(Ystar) we’re saying that if we have an empty column, we’ll define $\hat\mu(w,x)$ for that column to be the mean outcome in the bootstrap sample, i.e., $\frac{1}{n}\sum_{i=1}^n Y_i^\star$. This is a pretty arbitrary decision, but if getting an empty column is rare (e.g. 15 times in 10,000 bootstrap samples), it doesn’t have all that much impact on the results.

The way R code tends to be written helps us ignore this issue. We’ll get an NA most ways of asking for the mean of an empty set and ggplot will just ignore rows with NAs. when plotting. Maybe we’ll get a warning, but you get in the habit of ignoring those. I didn’t notice it happening here until it was brought to my attention in this week’s Friday lab session/office hours.

This isn’t really a bootstrap problem. It’s also possible that there’ll be no observation $(W_i, X_i, Y_i)$ in our sample with $(W_i, X_i)$ equal to a particular pair $(w,x)$ that occurs in the population. When that happens, there’s no obvious way to define the column mean $\hat\mu(w,x)$ for that pair. If our estimator $\hat\theta$ is specified in terms of $\hat\mu(w,x)$, then it means our estimator is undefined when we get an empty column in our sample. In theoretical terms—because an estimator is a function of the sample that gives us a (not undefined) number—that means the thing we’ve specified is not an estimator. To make it an estimator, we can specify—for each pair $(w,x)$—what $\hat\mu(w,x)$ should be if the column $(w,x)$ happens to be empty in our sample. Here what I’ve done is to make $\hat\mu(w,x)=\bar Y$ when the column $(w,x)$ is empty in our sample.⁴ Once we have an actual estimator, the bootstrap issue is resolved; my decision that $\hat\mu(w,x) = \bar Y$ for empty columns in the sample tells us that $\hat\mu^\star(w,x)=\bar Y^\star$ in the bootstrap sample.

We tend to encounter this ‘not-really-an-estimator problem’ when we’re bootstrapping or using simulated data to calculate coverage. The reason for this is that, if we get a point estimate of ‘undefined’, there’s no point in calibrating an interval estimate around it and calculating its coverage. We choose a different point estimator. This reflects a meaningful difference between the way we talk about estimators formally (in terms of what would happen in infinitely many imaginary studies) and informally (as a way of getting an estimate in the one study we’ve actually run).

Given that we have to make an arbitrary choice when we get an empty column, you might ask how the sample column mean $\hat\mu(w,x)$ can be an unbiased estimator of the population mean $\mu(w,x)$. The answer is that it isn’t and it can’t be. When I said it was earlier, I lied. Every step of our proof but the last one is valid—$\hat\mu(w,x)$ is unbiased conditional on when that column isn’t empty—but the last step, in which we average over $(W_1,X_1)\ldots (W_n,X_n)$ to get the unconditional average—doesn’t make sense. Unless you’ve defined $\hat\mu(w,x)$ when the column is empty, you don’t really have a well-defined random variable to average. And unless you actually know $\mu(w,x)$, there’s no way of defining an estimator that’s outright unbiased—whatever you choose for empty columns, you’re in effect making something up.

Here are histograms showing, for $\hat\Delta_\text{raw}$ and $\hat\Delta_1$, its bootstrap sampling distribution. I’ve used the function width to calculate the width of the middle 95% of each distribution and draw that, as well as the distribution’s mean, in as vertical lines. And I’ve added my point estimate as a blue dot, too.

Code You’ll Want

Here’s the function width we’ve used throughout the semester.

width = function(draws, center = mean(draws), alpha = .05) {
    covg.dif = function(w) {
        actual = mean(center - w / 2 <= draws & draws <= center + w / 2)
        nominal = 1 - alpha
        actual - nominal
    }
    uniroot(covg.dif, interval = c(0, 2 * max(abs(draws - center))))$root
}

And here’s some code to help you display this in the plot.

width_annotations = function(draws, center=mean(draws), alpha = 0.05, color='blue') {
  w = width(draws, center, alpha) 
  list(geom_vline(xintercept = center - w/2, linetype='dashed', color=color),
       geom_vline(xintercept = center + w/2, linetype='dashed', color=color),
       geom_vline(xintercept = center, linetype='solid', color=color))
}

scales = list(scale_y = scale_y_continuous(breaks=seq(0, .0002, by=.00002)), 
              scale_x = scale_x_continuous(breaks=seq(-40000, 10000, by=10000)),
              coord_cartesian(xlim=c(-40000, 10000), ylim=c(0, .0002))) 

bootstrap.plot.raw = ggplot() + 
  geom_histogram(aes(x=Deltahat.raw, y=after_stat(density)), 
    bins=50, alpha=.2, data=bootstrap.estimates) + 
  geom_point(aes(x=Deltahat.raw, y=.00005), color='blue') +
  width_annotations(bootstrap.estimates$Deltahat.raw)  

bootstrap.plot.1 = ggplot() + 
  geom_histogram(aes(x=Deltahat.1, y=after_stat(density)), 
    bins=50, alpha=.2, data=bootstrap.estimates) + 
  geom_point(aes(x=Deltahat.1, y=.00005), color='blue') +
  width_annotations(bootstrap.estimates$Deltahat.1)  

bootstrap.plot.raw + scales
bootstrap.plot.1 + scales

And here are 95% intervals around my point estimates, calibrated using the bootstrap.

bootstrap.interval.raw = Deltahat.raw + 
  c(-1,1)*width(bootstrap.estimates$Deltahat.raw)/2
bootstrap.interval.1   = Deltahat.1 +
  c(-1,1)*width(bootstrap.estimates$Deltahat.1)/2

bootstrap.interval.raw

[1] -17265.902  -6382.859

bootstrap.interval.1

[1] -20400.126  -9785.606

Here’s a review exercise on connecting what we see in the plot to the intervals we’ve reported.

Exercise 2

Draw these intervals (by hand) on top of the plots above. Or, if you prefer, describe how you would draw them.

Solution

Draw arms on the point estimate that extend to the dashed lines.

This gives us an interval with the same width as the middle 95% of our bootstrap sampling distribution and our point estimate in the middle because the bootstrap sampling distribution is centered on our point estimate.

Now it’s your turn to do all this for $\hat\Delta_0$ and $\hat\Delta_\text{all}$.

Exercise 3

Draw the bootstrap sampling distributions of $\hat\Delta_0$ and $\hat\Delta_\text{all}$. Include that same annotations used above: vertical lines indicating the mean and middle 95% of each distribution and a blue dot for each point estimate. Then report bootstrap-calibrated 95% confidence intervals around your point estimates from Exercise 1.

Solution

To get the bootstrap sampling distributions of these new estimators, all we need to do is add a few lines to the bootstrap code above. Here it is.

bootstrap.estimates = 1:10000 |> map(function(.) { 
  I.star = sample(1:n, n, replace=TRUE)                                             
  Wstar = W[I.star]                                                                 
  Xstar = X[I.star]                                                                 
  Ystar = Y[I.star]                                                                     

  summaries.star = sam[I.star,] |>                                                  
                group_by(w,x) |>                                                       
                summarize(muhat = mean(y), .groups='drop')                          
  muhat.star = summary.lookup('muhat', summaries.star, default=mean(Ystar))         
  
  data.frame(                                                                           
    Deltahat.raw = mean(Ystar[Wstar==1]) - mean(Ystar[Wstar==0]),                       
    Deltahat.1   = mean(muhat.star(1,Xstar[Wstar==1]) -   
                        muhat.star(0,Xstar[Wstar==1])),
    Deltahat.0   = mean(muhat.star(1, Xstar[Wstar==0]) -
                        muhat.star(0, Xstar[Wstar==0])),
    Deltahat.all = mean(muhat.star(1, Xstar) -
                        muhat.star(0, Xstar))
  )
}) |> bind_rows()

1: These are the new lines.

To plot the distributions, we just have to change a few names in the plotting code above. This code uses the variables Deltahat.0 and Deltahat.all defined above in the solution to Exercise 1.

bootstrap.plot.0 = ggplot() + 
  geom_histogram(aes(x=Deltahat.0, y=after_stat(density)), 
    bins=50, alpha=.2, data=bootstrap.estimates) + 
  geom_point(aes(x=Deltahat.0, y=.00005), color='blue') +
  width_annotations(bootstrap.estimates$Deltahat.0)  

bootstrap.plot.all = ggplot() + 
  geom_histogram(aes(x=Deltahat.all, y=after_stat(density)), 
    bins=50, alpha=.2, data=bootstrap.estimates) + 
  geom_point(aes(x=Deltahat.all, y=.00005), color='blue') +
  width_annotations(bootstrap.estimates$Deltahat.all)  

bootstrap.plot.all + scales
bootstrap.plot.all + scales

And here are 95% intervals around my point estimates, calibrated using the bootstrap.

bootstrap.interval.0 = Deltahat.0 + 
  c(-1,1)*width(bootstrap.estimates$Deltahat.0)/2
bootstrap.interval.all = Deltahat.all +
  c(-1,1)*width(bootstrap.estimates$Deltahat.all)/2

bootstrap.interval.0

[1] -19581.55 -10155.26

bootstrap.interval.all

[1] -19908.54 -10048.48

Let’s check that these are pretty similar to what we get using normal approximation. We’ll use our variance formula from our lecture on Inference for Complicated Summaries. And, as in lecture, we’ll ignore the term that reflects the randomness of our coefficients. The one written in red in that lecture.

grid = expand_grid(w=c(0,1), x=x) 
ww = grid$w
xx = grid$x

Vhat.raw = sum(alphahat.raw(ww,xx)^2 * sigmahat(ww,xx)^2/Nwx(ww,xx))
Vhat.1 = sum(alphahat.1(ww,xx)^2 * sigmahat(ww,xx)^2/Nwx(ww,xx))

Here are the plots above with our normal-approximation-based estimates of the sampling distribution draw on top of the bootstrap ones.

normal.interval.raw = Deltahat.raw + 1.96*c(-1,1)*sqrt(Vhat.raw)
normal.interval.1   = Deltahat.1 + 1.96*c(-1,1)*sqrt(Vhat.1) 

normal.interval.raw

[1] -16896.428  -6752.332

normal.interval.1

[1] -20346.148  -9839.583

Exercise 4

As discussed in our lecture on Inference for Complicated Summaries, the variance formula we’re using here is a lower bound. We’ve ignored a (non-negative) term that reflects the randomness of our coefficients. Based on the plots above, does it look like ignoring this term is a problem? That is, are we significantly underestimating the variance of our estimators when we use this formula?

Answer Yes or No. Then briefly explain how you know. A sentence or two should do.

Solution

No. We can see from our plots that the bootstrap sampling distribution and the normal-approximation-based estimate of the sampling distribution are very similar.

Note 2: Note.

I haven’t asked you to calculate normal-approximation-based confidence intervals for $\hat\Delta_0$ and $\hat\Delta_\text{all}$ here, but you will need to do that below in Exercise 6. If you want to check that you have that working before jumping into the coverage exercise, go ahead and calculate those intervals here and compare them to the bootstrap-calibrated intervals you calculated above.

Solution

This isn’t the solution to an exercise, but I’m going to take the suggestion from the note above. I’m going to get a head start on Exercise 6 by calculating normal-approximation-based confidence intervals for $\hat\Delta_0$ and $\hat\Delta_\text{all}$.

All I have to do is plug the appropriate versions of $\alpha(\cdot)$ from my solution to@exr-calc into the code I used to calculate the normal-approximation-based intervals for $\hat\Delta_\text{raw}$ and $\hat\Delta_1$ above.

Vhat.0 = sum(alphahat.0(ww, xx)^2 * sigmahat(ww, xx)^2 / Nwx(ww, xx))
Vhat.all = sum(alphahat.all(ww, xx)^2 * sigmahat(ww, xx)^2 / Nwx(ww, xx))

normal.interval.0 = Deltahat.0 + 1.96*c(-1,1)*sqrt(Vhat.0)
normal.interval.all = Deltahat.all + 1.96*c(-1,1)*sqrt(Vhat.all) 

normal.interval.0
normal.interval.all

[1] -19573.87 -10162.93
[1] -19940.74 -10016.28

Let’s compare these to the bootstrap-calibrated intervals. We’ll look at the differences in each endpoint.

normal.interval.0 - bootstrap.interval.0

[1]  7.670662 -7.670662

normal.interval.all - bootstrap.interval.all

[1] -32.19736  32.19736

They’re similar enough for me.

Checking Your Estimates using Fake Data

At this point, we have confidence intervals. But this is our first time doing inference for complicated summaries like this, so it’s worth checking that they have the coverage we want. We can’t do that directly because coverage is about the relationship between our estimation target and our estimator’s sampling distribution and we don’t know either—those are functions of our population and we only have a sample. What we can do is try all this out using fake data.

Simulating Fake Data

To get a fake population that looks about right, we’ll use a two-step process.⁵

To get pairs $(w_1,x_1), \ldots, (w_m,x_m)$ with a distribution that looks like the one in our sample, we’ll just draw them (with replacement) from our sample.
To get corresponding outcomes $y_1 \ldots y_m$ that look like the ones in our sample, we’ll add some normally-distributed fuzz to the column means within the sample. We’ll choose the amount of fuzz so that the column standard deviations in our population match those in the sample.

set.seed(1)
m = n*10
I = sample(1:n, m, replace=TRUE)
pop = sam[I,]

fuzz = rnorm(m)*sigmahat(pop$w, pop$x)
pop$y = muhat(pop$w, pop$x) + fuzz

pop.summaries = pop |> 
  group_by(w,x) |> 
  summarize(mu = mean(y), sigma = sd(y), Nwx = n(), .groups='drop')

And now that we have a population, we can sample with replacement from it.

J = sample(1:m, n, replace=TRUE)
fake.sam = pop[J,]

Here’s the actual sample, the fake population, and the sample we just drew from it. Not bad!

the sample we drew from the fake population

Calculating Sampling Distributions

Now we’re going to do something a lot like what we did in the ‘we have a population’ questions we looked at in the Week 1 and Week 3 Homeworks, where we were working with our NBA data. In particular, Exercise 2 and Exercise 4 from Week 1 and Exercise 11 from Week 3. All that’s really changed is the way we calculate our estimator $\hat\theta$ and our estimation target $\theta$. If we’re talking about estimating the raw difference in means, $\theta=\Delta_\text{raw}$ and $\hat\theta=\hat\Delta_\text{raw}$; if we’re talking about estimating the average-over-green-dots adjusted difference, $\theta=\Delta_1$ and $\hat\theta=\hat\Delta_1$; etc.

Let’s take a look at the sampling distributions of our estimators $\hat\Delta_\text{raw}$ and $\hat\Delta_1$ when we draw samples of size $n=2271$ with replacement from the fake population. For context, we’ll draw in our estimation targets $\Delta_\text{raw}$ and $\Delta_1$ as green vertical lines. This code will calculate those targets.

w = pop$w
x = pop$x
y = pop$y
mu = summary.lookup('mu', pop.summaries)
target = data.frame(Delta.raw = mean(y[w==1]) - mean(y[w==0]), 
                    Delta.1 = mean(mu(1,x[w==1]) - mu(0,x[w==1])))

This code will calculate our the sampling distributions of our estimators.

sampling.distribution = 1:10000 |> map(function(.) {
  J = sample(1:m, n, replace=TRUE)
  fake.sam = pop[J,]
  W = fake.sam$w
  X = fake.sam$x
  Y = fake.sam$y

  summaries = fake.sam |> group_by(w,x) |> summarize(muhat = mean(y), .groups='drop')
  muhat = summary.lookup('muhat', summaries, default=mean(Y))

  data.frame(Deltahat.raw = mean(Y[W==1]) - mean(Y[W==0]),
             Deltahat.1   = mean(muhat(1,X[W==1]) - muhat(0,X[W==1])))
}) |> bind_rows()

1: Do what follows 10,000 times.
2: Draw a sample of size $n=2271$ with replacement from the fake population.
3: Calculate the column mean function $\hat\mu$.
4: Use it to calculate our estimates $\hat\Delta_\text{raw}$ and $\hat\Delta_1$.
5: This is how we get a data frame with 10,000 rows and two columns, ‘Deltahat.raw’ and ‘Deltahat.1’. What map gives us is a list of 10,000 data frames with one row each (and the right two columns). bind_rows will combine them into one data frame.

Here are those sampling distributions plotted.

plot.sampling.distribution = function(estimates, target) { 
  w=width(estimates)
  ggplot() + 
     geom_vline(xintercept = target, color='green', linewidth=2, alpha=.5) +
     geom_histogram(aes(x=estimates, y=after_stat(density)), bins=50, alpha=.2) +
     width_annotations(estimates) + 
     scales 
}

plot.sampling.distribution(sampling.distribution$Deltahat.raw, target$Delta.raw)
plot.sampling.distribution(sampling.distribution$Deltahat.1,   target$Delta.1)

Now it’s your turn to do this for $\hat\Delta_0$ and $\hat\Delta_\text{all}$.

Exercise 5

Plot the sampling distributions of your estimators $\hat\Delta_0$ and $\hat\Delta_\text{all}$ when you draw samples of size $n=2271$ with replacement from the fake population. Annotate your plots with vertical lines indicating the sampling distribution’s mean and middle 95%. And add a green vertical line indicating your estimation targets, $\Delta_0$ and $\Delta_\text{all}$, in each plot.

Solution

To get the sampling distribution of $\hat\Delta_0$ and $\hat\Delta_\text{all}$, I just need to add a few lines to the code I used above to get the sampling distribution of $\hat\Delta_\text{raw}$ and $\hat\Delta_1$.

sampling.distribution = 1:10000 |> map(function(.) { 
  J = sample(1:m, n, replace=TRUE)  
  fake.sam = pop[J,]                
  W = fake.sam$w                     
  X = fake.sam$x                     
  Y = fake.sam$y                     

  summaries = fake.sam |> group_by(w,x) |> summarize(muhat = mean(y), .groups='drop') 
  muhat = summary.lookup('muhat', summaries, default=mean(Y))                          

  data.frame(Deltahat.raw = mean(Y[W==1]) - mean(Y[W==0]),  
             Deltahat.1   = mean(muhat(1,X[W==1]) - muhat(0,X[W==1])),
             Deltahat.0   = mean(muhat(1,X[W==0]) - muhat(0,X[W==0])),
             Deltahat.all = mean(muhat(1,X) - muhat(0,X)))
}) |> bind_rows()

These are the new lines. Nothing else has changed.

Then I can use plot.sampling.distribution to plot the sampling distributions of $\hat\Delta_0$ and $\hat\Delta_\text{all}$. But I’ll need to calculate the targets $\Delta_0$ and $\Delta_\text{all}$ first. I’ve got code that does that for $\Delta_\text{raw}$ and $\Delta_1$ above, so I’ll copy that here and add a line for each new target.

target = data.frame(Delta.raw = mean(y[w==1]) - mean(y[w==0]), 
                    Delta.1 = mean(mu(1,x[w==1]) - mu(0,x[w==1])),
                    Delta.0 = mean(mu(1,x[w==0]) - mu(0,x[w==0])), 
                    Delta.all = mean(mu(1,x) - mu(0,x)))

Now the plots.

plot.sampling.distribution(sampling.distribution$Deltahat.0, target$Delta.0)
plot.sampling.distribution(sampling.distribution$Deltahat.all, target$Delta.all)

Here’s a follow-up. You’re going to calculate the coverage of the confidence intervals discussed in the previous section. This one I’m leaving to you. I’m not going to do for $\hat\Delta_{\text{all}}$ and $\hat\Delta_1$ myself.

Exercise 6

For $\hat\Delta_0$ and $\hat\Delta_\text{all}$, calculate the coverage of the 95% confidence intervals discussed in the previous section—the ones based on the bootstrap and the normal approximation—when you draw samples of size $n=2271$ with replacement from the fake population.

Tips.

This is exactly like the the coverage calculation you did in Exercise 11 from the Week 3 Homework. All that’s changed is the way your estimator $\hat\theta$ (i.e. $\hat\Delta_0$ and $\hat\Delta_\text{all}$) and estimation target $\theta$ ($\Delta_0$ and $\Delta_\text{all}$) are calculated.

Make sure that when you bootstrap, you use a version of $\hat\mu$ that’s based on your bootstrap sample $(W_1^\star,X_1^\star,Y_1^\star) \ldots (W_n^\star,X_n^\star,Y_n^\star)$. \[ \hat\mu^\star(w,x) = \frac{1}{N_{wx}^\star} \sum_{i:W_i^\star=w,X_i^\star=x} Y_i^\star \qfor N_{wx}^\star = \sum_{i:W_i^\star=w,X_i^\star=x} 1 \]

And you may want want to use fewer than 10,000 samples from the fake population and fewer than 10,000 bootstrap samples from each sample, as this’d result in your calculating you estimators 100 million ($10,000^2$) times. In my solution to Week 3’s Exercise 11, I used 1000 of each for a total of a million. You can do that here too, but it’d take a little longer because our estimators are a bit more complicated. Maybe a few hours. Start with 100 of each to see that your code is working first and get a sense of how long it takes to run. When you increase both 100s to 1000s, it’ll take about $10^2=100$ times longer. If that’s too long for you, go with 200 or 400 of each.

Solution

This is a long one. I’m going to do it in 3 parts.

I’ll write code that calculates the bootstrap sampling distribution of our estimators and then boils those down to their widths. This’ll borrow from my solution to Exercise 3.
I’ll write code that calculates the normal-approximation-based widths of the confidence intervals. This’ll borrow from my ‘solution’ to Note 2.
I’ll write code that draws samples from the fake population and calculates point estimates (based on my solution to Exercise 5), then uses the code from the first two parts to make bootstrap-based and normal-approximation-based around them, and finally records whether the intervals cover the estimation targets.

Step 1. I’m going to start by copying the bootstrap code from my solution to Exercise 3 and modifying it so that it’s a function of the sample. A couple lines change. I’ll mark them.

bootstrap.estimates = function(sam, replications=400) {
  1:replications |> map(function(.) {
    n = nrow(sam)
    I.star = sample(1:n, n, replace=TRUE) 
    bootstrap.sam = sam[I.star,]
    Wstar = bootstrap.sam$w
    Xstar = bootstrap.sam$x
    Ystar = bootstrap.sam$y

    summaries.star = sam[I.star,] |>                                                  
                    group_by(w,x) |>                                                       
                    summarize(muhat = mean(y), .groups='drop')                          
    muhat.star = summary.lookup('muhat', summaries.star, default=mean(Ystar))         
  
    data.frame(                                                                           
      Deltahat.raw = mean(Ystar[Wstar==1]) - mean(Ystar[Wstar==0]),                       
      Deltahat.1   = mean(muhat.star(1,Xstar[Wstar==1]) -   
                          muhat.star(0,Xstar[Wstar==1])),
      Deltahat.0   = mean(muhat.star(1, Xstar[Wstar==0]) -  
                          muhat.star(0, Xstar[Wstar==0])),  
      Deltahat.all = mean(muhat.star(1, Xstar) -            
                          muhat.star(0, Xstar))             
    )
  }) |> bind_rows() 
}

1: Wrap it in a function of the sample (and number of replications with a default of 400).
2: Change the number of replications from 10,000 to whatever you ask for.
3: Calculate n from the sample passed in. The code we copied expected n to be a global variable that matched the size of the sample sam. That’d be ok here because we’re still using that sample size and the global variable is still there if you run the code in this document from start to finish, but it’d get you into trouble if you ever tried to use this code for a sample of a different size.
4: Calculate a bootstrap version of the sample passed in rather than based on the global variables W, X, and Y.
5: Just a close-brace for this new function.

Now I’ll write a little function that uses this to calculate the width of the bootstrap-based confidence intervals for the estimation targets we’ve been using.

bootstrap.widths = function(sam, replications=400) {
  bootstrap.estimates(sam, replications) |> 
    summarize(
      Deltahat.raw = width(Deltahat.raw),
      Deltahat.1   = width(Deltahat.1),
      Deltahat.0   = width(Deltahat.0), 
      Deltahat.all = width(Deltahat.all))
}

Step 2.

Now we’re going to write code that calculates the variance of our point estimates. We’ve got code to do this from our ‘solution’ to Note 2, which uses our definitions of alphahat.0 and alphahat.all from Exercise 1 and the definitions of muhat and sigmahat above them. Because all that stuff depends on the sample, we’ll have to recalculate them for each sample we draw from the population. Let’s wrap all that in a function that takes a sample as input.

normal.widths = function(sam) {
  n = nrow(sam)      
  summaries = sam |>
    group_by(w,x) |>
    summarize(muhat = mean(y), sigmahat = sd(y), Nwx = n(), .groups='drop')
  muhat    = summary.lookup('muhat',    summaries, default=mean(sam$y))
  sigmahat = summary.lookup('sigmahat', summaries, default=Inf)
  Nwx      = summary.lookup('Nwx',      summaries, default=0)
  
  Nw   = function(w) { sum(Nwx(w, x)) }
  Px.1 = function(x) { Nwx(1,x)/Nw(1) }
  Px.0 = function(x) { Nwx(0,x)/Nw(0) }
  Px   = function(x) { (Nwx(0,x)+Nwx(1,x)) / n }

  alphahat.0 = function(w,x) { ifelse(w==1,   Px.0(x), -Px.0(x)) }
  alphahat.all = function(w,x) { ifelse(w==1, Px(x), -Px(x)) }

  Deltahat.0   = mean(muhat(1, sam$x[sam$w==0]) -
                      muhat(0, sam$x[sam$w==0]))
  Deltahat.all = mean(muhat(1, sam$x) - muhat(0, sam$x))
  x=unique(xx)
  Deltahat.0.general = general.form.estimate(alphahat.0, muhat, w=c(0,1), x=x)
  Deltahat.all.general = general.form.estimate(alphahat.all, muhat, w=c(0,1), x=x)
  stopifnot(abs(Deltahat.0 - Deltahat.0.general) <= 1e-10)
  stopifnot(abs(Deltahat.all - Deltahat.all.general) <= 1e-10)

  Vhat.0   = sum(alphahat.0(ww, xx)^2 * sigmahat(ww, xx)^2 / Nwx(ww, xx))
  Vhat.all = sum(alphahat.all(ww, xx)^2 * sigmahat(ww, xx)^2 / Nwx(ww, xx))

  data.frame(Deltahat.0 = 2*1.96*sqrt(Vhat.0), Deltahat.all = 2*1.96*sqrt(Vhat.all))
}

1: Copied from above Exercise 1 with one change: I’ve made the default value for sigmahat infinity. The effect is that when we have a pair $(w,x)$ that we don’t see in our sample, the ratio $\hat\sigma^2(w,x) / N_{wx}$ will be $\infty/0=\infty$. In other words, if we’re estimating the mean of a subpopulation we don’t see in our sample—where for point estimation we just make something up (the default value for muhat, which I’m making the whole-sample mean $\bar Y$)—we’ll get a variance of $\infty$ and therefore an infinitely wide interval. Seems reasonable to me. If this happened often, instead of some weird ad-hoc fix like using these defaults, we’d want to use a different regression model (not the ‘all functions’ model) so we could guess at $\hat\mu(w,x)$ in a more principled and more statistically-analyzable way.
2: Copied from the solution to Exercise 1.
3: Check that we’ve got alpha defined correctly by comparing point estimates using them to ‘sum over people’ versions.
4: Copied from the ‘solution’ to Note 2. Uses xx and ww defined above in the solution to Exercise 1.

Step 3. Now I can copy my code from Exercise 5, which draws samples from the fake population and calculates point estimates, so it uses the code above to calculate interval estimates and checks whether they cover the targets. Note that I defined the targets in my solution to Exercise 5, so they’re available to my code here in the data frame target.

samples = 1:400 |> map(function(.) {
  J = sample(1:m, n, replace=TRUE)
  pop[J,]
})

covers = samples |> map(function(fake.sam) { 
  W = fake.sam$w                     
  X = fake.sam$x
  Y = fake.sam$y                    

  summaries = fake.sam |> group_by(w,x) |> summarize(muhat = mean(y), .groups='drop') 
  muhat = summary.lookup('muhat', summaries, default=mean(Y))                          
  
  Deltahat.0   = mean(muhat(1,X[W==0]) - muhat(0,X[W==0]))
  Deltahat.all = mean(muhat(1,X) - muhat(0,X))  

  width.boot   = bootstrap.widths(fake.sam)
  width.normal = normal.widths(fake.sam)

  boot.interval.0 = Deltahat.0 + c(-1,1)*width.boot$Deltahat.0/2
  normal.interval.0 = Deltahat.0 + c(-1,1)*width.normal$Deltahat.0/2
  boot.interval.all = Deltahat.all + c(-1,1)*width.boot$Deltahat.all/2
  normal.interval.all = Deltahat.all + c(-1,1)*width.normal$Deltahat.all/2
 
  in.interval = function(target, interval) {
    interval[1] <= target & target <= interval[2]
  }
  covers = data.frame(
    boot.0     = target$Delta.0   |> in.interval(boot.interval.0),
    normal.0   = target$Delta.0   |> in.interval(normal.interval.0),
    boot.all   = target$Delta.all |> in.interval(boot.interval.all),
    normal.all = target$Delta.all |> in.interval(normal.interval.all))
  covers
}) |> bind_rows()

1: This uses our code from Step 1. All the code above this is from Exercise 5.
2: This uses our code from Step 2.
3: Calculate intervals given centers and widths. Standard stuff.
4: A little function that checks whether a number is in an interval.
5: Use it to check whether the estimation targets are in the intervals.
6: Stick together the results from each sample into one data frame.
7: I decided to calculate and store my samples rather than drawing them in the loop to help me debug. I noticed I was getting a few NAs in covers and this way, I could look at the rows of covers where they were, find out corresponding samples, and step through the code for that specific sample to see what was up.

Now we’ve got a data frame telling us whether the bootstrap-based and normal-approximation-based confidence intervals cover the estimation targets in each of our samples drawn from the fake population. To calculate the coverage probabilities, we just need to take the mean of the four columns of covers. You can definitely write more verbose code to do this, but my GPT-o1-mini-based autocomplete suggested this fancy use of summarize. It’s nice and compact.

covers |> summarize(across(everything(), mean))

boot.0	normal.0	boot.all	normal.all
0.9275	0.93	0.9275	0.935

Now it’s time to do something a little different. For the first time in the semester, you’d going to estimate the population column means $\mu(w,x)$ using least squares.⁶

Exercise 7

Repeat the last two exercises with one change: estimate the column means $\mu(w,x)$ in the population via least squares with the ‘not-necessarily-parallel lines model’. Like this.

\[ \hat\mu(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_i \qty{ Y_i - m(W_i,X_i) }^2 \qfor \mathcal{M}= \qty{ a(w) + b(w) x \ \text{ for functions } a(w) \text{ and } b(w) } \]

fitted.model = lm(y ~ w*x, data=sam)
muhat = function(w,x) { predict(fitted.model, newdata=data.frame(w=w,x=x)) }

Tip.

As in Exercise 6, make sure that when you bootstrap you use a version of $\hat\mu$ fit to the bootstrap sample. \[ \hat\mu^\star(w,x) = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \frac{1}{n} \sum_{i=1}^n \qty{ Y_i^\star - m(W_i^\star,X_i^\star) }^2 \]

Tip.

If you want, you can check that you haven’t introduced a problem when you modify your code from the previous exercises to do least squares pretty easily. All you have to do is try your new code out using the ‘all functions’ model instead of the ‘not-necessarily-parallel lines’ model. Because the least squares estimator of $\mu(w,x)$ in this model is just a column mean, this should give you the same thing you got in the previous exercises. To do that, you just need to change the formula you use when you call lm. To use the all functions model, you can say y ~ w*factor(x) instead of y ~ w*x.

Solution: Sampling Distributions

default.formula = y ~ w*x

The code is from Exercise 5. We change how $\hat\mu$ is defined.

formula = default.formula
sampling.distribution = samples |> map(function(fake.sam) { 
  fake.sam$w = fake.sam$w + 0
  W = fake.sam$w 
  X = fake.sam$x                     
  Y = fake.sam$y                    

  fitted.model = lm(formula, data=fake.sam)
  muhat = function(w,x) { predict(fitted.model, newdata=data.frame(w=w,x=x)) }
  
  data.frame(
    Deltahat.0   = mean(muhat(1,X[W==0]) - muhat(0,X[W==0])),
    Deltahat.all = mean(muhat(1,X) - muhat(0,X))  
  )
}) |> bind_rows()

1: A lot of code in R will automatically convert FALSE/TRUE to 0/1 where needed, but lm makes a distinction. I’m adding the + 0 to convert W to numeric 0/1.
2: The new muhat.

plot.sampling.distribution(sampling.distribution$Deltahat.0, target$Delta.0)
plot.sampling.distribution(sampling.distribution$Deltahat.all, target$Delta.all)

This model choice worked out very well. Bias is very small, especially for $\Delta_0$.

Note 3: Note: Model Choice Matters

To give you a sense of what a bad model choice looks like, here are the sampling distributions we get when we use the ‘horizontal lines’ model, \[ \mathcal{M}=\{ a(w) \ \text{for functions } a(w) \}. \]

To do this, I just copy-paste the code from my solution above and change the formula.

formula = y ~ w

Solution: Coverage

To do this one, I’m going to copy over my code from Exercise 6 and make a few changes. I’ll point out the changes where they happen.

Step 1. In the bootstrap code, all I’ve done is change my definition of muhat.star to use least squares in the not-necessarily-parallel lines model.

I’ll add a ‘formula’ argument I can use least squares for any model I want. The default is the not-necessarily-parallel lines model.
Here I’m using lm to fit the specified model via least squares and writing a function that makes predictions using the fitted model.

Here’s my width function. Again, I’ve added a ‘formula’ argument.

Step 2.

Now here’s the code that calculates the variance of our point estimates by substituting the new version of $\hat\mu$ into our code from Exercise 6.

This is not actually the right variance. Our variance formula from our Lecture on Inference for Complex Summaries is specific to the case that we use subsample means to estimate the $\mu(w,x)$ in the population. ⁷ ⁸ But I forgot to tell you to do the bootstrap intervals only for this one, so let’s see what happens.

normal.widths = function(sam, formula=default.formula) {
  n = nrow(sam)      
  summaries = sam |>
    group_by(w,x) |>
    summarize(muhat = mean(y), sigmahat = sd(y), Nwx = n(), .groups='drop') 
  fitted.model = lm(formula, data=sam)
  muhat = function(w,x) { predict(fitted.model, newdata=data.frame(w=w,x=x)) }
  sigmahat = summary.lookup('sigmahat', summaries, default=Inf) 
  Nwx      = summary.lookup('Nwx',      summaries, default=0) 
  
  Nw   = function(w) { sum(Nwx(w, x)) }  
  Px.1 = function(x) { Nwx(1,x)/Nw(1) }  
  Px.0 = function(x) { Nwx(0,x)/Nw(0) }  
  Px   = function(x) { (Nwx(0,x)+Nwx(1,x)) / n } 

  alphahat.0 = function(w,x) { ifelse(w==1,   Px.0(x), -Px.0(x)) }  
  alphahat.all = function(w,x) { ifelse(w==1, Px(x), -Px(x)) }      

  Deltahat.0   = mean(muhat(1, sam$x[sam$w==0]) -                
                      muhat(0, sam$x[sam$w==0]))                 
  Deltahat.all = mean(muhat(1, sam$x) - muhat(0, sam$x))         
  x=unique(xx)                                               
  Deltahat.0.general = general.form.estimate(alphahat.0, muhat, w=c(0,1), x=x) 
  Deltahat.all.general = general.form.estimate(alphahat.all, muhat, w=c(0,1), x=x) 
  stopifnot(abs(Deltahat.0 - Deltahat.0.general) <= 1e-10)     
  stopifnot(abs(Deltahat.all - Deltahat.all.general) <= 1e-10)  

  Vhat.0   = sum(alphahat.0(ww, xx)^2 * sigmahat(ww, xx)^2 / Nwx(ww, xx))     
  Vhat.all = sum(alphahat.all(ww, xx)^2 * sigmahat(ww, xx)^2 / Nwx(ww, xx))   

  data.frame(Deltahat.0 = 2*1.96*sqrt(Vhat.0), Deltahat.all = 2*1.96*sqrt(Vhat.all))
}

1: The new part.

Step 3. Now all I’ve got to do is take my Step 3 code from Exercise 6 and fix my point estimates to use the new version of $\hat\mu$. The rest of the code is the same because it relies on the width calculations I’ve already fixed up above. I’ll use the same samples as I did in Exercise 6, which I stored in the list samples in my code for that one.

formula = default.formula
covers = samples |> map(function(fake.sam) { 
  fake.sam$w = fake.sam$w + 0
  W = fake.sam$w 
  X = fake.sam$x                     
  Y = fake.sam$y                    

  fitted.model = lm(formula, data=fake.sam)
  muhat = function(w,x) { predict(fitted.model, newdata=data.frame(w=w,x=x)) }
  
  Deltahat.0   = mean(muhat(1,X[W==0]) - muhat(0,X[W==0]))
  Deltahat.all = mean(muhat(1,X) - muhat(0,X))  

  width.boot   = bootstrap.widths(fake.sam)  
  width.normal = normal.widths(fake.sam)     

  boot.interval.0 = Deltahat.0 + c(-1,1)*width.boot$Deltahat.0/2           
  normal.interval.0 = Deltahat.0 + c(-1,1)*width.normal$Deltahat.0/2       
  boot.interval.all = Deltahat.all + c(-1,1)*width.boot$Deltahat.all/2    
  normal.interval.all = Deltahat.all + c(-1,1)*width.normal$Deltahat.all/2 
 
  in.interval = function(target, interval) {  
    interval[1] <= target & target <= interval[2] 
  } 
  covers = data.frame(                                                 
    boot.0     = target$Delta.0   |> in.interval(boot.interval.0), 
    normal.0   = target$Delta.0   |> in.interval(normal.interval.0), 
    boot.all   = target$Delta.all |> in.interval(boot.interval.all), 
    normal.all = target$Delta.all |> in.interval(normal.interval.all)) 
  covers
}) |> bind_rows()

1: A lot of code in R will automatically convert FALSE/TRUE to 0/1 where needed, but lm makes a distinction. I’m adding the + 0 to convert W to numeric 0/1.
2: The new muhat.

Now averaging to boil things down to coverage probabilities.

covers |> summarize(across(everything(), mean))

boot.0	normal.0	boot.all	normal.all
0.9175	0.9275	0.92	0.9375

Not bad! This Friday in Lab, we’ll dig into why this works ok even though the model doesn’t contain any function that can quite go through all the subpopulation means $\mu(w,x)$.

Here’s an extra-credit follow-up that looks back to our last exercise from Practice Midterm 1.

Exercise 8

Extra Credit. Suppose you want to report an interval estimate centered on the same point estimator you used Exercise 7. The one involving the not-necessarily-parallel lines model. But you want it to have 95% coverage. That means you’ll have to use a wider interval.⁹

Here’s an idea for how to do this using the bootstrap. Choose a width $w$ so that for 95% of draws $\hat\theta^\star$ from your estimator’s bootstrap sampling distribution, the interval $\hat\theta^\star \pm w/2$ contains a ‘sample version’ $\theta^\star$ of your estimation target. Do this and report three things.

A description of $\theta^\star$ in mathematical notation.
The coverage probability of this interval estimator.
The average width of this interval estimator and the bootstrap-calibrated interval estimator from Exercise 6.

I’ve been a little vague about what $\theta^\star$ is. That’s intentional. Try to figure it out. This approach to calibration does work, so you’ll know you’ve got it right when you get close to 95% coverage.¹⁰

You can use the function width to calculate the width you’re looking for. If you’ve got bootstrap samples $\hat\theta^\star_1, \hat\theta^\star_2, \ldots$ in an R vector called thetahat.star and your sample version of the estimation target in a variable called theta.star, this should do it.

w=width(thetahat.star, center=theta.star)

Footnotes

If you get stuck on one of those parts, reviewing the solutions for those assignments may help.↩︎
The total number of red and green dots, $m_w$, on the other hand, isn’t very visible in the plot. You’d actually have to count dots.↩︎
If you see a tiny difference, like $10^{-10}$ or so, don’t worry about it. Computer arithmetic is imperfect, so those happen when you use different mathematically equivalent formulas.↩︎
If you’ve come across the idea of regularization before, you can think of this choice of $\bar Y$ as a form of regularization toward a common mean. Elaborate versions of this tend to be called ‘multilevel models’ or ‘hierarchical models’. This is an important topic, but it’s not one we’ll have time to cover in this class.↩︎
This is basically what I did to get the fake population we’ve been looking at in lecture and Figure 1. I did something a bit fancier than adding gaussian noise in step 2, but not that much fancier.↩︎
If you want, you could argue it’s not really the first time. You’ve been estimating $\mu(w,x)$ by least squares all along, since column means are the least squares estimator in the ‘all functions’ model. But this is the first time you’re going to do it with a different model.↩︎
If you caught that and chose not to do that, that’s great. But if you catch something like it again, please shoot me an email so I can send out a correction for everyone else.↩︎
The correct formula is a little complicated. I’ll explain what it looks like and why you might not really want to bother with it next week.↩︎
You’ll have to use a wider interval unless you got 95% coverage in Exercise 6. You shouldn’t have. If you did, go back and try to figure out what went wrong.↩︎
To avoid any confusion: Yes, I did just gave you one of the three things I asked you to report.↩︎