library(tidyverse)
Week 9 Homework
\[ \DeclareMathOperator{\E}{E} \DeclareMathOperator{\Var}{V} \DeclareMathOperator{\hVar}{\widehat{V}} \DeclareMathOperator{\bias}{bias} \]
$$
Summary
This one is about interval estimation for estimation targets that are complex summaries of a population. We’ll focus on the summaries we discussed in our lecture on Multivariate Analysis and Covariate Shift. The lesson here is that calibration works exactly the same way it did when we were just estimating simpler summaries like means. Because of that, parts of this assignment will be very similar to things you did in the Week 2 and 3 Homeworks.1
I think working through some of those same problems in this more general context can help make these steps feel a bit more meaningful. For example, in Section 4 below, we’ll be checking that our interval estimates are properly calibrated when we use them on fake data. When you do this for a really simple summary, the process is almost too transparent. It doesn’t feel like you’re really making up a fake population, checking that it looks right, and seeing what happens when you’re sampling from it. You can see through the whole process from start to finish. In this more complex setting, you have to go step by step a bit more.
Because coverage calculations involve a fair amount of code and just calculating these estimators involves some too, this is a pretty code-heavy assignment. And when you’re new to that kind of programming, it can feel really overwhelming trying to get two pieces like this to fit together. Organizing your code well really helps. Unfortunately, some of the organizational principles you may have learned in other programming class — e.g. the kind of Object Oriented Programming where all the examples are about bank accounts — tend to be counterproductive for this kind of work. In my experience, the best way to organize your code when you’re doing something like this is to make it look as much like the math as you can. I’ll demonstrate that for you by doing versions of a lot of the same calculations I’m asking you to do.
Please don’t hesitate to reach out if you’re having trouble with any part of this. Learning new programming styles and new mathematical ideas at the same time can be a lot. And Google and ChatGPT may not be as helpful as you’d like. There’s a lot of content on the web about this programming style and these mathematical ideas, but there’s not much that talks about both at the same time.
We’ll use the tidyverse package.
Review
Let’s review the summaries we’ll be working with. We’ll stick with the same California income data we’ve used in lecture. Here are plots to think about while we review.
Consider the population \((w_1,x_1,y_1), \ldots, (w_m,x_m,y_m)\) you see above on the right. To summarize it, we might talk about mean \(\mu(w,x)\) in each column, the number of dots \(m_{wx}\) in each column, and the number of red (w=0) and green (w=1) dots \(m_w\) outright. That is, the mean income among people with a given education level and sex, the number of people with a given education level and sex, and the number of male and female people in our sample respectively. In mathematical notation, these. \[ \begin{aligned} \mu(w,x)&=\frac{1}{m_{wx}}\sum_{j:w_j=w,x_j=x} y_j && \text{ column means } \\ m_{wx} &= \sum_{j:w_j=w,x_j=x} 1 && \text{number of dots in column} \\ m_w &= \sum_{j:w_j=w} 1 && \text{total number of red (w=0) or green (w=1) dots} \end{aligned} \] In the plot, we see these means \(\mu(w,x)\) represented as pointranges (⍿) and the proportion of people of a given sex with a given level of education, \(m_{wx}/m_w\), represented by the heights of red and green bars.2
In terms of these, we can summarize the income disparity between the male and female people in our population a few pretty reasonable ways. We call the first one below the raw difference in means and the remaining three, which are all weighted averages of the within-education-level means \(\mu(1,x)-\mu(0,x)\), differences that are adjusted for education level. \[ \begin{aligned} \Delta_{\text{raw}} &= \frac{1}{m_1}\sum_{j:w_j=1} y_j - \frac{1}{m_0}\sum_{j:w_j=0} y_j \\ \Delta_0 &=\frac{1}{m_0}\sum_{j:w_j=0} \qty{\mu(1,x_j) - \mu(0, x_j)} \\ \Delta_1 &= \frac{1}{m_1}\sum_{j:w_j=1} \qty{\mu(1,x_j) - \mu(0, x_j)} \\ \Delta_{\text{all}} &= \frac{1}{m}\sum_{j=1}^m \qty{\mu(1,x_j) - \mu(0, x_j)} \end{aligned} \]
To estimate them, we can do exactly the same thing with the sample. Letting \(\hat\mu\), \(N_{wx}\), and \(N_w\) be sample versions of \(\mu\), \(m_{wx}\), and \(m_w\), we can calculate our estimators like this.
\[
\begin{aligned}
\hat \Delta_{\text{raw}} &= \frac{1}{N_1}\sum_{i: W_i=1} Y_i - \frac{1}{N_0}\sum_{i: W_i=0} Y_i \\
\hat \Delta_0 &=\frac{1}{N_0}\sum_{i: W_i=0} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\
\hat \Delta_1 &= \frac{1}{N_1}\sum_{i: W_i=1} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\
\hat \Delta_{\text{all}} &= \frac{1}{n} \sum_{i=1}^n \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)}
\end{aligned}
\]
As discussed in our lecture on Least Squares in Linear Models, the column means in the sample are really just one estimate of the mean of the corresponding column in the population. We can estimate our adjusted differences by plugging any estimate \(\hat\mu\) of the function \(\mu\) into the formulas above. The column means are the least squares estimate in the ‘all functions’ regression model, i.e., \[ \begin{aligned} \hat\mu(w,x) = \frac{1}{N_{wx}} \sum_{i:W_i=w,X_i=x} Y_i \qqtext{ minimizes } & \frac{1}{n}\sum_{i=1}^n \qty{ Y_i - m(W_i,X_i) }^2 \\ &\qqtext{ over the set of all functions $m(w,x)$}. \end{aligned} \] We get other estimates \(\hat\mu\) by minimizing the sum of squared errors over other (smaller) sets of functions of \(w\) and \(x\). For example, the ones discussed here in that lecture. In this assignment, we’ll hold off on that until our last exercise.
Estimation
We’re going to work with the same California income data we used in our last homework. But this time, instead of doing everything by hand using summaries and tables, we’ll write code that works with the raw data. You can get it like this.
= read.csv('https://qtm285-1.github.io/assets/data/ca-income.csv')
ca = data.frame(
sam w = ca$sex == 'female',
x = ca$education,
y = ca$income)
We’ll use some tidyverse stuff in the calculations below.
library(tidyverse)
Calculating Point Estimates
As a warm-up, we’ll repeat Exercise 5 from our last homework. We’ll calculate the following four summaries of the income discrepency between male and female residents of California. \[ \begin{aligned} \hat \Delta_{\text{raw}} &= \frac{1}{N_1}\sum_{i: W_i=1} Y_i - \frac{1}{N_0}\sum_{i: W_i=0} Y_i \\ \hat \Delta_0 &=\frac{1}{N_0}\sum_{i: W_i=0} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_1 &= \frac{1}{N_1}\sum_{i: W_i=1} \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \\ \hat \Delta_{\text{all}} &= \frac{1}{n} \sum_{i=1}^n \qty{\hat\mu(1,X_i) - \hat \mu(0, X_i)} \end{aligned} \] But, instead of doing it by hand using a table, we’ll do it with code. We’ll calculate each three ways.
- Using code that looks like the formulas above, i.e., code that sums over people.
- With code that looks like the equivalent ‘histogram form’ formulas. You needed these for last week’s Exercise 5.
- With code that looks like the ‘general form’, \(\sum_{w,x} \hat\alpha(w,x) \hat\mu(w,x)\), that we talked about in our lecture on inference for complex summaries.
Ultimately, we don’t really need three implementations of the same thing. We could get by with just the ‘general form’ code. We can use it to calculate point estimates and it’s what we’ll ultimately need to do variance calculations. If you like, you can think of the other two as steps along the way.
I’ll get you started by doing \(\hat\Delta_{\text{raw}}\) and \(\hat\Delta_1\). I like to write code that looks almost exactly like what I’ve written in mathematical notation. From what I’ve seen, when people do that, they tend to introduce fewer bugs when they go from math to code. And the code tends to be easier to understand, too.
I’ll start by defining three functions: \(\hat\mu(w,x)\), \(\hat\sigma(w,x)\), and \(N_{w,x}\). Everything else I need, e.g. the histogram height \(P_{x \mid 1}(x)\) that appears in the equivalent ‘histogram form’ expression for \(\hat\Delta_1\), can be computed from these.
= sam |>
summaries group_by(w,x) |>
summarize(muhat = mean(y), sigmahat = sd(y), Nwx = n(), .groups='drop')
= function(column.name, summaries, default=NA) {
summary.lookup function(w,x) {
= left_join(data.frame(w=w,x=x),
out c('w','x',column.name)],
summaries[, by=c('w','x')) |>
pull(column.name)
is.na(out)] = default
out[
out
}
}
= summary.lookup('muhat', summaries)
muhat = summary.lookup('sigmahat', summaries)
sigmahat = summary.lookup('Nwx', summaries)
Nwx
= sam$w
W = sam$x
X = sam$y
Y = unique(X)
x = length(Y) n
- 1
-
If you’re unfamiliar with code that looks like this, it helps to think about the types of
summary.lookup
‘s input and output. The input is the easy part. Its arguments as a column name and a data frame—one that should have columns named ’w’, ‘x’, and column.name. Its output is a function. It’s a function that takes as input two vectors of the same length—a vector of values for ‘w’ and a vector of values for ‘x’—and returns the corresponding values from the column.name column of the data frame. This basic pattern—writing functions that return other functions—is a big part of what makes functional programming effective in the right hands. - 2
-
What we’re doing here is taking each row in the table
data.frame(w=w,x=x)
and adding to it a columncolumn.name
with values taken from the data framesummaries
where the ‘w’ and ‘x’ match. If there is no row insummaries
corresponding to some pair ‘w’ and ‘x’ we’ve asked for—i.e. some row indata.frame(w=w,x=x)
—this putsNA
in that column. - 3
-
If
default
is passed, then we returndefault
instead ofNA
for any row indata.frame(w=w,x=x)
that doesn’t have a corresponding row insummaries
.
This is everything I need to do calculations that look like the formulas above.
= mean(Y[W==1]) - mean(Y[W==0])
Deltahat.raw .1 = mean(muhat(1,X[W==1]) - muhat(0,X[W==1])) Deltahat
To do calculations that look like the ‘histogram form’ formulas, I need \(P_{x \mid 0}\) (Px.0
in the code below), \(P_{x \mid 1}\) (Px.1
in the code below), and \(P_x\) (Px
in the code below). I can define these in terms of the function \(N_{w,x}\) implemented above.
= function(w) { sum(Nwx(w, x)) }
Nw .1 = function(x) { Nwx(1,x)/Nw(1) }
Px.0 = function(x) { Nwx(0,x)/Nw(0) }
Px= function(x) { (Nwx(0,x)+Nwx(1,x)) / n } Px
Here’s how I calculate \(\hat\Delta_{\text{raw}}\) and \(\hat\Delta_1\) using the ‘histogram form’ formulas.
= sum(Px.1(x)*muhat(1,x)) - sum(Px.0(x)*muhat(0,x))
Deltahat.raw.histform 1.histform = sum(Px.1(x)*(muhat(1,x) - muhat(0,x))) Deltahat.
To do calculations that look like the ‘general form’, \(\sum_{w,x} \hat\alpha(w,x) \hat\mu(w,x)\), I need two new things. First, I need a function \(\hat\alpha(w,x)\) for each estimation target. I’ll start by writing them out in mathematical notation. To do this, I take a look at the ‘histogram form’ formulas and pull out the coefficients on each instance of \(\hat\mu(w,x)\).
$$ \[\begin{aligned} \hat\Delta_\text{raw} &= \sum_{x} P_{x \mid 1}(x) \ \hat\mu(1,x) - \sum_{x} P_{x \mid 0} \ \hat\mu(0,x) \\ &= \sum_{w,x} \hat\alpha_\text{raw}(w,x) \ \hat\mu(w,x) \qfor \begin{cases} \hat\alpha_\text{raw}(w,x) = P_{x \mid 1}(x) & \text{if } w=1 \\ \hat\alpha_\text{raw}(w,x) = -P_{x \mid 0}(x) & \text{if } w=0 \end{cases} \\ \hat\Delta_1 &= \sum_{x} P_{x \mid 1}(x) \qty{ \hat\mu(1,x) -\hat\mu(0,x)} \\ &= \sum_{w,x} \hat\alpha_1(w,x) \ \hat\mu(w,x) \qfor \begin{cases} \hat\alpha_1(w,x) = P_{x \mid 1}(x) & \text{if } w=1 \\ \hat\alpha_1(w,x) = -P_{x \mid 1}(x) & \text{if } w=0 \end{cases} \end{aligned}\]$$
Now, having done this, I can implement functions that look like that in R
.
= function(w,x) { ifelse(w==1, Px.1(x), -Px.0(x)) }
alphahat.raw .1 = function(w,x) { ifelse(w==1, Px.1(x), -Px.1(x)) } alphahat
Now that I have these, I can calculate \(\hat\Delta_\text{raw}\) and \(\hat\Delta_1\) by plugging them into the function general.form.estimate
defined below, which sums the product \(\hat\alpha(w,x) \hat\mu(w,x)\) over all pairs \((w,x)\).
= function(alphahat, muhat, w, x) {
general.form.estimate = expand_grid(w=w, x=x)
grid = grid$w
ww = grid$x
xx sum(alphahat(ww,xx)*muhat(ww,xx))
}
= general.form.estimate(alphahat.raw, muhat, w=c(0,1), x=x)
Deltahat.raw.generalform 1.generalform = general.form.estimate(alphahat.1, muhat, w=c(0,1), x=x) Deltahat.
- 1
- This is reusable. Just make sure you pass all levels of \(w\) and \(x\) to the function.
And to check for mistakes, we can look to see if all three methods give the same answer.3
c(Deltahat.raw, Deltahat.raw.histform, Deltahat.raw.generalform)
[1] -11824.38 -11824.38 -11824.38
c(Deltahat.1, Deltahat.1.histform, Deltahat.1.generalform)
[1] -15092.87 -15092.87 -15092.87
Now it’s your turn.
Calibrating Interval Estimates
Now that we’ve got all this machinery in place, we’re ready to calibrate some interval estimates around them. We’ll start by bootstrapping the point estimates we just calculated. As in the previous exercise, I’ll do it for \(\hat\Delta_\text{raw}\) and \(\hat\Delta_1\) and you’ll do it for \(\hat\Delta_0\) and \(\hat\Delta_\text{all}\). You can re-use a lot of my code, so all you’ll need to do is add a few lines.
= 1:10000 |> map(function(.) {
bootstrap.estimates = sample(1:n, n, replace=TRUE)
I.star = W[I.star]
Wstar = X[I.star]
Xstar = Y[I.star]
Ystar
= sam[I.star,] |>
summaries.star group_by(w,x) |>
summarize(muhat = mean(y), .groups='drop')
= summary.lookup('muhat', summaries.star, default=mean(Ystar))
muhat.star
data.frame(
Deltahat.raw = mean(Ystar[Wstar==1]) - mean(Ystar[Wstar==0]),
Deltahat.1 = mean(muhat.star(1,Xstar[Wstar==1]) -
muhat.star(0,Xstar[Wstar==1]))
)|> bind_rows() })
- 1
- Do what follows 10,000 times.
- 2
- Sample with replacement from the sample to get a bootstrap sample.
- 3
-
Summarize the bootstrap sample to get a bootstrap-sample version of the function \(\hat\mu(w,x)\). See Note 1 below for what passing
default
is doing in the call tosummary.lookup
. - 4
- Calculate the estimation target using the bootstrap-sample stuff. All you’ll need to do is add a few lines here.
- 5
-
This is how we get a data frame with 10,000 rows and two columns, ‘Deltahat.raw’ and ‘Deltahat.1’. What map gives us is a list of 10,000 data frames with one row each (and the right two columns).
bind_rows
will combine them into one data frame.
Here are histograms showing, for \(\hat\Delta_\text{raw}\) and \(\hat\Delta_1\), its bootstrap sampling distribution. I’ve used the function width
to calculate the width of the middle 95% of each distribution and draw that, as well as the distribution’s mean, in as vertical lines. And I’ve added my point estimate as a blue dot, too.
= list(scale_y = scale_y_continuous(breaks=seq(0, .0002, by=.00002)),
scales scale_x = scale_x_continuous(breaks=seq(-40000, 10000, by=10000)),
coord_cartesian(xlim=c(-40000, 10000), ylim=c(0, .0002)))
= ggplot() +
bootstrap.plot.raw geom_histogram(aes(x=Deltahat.raw, y=after_stat(density)),
bins=50, alpha=.2, data=bootstrap.estimates) +
geom_point(aes(x=Deltahat.raw, y=.00005), color='blue') +
width_annotations(bootstrap.estimates$Deltahat.raw)
.1 = ggplot() +
bootstrap.plotgeom_histogram(aes(x=Deltahat.1, y=after_stat(density)),
bins=50, alpha=.2, data=bootstrap.estimates) +
geom_point(aes(x=Deltahat.1, y=.00005), color='blue') +
width_annotations(bootstrap.estimates$Deltahat.1)
+ scales
bootstrap.plot.raw .1 + scales bootstrap.plot
And here are 95% intervals around my point estimates, calibrated using the bootstrap.
= Deltahat.raw +
bootstrap.interval.raw c(-1,1)*width(bootstrap.estimates$Deltahat.raw)/2
.1 = Deltahat.1 +
bootstrap.intervalc(-1,1)*width(bootstrap.estimates$Deltahat.1)/2
bootstrap.interval.raw
[1] -17265.902 -6382.859
.1 bootstrap.interval
[1] -20400.126 -9785.606
Here’s a review exercise on connecting what we see in the plot to the intervals we’ve reported.
Now it’s your turn to do all this for \(\hat\Delta_0\) and \(\hat\Delta_\text{all}\).
Let’s check that these are pretty similar to what we get using normal approximation. We’ll use our variance formula from our lecture on Inference for Complicated Summaries. And, as in lecture, we’ll ignore the term that reflects the randomness of our coefficients. The one written in red in that lecture.
= expand_grid(w=c(0,1), x=x)
grid = grid$w
ww = grid$x
xx
= sum(alphahat.raw(ww,xx)^2 * sigmahat(ww,xx)^2/Nwx(ww,xx))
Vhat.raw .1 = sum(alphahat.1(ww,xx)^2 * sigmahat(ww,xx)^2/Nwx(ww,xx)) Vhat
Here are the plots above with our normal-approximation-based estimates of the sampling distribution draw on top of the bootstrap ones.
= Deltahat.raw + 1.96*c(-1,1)*sqrt(Vhat.raw)
normal.interval.raw .1 = Deltahat.1 + 1.96*c(-1,1)*sqrt(Vhat.1)
normal.interval
normal.interval.raw
[1] -16896.428 -6752.332
.1 normal.interval
[1] -20346.148 -9839.583
Checking Your Estimates using Fake Data
At this point, we have confidence intervals. But this is our first time doing inference for complicated summaries like this, so it’s worth checking that they have the coverage we want. We can’t do that directly because coverage is about the relationship between our estimation target and our estimator’s sampling distribution and we don’t know either—those are functions of our population and we only have a sample. What we can do is try all this out using fake data.
Simulating Fake Data
To get a fake population that looks about right, we’ll use a two-step process.5
- To get pairs \((w_1,x_1), \ldots, (w_m,x_m)\) with a distribution that looks like the one in our sample, we’ll just draw them (with replacement) from our sample.
- To get corresponding outcomes \(y_1 \ldots y_m\) that look like the ones in our sample, we’ll add some normally-distributed fuzz to the column means within the sample. We’ll choose the amount of fuzz so that the column standard deviations in our population match those in the sample.
set.seed(1)
= n*10
m = sample(1:n, m, replace=TRUE)
I = sam[I,]
pop
= rnorm(m)*sigmahat(pop$w, pop$x)
fuzz $y = muhat(pop$w, pop$x) + fuzz pop
= pop |>
pop.summaries group_by(w,x) |>
summarize(mu = mean(y), sigma = sd(y), Nwx = n(), .groups='drop')
And now that we have a population, we can sample with replacement from it.
= sample(1:m, n, replace=TRUE)
J = pop[J,] fake.sam
Here’s the actual sample, the fake population, and the sample we just drew from it. Not bad!