Week 2 Homework

Summary

This week, we’re going to get into the details of real-world calibration. The kind that doesn’t require that you know your estimator’s actual sampling distribution. We’ll start simple. In Exercise 1 and Exercise 2, you’ll implement a calibrated interval estimator for a population proportion and calculate its actual coverage probability. Then, after a detour on how exactly we go from an estimate of the sampling distribution to an interval width (Exercise 3), we’ll really get into how using an estimate affects the calibration of our interval estimates.

Here we take advantage of a very nice feature of the sampling distributions we get when we’re estimating a proportion \(\theta\). after sampling with or without replacement. They’re determined by our estimation target \(\theta\), i.e. it’s Binomial or Hypergeometric with success probability \(\theta\). As a result, once we have a point estimate \(\hat\theta\), we’ve essentially determined our estimate of the sampling distribution estimate, i.e., it’s Binomial or Hypergeometric with success probability \(\hat\theta\). This means we can think of properties of our interval, e.g. its width or its upper and lower bounds, as functions of \(\hat\theta\). And we can visualize these functions. That’s what Figure 2 and Figure 3 are about, and you’ll be using them in Exercise 4-Exercise 6 to understand a phenomeon we saw on the last slide of Lecture 3: around some point estimates, we get a wider-than-perfectly-calibrated interval, which effectively helps it cover the estimation target, and around others we get a narrower-than-perfectly-calibrated interval, which does the opposite. By looking at the width of the sampling distribution as a function of \(\theta\), we can understand where and why this happens—no programming required.

We’ll conclude by using our visualizations to calculate the actual coverage probability of our imperfectly-calibrated interval estimates. First, in Exercise 7, we’ll do it at a single value of \(\theta\)—effectively repeating Exercise 2 using our understanding of what these intervals look like rather than brute-force simulation. Then, in Exercise 8, we’ll repeat the process for a range of \(\theta\) values, allowing us to plot the actual coverage probability of our real-world interval estimates as a function of \(\theta\).

This is all for the case of sampling with replacement. Some extra credit exercises, Exercise 10-Exercise 12, repeat this process for the case of sampling without replacement.

The Point

For a lot of estimators, calibration is a bit more complicated than it is here. For example, that the width of the sampling distribution of a difference of two proportions depends on more than that difference—it depends on the two proportions themselves. We lose our one-to-one mapping between the point estimate and the sampling distribution, so there’s more to the question of how estimating this sampling distribution affects calibration. My hope is that, having some shared visual intuition for how calibration works in this simple case, we’ll have an easier time talking about how it works more generally.

The Data

We’ll be using the NBA data from last week’s homework. But to stay in familiar territory, we’ll need a binary outcome, so we’ll use a frequency. In particular, we’ll look at the frequency that a player’s team wins in over half of the games they play in.

You may have caught on that ‘your sample’ from last week’s homework probably isn’t sampled with replacement from the population after all. So we’ll work with ‘my sample’. Here’s some code you can run to get the data set up. Just like in class, we’ll use (little) y for the population and (big) Y for the sample.1

sam = read.csv("https://qtm285-1.github.io/assets/data/nba_sample_1.csv")
pop = read.csv("https://qtm285-1.github.io/assets/data/nba_population.csv") 

indicator = function(W,L,...) { W / (W+L) > 1/2 }

library(purrr)
Y = sam |> pmap_vec(indicator)
y = pop |> pmap_vec(indicator)

n = length(Y)
m = length(y)
pink     = '#ef476f'  # rgb(239,71,111)
teal     = '#118ab2'  # rgb(17,138,178)

Warm Up

To get started, we’ll redo a few exercises from last week with a few modifications. First, we’ll use this new binary outcome instead of points. And second, we’ll calibrate our intervals using estimates of our sampling distribution rather than the actual sampling distribution.

To remind you of the notation, here’s how we write out our estimation target and estimator again. \[ \begin{aligned} \underset{\text{estimation target}}{\theta} &= \frac{1}{m}\sum_{j=1}^m y_j \\ \underset{\text{estimate}}{\hat\theta} &= \frac{1}{n}\sum_{i=1}^n Y_i \end{aligned} \]

The first one is a version of last week’s Exercise 3.

Exercise 1  

Using only your sample, calculate a 95% confidence interval for \(\theta\), the proportion of players in our population whose team won more than half of the games they played in. Assume that the sample is drawn with replacement from the population.

Your confidence interval should be \(\hat\theta \pm \hat{w}/2\) where \(\hat{w}\) is the width of the middle 95% of your estimate of the sampling distribution. You can use the function width from last week’s homework, but you’ll need draws from your estimate of the sampling distribution to pass to it. This lecture slide should help.

That function width from last week’s homework

width = function(draws, center = mean(draws), alpha = .05) {
    covg.dif = function(w) {
        actual = mean(center - w / 2 <= draws & draws <= center + w / 2)
        nominal = 1 - alpha
        actual - nominal
    }
    uniroot(covg.dif, interval = c(0, 2 * max(abs(draws - center))))$root
}
theta.hat = mean(Y)
theta.hat.draws = rbinom(10000, n, theta.hat)/n
w = width(theta.hat.draws)
interval = c(theta.hat - w/2, theta.hat + w/2)

My interval estimate is \(0.48 \pm 0.10\) or equivalently \([0.38, 0.58]\).

The second one is a version of last week’s Exercise 4.

Exercise 2  

Report the coverage probability of your interval estimator. To do this, draw 10,000 samples of size \(n=100\) with replacement from the population, calculate an interval estimate based on each of these samples just like you did in Exercise 1, and report the fraction of these intervals that contain the population mean.

theta = mean(y)
coverage.probability = 1:10000 |> map_vec(function(.) {
  J = sample(1:m, n, replace=TRUE)
  Y = y[J]

  theta.hat = mean(Y)
  theta.hat.draws = rbinom(10000, n, theta.hat)/n
  w = width(theta.hat.draws)
  interval = c(theta.hat - w/2, theta.hat + w/2)
  covers = interval[1] <= theta & theta <= interval[2]
  covers
}) |> mean()

My interval estimator’s coverage probability is approximately 95%.

Improving Speed and Precision at The Same Time

So far, to calculate our interval widths, we’ve used a ‘dot counting’ method. We’ve sampled 10,000 times from our (estimated) sampling distribution, then tried drawing arms of different lengths around the sampling distribution’s mean until we go one that included 9500 (95%) of those dots. You can think of it as an iterative process like this.

Too narrow. Only 8000 dots are covered.

Too wide. 9990 dots are covered.

Getting there. 9750 dots.

Just right. 9500 dots.
Figure 1

That’s really what the code is doing. What I’m leaving out is a clever way of choosing the next width to try given what’s happened so far. That’s baked into R’s built-in function uniroot, which is being used in width to find the solution to this equation.

\[ \hat f(x) = 0 \qqtext{ where } \hat f(x) = \text{the fraction of dots covered at width } x - .95 \]

What’s imprecise about this is that we’re using 10,000 dots instead of infinitely many. We don’t really want to know the fraction of 10,000 dots that we’re going to cover—we want to know the fraction of the sampling distribution that we’re going to cover. To do this, we can make a slight modification to our code. We’ll swap our dot counting function \(\hat f(x)\), which was called covg.dif in the code, with a version \(f(x)\) that actually calculates probability. Here’s the code. It’s a bit faster than the dot-counting version too.

width.binom = function(theta, n, alpha=.05) {
  if(theta == 0 || theta == 1) { return(0) }
  covg.dif = function(w) {
    actual = pbinom(n*(theta+w/2),   n, theta) -
             pbinom(n*(theta-w/2)-1, n, theta)
    nominal = 1-alpha 
    actual - nominal  
  }
  limits = c(0, 1)
  uniroot(covg.dif, interval=limits)$root
}
1
This uses the built-in function pbinom, which calculates the probability that a binomial random variable is less than or equal to \(x\). To calculate the probability that it’s between two values \(a<b\), we take the probability it’s less than or equal to \(b\) and subtract the probability it’s less than \(a\): \(P(a \le X \le b) = P(X \le b) - P(X < a)\). And the probability that this average—which is some fraction of \(n\)—is less than \(a\) is, of course, the probability that it’s less than or equal to \(a-1/n\).

You can interpret the ‘dot counting’ method as doing is exactly this, except using a histogram of 10,000 draws from the sampling distribution in place of the sampling distribution itself. Remember the exercise from class last week where we thought about twins in neon-green shirts? That’s what width is doing. It’s using the histogram I’ve shaded pink instead of the sampling distribution I’ve outlined on top of it in teal.

In this exercise, you’re going to compare the two approaches. You’re going to work with your estimate of the sampling distribution from Exercise 1 and plot two functions of interval width \(x\): the fraction of the sampling distribution that’s covered by the interval \(\hat\theta \pm x/2\) and the fraction of your 10,000 draws that is.

Exercise 3  

Plot these two functions on the same axes. Add the horizontal lines \(y=.8\) and \(y=.95\). Then explain how, using this plot, you can find the widths of the 95% confidence intervals calculated by width and width.binom. And report an 80% confidence interval centered on your estimate \(\hat\theta\). You should be able to read the width for that off the plot.

To reduce the amount of code you’ll have to write, I’ll give you most of what you need. All you’ll need to do is replace the NaN in probability.covered with the fraction of the sampling distribution that’s covered by the interval \(\hat\theta \pm x/2\). If you want a hint, mouse over the (1) and (2) below.

draws =  rbinom(10000, n, theta.hat)/n
center = mean(draws)
dots.covered = function(x) { 
  mean(center - x / 2 <= draws & draws <= center + x / 2)
}
probability.covered = function(x) {
  NaN
}

x = (0:n)/n
ggplot() +
  geom_line(aes(x=x, y = x |> map_vec(dots.covered)), color=pink) +
  geom_line(aes(x=x, y = x |> map_vec(probability.covered)), color=teal) +
  scale_y_continuous(breaks = c(0,.8,.95,1), minor_breaks = NULL) + 
  scale_x_continuous(breaks = seq(0,1,by=.125), limits=c(0,.5)) + 
  labs(x='', y='') 
1
I’ve borrowed this code from the function width.
2
Is there something analogous you should be borrowing from width.binom?

draws =  rbinom(10000, n, theta.hat)/n
center = mean(draws)
dots.covered = function(x) { 
  mean(center - x / 2 <= draws & draws <= center + x / 2)
}

theta = mean(y)
probability.covered = function(x) {
  pbinom(n*(theta+x/2),   n, theta) -
  pbinom(n*(theta-x/2)-1, n, theta)
}

x = (0:n)/n
ggplot() +
  geom_line(aes(x=x, y = x |> map_vec(dots.covered)), color=pink) +
  geom_line(aes(x=x, y = x |> map_vec(probability.covered)), color=teal) +
  scale_y_continuous(breaks = c(0,.8,.95,1), minor_breaks = NULL) + 
  scale_x_continuous(breaks = seq(0,1,by=.125), limits=c(0,.5)) + 
  labs(x='', y='') 
1
I’ve borrowed this code from the function width.
2
And I’ve borrowed this from width.binom.
3
I needed this for what I borrowed from width.binom to work.

The height of the two lines at a given \(x\)-coordinate, which should be and are roughly the same, shows the fraction of our estimate of the sampling distribution that we cover with an interval of width \(x\). The \(x\)-coordinate where these lines cross the horizontal lines \(y=.8\) and \(y=.95\) are the widths we need for 80% and 95% coverage respectively. They cross \(y=.8\) at \(x \approx .125\), so the interval \(\hat\theta \pm .125/2\) for \(\hat\theta \approx 0.48\) (i.e. the interval \([0.42 , 0.54]\) ) is an 80% confidence interval.

Visualizing Calibration

In this problem, we’re going to try to understand how calibration works by visualizing it. This turns out to be pretty straightforward when we’re talking about estimating a frequency after sampling with replacement. That’s because we know exactly what the sampling distribution of the sample frequency looks like except for one thing: we don’t know the population frequency \(\theta\). Here’s what they look like for population frequencies \(\theta\) from 10% to 90% in 10% increments at our sample size of n=100.

It follows that the width we should be using to calibrate our interval estimates—the width of the middle 95% of a distribution like this—is also determined by the population frequency \(\theta\). I’ve drawn it on as a ‘belt’ on the each distribution above. And we already have a function to calculate it: width.binom. And if we want to understand how these widths depend on \(\theta\), we can do the obvious thing: plot the width as a function of \(\theta\).

thetas = seq(0,1,by=.01)
widths.100 = thetas |> map_vec(function(theta) { width.binom(theta, n=n, alpha=.05) })
widths.625 = thetas |> map_vec(function(theta) { width.binom(theta, n=625, alpha=.05) })

ggplot() + 
  geom_line(aes(x=thetas, y=widths.100)) +
  labs(x='theta', y='width')
ggplot() + 
  geom_line(aes(x=thetas, y=widths.625)) +
  labs(x='theta', y='width')

The width of a calibrated 95% confidence interval as a function of the population frequency theta when \(n=100\).

The width of a calibrated 95% confidence interval as a function of the population frequency theta when \(n=625\).
Figure 2

We can even plot the interval itself as a function of \(\theta\). Here’s two slightly different ways of doing it. On the left, we have a visualization that focuses on the interval’s upper and lower bounds as functions of \(\theta\). The blue line shows the upper bound and the red line shows the lower bound. On the right, we have a visualization that focuses on what the intervals themselves look like. We’re plotting the intervals themselves, laid out vertically, as a function of \(\theta\).

These are really just different ways of styling the same plot. On the left, the upper and lower bounds are styled as two lines. On the right, maybe a bit more evocatively, they’re styled as the edges of actual intervals.

Figure 3

Here’s a few quick exercises to solidify your understanding of what these plots are saying.

Exercise 4  

Let’s use the notation \(w(\theta)\) to refer to the function from Figure 2, the width of a calibrated 95% confidence interval when the population frequency is \(\theta\). And let’s call the lower and upper bounds of the corresponding interval, which we see in Figure 3, \(\ell(\theta)\) and \(u(\theta)\) respectively. Write out formulas for \(\ell(\theta)\) and \(u(\theta)\) in terms of \(\theta\) and \(w(\theta)\).

\[ \begin{aligned} \ell(\theta) &= \theta - \frac{w(\theta)}{2} \\ u(\theta) &= \theta + \frac{w(\theta)}{2} \end{aligned} \]

Exercise 5  

Let’s think about the ‘confidence interval for our estimate of the sampling distribution’ we talked about at the end of class this Wednesday. The one from the last slide. On that slide, I showed interval estimates \(\hat\theta \pm w(\hat\theta)/2\) calibrated by plugging a point estimate \(\hat\theta\) into the binomial formula to estimate the sampling distribution, like you did in Exercise 1. In particular, I showed these intervals for point estimates \(\hat\theta\) at the two ends of the sampling distribution’s middle 95%: a red one centered at \(\theta-w(\theta)/2\) and a blue one centered at \(\theta+w(\theta)/2\) where \(\theta\) was the population frequency. These intervals had different widths because they were calibrated using different estimates of the sampling distribution. Write out formulas these two intervals in terms of \(\theta\) and the functions \(\ell\) and \(u\) you defined in Exercise 4. Then, using your solution to Exercise 4, write an equivalent formula in terms of \(\theta\) and the function \(w\).

\[ \begin{aligned} \qty[\ell\qty(\theta-\frac{w(\theta)}{2}),\ u\qty(\theta-\frac{w(\theta)}{2})] &= \qty(\theta - \frac{w(\theta)}{2}) \pm \frac{w\qty(\theta-\frac{w(\theta)}{2})}{2} && \text{ is the \textcolor[RGB]{248,118,109}{red} one } \\ \qty[\ell\qty(\theta+\frac{w(\theta)}{2}),\ u\qty(\theta+\frac{w(\theta)}{2})] &= \qty(\theta + \frac{w(\theta)}{2}) \pm \frac{w\qty(\theta+\frac{w(\theta)}{2})}{2} && \text{ is the \textcolor[RGB]{0,0,255}{blue} one } \\ \end{aligned} \]

Exercise 6  

Using Figure 2, think about why the red interval, centered on an underestimate of the population frequency, covers the population frequency but the blue interval, centered on an overestimate, does not. No need to turn anything in as evidence that you’ve done the thinking.

Now suppose that, instead of being roughly 70%, turnout was roughly 30%. Maybe we’re thinking about a midterm primary like we talked about in our first class of the semester. Suppose we’ve calibrated interval estimates exactly the same way, using widths calibrated using estimates of the sampling distribution. And we’re thinking about two particular point estimates that could happen: a new version of the red one, at the lower end of the actual sampling distribution’s middle 95%, and a new version of the blue one, at the upper end of the actual sampling distribution’s middle 95%. Which of the two—if any—would cover the population frequency of 30%? What about in the case that the population frequency is 50%?

If turnout were roughly 30%, the blue interval at the upper end would cover and the red interval at the lower end would not. If it were 50%, neither would cover.

I’ll explain why. Looking at the \(x \pm w(x)/2\) formulas from Figure 3, we see that intervals centered at \(x=\textcolor[RGB]{248,118,109}{\theta-w(\theta)/2}\) and \(x=\textcolor[RGB]{0,0,255}{\theta+w(\theta)/2}\) cover \(\theta\) if and only if \(w(x) \ge w(\theta)\). That’s the point of these centers: perfectly calibrated intervals centered at those points would just touch \(\theta\). If we replace the ‘perfectly calibrated’ width \(w(\theta)\) with one that’s greater than \(w(\theta)\), then the interval will cover \(\theta\); if we replace it with one that’s less than \(w(\theta)\), then the interval will not.

When \(\theta\) is roughly 70%, the red interval’s actual width is the height of the width function \(w(x)\) at \(x = \theta-\frac{w(\theta)}{2}\), a point between \(\theta\) and \(.5\),and it is greater than \(w(\theta)\) because the width function increases as we move from \(\theta\approx .7\) toward its maximum at \(.5\). On the other hand, the blue interval’s actual width is smaller than \(w(\theta)\) because we move away from \(.5\) as we go from \(\theta\) to its center. The situation is reversed when \(\theta\) is roughly 30%, as the blue interval’s center is between \(\theta\) and \(.5\) and the red interval’s center is not. When \(\theta=.5\), width at \(\theta\) is maximal, so the widths we get for both the red and blue intervals are smaller than \(w(\theta)\).

Now here’s where we start to put it all together. Let’s write out an indicator that tells us whether, when our point estimate is \(x\), our interval estimate covers the population frequency \(\theta\).2

\[ 1_{\ell(\cdot) \le \theta \le u(\cdot)}(x) = \begin{cases} 1 & \text{if } \ell(x) \le \theta \le u(x) \\ 0 & \text{otherwise} \end{cases} \]

Exercise 7  

Draw this indicator function, like you did in the Week 0 Homework, on top of the plot of your estimator’s actual sampling distribution below. Describe how, assuming you were willing to get out a ruler and spend a lot of time measuring plots, you could calculate the coverage of the \(\hat\theta \pm w(\hat\theta)/2\) interval estimates we’ve been talking about this week without writing any code.

To draw your indicator function, you need to know where it’s one and where it’s zero. You can use Figure 3 to do this. At which \(x\)-coordinate does the interval’s upper bound \(u(x)\) cross the population frequency \(\theta \approx 0.49\)? What happens to your indicator function there? What about the interval’s lower bound \(\ell(x)\)?

It’s fine to do this approximately, just by looking at the figure. But if you want to get the exact \(x\)-coordinates, you should be able to use widths.100 from the code used to draw Figure 2 in combination with your answer to Exercise 4.

theta = mean(y)
draws =  rbinom(1000, n, theta)/n
prob = dbinom(0:n, n, theta)

sampling.dist.plot = ggplot() +
  geom_col(aes(x=(0:n)/n, y=prob), alpha=.2) +
  geom_point(aes(x=draws, y=max(prob)*(1:1000)/1000), alpha=.1, size=.5, color='purple') +
  geom_vline(xintercept=theta, color='green', alpha=.7, linewidth=1.5) + 
  scale_x_continuous(breaks=seq(0,1,by=.1), limits=c(.25,.75)) +
  labs(x='', y='')
sampling.dist.plot

Let’s figure out where our indicator function is one. Here’s an easy way to do it visually. Take your pick of the two plots in Figure 3, which show the upper and lower bounds of our interval estimates, \(\ell(\cdot)\) and \(u(\cdot)\), as functions of \(x\). Add a horizontal line at \(y=\theta \approx 0.49\). I drew a green line on top of both.

The plot on the right is a bit more evocative but the one on the left is a bit easier to read precisely. Our indicator is one at \(x\)s where \(\theta\) is between the upper bound \(\textcolor[RGB]{0,0,255}{u(x)}\) and lower bound \(\textcolor[RGB]{248,118,109}{\ell(x)}\). So the edges of the region we should be shading—the \(x\)-coordinates where our indicator switches from zero to one—are the \(x\)-coordinates where the green line crosses the bounds from ‘outside’ to ‘inside’. Visually, what I’m seeing that it crosses ‘in’ at roughly \(x=.4\) and ‘back out’ at roughly \(x=.59\).

I can get the exact \(x\) coordinates the green line is inside the bounds by actually calculating them. I can do that using widths.100 vector used to draw Figure 2 and Figure 3. In the code below, inside is my indicator function evaluated at each possible value of \(\hat\theta\), i.e. the ones with denominator \(n\). To find the boundaries, I just take the minimum and maximum of these possible values..

x = (0:n)/n
lx = x - widths.100/2
ux = x + widths.100/2

inside = lx <= theta & theta <= ux
lower.edge = min(x[inside])
upper.edge = max(x[inside])

To draw the indicator function, I’ll use annotate to draw a rectangle as in the Week 0 Homework.

binomial.indicator = annotate(geom='rect', xmin=lower.edge, xmax=upper.edge, 
                              ymin=-Inf, ymax=Inf, alpha=.2, fill='blue')
sampling.dist.plot + binomial.indicator

Our coverage probability is just the expected value of this indicator function. It’s the fraction of the sampling distribution contained in our rectangle. I didn’t actually ask you to calculate it, but it’s easy enough when we know our boundaries.

It’s 92.8%.

Our final question is a little more open-ended in how you do it. You could, for example, generalize your solution to Exercise 2 or translate your solution to Exercise 7 into code. What we’re going to do is think of the coverage probability of these interval estimates as a function—a function of the actual population frequency \(\theta\)—and plot it.

Exercise 8  

Suppose we’re estimating \(\theta\), the frequency of ones in some population \(y_1 \ldots y_m\), based on a sample of size n=100 drawn with replacement. Plot, as a function of \(\theta\), the coverage probability of the interval estimate \(\hat\theta \pm w(\hat\theta)/2\). Repeat for a sample of size n=625.

When I was writing my solution, I found that my precision was a bit limited if I used the version of width.binom above because it bounces around a little. The problem is that, due to the discreteness of the binomial distribution, there’s a range of widths that yield the same coverage probability. The smart thing to do would be to choose the smallest one, of course, but uniroot doesn’t know to do that. Within that range, uniroot doesn’t know what to pick and the choices it makes look a bit odd, as they can bounce up and down a little as we make small changes to the width—even as those changes are all in the same direction. I haven’t implemented the smart thing, but I have done something that breaks these ties and fixes the bouncing. What I’ve done is to ‘smooth’ the actual binomial probabilities by mixing them with a bit of the probabilities we get using its normal approximation.

width.binom = function(theta, n, alpha=.05, smooth.fraction=.001) {
  if(theta == 0 || theta == 1) { return(0) }
  covg.dif = function(w) {
    actual.normal.approx =  
      pnorm(theta+w/2, theta, sqrt(theta*(1-theta)/n)) - 
      pnorm(theta-w/2, theta, sqrt(theta*(1-theta)/n))     
    actual = pbinom(n*(theta+w/2),   n, theta) -  
             pbinom(n*(theta-w/2)-1, n, theta)   
    smoothed.actual = (1-smooth.fraction)*actual+smooth.fraction*actual.normal.approx        
    nominal = 1-alpha 
    smoothed.actual - nominal 
  }
  limits = c(0, 1)
  uniroot(covg.dif, interval=limits)$root
}

I’ll do this by extending the approach from Exercise 7, where I wrote code to calculate coverage at a single value of \(\theta\). All I need to do now is do the same thing repeatedly, looping over a grid of values of \(\theta\).

For a small population like our NBA one, we can try every possible value of \(\theta\): \(0, 1/m, 2/m, \ldots, 1\). For a population like our GA voters, that’s too many. We’ll just count from 0 to 1 by thousandths, which should be good enough for a plot.

Note that there’s a bit of bouncing up and down in this curve. To make this a little more palatable visually, I’ve plotted it as a sequence of points instead of a line. It’s a useful trick. I haven’t had time to think through whether the actual curve bounces a little or it’s an artifact introduced by my code. Extra Credit. Figure out which it is and make a convincing argument. Let’s make it worth something: 10 points on Midterm 1.

I’ve drawn a green line at \(y=.95\), our nominal coverage probability, i.e. the coverage probability we’re trying for.

coverage.curve = function(n, thetas) {
  x = (0:n)/n
  widths = x |> map_vec(function(x) { width.binom(x, n, alpha=.05) })
  lx = x - widths/2
  ux = x + widths/2

  thetas |> map_vec(function(theta) { 
    probs = dbinom(0:n, n, theta)
    inside = lx <= theta & theta <= ux
    sum(probs[inside])
  })
} 

thetas.100 = (0:m)/m
thetas.625 = (0:1000)/1000

ggplot() + 
  geom_point(aes(x=thetas.100, y=coverage.curve(100, thetas.100)), color=pink, size=.25) +
  geom_hline(yintercept=.95, color='green', alpha=.7, linewidth=.5) +
  ylim(0,1) + labs(x='θ', y='coverage')
ggplot() +
  geom_point(aes(x=thetas.625, y=coverage.curve(625,thetas.625)), color=teal, size=.25) + 
  geom_hline(yintercept=.95, color='green', alpha=.7, linewidth=.5) +
  ylim(0,1) + labs(x='θ', y='coverage')

n=100

n=625

Exercise 9  

Optional. Knowing a bit about your experience doing this homework will help me write better-calibrated assignments in the future. If you’d like, please tell me this.

  1. Roughly how long it took you.
  2. Whether that was the time you want to spend on homework for this class, too much, or too little.

Extra Credit: Sampling without Replacement

Remember from class that if we sample without replacement, the sample frequency has a hypergeometric distribution, not a binomial distribution? That means that if we want to calibrate interval estimates for sampling without replacement, we need to make a very small change to the function width.binom to switch out the binomial distribution for the hypergeometric distribution. You are, of course, welcome to look up the documentation on the R function phyper and code that up yourself, but I’ve done it for you and provided the code below in case you’d rather not.

width.hyper = function(theta, n, m, alpha=.05) { 
  if(theta == 0 || theta == 1) { return(0) }
  covg.dif = function(w) {
    actual = phyper(n*(theta+w/2),   m*theta, m*(1-theta), n) - 
             phyper(n*(theta-w/2)-1, m*theta, m*(1-theta), n)
    nominal = 1-alpha 
    actual - nominal  
  }
  limits = c(0, 1)
  uniroot(covg.dif, interval=limits)$root
}

What we’re going to do here are versions of Exercise 7 and Exercise 8 for a version of our NBA survey where we’ve collected a sample of size n=100 from our population of size m=539 players, but this time sampling without replacement. Here’s a plot of our sample frequency’s sampling distribution when we do this.

theta = mean(y)
draws =  rhyper(1000, m*theta, m*(1-theta), n)/n
prob = dhyper(0:n, m*theta, m*(1-theta), n)

sampling.dist.plot.hyper = ggplot() +
  geom_col(aes(x=(0:n)/n, y=prob), alpha=.2) +
  geom_point(aes(x=draws, y=max(prob)*(1:1000)/1000), alpha=.1, size=.5,color='purple') +
  geom_vline(xintercept=theta, color='green', alpha=.7, linewidth=1.5) +  
  scale_x_continuous(breaks=seq(0,1,by=.1), limits=c(.25,.75)) +
  labs(x='', y='')

Exercise 10  

Draw the indicator for coverage of a 95% interval, \(1_{\ell(\cdot) \le \theta \le u(\cdot)}\), on the plot above. Remember to use the width of the hypergeometric distribution rather than the binomial when calculating \(\ell\) and \(u\).

I’ll do this the same way I did it in Exercise 7. But this time I’ll use the hypergeometric width function.

widths.100 = array(101)
for(ii in 0:100) {
  widths.100[ii+1] = width.hyper(ii/n, n, m, alpha=.05)
}


x = (0:n)/n
lx = x - widths.100/2
ux = x + widths.100/2

inside = lx <= theta & theta <= ux
lower.edge = min(x[inside])
upper.edge = max(x[inside])
1
Here’s the change. The intended change was to swap out width.binom for width.hyper. I also replaced the call to ‘map_vec’ with a ‘for’ loop because I was getting errors when I mapped and they vanished when I made exactly the same calls in a loop. I don’t know why that happened.
hypergeometric.indicator = annotate(geom='rect', xmin=lower.edge, xmax=upper.edge, 
                              ymin=-Inf, ymax=Inf, alpha=.2, fill='blue')

sampling.dist.plot.hyper + hypergeometric.indicator

Let’s calculate the coverage probability.

It’s 92.4%.

Exercise 11  

Suppose we’re estimating \(\theta\), the frequency of ones in some population \(y_1 \ldots y_m\), based on a sample of size n=100 drawn without replacement from a population of size 539. Plot, as a function of \(\theta\), the coverage probability of the interval estimate \(\hat\theta \pm w(\hat\theta)/2\).

While writing the solution, I implemented a smoothed version of this width function,too.

width.hyper = function(theta, n, m, alpha=.05, smooth.fraction=.001) { 
  if(theta == 0 || theta == 1) { return(0) }
  covg.dif = function(w) {
    actual.normal.approx =  
      pnorm(theta+w/2, theta, sqrt(theta*(1-theta)/n)*(m-n)/(m-1)) - 
      pnorm(theta-w/2, theta, sqrt(theta*(1-theta)/n)*(m-n)/(m-1))     
    actual = phyper(n*(theta+w/2),   m*theta, m*(1-theta), n) - 
             phyper(n*(theta-w/2)-1, m*theta, m*(1-theta), n)
    smoothed.actual = (1-smooth.fraction)*actual+smooth.fraction*actual.normal.approx        
    nominal = 1-alpha 
    actual - nominal  
  }
  limits = c(0, 1)
  uniroot(covg.dif, interval=limits)$root
}

The function coverage.curve differs from the one in solution to Exercise 8 in two lines. They’re marked.

coverage.curve = function(n, thetas) {
  x = (0:n)/n
  widths = x |> map_vec(function(x) { width.hyper(x, n, m, alpha=.05) })
  lx = x - widths/2
  ux = x + widths/2

  thetas |> map_vec(function(theta) { 
    probs = dhyper(0:n, m*theta, m*(1-theta), n)
    inside = lx <= theta & theta <= ux
    sum(probs[inside])
  })
} 

thetas.100 = (0:m)/m
ggplot() + 
  geom_point(aes(x=thetas.100, y=coverage.curve(100, thetas.100)), color=pink, size=.25) +
  geom_hline(yintercept=.95, color='green', alpha=.7, linewidth=.5) +
  ylim(0,1) + labs(x='θ', y='coverage')
1
Calculate width using the hypergeometric instead of the binomial.
2
Calculate the expectation of our indicator using the hypergeometric probabilities rather than the binomial.

Finally, let’s look into what would happen if we mistakenly analyzed our actual data — which was sampled with replacement — as if it were sampled without replacement.

Exercise 12  

Draw your indicator from Exercise 10 on top of your plot of the sampling distribution from Exercise 7. Then calculate and plot, as a function of the population frequency \(\theta\), the coverage probability it illustrates. That is, the coverage probability of the interval estimator \(\hat\theta \pm w(\hat\theta)/2\) calibrated as if you’d drawn a sample of size 100 without replacement from a population of 539, when you’ve actually drawn that sample with replacement.

So what we’re going to see here is how much we over-cover, i.e. exceed our nominal coverage probability of 95%, when we calibrate using the wider binomial distribution. The degree of over-coverage isn’t huge because \(n=100\) is not a huge fraction of \(m=539\). There’s some, because it’s not tiny in comparison, but I’ll show the coverage plot for \(n=(m-1)/2\) to see what happens in that case too.

It’s the same binomial indicator and width calculation function as in Exercise 7 and Exercise 8, but we calculate the expected value of the indicator using the hypergeometric probabilities as in Exercise 11.

We already have the indicator from Exercise 7, so we can just overlay it on our new sampling distribution plot.

sampling.dist.plot.hyper + binomial.indicator

This time we make one change to coverage.curve from Exercise 8. I’ll mark the changes we do and don’t make.

coverage.curve = function(n, thetas) {
  x = (0:n)/n
  widths = x |> map_vec(function(x) { width.binom(x, n, alpha=.05) })
  lx = x - widths/2
  ux = x + widths/2

  thetas |> map_vec(function(theta) { 
    probs = dhyper(0:n, m*theta, m*(1-theta), n)
    inside = lx <= theta & theta <= ux
    sum(probs[inside])
  })
} 

thetas = (0:m)/m

ggplot() + 
  geom_point(aes(x=thetas, y=coverage.curve(100, thetas)), color=pink, size=.25) +
  geom_hline(yintercept=.95, color='green', alpha=.7, linewidth=.5) +
  ylim(0,1) + labs(x='θ', y='coverage')
1
Here’s the change we don’t make that we did in Exercise 11. We’re calibrating using the binomial.
2
Here’s the change we do make. We’re calculating the coverage probability using the hypergeometric.
ggplot() + 
  geom_point(aes(x=thetas, y=coverage.curve((m-1)/2, thetas)), color=teal, size=.25) +
  geom_hline(yintercept=.95, color='green', alpha=.7, linewidth=.5) +
  ylim(0,1) + labs(x='θ', y='coverage')

n=100

n=(m-1)/2

Footnotes

  1. If you’re getting an error about pmap_vec not being found, you’ve probably got a pretty old version of purrr installed. That function, like map_vec from last week’s homework, was added in purrr version 1.0 in December 2022. You can update it by running install.packages("purrr").↩︎

  2. You’ll have to forgive the notation here. Sometimes when we want to write out a function, but don’t want to gives its argument a name, we’ll use the symbol \(\cdot\) instead of a real argument name like \(x\). What this notation is saying is that, when you pass the function an argument \(x\), you should substitute it for the \(\cdot\). Perhaps it makes more sense when you see it compared to the alternative in action. Which looks more like an indicator variable for the interval estimate around \(\hat\theta\) covering the population frequency \(\theta\): \(1_{\ell(\cdot) \le \theta \le u(\cdot)}(\hat\theta)\) or \(1_{\ell(x) \le \theta \le u(x)}(\hat\theta)\)?↩︎