Practice Midterm 1

QTM 285-1

Problem 1

Suppose we’ve drawn a sample \(Y_1 \ldots Y_n\) with replacement from a population \(y_1 \ldots y_m\) with mean \(\theta=\frac{1}{m}\sum_{j=1}^m y_j\). The plot above shows the sampling distributions of these three estimators of \(\theta\). \[ \begin{aligned} \hat \theta_1 &= 0 \\ \hat \theta_2 &= \frac{1}{n}\sum_{i=1}^n Y_i \\ \hat \theta_3 &= \frac{1}{n+10}\sum_{i=1}^n Y_i \end{aligned} \]

Part A

Match each estimator to the plot of its sampling distribution, e.g. \(\hat\theta_1: a\), \(\hat\theta_2: b\), etc. Of the estimators \(\hat\theta_1\), \(\hat\theta_2\), and \(\hat\theta_3\), which do you know to be consistent? Which could possibly be consistent?

Correction

The x-axis ticks were mislabeled in the version of this document posted earlier. They were off by 1, so ‘estimator a’ was centered at 0, etc. This has been fixed. I apologize for any confusion this caused.

\(\hat\theta_1\) is \(c\), \(\hat\theta_2\) is \(a\), and \(\hat\theta_3\) is \(b\). I know \(\hat\theta_2\) and \(\hat\theta_3\) are consistent, whereas \(\hat\theta_1\) could be consistent but I don’t know that.

Explanation. It’s easy to identify the sampling distribution of \(\hat\theta_1\), as it has no spread at all: it’s always \(0\). And we can differentiate between the sampling distributions of \(\hat\theta_2\) and \(\hat\theta_3\) by observing that \(\hat\theta_2\) (the sample mean) is unbiased, so it must be the one whose center doesn’t change as \(n\) does.

I know \(\hat\theta_2\) and \(\hat\theta_3\) are consistent, as their standard deviations both go to zero as \(n\) goes to infinity and the bias of \(\theta_2\) is zero and that of \(\theta_3\) is \(10\theta/(n+10)\) which goes to zero as \(n\) goes to infinity. I don’t know that \(\hat\theta_1\) is, since its bias is \(-\theta\) no matter what \(n\) is.

But — if you disconnect this part from the sampling distributions I’ve drawn — \(\hat\theta_1\) could be consistent, since its standard deviation does go to zero as \(n\) goes to infinity (it’s zero for all \(n\)) and its bias would be zero if \(\theta\) were \(0\). I wouldn’t take points off for saying it couldn’t be for two reasons. First, I did draw the sampling distributions for you, and what I’ve drawn is inconsistent with \(\theta\) being 0. Second, it’s so rare that anything is ever exactly zero that I wouldn’t fault you for answering the question as if it outright never happened. But that idea—that \(\hat\theta_1\) would be consistent if \(\theta\) were \(0\)—was what I was trying to get at with this could question.

Part B

In the plot below, I’ve shown a bootstrap estimate of the sampling distribution of an estimator \(\hat\theta\). Suppose it’s a good estimate, so you can get away with thinking of it as the estimator’s actual sampling distribution. On top of it, I’ve drawn four interval estimators.

  • Which are calibrated to have at least 95% coverage? Put a check next to those.
  • Which are calibrated to have almost exactly 95% coverage? Circle your check for those.
A Clarification.

If I ask a problem like this on an exam, I’ll make it clear that ‘at least but not exactly 95% coverage’ means substantially more than 95% coverage. Maybe I’ll give you a list of choices for each interval, e.g. 1%, 5%, 50%, 95%, 99% that are spread out enough that familiarity with the illustrations we use often in class will make it clear which is which.

From the top, the first, third, and fourth have 95% coverage—the widest 3. The third and fourth have almost exactly 95% coverage.

Explanation. All 3 are wide enough to span at least 95% of draws from (i.e. probability mass of the) sampling distribution. The third and fourth, which are roughly the same, have almost exactly 95% coverage. The first is too wide for that—it spans almost all draws from the sampling distribution.

Problem 2

The miracle of random sampling is that we’re able to estimate the mean of a population with a very small sample from that population. But for that to work, our observations have to be independent—or close to it. If we observe the incomes \(Y_1 \ldots Y_n\) of \(n\) people drawn with replacement from a population with mean income \(\mu\) and income standard deviation \(\sigma\), the variance of the sample mean \(Y_1 \ldots Y_n\) will be \(\sigma^2/n\). Below, I’ve shown the calculation.

\[ \begin{aligned} \Var\qty[\frac1n\sum_{i=1}^n Y_i] &= \E\qty[ \qty{ \frac{1}{n}\sum_i Y_i - \E \qty( \frac{1}{n}\sum_i Y_i ) }^2 ] && \\ &= \E\qty[ \qty{ \frac{1}{n}\sum_i (Y_i - \E Y_i) }^2 ] && \\ &= \E\qty[ \qty{ \frac{1}{n}\sum_i Z_i }^2 ] && \text{for} \ \ Z_i = Y_i - \E Y_i \\ &= \E\qty[ \frac{1}{n^2}\sum_i \sum_j Z_i Z_j ] && \\ &= \frac{1}{n^2} \sum_i \sum_j \E Z_i Z_j && \\ &= \frac{1}{n^2} \sum_i \sum_j \begin{cases} \sigma^2 & \text{ when } j=i \\ 0 & \text{ otherwise } \end{cases} \\ &= \frac{1}{n^2} \sum_i \sigma^2 = \frac{1}{n^2} \times n \times \sigma^2 = \frac{\sigma^2}{n} \end{aligned} \]

Now suppose that you’ve been lazy, and instead of calling \(n\) different people, you’ve just called one and reported their income \(n\) times. That is, you’ve got ‘a sample of size n’, \(\tilde Y_1 \ldots \tilde Y_n\), with \(\tilde Y_1=Y_1, \tilde Y_2 = Y_1, \tilde Y_3 = Y_1, \ldots\). What is the variance of the mean of this ‘sample’, \(\frac1n\sum_{i=1}^n \tilde Y_i\)? And if it’s not \(\sigma^2/n\) like we got for the mean of \(Y_1 \ldots Y_n\), explain—with reference to the calculation above—why it is not. If there’s a line or lines where something different happens, say which; say what happens instead; and explain why.

The variance of this ‘sample mean’ is \(\sigma^2\). The problem is in the second to last line. Note that if we replace each instance of \(Y_i\) with \(\tilde Y_i=Y_1\), then \(Z_i=Y_1 - \E Y_1\) for all \(i\) and \(Z_iZ_j=(Y_1-\E Y_1)^2\) for all \(i\) and \(j\). This means that \(\E Z_i Z_j=\sigma^2\) always instead of only when \(i=j\) and, as a result, \(\frac{1}{n^2}\sum_i\sum_j \E Z_i Z_j\) is an average of \(n^2\) copies of \(\sigma^2\) — one for each pair \((i,j)\).

Problem 3

A brewer delivers a batch of a million bottles to its distributor. The distributor has recently been getting complaints, so they’ve instituted a new quality control process to try to ensure that the proportion of bad bottles is 2% or less. They set aside a hundred bottles, drawn without replacement from that million, for testing. And they find that 3 of them—3%—is bad. Having done this, they claim that the batch doesn’t meet its standards and refuse to pay. The brewer has called you in to consult.

Part A

The distributor is claiming that, on the basis of their test, they’re confident that the proportion of bad bottles in the million delivered exceeds 2%. Are you? Explain why or why not.

You’ll probably want to use the plots below. These are plots of the sampling distribution of the sample proportion \(\hat\theta=\frac{1}{100}\sum_{i=1}^{100} Y_i\) when \(Y_1 \ldots Y_{100}\) are sampled without replacement from a binary population of size 1 million in which the proportion \(\theta\) of 1s is \(.02\), \(.015\), and \(.01\).

I am not confident that it exceeds 2%. If the proportion of bad bottles in the million were exactly 2%, and I drew 100 bottles without replacement like this repeatedly, almost 20% of the time I’d get a sample with exactly 3 bad bottles. And over 30% of the time, I’d get a sample with at least 3 bad bottles. We can see that in the plot on the left, which shows how frequently the proportion of bad bottles set aside for testing would be 0/100, 1/100, 2/100, …

That’s enough, but I’d say a bit more if I were actually in this situation. I’d say this. Even if it were lower, this would happen pretty often. From the plot in the center, we see that if the proportion of bad bottles in the million were 1.5%, we’d have at least 3 bad bottles in the sample we test about 20% of the time. And from the one on the right, we see that even if the proportion in the million were 1%, we’d see at least 3 bad ones in our test about 10% of the time.

for θ=.02

for θ=.015

for θ=.01
Part B

The brewer claims, based on its own testing, that the proportion of bad bottles in their deliveries will be no more than 1%. You learn that the distributor will be rejecting deliveries whenever the proportion of bad ones in the sample they test exceeds 2% and, after consulting with the brewer, you decide that the best course of action is to ask them to adjust the size of their sample so they’ll reject no more than 5% of deliveries. Assuming the brewer is correct about their proportion of bad bottles, which of the following is a good sample size to ask for: 100, 400, or 1600?

Comment. This one was harder than I intended. I’ll make sure the actual exam’s question on choosing a sample size is more straightforward. What I’m going to do here is explain the straightforward approach I was thinking of. That’s the kind of thing you’ll want to be able to do for the exam. If you want to read a bit about the subtleties, some of which came up a bit during our review session, take a look at the note below the second solution. Don’t feel obligated. This isn’t something I intend as required reading before or after the exam. But given that the question I asked wound up raising a few questions, I felt like I owed you answers.

The Simple Explanation. Our delivery will be rejected when the sample proportion \(\hat\theta\) is greater than \(.02\). To ensure our delivery is rejected no more than 95% of the time, we want to choose \(n\) so 95% of draws from \(\hat\theta\)’s sampling distribution are less than or equal to \(.02\). Now we’re going to make 3 guesses/approximations that make this simpler.

  1. We’ll choose \(n\) to ensure that the middle 95% of draws from our sampling distribution are less than or equal to \(.02\). This is overkill. Having the smallest 95% of draws be less than or equal to \(.02\) is enough.
  2. We’ll assume the worst about the brewer’s actual proportion of bad bottles. They say \(\theta \le .01\), so let’s assume \(\theta=.01\).
  3. We’ll assume assume that the sampling distribution of our sample proportion \(\hat\theta\) will be approximately normal no matter which sample size we choose.

The mean and standard deviation of our estimator \(\hat\theta\) are \(\theta\) and \(\sqrt{\theta(1-\theta)/n}\) respectively.1 Given our guesses/approximations, this means that roughly 95% of draws from its sampling distribution will be in the interval \(\theta \pm 1.96 \sqrt{\theta(1-\theta)/n}\) for \(\theta=.01\) and we want to choose \(n\) so that the upper bound on this interval is no larger than \(.02\). When we do this, we find we want roughly \(n=20^2=400\) observations. \[ .01 + 1.96 \sqrt{\frac{.01 \times .99}{n}} \le .02 \qqtext{ when } 1.96 \sqrt{\frac{.01 \times .99}{n}} \le .01 \] i.e. when \[ \sqrt{n} \ge 1.96 \frac{\sqrt{.01 \times .99}}{.01} \approx 2 \frac{\sqrt{.01}}{.01} = 2 \frac{.1}{.01} = 20. \]

Here I’m going to write a bit about what happens when we drop the 3 simplifying assumptions we made above. I’ll go in order: first 1, then 2, then 3.

First, let’s think about what happens when focus on the smallest 95% of draws instead of the middle 95%. That is, when we want the 95th percentile of our sampling distribution to be less than or equal to \(.02\). If we stick with our other assumptions, this means changing \(1.96\) to 1.64 in the calculation above, because \(\theta+1.64\sigma\) is the 95th percentile of a normal distribution with mean \(\theta\) and standard deviation \(\sigma\).

Figure 1: An illustration comparing the draws we were counting before when we used \(1.96\) (shaded red) to the draws we’re counting now using 1.64 (shaded green). The orange area counts the draws they have in common; the green area the draws below them that we weren’t counting before; and the red area the larger draws we had to count to get up to 95% as as a result. To improve the visibility of the two upper bounds I’m talking about, I’ve plotted them as dotted lines.

It makes a meaningful difference, but not enough to change our answer from 400 to 100. If we substitute 1.64 for 1.96 in the calculation above, we decrease our estimate of \(n\) by a factor of \((1.64 / 1.96)^2 \approx 0.7\).

We can use the smallest 95% of the sampling distribution to calibrate an interval estimate just like we’ve used the middle 95%. The difference is that this is a one-sided interval that goes from some lower bound all the way to infinity. In particular, if we let \(w_+\) be the distance from \(\theta\) to the 95th percentile of \(\hat\theta\)’s sampling distribution (the dotted green line above), then \([\hat\theta - w_+, \infty]\) is a one-sided interval estimate with 95% coverage. If that sampling distribution is normal with standard deviation \(\sigma\), it’s \([\hat\theta - 1.64\sigma, \infty]\). Why is that calibrated for 95% coverage? Because it contains \(\theta\) for 95% of draws \(\hat\theta\) from the sampling distribution — the smallest 95% of draws. We can flip all this around, and think about the largest 95% of draws, if we want an interval that extends from some upper bound all the way to \(-\infty\).

Second, let’s think about what happens when we stop assuming the worst about the brewer’s actual proportion of bad bottles. This assumption doesn’t cause us any problems. The upper bound we’re comparing to \(.02\), whether we use \(\theta + 1.96\sigma\) or \(\theta + 1.64\sigma\) for \(\sigma = \sqrt{\theta(1-\theta)/n}\), increases as we increase \(\theta\). This means ensuring the bound is \(.02\) or smaller for \(\theta=.01\) ensures it’s \(.02\) or smaller for all \(\theta \le .01\).

Finally, let’s think about what happens when we stop assuming that the sampling distribution of our sample proportion \(\hat\theta\) will be approximately normal no matter which sample size we choose. If we’re choosing between the three sample sizes \(100\), \(400\), and \(1600\), we can just work out the 95th percentile of our estimator’s actual sampling distribution at all three sample sizes and choose the smallest sample size at which it’s less than or equal to \(.02\). We already know, from our plots from Part A, that it doesn’t work at \(100\). Let’s check the other two.

for θ=.01 and n=100

for θ=.01 and n=400

for θ=.01 and n=1600
Figure 2: The sampling distribution of \(\hat\theta\) when \(\theta=.02\) for \(n=100\) (left), \(n=400\) (center), and \(n=1600\) (right). The normal approximation to each sampling distribution is drawn over it in red and the 95th percentile of the actual sampling distribution and its normal approximation are plotted as magenta and green dotted lines respectively and the 97.5th percentile of the normal approximation—the upper bound we get using 1.96 instead of 4—is plotted as a red dotted line.

What we see is pretty interesting. Our normal approximation is pretty bad at \(n=100\), ok for \(n=400\), and pretty good at \(n=1600\). And the 95th percentile of the normal approximation tends to be a bit smaller than the 95th percentile of the actual sampling distribution. But our choice of \(n=400\) works: the 95th percentile of the actual sampling distribution is exactly \(.02\) at \(n=400\). What’s happened is that the errors in our two approximations — in using the normal instead of the hypergeometric and thinking about the middle 95% instead of the smallest 95% of draws — have basically canceled each other out.

That this worked out so nicely was luck. I was thinking both sources of error would be small enough not to matter much, not that they’d cancel out. And when you really dig into what’s happened, it’s pretty strange. If you look at the three upper bounds we’ve been talking about as functions of \(n\) plotted below, you can see the one that’s exactly right—the 95th percentile of the actual sampling distribution—has a bit of a sawtooth pattern on top of the overall trend. Why? Think about how the 95th percentile of the number of bad bottles changes as we increase our sample size \(n\). It has flat spots—it’s 0, then 1, then 2, then 3, etc. To get the 95th percentile of the proportion of bad bottles in the sample, we take the 95th percentile of the number and divide it by \(n\). Because this denominator \(n\) is increasing in the numerator’s flat spots, the 95th percentile of the proportion goes down — that’s the downward slope of each ‘tooth’ of the saw — until the numerator jumps up to the next level — those are the jumps in the sawtooth pattern. The 95th percentile of the actual sampling distribution tends to bounce between the 95th and 97.5th percentiles of its normal approximation and at \(n=100\) and \(n=400\) it’s just taken a jump up, so it’s closer to the 97.5th.

Figure 3: Left: The three upper bounds we’ve talked about plotted as functions of \(n\). The 95th percentile of the actual sampling distribution of \(\hat\theta\) in magenta, the 95th percentile of its normal approximation in red (\(\theta+1.64\sigma\)), and the 97.5 percentile of its normal approximation in green (\(\theta+1.96\sigma\)). Right: The 95th percentile of the number of bad bottles in the sample.

If we accept the premise that what the brewer really wants is to ensure 95% of (infinitely many) shipments pass this test, there is actually a way for the brewer to take advantage of this strange sawtooth pattern in this example. Suppose that the distributor is willing to increase the sample size, but only if the distributor compensates them for the cost of the additional bottles they have to open and test. Instead of asking them to open \(400\), we can ask them to open 250. That’s the first \(n\) at which the 95th percentile of the sampling distribution is less than or equal to \(.02\), so it’s the cheapest way to ensure that 95% of the distributor’s tests result in a delivery that does get accepted.

This is, of course, all a bit contrived. There’s no particular reason for the brewer to want 95% of shipments to get accepted vs. 96% or 95.5%, especially if they’re not getting paid for the deliveries that get rejected.

Problem 4

In the block of R code below, I’ve implemented the estimators \(\hat\theta_2\) and \(\hat\theta_3\) from Problem 1.

theta.hat.2 = function(Y) { mean(Y) }
theta.hat.3 = function(Y) { sum(Y)/(length(Y)+10) }

And here is code that does a thing.

do.thing = function(estimator) {
  1:10000 |> map_vec(function(.) {
    Ystar = sample(Y, size=n, replace=TRUE)
    estimator(Ystar)
  }) 
}

Below, I’ve plotted the sampling distributions of \(\hat\theta_2\) (left) and \(\hat\theta_3\) (right) in gray with their means indicated by blue vertical lines, a histogram of the result of calling do.thing(theta.hat.2) (left) and do.thing(theta.hat.3) (right) in orange, a green vertical line indicating the value of \(\theta\), and interval estimates of the form \(\hat\theta_2 \pm 1.96\hat\sigma_2\) (left) and \(\hat\theta_3 \pm 1.96\hat\sigma_3\) (right) where \(\hat\sigma_2\) and \(\hat\sigma_3\) are the results of calling sd(do.thing(theta.hat.2)) and sd(do.thing(theta.hat.3)) respectively.

Part A

If we take this approach to calibrating an interval estimator centered on \(\hat\theta_2\), what is the coverage probability of these intervals: roughly 95%, roughly 50%, or roughly 5%? What about the interval estimators centered on \(\hat\theta_3\)?

95% for \(\hat\theta_2\) and 50% for \(\hat\theta_3\).

Explanation. The ‘thing’ do.thing does is draw 10,000 samples from the bootstrap sampling distribution of our estimator, so these are the intervals \(\hat\theta \pm 1.96\hat\sigma\) where \(\hat\sigma\) is the standard deviation of the estimator’s bootstrap sampling distributions standard deviation. But we don’t need to know that to answer this question. All we need to know is that, whatever it does, we get histograms with the same widths as our estimators’ actual sampling distributions. That gives us 95% coverage for the unbiased estimator \(\hat\theta_2\), but only about 50% coverage for the estimator \(\hat\theta_3\), which has bias roughly equal to this interval’s width. Why 50%? Because when bias and interval width are equal, we get an interval that covers when our point estimate is on the right half of its sampling distribution and one that doesn’t when it’s on the left half.

Part B

Below, I’ve added another interval estimate to each plot. A blue one. These are \(\hat\theta_2 \pm 1.96 \hat\sigma_2\) and \(\hat\theta_3 \pm 1.96 \hat\sigma_3\) where, letting \(\hat\theta_2^{(1)} \ldots \hat\theta_2^{(10,000)}\) be the elements of do.thing(theta.hat.2) and \(\hat\theta_3^{(1)} \ldots \hat\theta_3^{(10,000)}\) be the elements of do.thing(theta.hat.3), \[ \begin{aligned} \hat\sigma_2^2 &= \frac{1}{10,000}\sum_{r=1}^{10,000} (\hat\theta_2^{(r)} - \bar Y)^2 \\ \hat\sigma_3^2 &= \frac{1}{10,000}\sum_{r=1}^{10,000} (\hat\theta_3^{(r)} - \bar Y)^2. \end{aligned} \]

Explain why the blue interval on the left looks about the same as the black one but the one on the right is wider. Why might you want to use these blue intervals instead of the black ones?

Extra Credit Problems.

This one was meant to be a little unfamiliar—something you couldn’t do on autopilot even if you had perfect recall of the lectures and homeworks. I will put something a bit like this on the exam, but it’ll be an extra credit problem and clearly identified as an extra credit problem. If want to prepare for it, you might want to think a little bit more about \(\hat\sigma_2\) and its relationship to the usual estimate of the standard deviation of the sample mean \(\bar Y\), \(\hat\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n (Y_i - \bar Y)^2 / n}\).

What we’ve done is made \(\hat\sigma\) the root-mean-squared distance of a draw \(\hat\theta^\star\) from our estimator’s bootstrap sampling distribution from the sample’s mean \(\bar Y\) instead of from the bootstrap sampling distribution’s mean. When we’re using the estimator \(\hat\theta_2\), \(\bar Y\) is the bootstrap sampling distribution’s mean, so nothing changes. When we’re using the estimator \(\hat\theta_2\), it’s not—our draws \(\hat\theta^\star\) tend to be lower—so the root-mean-squared distance to \(\bar Y\) is bigger than the standard deviation of \(\hat\theta_2\). We might want to use these because, when we have a biased estimator, they get wider to account for the bias, which improves coverage.

Detail I wouldn’t expect you to provide. In fact, it gets the right amount wider. Think of a point estimate at the left edge of the middle 95% of the sampling distribution of \(\hat\theta_3\)—about where the black interval ends. If you calibrated an interval estimate around it ‘the blue way’, it’d just touch the estimation target. You can see why by thinking about the analogy that motivates the bootstrap.

About the Bootstrap. When we use the bootstrap, we’re thinking of the relationship between a bootstrap sample and the sample as analogous to the relationship between the sample and population. So what we’re really doing here is calculating the root-mean-squared difference between a bootstrap version of our estimator, \(\hat\theta^\star\), to a bootstrap version of our estimation target, \(\theta^\star=\bar Y\)—that’s the mean of the ‘population’ that our bootstrap sample is drawn from. When \(\hat\theta^\star\) an an unbiased estimator of \(\theta^\star\), this is just its standard deviation. But when \(\hat\theta^*\) is a biased estimator of \(\theta^\star\), it’s a bootstrap version of root-mean-squared-error. \[ \text{bootstrap rmse} = \sqrt{\text{bootstrap bias}^2 + \text{bootstrap sd}^2}. \] Often, the bootstrap analogy works and the bootstrap bias and bootstrap sd are about the same as the actual bias and actual sd. This mean we’ll get 95% coverage from our biased estimator, as we’re inflating the width of our intervals just enough that that happens. That is not, however, necessarily the best way to take advantage of our ability to use the bootstrap to anticipate our estimator’s bias. Instead of using it to inflate the width of our intervals, we can use it to correct the bias of our estimator so we don’t have to inflate them. More on this later in the semester.

Footnotes

  1. Almost. Since we’re sampling without replacement, the actual standard deviation is \(\sqrt{(m-n)/(m-1)}\) times smaller, as we saw in a recent homework. But since \(m\) is a million and \(n\) is 100, this is so close to 1 it makes no practical difference.↩︎