\[
\DeclareMathOperator{\E}{E}
\DeclareMathOperator{\Var}{V}
\DeclareMathOperator{\hVar}{\widehat{V}}
\DeclareMathOperator{\bias}{bias}
\]
Summary
This week’s homework addresses two issues we’ve left hanging.
I know what you’re thinking: we had a whole homework assignment on this. But that’s only half-right. In the Week 2 Homework, we focused on showing that we get near-perfect calibration, not really why. But using normal approximation, we can do the why part pretty easily. This’ll be quick, and it’ll involve a bit of calculus, which makes it a good warm-up for what we’ll do next. It’ll give us an opportunity to revisit some of our Week 2 stuff using a more formula-driven perspective, too.
I said, in our Lecture on Comparing Two Groups, that you’d be calculating the variance of a difference in subsample means in this one. You’ll be doing that and a little more: you’ll be calculating,approximately, the variance of a ratio of subsample means, too. Because sometimes that’s closer to what you want to know. People often say, for example, that women in this country earn 78 cents on the dollar for doing the same work as men. That’s a ratio. Tackling this in addition to the difference won’t be too much additional work. After a little bit of calculus, we basically wind up in the same place as we do for the difference.
Calculus Review: Linear Approximation
We’re going to be using linear approximation to simplify some of our calculations. Given a function \(f(x)\), we can approximate it near any point \(x_0\) like this. \[
f(x) \approx f(x_0) + f'(x_0)(x-x_0)
\]
Hopefully you remember that from calculus. If you like, you can call this first-order Taylor approximation. And there are a few formulas for the error of this approximation, which is called the remainder in Taylor’s Theorem, in most calculus textbooks.
When we’re thinking about functions of multiple variables, we use the multivariate version, which involves partial derivatives. \[
\begin{aligned}
f(x,y)
&\approx f(x_0,y_0) + \qty[\frac{\partial f}{\partial x}(x_0,y_0)] (x-x_0) \ + \ \qty[\frac{\partial f}{\partial y}(x_0,y_0)] (y-y_0).
\end{aligned}
\]
Why We Usually Get Near-Perfect Calibration
Suppose we’ve sampled with replacement from a binary population in which \(\theta\) is the proportion of ones. If we use the sample mean \(\hat\theta\) as our point estimate and calibrate a 95% confidence interval around it using normal approximation, this is the interval we get.
\[
\hat\theta \pm 1.96 \hat\sigma / \sqrt{n} \qfor \hat\sigma^2 = \hat\theta(1-\hat\theta)
\]
On the other hand, the interval we’d want—assuming we’re still happy to use normal approximation—uses the actual variance of our sample proportion, \(\sigma^2=\theta(1-\theta)\), instead of the estimate \(\hat\sigma^2\). Figure 1 is an attempt to convince ourselves that it doesn’t make much of a difference at all. I think I called the difference ‘a fingernail thick’ in lecture. Now you’re going to quantify this difference. We’ll assume that \(\hat\theta\) is one of the ‘good draws’ from its sampling distribution, which for our purposes will mean that it’s in the interval \(\theta \pm 1.96\sigma / \sqrt{n}\), the middle 95% of the sampling distribution’s normal approximation.
Exercise 1
Suppose \(\hat\theta \in \theta \pm 1.96\sigma / \sqrt{n}\) for \(\sigma^2=\theta(1-\theta)\). Find an approximate upper bound on the difference \(|\hat w - w|\) between the estimated interval width \(\hat w = 2 \times 1.96 \hat\sigma / \sqrt{n}\) and the ideal-but-unusable interval width \(w= 2 \times 1.96 \sigma / \sqrt{n}\). What fraction of the ideal width \(w\) is your bound?
Your bound, both in absolute terms and as a fraction of \(w\), should be a function of \(\theta\) and \(n\).
You’ll probably want to use linear approximation to do this. \[
\hat\sigma - \sigma = f(\hat\theta) - f(\theta) \approx f'(\theta)(\hat\theta - \theta) \qfor f(x) = \sqrt{x(1-x)}.
\]
We’ll want to know the derivative of \(f(x) = \sqrt{x(1-x)}\), so let’s start by calculating that. \[
\begin{aligned}
f'(x)
&= \frac{d}{dx} \qty{x(1-x)}^{1/2} = \frac{1}{2} \qty{x(1-x)}^{-1/2} \frac{d}{dx} x (1-x) && \text{by the power rule and chain rule} \\
&= \frac{1}{2} \qty{x(1-x)}^{-1/2} \qty{1 \times (1-x) + x \times (-1)} && \text{by the product rule} \\
&= \frac{1}{2} \qty{x(1-x)}^{-1/2} \qty{1 - 2x} = \frac{1 - 2x}{2\sqrt{x(1-x)}}
\end{aligned}
\]
Now we can use this to approximate the difference \(|\hat w - w|\). \[
\begin{aligned}
\hat w - w
&= 2 \times 1.96 \times \frac{\hat\sigma-\sigma}{\sqrt{n}} \\
&= \frac{2 \times 1.96}{\sqrt{n}} \times f(\hat\theta) - f(\theta) \qfor f(x) = \sqrt{x(1-x)} \\
&\approx \frac{2 \times 1.96}{\sqrt{n}} \times f'(\theta) (\hat\theta - \theta) \\
&= \frac{2 \times 1.96}{\sqrt{n}} \times \frac{1 - 2\theta}{2\sqrt{\theta(1-\theta)}} (\hat\theta - \theta)
\end{aligned}
\] Now let’s use our assumption that \(\hat\theta \in \theta \pm 1.96\sigma / \sqrt{n}\) for \(\sigma^2=\theta(1-\theta)\). We can rephase this as \(\hat\theta - \theta \in \pm 1.96\sigma / \sqrt{n}\) and equivalently \(\abs{\hat\theta - \theta} \leq 1.96\sigma / \sqrt{n}\). Combining this with our approximation above, we get the following approximate upper bound on the difference \(\abs{\hat w - w}\). \[
\begin{aligned}
\abs{\hat w - w}
&\approx \frac{2 \times 1.96}{\sqrt{n}} \times \frac{1 - 2\theta}{2\sqrt{\theta(1-\theta)}} \times \textcolor{blue}{\abs{\hat\theta - \theta}}
&& \text{as calculated above} \\
&\leq \frac{2 \times 1.96}{\sqrt{n}} \times \frac{1 - 2\theta}{2\sqrt{\theta(1-\theta)}} \times \textcolor{blue}{1.96 \frac{\sigma}{\sqrt{n}}} \qfor \sigma = \sqrt{\theta(1-\theta)} &&\text{using our rephrased assumption} \\
&= \frac{2 \times 1.96}{\sqrt{n}} \times \frac{1 - 2\theta}{2\sqrt{\theta(1-\theta)}} \times \textcolor{blue}{1.96 \frac{\sqrt{\theta(1-\theta)}}{\sqrt{n}}} &&\text{substituting}\\
&= 1.96^2 \frac{1-2\theta}{n} && \text{simplifying}
\end{aligned}
\] Let’s work out what fraction of the ideal width \(w\) this is. \[
\begin{aligned}
\frac{\abs{\hat w - w}}{w}
&\lessapprox
\frac{1.96^2 \frac{\abs{1-2\theta}}{n}}{2 \times 1.96 \frac{\sqrt{\theta(1-\theta)}}{\sqrt{n}}} \\
&= \frac{1.96}{2} \times \frac{\abs{1-2\theta}}{\sqrt{\theta(1-\theta)}} \times \frac{1/n}{1/\sqrt{n}} \\
&= \frac{1.96}{2} \times \times \frac{\abs{1-2\theta}}{\sqrt{\theta(1-\theta)}} \frac{1}{\sqrt{n}} \\
&\approx \frac{1}{\sqrt{n}} \times \frac{\abs{1-2\theta}}{\sqrt{\theta(1-\theta)}}
\end{aligned}
\]
Above, the notation \(a \lessapprox b\) means that \(a\) is approximately less than \(b\). What’s happening here is that \(a\) is approximately equal to its linear approximation \(a_{\text{lin}}\) and \(a_{\text{lin}}\) is less than or equal to \(b\): \(a\lessapprox b\) because \(a \approx \tilde{a} \le b\).
Exercise 2
Are there values of \(\theta\) where this difference \(\hat w - w\) is a large fraction of the ideal width \(w\)? If so, how large? Use Figure 3 to explain what’s going on in intuitive terms.
Yes. Arbitrarily large, in fact. The function \(g(\theta) = \frac{\abs{1-2\theta}}{\sqrt{\theta(1-\theta)}}\) goes to infinity as \(\theta\) goes to 0 or 1.
What’s going on in terms of Figure 3? The problem is that the slope of the width curve gets large where the width curve gets small—near \(\theta=0\) and \(\theta=1\)—so when we have an estimate \(\hat\theta \approx \theta\) for such \(\theta\), the error \(w(\hat\theta) - w(\theta) \approx w'(\theta) \times (\hat\theta - \theta)\) of our estimated width gets large relative to the width \(w(\theta)\) itself.
Variance Calculation for Comparisons
Differences in Means
In our Lecture on Comparing Two Groups, we talked about how to use subsample means to compare two groups. In particular, we talked about the case that we’ve drawn a sample \((X_1,Y_1) \ldots (X_n,Y_n)\) with replacement from a population \((x_1,y_1) \ldots (x_m,y_m)\) in which \(x_j \in \{0,1\}\) indicates membership in one of two groups, e.g. treated and control groups in Figure 4. And we talked about using the difference \(\textcolor[RGB]{0,191,196}{\hat\mu(1)}-\textcolor[RGB]{248,118,109}{\hat\mu(0)}\) in the mean of \(Y_i\) for the subsamples in which \(\textcolor[RGB]{0,191,196}{X_i=1}\) and \(\textcolor[RGB]{248,118,109}{X_i=0}\) to estimate the corresponding difference \(\textcolor[RGB]{0,191,196}{\mu(1)}-\textcolor[RGB]{248,118,109}{\mu(0)}\) in the population.
Exercise 3
We showed that an individual subsample mean \(\hat\mu(x)\) is an unbiased estimator of the corresponding population mean \(\mu(x)\). That implies that the difference of two such estimates, \(\hat\mu(1)-\hat\mu(0)\), is an unbiased estimator of the difference of the corresponding population means, \(\mu(1)-\mu(0)\).
Explain why the first implies the second. A sentence or even a couple words should do.
We also calculated a formula for the variance of a subsample mean \(\hat\mu(x)\). \[
\mathop{\mathrm{\mathop{\mathrm{V}}}}[\hat\mu(x)] = \frac{\sigma^2(x)}{N_x} \text{ for } N_x = \sum_{i}1_{=x}(X_i) \qand \sigma^2(x) = \mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i \mid X_i=x]
\]
And I stated without proof a formula for the variance of the difference of two subsample means. \[
\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\hat{\mu}(1)-\hat{\mu}(0)] = \mathop{\mathrm{E}}\qty[\frac{1}{N_1}\sigma^2(1)+\frac{1}{N_0}\sigma^2(0)] \text{ for } N_x = \sum_{i}1_{=x}(X_i)
\]
It’s a simple formula. The variance of the difference in means is the sum of the variances of the two means. Why is that the case? To start to see why, we can start from definitions and do a bit of arithmetic.
\[
\begin{aligned}
\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\hat{\mu}(1)-\hat{\mu}(0)]
&= \mathop{\mathrm{E}}\qty[ \qty(\{\hat{\mu}(1) - \hat{\mu}(0)\} - \{\mu(1)-\mu(0)\})^2 ] \\
&= \mathop{\mathrm{E}}\qty[ \qty(\{\hat{\mu}(1) - \mu(1)\} - \{\hat{\mu}(0) -\mu(0)\})^2 ] \\
&= \mathop{\mathrm{E}}\qty[ \qty(\{\hat{\mu}(1) - \mu(1)\})^2 ] + \mathop{\mathrm{E}}\qty[ \qty(\{\hat{\mu}(0) -\mu(0)\})^2 ] \\
&- 2\mathop{\mathrm{E}}\qty[ \{\hat{\mu}(1) - \mu(1)\}\{\hat{\mu}(0) -\mu(0)\}]
\end{aligned}
\tag{1}\]
The first two terms here are the ones that appear in our formula above: the variances of the two means. For that formula to be correct, the last term has to be zero. It’s up to you to prove that.
Exercise 4
Complete the argument by proving that the ‘cross term’ is zero, i.e., that \[
\mathop{\mathrm{E}}\qty[\{\hat{\mu}(1) - \mu(1)\}\{\hat{\mu}(0) -\mu(0)\}] = 0.
\]
Look over the calculation we used for one subsample mean \(\hat{\mu}(x)\) here. You’ll want to use a lot of the same ideas: writing sums over our subsamples as sums over \(1 \ldots n\) by putting in group indicators \(1_{=x}(X_i)\), conditioning on \(X_1 \ldots X_n\), the indicator trick, thinking about what happens when \(j=i\) and \(j\neq i\) in the double sum we get when we expand the product, etc. In the subsample mean calculation, the \(j=i\) terms gave us some stuff that was nonzero. Why isn’t that happening here?
I saw a lot of you guys claim that \(\hat\mu(0)\) and \(\hat\mu(1)\) are independent in your submission. Sometimes that claim wasn’t justified; sometimes it was justified by the claim that the subsamples \(\{ (X_i,Y_i) \ : \ X_i=0\}\) and \(\{ (X_i, Y_i) \ : \ X_i=1 \}\) are independent. That’s not true. One way of seeing that the subsamples aren’t independent is to notice that the numbers \(N_0\) and \(N_1\) of observations in these subsamples sum to \(n\), so if you know one you know the other. The means are do, however, have zero covariance. That’s the point of this exercise, but it does take a little calculation to show it.
We’ll show that the expectation conditional on \(X_1 \ldots X_n\) is zero. By the law of iterated expectations, this will imply that the unconditional expectation is also zero.
Let’s start by just expanding \(\hat{\mu}(1)\) and \(\hat{\mu}(0)\) in the cross term, doing a little arithmetic, and then using the indicator trick. \[
\begin{aligned}
\{\hat{\mu}(1) - \mu(1)\}\{\hat{\mu}(0) -\mu(0)\}
&= \frac{1}{N_1}\sum_{i=1}^n 1_{=1}(X_i) \{ Y_i-\mu(1) \} \frac{1}{N_0}\sum_{j=1}^n 1_{=0}(X_j) \{Y_j-\mu(0)\} \\
&= \frac{1}{N_1 N_0} \sum_{i=1}^n \sum_{j=1}^n 1_{=1}(X_i) 1_{=0}(X_j) \{ Y_i-\mu(1) \} \{Y_j-\mu(0)\} \\
&= \frac{1}{N_1 N_0} \sum_{i=1}^n \sum_{j=1}^n 1_{=1}(X_i) 1_{=0}(X_j) \{ Y_i-\mu(X_i) \} \{Y_j-\mu(X_j)\}
\end{aligned}
\] Now let’s look at the conditional expectation. Using linearity of conditional expectations, we can push the expectation all the way into the double sum. \[
\begin{aligned}
&\mathop{\mathrm{E}}\qty[\{\hat{\mu}(1) - \mu(1)\}\{\hat{\mu}(0) -\mu(0)\} \mid X_1 \ldots X_n ] \\
&= \frac{1}{N_1 N_0} \sum_{i=1}^n \sum_{j=1}^n 1_{=1}(X_i) 1_{=0}(X_j) \mathop{\mathrm{E}}\qty[\{Y_i-\mu(X_i)\}\{Y_j-\mu(X_j)\} \mid X_1 \ldots X_n ]
\end{aligned}
\] Every term in this double sum is zero. Why?
- If \(i \neq j\), then \(\mathop{\mathrm{E}}\qty[\{Y_i-\mu(X_i)\}\{Y_j-\mu(X_j)\} \mid X_1 \ldots X_n ] = 0\) because \((X_i,Y_i)\) and \((X_j,Y_j)\) are independent. This is shown here in our lecture on Comparing Two Groups.
- If \(i=j\), then it’s not possible that \(X_i=1\) and \(X_j=0\), so the product of indicators is zero.
Ratios of Means
If \(\hat\mu(1)-\hat\mu(0)\) is a good estimator of \(\mu(1)-\mu(0)\), then shouldn’t \(\hat\mu(1)/\hat\mu(0)\) be a good estimator of \(\mu(1)/\mu(0)\)? Let’s look into it. To do this, we’ll think of the ratio as a function of the two means. \[
\frac{\hat\mu(1)}{\hat\mu(0)} - \frac{\mu(1)}{\mu(0)} = f(\hat\mu(1), \hat\mu(0)) - f(\mu(1), \mu(0)) \qfor f(x,y) = \frac{x}{y}.
\]
And we’ll use a linear approximation to this function to think about this difference.
\[
\begin{aligned}
f(\hat\mu(1), \hat\mu(0)) \approx f(\mu(1), \mu(0))
&+ \qty[\frac{\partial f}{\partial x}(\mu(1), \mu(0))](\hat\mu(1) - \mu(1)) \\
&+ \qty[\frac{\partial f}{\partial y}(\mu(1), \mu(0))](\hat\mu(0) - \mu(0))
\end{aligned}
\]
This approximation should be good if \(\hat\mu(1)\) and \(\hat\mu(0)\) are close to \(\mu(1)\) and \(\mu(0)\).
Exercise 5
When our sample includes a reasonably large number of observations in both groups, it’s reasonable to expect that they should be close. With a sentence or two, or a rough sketch if you prefer, explain why.
If we have a large sample size in group \(x\), then \(\hat\mu(x)-\mu(x)\) will be small with very high probability. It’s approximately normal with mean zero and standard deviation \(\sigma=\sqrt{\mathop{\mathrm{E}}[\sigma^2(x)/N_x]}\). This means that the difference should be smaller than \(2\sigma\) in 95% of samples and smaller than \(3\sigma\) in more than 99% of samples.
Now that we’ve justified the approximation, let’s use it to analyze our ratio estimator.
Exercise 6
Write out a formula for the linear approximation to the ratio estimator \(\hat\mu(1)/\hat\mu(0)\) in terms of \(\mu(0)\), \(\mu(1)\), \(\hat\mu(1)-\mu(1)\), and \(\hat\mu(0)-\mu(0)\). Then approximate its bias by comparing the expected value of this linear approximation to the estimation target \(\mu(1)/\mu(0)\).
We’ll start by calculating the partial derivatives that appear in our linear approximation. For \(f(x,y) = x/y\), \[
\begin{aligned}
\frac{\partial f}{\partial x} &= \frac{1}{y} \\
\frac{\partial f}{\partial y} &= -\frac{x}{y^2}
\end{aligned}
\] Plugging these into our formula for the linear approximation, we get this. \[
\frac{\hat\mu(1)}{\hat\mu(0)} \approx \frac{\mu(1)}{\mu(0)} + \frac{1}{\mu(0)} (\hat\mu(1) - \mu(1)) - \frac{\mu(1)}{\mu(0)^2} (\hat\mu(0) - \mu(0)).
\] To approximate our bias, we take the expectation of both sides. \[
\begin{aligned}
\mathop{\mathrm{E}}\qty[\frac{\hat\mu(1)}{\hat\mu(0)}]
&\approx \frac{\mu(1)}{\mu(0)} + \frac{1}{\mu(0)} \mathop{\mathrm{E}}\qty[\hat\mu(1) - \mu(1)] - \frac{\mu(1)}{\mu(0)^2} \mathop{\mathrm{E}}\qty[\hat\mu(0) - \mu(0)] \\
&=\frac{\mu(1)}{\mu(0)} + \frac{1}{\mu(0)} \times 0 - \frac{\mu(1)}{\mu(0)^2} \times 0 = \frac{\mu(1)}{\mu(0)}
\end{aligned}
\] Subtracting \(\frac{\mu(1)}{\mu(0)}\) from both sides, it follows that our bias, \(\mathop{\mathrm{E}}\qty[\frac{\hat\mu(1)}{\hat\mu(0)}] - \frac{\mu(1)}{\mu(0)}\), is approximately zero.
Exercise 7
By calculating the variance of the linear approximation to \(\hat\mu(1)/\hat\mu(0)\) you used in Exercise 6, find a formula approximating the variance of this ratio estimator. Use it to calculate a 95% confidence interval based on normal approximation for the ratio \(\mu(1)/\mu(0)\) in the National Supported Work Demonstration. Draw it on top of the plot of that estimator’s bootstrap sampling distribution in Figure 2. With all these approximations, are you still getting an interval similar to the one I already drew there, which is calibrated using the bootstrap?
You’ll need some information included in a table in our discussion of this study in lecture.
Using our linear approximation, we can see that our ratio estimator’s variance is approximately the variance of a linear combination of the subsample means \(\hat\mu(1)\) and \(\hat\mu(0)\). \[
\begin{aligned}
\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\frac{\hat\mu(1)}{\hat\mu(0)}]
&\approx \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\frac{\mu(1)}{\mu(0)} + \frac{1}{\mu(0)} \{\hat\mu(1) - \mu(1)\} - \frac{\mu(1)}{\mu(0)^2} \{\hat\mu(0) - \mu(0)\}] \\
&=\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\frac{1}{\mu(0)} \{\hat\mu(1) - \mu(1)\} - \frac{\mu(1)}{\mu(0)^2} \{\hat\mu(0) - \mu(0)\}]
\end{aligned}
\] A slight extension of our result from Exercise 4 shows that this variance is the sum of the variances of the two terms. In particular, the cross term in an expansion analogous to Equation 1 is just a constant times the cross term that is in Equation 1, and we showed that was zero in Exercise 4. Since \(\mathop{\mathrm{\mathop{\mathrm{V}}}}[aX] = a^2 \mathop{\mathrm{\mathop{\mathrm{V}}}}[X]\) for any constant \(a\), using our formula for the variance of \(\hat\mu(1)\) and \(\hat\mu(0)\) from Lecture 6 we get this. \[
\begin{aligned}
\mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\frac{\hat\mu(1)}{\hat\mu(0)}]
&=\frac{1}{\mu(0)^2} \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\hat\mu(1) - \mu(1)] + \frac{\mu(1)^2}{\mu(0)^4} \mathop{\mathrm{\mathop{\mathrm{V}}}}\qty[\hat\mu(0) - \mu(0)] \\
&=\frac{1}{\mu(0)^2} \mathop{\mathrm{E}}\qty[\frac{\sigma^2(1)}{N_1}] + \frac{\mu(1)^2}{\mu(0)^4} \mathop{\mathrm{E}}\qty[\frac{\sigma^2(0)}{N_0}].
\end{aligned}
\] To estimate this quantity, we plug in the subsample means \(\hat\mu(1)\) and \(\hat\mu(0)\) for \(\mu(1)\) and \(\mu(0)\); the sample variances \(\hat\sigma^2(1)\) and \(\hat\sigma^2(0)\) for the population variances \(\sigma^2(1)\) and \(\sigma^2(0)\); and the sample sizes \(1/N_1\) and \(1/N_0\) for their expected values \(\mathop{\mathrm{E}}[ 1/N_1]\) and \(\mathop{\mathrm{E}}[ 1/N_0]\).
\[
\begin{aligned}
\hVar\qty[\frac{\hat\mu(1)}{\hat\mu(0)}]
&=\frac{1}{\hat\mu(0)^2} \frac{\hat\sigma^2(1)}{N_1} + \frac{\hat\mu(1)^2}{\hat\mu(0)^4} \frac{\hat\sigma^2(0)}{N_1} \\
&= \frac{1}{4600^2} \frac{7900^2}{185} +
\frac{6300^2}{4600^4} \frac{5500^2}{260} \\
&\approx 0.0262563 \\
&\approx 0.16^2
\end{aligned}
\]
It follows that this interval estimate for the ratio \(\mu(1)/\mu(0)\) should have approximately 95% coverage. \[
\begin{aligned}
\frac{\hat\mu(1)}{\hat\mu(0)} \pm 1.96 \times \hVar\qty[\frac{\hat\mu(1)}{\hat\mu(0)}]
&\approx \frac{6300}{4600} \pm 1.96 \times 0.16 \\
&\approx [1.05, 1.69].
\end{aligned}
\tag{2}\]
Below, we see this interval drawn in purple on top of the bootstrap sampling distribution plot from Figure 2. It’s very similar to the bootstrap-calibrated interval, but not centered in exactly the same place. The reason for that is rounding error. In the Figure 2, I calculate the exact ratio of subsample means \(\hat\mu(1)/\hat\mu(0)\), whereas in the calculation above I used rounded version of \(\hat\mu(1)\) and \(\hat\mu(1)\) from this table from Lecture.
If you replace the ratio \(6300/4600\) with the exact ratio of subsample means in the formula Equation 2, you get the interval drawn in red, which is even more similar to the bootstrap-calibrated interval. Note that I did not replace the rounded values of \(\hat\mu(1)\) and \(\hat\mu(0)\) in the variance estimate. That’d probably have given an even more similar interval.
The unintended lesson here is that, unless you’re just doing a rough calculation by hand, you shouldn’t round until the end of your calculations.
All of that ignores the error of our linear approximation as a potential problem. We should, if we like, be able to reason about this error using tools from calculus.
Exercise 8
Extra Credit. Using some version of Taylor’s Theorem to characterize the error of the linear approximation you’ve been working with, refine your answer to Exercise 6: find an upper bound on the absolute value of the estimator’s bias. This should be a formula involving \(\mu(1)\), \(\mu(0)\), and the subsample sizes \(N_1\) and \(N_0\). Compare it to your approximation of the estimator’s standard deviation from Exercise 7. Are you worried that confidence intervals like the one you calculated in Exercise 7 might have coverage well below the nominal level of 95%?