Lecture 3

Calibrating Interval Estimates with Binary Observations

Sampling and Calibration Review

A Polling Nightmare

Before the Election

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800

Your Sample

\[ \begin{array}{r|rrrr|r} i & 1 & 2 & \dots & 625 & \bar{Y}_{625} \\ Y_i & 1 & 1 & \dots & 0 & 0.72 \\ \end{array} \]

The Population

\[ \color{lightgray} \begin{array}{r|rrrrrr|r} j & 1 & 2 & 3 & 4 & \dots & 7.23M & \bar{y}_{7.23M} \\ y_{j} & 1 & 1 & 1 & 0 & \dots & 1 & 0.70 \\ \end{array} \]

  • You want to estimate the proportion of all registered voters — the population — who will vote.
  • To do this, you use the proportion of polled voters — your sample — who said they would.
    • You’re probably not going to match the population proportion exactly …
    • … so you report an interval estimate—a range of values we claim the population proportion is in.
    • You’re know you’re not going to be right 100% of the time you make with claims like this …
    • … so you state a nominal coverage probability—how often you’re right about claims like this.1
  • Without thinking too hard about it, you …
    • … report, as your interval, your sample proportion (72%) ± 1%. Because it sounds good.
    • … and say your coverage probability is 95%. Because that’s what everybody else says.

After the Election

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800

Your Sample

\[ \begin{array}{r|rrrr|r} i & 1 & 2 & \dots & 625 & \bar{Y}_{625} \\ Y_i & 1 & 1 & \dots & 0 & 0.72 \\ \end{array} \]

The Population

\[ \begin{array}{r|rrrrrr|r} j & 1 & 2 & 3 & 4 & \dots & 7.23M & \bar{y}_{7.23M} \\ y_{j} & 1 & 1 & 1 & 0 & \dots & 1 & 0.70 \\ \end{array} \]

  • When the election occurs, we get to see who turns out to vote.
    • 5.05M people, or roughly 70% of registered voters, actually vote.
    • You — and future employers — can see how well you did. And how well everybody else did.
  • Your interval missed the target. It doesn’t contain the turnout proportion.
    • Your point estimate is only off by 2%. But you overstated your precision.
    • Now you’re kicking yourself. You’d briefly considered saying ± 3% or ± 4%.
    • That would’ve done it. But that didn’t sound as good, so you went for ± 1%.
  • You hope you’re not the only one who missed. So you check out the competition.

The Competition

\[ \begin{array}{r|rr|rr|r|rr|r} \text{call} & 1 & & 2 & & \dots & 625 & & \\ \text{poll} & J_1 & Y_1 & J_2 & Y_2 & \dots & J_{625} & Y_{625} & \overline{Y}_{625} \\ \hline \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}869369 & \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}4428455 & \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}\dots & \color[RGB]{7,59,76}1268868 & \color[RGB]{7,59,76}1 & \color[RGB]{7,59,76}0.68 \\ \color[RGB]{239,71,111}2 & \color[RGB]{239,71,111}600481 & \color[RGB]{239,71,111}0 & \color[RGB]{239,71,111}6793745 & \color[RGB]{239,71,111}1 & \color[RGB]{239,71,111}\dots & \color[RGB]{239,71,111}1377933 & \color[RGB]{239,71,111}1 & \color[RGB]{239,71,111}0.71 \\ \color[RGB]{17,138,178}3 & \color[RGB]{17,138,178}3830847 & \color[RGB]{17,138,178}1 & \color[RGB]{17,138,178}5887416 & \color[RGB]{17,138,178}1 & \color[RGB]{17,138,178}\dots & \color[RGB]{17,138,178}4706637 & \color[RGB]{17,138,178}1 & \color[RGB]{17,138,178}0.70 \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ \color[RGB]{6,214,160}1M & \color[RGB]{6,214,160}1487507 & \color[RGB]{6,214,160}1 & \color[RGB]{6,214,160}393580 & \color[RGB]{6,214,160}1 & \color[RGB]{6,214,160}\dots & \color[RGB]{6,214,160}1247545 & \color[RGB]{6,214,160}0 & \color[RGB]{6,214,160}0.72 \\ {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} & {\vdots} \\ \end{array} \]

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800

  • It turns out that everyone got their point estimate exactly like you did.
    • Each of them rolled their 7.23M-sided die 625 times.
    • And after making their 625 calls, they reported their sample proportion.
  • You’re not the only one who missed, but you’re part of a pretty small club.
    • 5% of the other polls got it wrong.
  • But you are the only one who claimed they’d get within 1%.
    • Everyone else said claimed they’d get within roughly 4%.
    • Most were right. About 95%. And a lot of them had worse point estimates than you.
  • What you have is a calibration problem.

Calibration using the Sampling Distribution

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800

  • You’d have seen that your interval was miscalibrated if you knew what you do now.
    1. Your competitors’ point estimates, which they got exactly like you got yours.
    2. Who actually turned out, so you could simulate even more polls if you wanted to.
  • You’d have drawn ± 1% intervals around all these point estimates.1
    • And noticed that many of these intervals cover the target. About 43% do.
    • Or, if leaving the target out of it, that only 43% of them cover the mean of all these estimates.
    • Same thing. The mean and target are right on top of one another. Like this: |.
  • And you’d have known how to choose the right interval width. One that makes 95% of these intervals cover. Right?

Quiz

0.000 0.005 0.010 0.015 0.020 0.025 0.68 0.72 0.76

  • I’ve drawn the sampling distribution of an estimator, 100 draws from it as s, and 3 interval estimates.
  • One of these interval estimates is calibrated to have exactly 95% coverage. Which is it?
    • A. The one around the on top.
    • B. The one around the in the middle.
    • C. The one around the on the bottom.
  • Submit your answer here.
  • Hint. You can reach your twin with your arms if your twin can reach you with theirs.
  • Hint. Maybe you look like this and your twin like this |.

Calibration in Reality

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
What You Know Now
0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
What You Knew Then
  • But you didn’t know any of this. You just knew your own point estimate.
  • Without all that post-election information, you didn’t know how to calibrate an interval estimate.
  • But everyone else did. Their widths were almost the same as the width you’d choose now.
0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
Your competitors’ intervals, recentered for easy comparison of width to the sampling distribution’s.
  • They didn’t know the point estimator’s actual sampling distribution…
  • … but they knew enough about it to choose the right interval width.
  • That’s what we’re going to look into now. Over the course of this lecture and the next, we’ll work out what the sampling distributon of our estimator can look like and how what it does look like depends on the population.
  • And we’ll see how to use that information to calibrate our intervals.

Probability Review

Sampling Distributions of Sums of Binary Random Variables

This’ll be a slog. I’ll explain why we’ve bothered when we get to the end.

Starting Simple

\(j\) name
\(1\) Rush
\(2\) Mitt
\(3\) Al

0 1 1 2 3

  • Let’s start with the distribution of \(Y_1\), the response to one call, when we sample with replacement.
  • And to keep things concrete, let’s suppose we’re polling a population of size \(m=3\). We’re rolling a 3-sided die.1
    • You can see the population’s responses in above in two formats: a table and a plot.
    • We have two Nos (0s) and one Yes (1).
    • You could say the ‘Yes’ frequency in the population \(\theta_1 = 1/3\).
  • We’ll walk through the process of finding the distribution of \(Y_1\).

Finding the Distribution of \(Y_1\) in a Small Population

\(j\) name \(y_j\)
\(1\) Rush \(0\)
\(2\) Mitt \(0\)
\(3\) Al \(1\)
\(p\) \(J_1\) \(Y_1\)
\(\frac13\) \(1\) \(0\)
\(\frac13\) \(2\) \(0\)
\(\frac13\) \(3\) \(1\)
\(p\) \(J_1\) \(Y_1\)
\(\frac13\) \(\color[RGB]{64,64,64}1\) \(0\)
\(\frac13\) \(\color[RGB]{64,64,64}2\) \(0\)
\(\frac13\) \(\color[RGB]{64,64,64}3\) \(1\)
\(p\) \(\color[RGB]{64,64,64}J_1\) \(Y_1\)
\(\color[RGB]{239,71,111}\frac13\) \(\color[RGB]{64,64,64}1\) \(\color[RGB]{239,71,111}0\)
\(\color[RGB]{239,71,111}\frac13\) \(\color[RGB]{64,64,64}2\) \(\color[RGB]{239,71,111}0\)
\(\color[RGB]{17,138,178}\frac13\) \(\color[RGB]{64,64,64}3\) \(\color[RGB]{17,138,178}1\)
\(p\) \(Y_1\)
\(\frac23\) \(\color[RGB]{239,71,111}0\)
\(\frac13\) \(\color[RGB]{17,138,178}1\)
\(p\) \(Y_1\)
\(\color[RGB]{239,71,111}\frac23\) \(\color[RGB]{239,71,111}0\)
\(\color[RGB]{17,138,178}\frac13\) \(\color[RGB]{17,138,178}1\)
  1. We write the probability distribution of our die roll in a table.
  • It has two columns: one for the roll \(J_1\) and another for its probability \(p\)
  • It has \(m=3\) rows, one for each possible outcome of the roll.
  • Each roll is equally likely, so the probability in each from is \(1/3\)
  1. We add a column for the response to our call, \(Y_1\).
  • Adding this row is ‘free’—it doesn’t change the row’s propability—because what we hear is determined by the roll.
    • E.g. if we roll a 1, we’re going to hear Mitt say ‘Yes’.
    • So the probability we roll a 1 and hear a ‘Yes’ is the same as the probability we roll a 1.
  1. We marginalize over rolls to find the distribution of our call’s outcome.
  • We can color each row according to the call’s outcome \(Y_1\), then sum the probabilities in each color.
  • The probability of hearing ‘Yes’ is the probability that we roll a 1 or 2: \(1/3+1/3 = 2/3\).
  • The probability of hearing ‘No’ is the probability that we roll a 3: \(1/3\).
  • These probabilities match the frequencies of ‘Yes’ and ‘No’ in our population.

Generalizing to Larger Populations

  • Is this probabilities-match-frequencies phenomenon a fluke? Or is it a general rule?
  • To find out, we’ll work through the same steps in abstract, i.e. for a binary population \(y_1 \ldots y_m\) of arbitrary size \(m\).
    • To ground us while we do it, we’ll think about our population of \(m=7.23M\) registered voters in Georgia.
    • But we’ll handle the general case. Plugging in 7.23M ones and zeros wouldn’t exactly make things easier.

Finding the Distribution of \(Y_1\) in a Large Population

The Population

\(j\) \(y_j\)
\(1\) 1
\(2\) 1
\(3\) 1
\(4\) 0
\(m\) 1

The Joint

\(p\) \(J_1\) \(Y_1\)
\(1/m\) \(1\) \(1\)
\(1/m\) \(2\) \(1\)
\(1/m\) \(3\) \(1\)
\(1/m\) \(4\) \(0\)
\(1/m\) \(m\) \(1\)

The Marginal

\(p\) \(Y_1\)
\(\underset{\color{gray}\approx 0.30}{\sum\limits_{j:y_j=0} \frac{1}{m}}\) \(0\)
\(\underset{\color{gray}\approx 0.70}{\sum\limits_{j:y_j=1} \frac{1}{m}}\) \(1\)

Generalizing to Larger Populations

  1. We start by writing the probability distribution of our dice rolls.
  • The person we call, \(J_1\), is equally likely to be anyone in the population of \(m \approx 7.23M\) people.
  • It takes on each value \(1 \ldots m\) with probability \(1/m\).
  1. We add a column for the response to our call, \(Y_1\).
  • This is ‘for free’. The roll determines what we hear, so the probabilities don’t change. Still \(1/m\).
  1. And we marginalize to find the distribution of \(Y_1\).
    • We sum the probabilities in rows where \(Y_1=0\) and \(Y_1=1\).
    • What we saw in our small population does generalize.
    • The probability of hearing a response \(y \in \{0,1\}\) is the frequency of that response in the population.
    • We’ll call these frequencies \(\theta_0\) and \(\theta_1\). This is a little redundant because \(\theta_0 = 1 - \theta_1\).
    • Our estimation target is just the ‘Yes’ frequency \(\theta_1\). Some people call it the success rate.
  • To find the distribtion of a sum of \(Y_1+Y_2\), we can follow the same steps.
  • There’s a bit more too it, so we’ll break down the marginalization step to make it a little more tractable.

Exercise: Two Calls to A Small Population

\(j\) name \(y_j\)
\(1\) Rush \(0\)
\(2\) Mitt \(0\)
\(3\) Al \(1\)

0 1 1 2 3

  1. Make a table for the joint distribution of three rolls of a 3-sided die. Add columns for the responses.
  2. Partially marginalize to get the joint distribution of the responses. Add a column for their sum.
  3. Marginalize to get the distribution of the sum.

Aren’t you happy that, even though you can’t make a 3-sided die, I didn’t ask you to use 4-sided one?

\(p\) \(J_1\) \(J_2\) \(Y_1\) \(Y_2\)
\(\color[RGB]{239,71,111}1/9\) \(1\) \(1\) \(0\) \(0\)
\(\color[RGB]{239,71,111}1/9\) \(1\) \(2\) \(0\) \(0\)
\(\color[RGB]{17,138,178}1/9\) \(1\) \(3\) \(0\) \(1\)
\(\color[RGB]{239,71,111}1/9\) \(2\) \(1\) \(0\) \(0\)
\(\color[RGB]{239,71,111}1/9\) \(2\) \(2\) \(0\) \(0\)
\(\color[RGB]{17,138,178}1/9\) \(2\) \(3\) \(0\) \(1\)
\(\color[RGB]{17,138,178}1/9\) \(3\) \(1\) \(1\) \(0\)
\(\color[RGB]{17,138,178}1/9\) \(3\) \(2\) \(1\) \(0\)
\(\color[RGB]{6,214,160}1/9\) \(3\) \(3\) \(1\) \(1\)
\(p\) \(Y_1\) \(Y_2\) \(Y_1 + Y_2\)
\(\color[RGB]{239,71,111}\frac{4}{9}\) \(0\) \(0\) \(0\)
\(\color[RGB]{17,138,178}\frac{2}{9}\) \(0\) \(1\) \(1\)
\(\color[RGB]{17,138,178}\frac{2}{9}\) \(1\) \(0\) \(1\)
\(\color[RGB]{6,214,160}\frac{1}{9}\) \(1\) \(1\) \(2\)
\(p\) \(Y_1 + Y_2\)
\(\color[RGB]{239,71,111}\frac{4}{9}\) \(0\)
\(\color[RGB]{17,138,178}\frac{4}{9}\) \(1\)
\(\color[RGB]{6,214,160}\frac{1}{9}\) \(2\)

From Sampling to Coin Flip Sums: Two Responses

The Joint

\(p\) \(J_1\) \(J_2\) \(Y_1\) \(Y_2\) \(Y_1 + Y_2\)
\(\frac{1}{m^2}\) \(1\) \(1\) \(y_1\) \(y_1\) \(y_1+y_1\)
\(\frac{1}{m^2}\) \(1\) \(2\) \(y_1\) \(y_2\) \(y_1+y_2\)
\(\frac{1}{m^2}\) \(1\) \(m\) \(y_1\) \(y_m\) \(y_1+y_m\)
\(\frac{1}{m^2}\) \(2\) \(1\) \(y_2\) \(y_1\) \(y_2+y_1\)
\(\frac{1}{m^2}\) \(2\) \(2\) \(y_2\) \(y_2\) \(y_2+y_2\)
\(\frac{1}{m^2}\) \(m\) \(m\) \(y_m\) \(y_m\) \(y_m+y_m\)

Partially Marginalized

\(p\) \(Y_1\) \(Y_2\) \(Y_1 + Y_2\)
\(\textcolor[RGB]{239,71,111}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 0,0}} \frac{1}{m^2}}\) \(0\) \(0\) \(\textcolor[RGB]{239,71,111}{0}\)
\(\textcolor[RGB]{17,138,178}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 0,1}} \frac{1}{m^2}}\) \(0\) \(1\) \(\textcolor[RGB]{17,138,178}{1}\)
\(\textcolor[RGB]{17,138,178}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 1,0}} \frac{1}{m^2}}\) \(1\) \(0\) \(\textcolor[RGB]{17,138,178}{1}\)
\(\textcolor[RGB]{6,214,160}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1},y_{j_2} = 1,1}} \frac{1}{m^2}}\) \(1\) \(1\) \(\textcolor[RGB]{6,214,160}{2}\)

Fully Marginalized

\(p\) \(Y_1 + Y_2\)
\(\textcolor[RGB]{239,71,111}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1}+y_{j_2} = 0}} \frac{1}{m^2}}\) \(\textcolor[RGB]{239,71,111}{0}\)
\(\textcolor[RGB]{17,138,178}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1}+y_{j_2} = 1}} \frac{1}{m^2}}\) \(\textcolor[RGB]{17,138,178}{1}\)
\(\textcolor[RGB]{6,214,160}{\sum\limits_{\substack{j_1,j_2 \\ y_{j_1}+y_{j_2} = 2}} \frac{1}{m^2}}\) \(\textcolor[RGB]{6,214,160}{2}\)
  • To find the distribution of a sum of two responses, \(Y_1+Y_2\), we do the same thing.

  • We start with the joint distribution of two dice rolls.

    • Our rolls are equally likely to be any pair of numbers in \(1\ldots m\).
    • And there are \(m^2\) pairs, so the probability is \(1/m^2\) for each pair.
  • Then we add columns that are functions of the rolls: \(Y_1\), \(Y_2\), and \(Y_1+Y_2\).

    • This is ‘for free’ as before.
    • If we know who we’re calling, we know the responses we’ll hear and their sum.
  • Then we marginalize to find the distribution of the sum. We’ll do this in two steps.

    1. We partially marginalize over \(J_1,J_2\), to find the joint distribution of the response sequence \(Y_1, Y_2\).
    2. We fully marginalize the result to find the distribution of \(Y_1+Y_2\).
  • To sum over pairs of responses, we sum over one response then the other.
    • This exactly like what we did for one response. Except twice.
    • And what we get is pretty similar too.
  • The probability of observing the pair of responses is determined by their frequencies in the population.
    • It’s the product of the frequences of those responses.
    • For ‘Yes,Yes’ it’s \(\theta_1^2\). For ‘No,No’ it’s \(\theta_0^2\).
    • And for ‘Yes,No’ and ‘No,Yes’ it’s \(\theta_1\theta_0\).
  • This is, for what it’s worth, the product of the marginal probabilities of those responses.
    • When this happens for any pair of random variables, we say they are independent. More on that soon.
  • And the probability of any response pair \(a,b\) depends only on its sum \(a+b\).

\[ \begin{aligned} \sum\limits_{\substack{j_1,j_2\\ y_{j_1},y_{j_2} = a,b}} \frac{1}{m^2} &= \sum\limits_{\substack{j_1 \\ y_{j_1}=a}} \qty{ \sum\limits_{\substack{j_2 \\ y_{j_2}=b}} \frac{1}{m^2} } \\ &= \sum\limits_{\substack{j_1 \\ y_{j_1}=a}} \qty{ m_b \times \frac{1}{m^2} } \qfor m_y = \sum\limits_{j:y_j=y} 1 \\ &= m_a \times m_b \times \frac{1}{m^2} = \theta_a \times \theta_b \qfor \theta_y = \frac{m_y}{m} \\ &= \theta_1^{s} \theta_0^{1-s} \qfor s = a+b \end{aligned} \]

  • To calculate the marginal probability that \(Y_1+Y_2=s\), we sum the probabilities of all pairs with that sum.
    • This is pretty easy because there aren’t too many pairs.
    • And I’ve color coded the pairs to make it even easier.

From Sampling to Coin Flip Sums: \(n\) Responses.

\(p\) \(J_1\) \(J_n\) \(Y_1\) \(Y_n\)
\(\frac{1}{m^n}\) \(1\) \(1\) \(y_1\) \(y_1\)
\(\frac{1}{m^n}\) \(1\) \(2\) \(y_1\) \(y_2\)
\(\frac{1}{m^n}\) \(1\) \(m\) \(y_1\) \(y_m\)
\(\frac{1}{m^n}\) \(2\) \(1\) \(y_2\) \(y_1\)
\(\frac{1}{m^n}\) \(2\) \(2\) \(y_2\) \(y_2\)
\(\frac{1}{m^n}\) \(m\) \(m\) \(y_m\) \(y_m\)
\(p\) \(Y_1 + \ldots + Y_n\)
\(\color[RGB]{239,71,111}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = 0}} \sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n}\) \(\color[RGB]{239,71,111}0\)
\(\color[RGB]{17,138,178}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = 0}} \sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n}\) \(\color[RGB]{17,138,178}1\)
\(\color[RGB]{6,214,160}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = n-1}} \sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n}\) \(\color[RGB]{6,214,160}n-1\)
\(\color[RGB]{255,209,102}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = n}} \sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n}\) \(\color[RGB]{255,209,102}n\)

Partially Marginalized

\(p\) \(Y_1\) \(Y_2\) \(Y_n\) \(Y_1 + \ldots + Y_n\)
\(\color[RGB]{239,71,111}\sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 0 \ldots 0}} \frac{1}{m^n}\) 0 0 \(0\) \(\color[RGB]{239,71,111}0\)
\(\color[RGB]{17,138,178}\sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 0 \ldots 1}} \frac{1}{m^n}\) 0 0 \(1\) \(\color[RGB]{17,138,178}1\)
\(\color[RGB]{17,138,178}\sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 1 \ldots 0}} \frac{1}{m^n}\) 0 1 \(0\) \(\color[RGB]{17,138,178}1\)
\(\color[RGB]{6,214,160}\sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 1 \ldots 1}} \frac{1}{m^n}\) 0 1 \(1\) \(\color[RGB]{6,214,160}n-1\)
\(\color[RGB]{255,209,102}\sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 1, 1 \ldots 1}} \frac{1}{m^n}\) 1 1 \(1\) \(\color[RGB]{255,209,102}n\)
  • To find the distribution of a sum \(Y_1 + \ldots + Y_n\), we do the same thing.

  • We start by writing out the joint distribution of \(n\) dice rolls.

    • We’re equally likely to roll any sequence of \(n\) numbers in \(1\ldots m\).
    • And there are \(m^n\) sequences, so the probability is \(1/m^n\) for each sequence.
    • Then we add columns for the corresponding responses ‘for free’. And the sums.
  • Then we marginalize in two steps.

    1. Sum over roll sequences leading to the same response sequence \(a_1 \ldots a_n\).
    2. Sum over response sequences leading to the same sum \(s=a_1+\ldots+a_n\).
  • This isn’t a class about counting, so this stuff won’t be on the exam.

  • To sum over roll sequences, we sum over one roll after another.
    • This exactly like what we did for one response. Except over and over.
    • You can think of this sum ’inside out.
  • Suppose we’re calculating the probability of the response sequence \(a_1 \ldots a_n = 0,0,...,0,0\).
    1. We consider any ‘head’ \(j_1\ldots j_{n-1}\) where \(y_{j_1} \ldots y_{j_{n-1}}=0,0,...0\) and think about choosing the last element. No matter what the head is, there are \(m_0\) ways to complete with by choosing \(j_{n}\) with \(y_{j_n}=0\).
    2. We repeat with shortened ‘head’ \(j_1 \ldots j_{n-2}\) to consider the choice of \(j_{n-1}\). We have \(m_0\) choices and our last step tells us that each choice results in \(m_0\) call sequences, so there are \(m_0^2\) sequences with head \(j_1 \ldots j_{n-2}\) ending in \(0,0\).
    3. Repeat \(n-2\) more times.
  • What we get is pretty similar to the two-response case.
  • The probability of observing the sequence is determined by the frequency of Yeses and Nos in the population.
    • It’s the product of the frequences of those responses.
    • For ‘Yes,Yes,…,Yes’ it’s \(\theta_1^n\). For ‘No,No,…,No’ it’s \(\theta_0^n\).
    • And if we have \(s\) Yeses and therefore \(n-s\) Nos, it’s \(\theta_1^s\theta_0^{n-s}\).
  • This is, again, the product of the marginal probabilities of those \(n\) responses.

\[ \begin{aligned} \sum_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n} &= \sum_{\substack{j_1 \ldots j_{n-1} \\ y_{j_1} \ldots y_{j_{n-1}} = a_1 \ldots a_{n-1}}} \sum_{\substack{j_n \\ y_n=a_n}} \frac{1}{m^n} \\ &= \sum_{\substack{j_1 \ldots j_{n-1} \\ y_{j_1} \ldots y_{j_{n-1}} = a_1 \ldots a_{n-1}}} \sum_{\substack{j_n \\ y_n=a_n}} m_{a_n} \times \frac{1}{m^n} \\ &= \sum_{\substack{j_1 \ldots j_{n-2} \\ y_{j_1} \ldots y_{j_{n-2}} = a_1 \ldots a_{n-2}}} \sum_{\substack{j_{n-1} \\ y_n=a_{n-1}}} m_{a_n} \times \frac{1}{m^n} \\ &= \sum_{\substack{j_1 \ldots j_{n-2} \\ y_{j_1} \ldots y_{j_{n-2}} = a_1 \ldots a_{n-2}}} \sum_{\substack{j_{n-1} \\ y_n=a_{n-1}}} m_{a_{n-1}} m_{a_n} \times \frac{1}{m^n} \\ &= m_{a_1} \ldots m_{a_{n}} \times \frac{1}{m^n} \qqtext{after repeating $n-2$ more times} \\ &= \theta_{a_i} \ldots \theta_{a_n} \\ &= \prod_{i:a_i=1} \theta_1 \prod_{i:a_i=0} \theta_0 = \theta_1^{s} \theta_0^{n-s} \qfor s = a_1 + \ldots + a_n \end{aligned} \]

  • The full marginalization step is a bit trickier. But we can take advantage of our partial marginalization.
  • To calculate the probability that the sum takes on the value \(s\), we order our sum of probabilities thoughtfully.
    • We sum the over response sequences \(a_1 \ldots a_n\) with \(s\) Yeses and \(n-s\) Nos.
    • And within that, sum over calls where we’d hear those responses.
  • The inner sum we’ve done. That was our partial marginalization.
  • And the outer sum boils down to counting the number of sequences with \(s\) Yeses.
    • The probability of all these sequences is the same: \(\theta_1^s\theta_0^{n-s}\). So summing is counting and multiplying.
    • You may have done the counting part in high school in a unit on ‘permutations and combinations’.
    • In any case, that count has a name. It’s spoken ‘\(n\) choose \(s\)’ and written \(\binom{n}{s}\).

\[ \begin{aligned} \sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1} + \ldots + y_{j_n} = s}} \frac{1}{m^n} &= \sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} \ \ \sum_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{1}{m^n} \\ & =\sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} \theta_1^{s}\theta_0^{n-s} \\ &= \binom{s}{n} \theta_1^{s} \theta_0^{n-s} \qqtext{ where } \binom{s}{n} = \sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} 1 \end{aligned} \]

\[ P\qty(\sum_{i=1}^n Y_i = s) = \binom{s}{n} \theta_1^{s}\theta_0^{n-s} \ \ \text{ where } \ \ \binom{s}{n} \text{ is the number of binary sequences $a_1 \ldots a_n$ summing to $s$.} \]

  • We don’t actually have to do all this sequence-summing-to-\(s\) counting ourselves.
  • The choose function in R will do it for us: \(\binom{s}{n}\) is choose(n,s).
  • The dbinom function will give us the whole probability: \(\binom{s}{n}\theta_1^s\theta_0^{n-s}\) is dbinom(s, n, theta_1).
  • The rbinom function will draw samples from this distribution: rbinom(10000, n, theta_1) gives us 10,000.
theta_1 = .7
n = 625
p = dbinom(0:n, n, theta_1)
S = rbinom(10000, n, theta_1)

ggplot() + geom_area(aes(x=0:n,   y=p), color=pollc, alpha=.2) +
           geom_bar(aes(x=S, y=after_stat(prop)), alpha=.3) +
           geom_point(aes(x=S[1:100],   y=max(p)*seq(0,1,length.out=100)), 
                      color='purple', alpha=.2)

0.00 0.01 0.02 0.03 0 200 400 600

The Binomial Distribution

0.00 0.01 0.02 0.03 0.5n 0.6n 0.7n 0.8n 0.9n

\[ \begin{aligned} &\overset{\color{gray}=P\qty(\frac{1}{n}\sum_{i=1}^n Y_i = \frac{s}{n})}{P\qty(\sum_{i=1}^n Y_i = s)} = \binom{s}{n} \theta_1^{s}\theta_0^{n-s} \\ &\qfor n =625 \\ &\qand \theta_1 \in \{\textcolor[RGB]{239,71,111}{0.68}, \textcolor[RGB]{17,138,178}{0.7}, \textcolor[RGB]{6,214,160}{0.72} \} \end{aligned} \]

  • These functions have ‘binom’ in their name because we call this distribution the Binomial distribution.
    • The Binomial distribution on \(n\) trials with success probability \(\theta\) is our name for …
    • the distribution of the sum of \(n\) independent binary random variables with probability \(\theta\) of being \(1\).
    • E.g. the number of heads in \(n\) coin flips is Binomial with \(n\) trials and success probability \(1/2\).
  • We’ve shown that’s the sampling distribution of the sum of responses, \(Y_1 + \ldots + Y_n\), when we sample …
    • … with replacement from a population of binary responses \(y_1 \ldots y_m\) in which \(\theta\) is the frequency of ones.
  • We’re interested in the mean of responses, so we divide by \(n\): \(\color{gray} \sum_{i=1}^n Y_i = s \ \text{ when } \ \frac{1}{n}\sum_{i=1}^n Y_i = s/n\).
  • It’s easy to estimate — and talk about estimating — this sampling distribution.
    • It depends only on one thing we don’t know — the population frequency \(\theta\).
    • And that’s exactly the thing we’re trying to estimate anyway.
  • That’s why we’ve started here—the case of sampling with replacement from a population of binary responses.

Payoff

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
You vs. Your Competitors Now
0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
How You’re Calibrating
  • To estimate this sampling distribution, you plug your point estimate \(\hat\theta\) into the Binomial formula. \[ \hat P\qty(\sum_{i=1}^n Y_i = s) = \binom{s}{n} \hat\theta^{s} (1-\hat\theta)^{n-s} \qqtext{ estimates } P\qty(\sum_{i=1}^n Y_i = s) = \binom{s}{n} \theta^{s} (1-\theta)^{n-s} \]

  • To calibrate your interval estimate, you …

    • use rbinom to draw 10,000 samples from this estimate of the sampling distribution.
    • Even remembering to divide by \(n\).
    • And use the function width from the Week 1 Homework to find an interval that covers 95% of them.
theta.hat = mean(Y)
samples = rbinom(10000, n, theta.hat) / n
interval.width = width(samples, alpha=.05)
  • You nail it. Your interval covers the estimation target \(\theta\) just like 95% of your competitors’ do.
0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
You and your competitors’ intervals, recentered for easy comparison of width to your estimators’ sampling distribution’s middle-95% width.
  • It’s not just that you’ve widened your interval enough.
  • You’ve widened it almost exactly the right amount.
  • Just like your competitors. Almost as if you all knew …
  • how to estimate your estimator’s sampling distribution.

Sampling without Replacement

Another Example. This won’t be on the Exam.

Marginalization

\(p\) \(J_1\) \(J_n\) \(Y_1\) \(Y_n\)
\(\frac{m-n!}{m!}\) \(1\) \(1\) \(y_1\) \(y_1\)
\(\frac{m-n!}{m!}\) \(1\) \(2\) \(y_1\) \(y_2\)
\(\frac{m-n!}{m!}\) \(1\) \(m\) \(y_1\) \(y_m\)
\(\frac{m-n!}{m!}\) \(2\) \(1\) \(y_2\) \(y_1\)
\(\frac{m-n!}{m!}\) \(2\) \(2\) \(y_2\) \(y_2\)
\(\frac{m-n!}{m!}\) \(m\) \(m\) \(y_m\) \(y_m\)
\(p\) \(Y_1 + \ldots + Y_n\)
\(\color[RGB]{239,71,111}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = 0}} \sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{m-n!}{m!}\) \(\color[RGB]{239,71,111}0\)
\(\color[RGB]{17,138,178}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = 0}} \sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{m-n!}{m!}\) \(\color[RGB]{17,138,178}1\)
\(\color[RGB]{6,214,160}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = n-1}} \sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{m-n!}{m!}\) \(\color[RGB]{6,214,160}n-1\)
\(\color[RGB]{255,209,102}\sum\limits_{\substack{a_1 \ldots a_n \\ a_1 + \ldots + a_n = n}} \sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} \frac{m-n!}{m!}\) \(\color[RGB]{255,209,102}n\)

Partially Marginalized

\(p\) \(Y_1\) \(Y_2\) \(Y_n\) \(Y_1 + \ldots + Y_n\)
\(\color[RGB]{239,71,111}\sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 0 \ldots 0}} \frac{m-n!}{m!}\) 0 0 \(0\) \(\color[RGB]{239,71,111}0\)
\(\color[RGB]{17,138,178}\sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 0 \ldots 1}} \frac{m-n!}{m!}\) 0 0 \(1\) \(\color[RGB]{17,138,178}1\)
\(\color[RGB]{17,138,178}\sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 1 \ldots 0}} \frac{m-n!}{m!}\) 0 1 \(0\) \(\color[RGB]{17,138,178}1\)
\(\color[RGB]{6,214,160}\sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 0, 1 \ldots 1}} \frac{m-n!}{m!}\) 0 1 \(1\) \(\color[RGB]{6,214,160}n-1\)
\(\color[RGB]{255,209,102}\sum\limits_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1}, y_{j_2} \ldots y_{j_n} = 1, 1 \ldots 1}} \frac{m-n!}{m!}\) 1 1 \(1\) \(\color[RGB]{255,209,102}n\)
  • To find the distribution of a sum \(Y_1 + \ldots + Y_n\) when we sample without replacement we do the same thing.

  • We start by writing out the joint distribution of \(n\) dice rolls.

    • We’re equally likely to pull any sequence of \(n\) distinct numbers in \(1\ldots m\).
    • There are \(m \times (m-1) \times \ldots \times (m-n+1)=m!/(m-n)!\) sequences…
    • … so the probability is \((m-n)!/m!\) for each sequence.
    • Then we add the responses and sums as columns. That doesn’t change.
  • Then we marginalize in two steps.

    1. Sum over roll sequences leading to the same response sequence \(a_1 \ldots a_n\).
    2. Sum over response sequences leading to the same sum \(s=a_1+\ldots+a_n\).
  • To sum over roll sequences, we sum over one roll after another.
  • Suppose we’re calculating the probability of the response sequence \(a_1 \ldots a_n = 0,0,...,0,0\).
    1. We consider any ‘head’ \(j_1\ldots j_{n-1}\) where \(y_{j_1} \ldots y_{j_{n-1}}=0,0,...0\) and think about choosing the last element. No matter what the head is, we’ve ‘used up’ \(n-1\) people with response \(y_j=0\). So there are \(m_0-(n-1)=m_0-n+1\) ways to complete the sequence by choosing an as-yet unused \(j_{n}\) with \(y_{j_n}=0\).
    2. We repeat with shortened ‘head’ \(j_1 \ldots j_{n-2}\) to consider the choice of \(j_{n-1}\). We have \(m_0-n+2\) as-yet unused choices and our last step tells us that each choice results in \(m_0-(n-1)\) call sequences, so there are \((m_0 - n + 1)(m_0 - n + 2)\) sequences with head \(j_1 \ldots j_{n-2}\) ending in \(0,0\).
    3. Repeat \(n-2\) more times. We get \((m_0 - n + 1) \times \ldots \times m_0 = m_0!/(m_0-n)!\) rows.
  • When we work through the same argument for sequences with a mix of 0s and 1s, we see that a call in the head forecloses possibilities for later calls with the same response only. Thus, when we have \(s_1\) ones and \(s_0\) zeros, we can think of ourselves as choosing calls resulting with response zero and response one separately. We have \(m_0!/(m_0-s_0)! \times m_1! / (m_1-s_1)!\) rows.

\[ \begin{aligned} \sum_{\substack{j_1 \neq \ldots \neq j_n \\ y_{j_1} \ldots y_{j_n} = 0 \ldots 0}} \frac{m-n!}{m!} &= \sum_{\substack{j_1 \neq \ldots \neq j_{n-1} \\ y_{j_1} \ldots y_{j_{n-1}} = 0 \ldots 0}} \sum_{\substack{j_n \not\in j_1 \ldots j_{n-1} \\ y_n=0}} \frac{(m-n)!}{m!} \\ &= \sum_{\substack{j_1 \ldots j_{n-1} \\ y_{j_1} \ldots y_{j_{n-1}} = a_1 \ldots a_{n-1}}} \sum_{\substack{j_n \not\in j_1 \ldots j_{n-1} \\ y_n=0}} (m_{0} - n + 1) \times \frac{(m-n)!}{m!} \\ &= \sum_{\substack{j_1 \neq \ldots \neq j_{n-2} \\ y_{j_1} \ldots y_{j_{n-2}} = 0 \ldots 0}} \sum_{\substack{j_{n-1}\not\in j_1 \ldots j_{n-2} \\ y_n=0}} (m_0 - n + 2) (m_{0} - n + 1) \times \frac{(m-n)!}{m!} \\ &= m_0 \times \ldots \times (m_0 - n + 1) \times (m-n)!/m! = \frac{m_0!}{(m_0-n)!} \times \frac{(m-n)!}{m!} \qqtext{after repeating $n-2$ more times} \\ &= \frac{m_0!}{(m_0-s_0)!} \times \frac{m_1!}{(m_1-s)!} \times \frac{(m-n)!}{m!} \qqtext{ generally. } \end{aligned} \]

  • The full marginalization step is exactly the same as it was for sampling with replacement.
    • Again, our response sequence probabilities depend only on the sequence’s sum.
    • So summing amounts to multiplying that probability by the number of sequences with a given sum.
    • And we counted those when we sampled with replacement. There are \(\binom{n}{s}\).

\[ \begin{aligned} \sum\limits_{\substack{j_1 \ldots j_n \\ y_{j_1} + \ldots + y_{j_n} = s}} (m-n)!/m! &= \sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} \ \ \sum_{\substack{j_1 \ldots j_n \\ y_{j_1} \ldots y_{j_n} = a_1 \ldots a_n}} (m-n)!/m! \\ & =\sum\limits_{\substack{a_1 \dots a_n \\ a_1 + \ldots + a_n = s}} \frac{m_0!}{(m_0-s_0)!} \times \frac{m_1!}{(m_1-s_1)!} \times \frac{(m-n)!}{m!} \\ &= \binom{s}{n} \frac{m_0!}{(m_0-s_0)!} \times \frac{m_1!}{(m_1-s)!} \times \frac{(m-n)!}{m!} \end{aligned} \]

\[ P\qty(\sum_{i=1}^n Y_i = s) = \binom{s}{n} \frac{m_0!}{(m_0-\{n+s\})!} \times \frac{m_1!}{(m_1-s)!} \times \frac{(m-n)!}{m!} \]

  • This is called the hypergeometric distribution. R has what we need for this one too.
  • The dhyper function will give us the probability of a sum \(s\). That’s dhyper(s,m_0,m_1,n).
  • The rhyper function will draw samples: rhyper(10000, m_0, m_1, n) gives us 10,000.
n = 625
p = dhyper(0:n, sum(y==1), sum(y==0), n)
S = rhyper(10000, sum(y==1), sum(y==0), n)

ggplot() + geom_area(aes(x=0:n,   y=p), color=pollc, alpha=.2) +
           geom_bar(aes(x=S, y=after_stat(prop)), alpha=.3) +
           geom_point(aes(x=S[1:100],   y=max(p)*seq(0,1,length.out=100)), 
                      color='purple', alpha=.2)

0.00 0.01 0.02 0.03 0 200 400 600

With or Without Replacement: What’s The Difference?

0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75
n=625
0e+00 2e-04 4e-04 6e-04 0.6970 0.6975 0.6980 0.6985
n=m/2
  • When we’re sampling a small fraction of our population, e.g. 625/7.20M, not much.
    • In this case, acting as if you’d sampled with replacement — which is easier mathematically — is benign.
  • When we’re sampling a large fraction, e.g. half, we get a narrower sampling distribution
    with sampling without replacement than sampling with replacement
    • In this case, acting as if you’d sampled with replacement would give you wider-than-necessary intervals.
    • You get 95% coverage either way. 99% vs. 96%.
    • But you’re underselling your precision by inaccuratelely characterizing your sampling distribution.

These Worked Out

0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75

0e+00 2e-04 4e-04 6e-04 0.6970 0.6975 0.6980 0.6985

  • We got nice formulas and everything. The bad news is that it doesn’t work out like this all that often.
  • Often doing all this marginalization symbolically is too hard or doesn’t leave you with an interpretable formula.
  • The good news is that there are a variety of ways to work around this.
  • If it ‘works out halfway’, meaning you know the parameter(s) it depends on but not the formula …
    • .. you can calculate the sampling distribution as accurately as you want by simulation.
    • You’ll get a chance to try that in this week’s homework.
  • Failing that, you can work with a simpler approximation to your estimator’s sampling distribution.
    • Ideally, you’d have some mathematical result telling you that approximation is accurate.
    • And proving that yourself would involve some thinking about this marginalization process.
  • That’s what people usually do. Including us later this semester.
  • This usually works and often, though not always, we have mathematical proof.

Why It’s Important to Think About This

  • You can often get by acting as if what you actually did was whatever is most convenient mathematically.
    • But we did just see a case in which this isn’t ideal: when we sample a large fraction of our population without replacement.
    • And this was a relatively benign failure. We understated our precision but we still got the coverage we claimed. We ‘overcovered’.
  • There are, however, important instances in which the usual approximations can fail pretty badly.
    1. When we’re running an adaptive randomized trial. These let us experiment with new treatments with minimal risk of a negatively impacting patients who’d have done better on the old one.
    2. When we’re estimating effect variability. These help understand whether a treatment that doesn’t work on average might still be good for some people.
    3. When we’re studying people who are connected in some way, e.g. in a social network or in a classroom.
  • This is hard stuff. We’re not going to get there in this class.
  • But knowing how inference works from the ground up will hopefully help you in the future.
    • That might mean working out a new estimator’s sampling distribution from scratch.
    • Or maybe knowing what to look for or ask about when the usual stuff isn’t working.
  • If you want to know more about any of the three estimation problems above, let me know.
    • I can give you reading or maybe even a project idea.

Calibration using the Binomial Distribution

Looking Back on Your Success

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
What You Know Now—Plus What You Used For Calibration
0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800
Your Calibration Picture—Plus the Estimation Target
  • Remember your sucessfully calibrated interval estimate from a moment ago?
    • You did that by plugging your point estimate \(\hat\theta\) into the Binomial formula.
    • And you got a nicely calibrated interval estimate. That was great.
  • But let’s take a closer look. Let’s compare your estimate of the sampling distribution to the actual thing.
    • We can do that because we have a bunch of draws from the real thing—all the others polls. s.
    • And if we want more draws, since it’s after the election, we can simulate as many polls as we want.
  • It doesn’t look great. It’s off center.
    • Its mean is a bit lower than the population proportion \(\theta\).
    • It’s actually our sample proportion \(\hat\theta\). That makes sense.
  • The mean of the Binomial distribution with success rate \(\theta\) is \(\theta\) and we’re using \(\hat\theta\) in its place.
  • It turns out that this doesn’t matter much. It worked just fine for calibration. Why?

Calibration Comparison

0.00 0.01 0.02 0.03 0.600 0.625 0.650 0.675 0.700 0.725 0.750 0.775 0.800

  • It doesn’t matter because we’re not putting arms on draws from the estimated sampling distribution.
  • We’re putting arms on our point estimate1, which is a draw from the actual sampling distribution.
  • To do this, we’re using the width—but not the center—of the estimated sampling distribution.
  • And it works because that’s very close to the width we’d get from the actual sampling distribution.
  • Above, I’ve drawn in shaded regions corresponding to two versions of the the population proportion’s arms.
    • The green region is the one we get from the actual sampling distribution. We can’t use this.
    • The red region is the one we get from the estimated sampling distribution. We do use this.
  • And I’ve drawn in a version of our interval estimate calibrated each way. They’re almost the same.

Our Competitors’ Calibration

0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75 0.80
Actual Sampling Distribution
0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75 0.80
Estimated Sampling Distribution (Poll 1)
0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75 0.80
Estimated Sampling Distribution (Poll 2)
0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75 0.80
Estimated Sampling Distribution (Poll 3)
  • Our competitors’ sampling distributions, just like ours, are centered on their sample frequencies \(\hat\theta\).
  • But they’re all close to the width we get using the actual sampling distribution. You can see it above.
    • I’ve plotted the intervals they’d use — based on their sampling distribution estimates — in bold colors.
    • And the intervals they wish they could — based on the actual sampling distribution — more faintly.
  • If you look very closely, you can see that …
    • the intervals around overestimates are slightly narrower than we’d want.
    • the intervals around underestimates are slightly wider than we’d want.
  • But you have to look very closely. They’re all very close to the actual sampling distribution’s width.

All of Our Competitors

0 5 10 15 20 -0.10 -0.05 0.00 0.05 0.10

  • We saw earlier that all of our competitors had almost-perfectly calibrated intervals.
    • They got their widths by plugging their sample frequencies \(\hat\theta\) into the Binomial formula.
    • And the result was almost exactly as if they’d plugged in the population frequency \(\theta\) instead.
  • Let’s look again. This time, we’ll plot their estimated sampling distributions too.
    • But we’ll shift them all so they’re centered at zero.
    • That way we can compare their widths more easily. Because width is what matters.
  • Each competitor gets their own color and we see their …
    • centered sampling distribution plotted as a line
    • centered interval estimate as a point with arms
  • Compare to the actual sampling distribution (centered and shaded gray) and its middle 95% (dotted lines).

Why Does it Work?

0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75 0.80

0.00 0.01 0.02 0.03 -0.10 -0.05 0.00 0.05 0.10

  • Why are we getting a good estimate of the sampling distribution?
  • The Binomial distribution is continuous as a function of \(\theta\)—when \(\theta\) changes little, the distribution changes little.
  • This means that, if we have a good estimate of \(\theta\), we have a good estimate of the sampling distribution.
  • The relevant difference (after centering) is even smaller because the way the binomial changes is mostly location.
  • Here I’m showing three estimates of the sampling distribution based on three point estimates \(\hat\theta\).
    • with centering (right) and without (left).
  • In particular, point estimates at the center and two edges of the actual sampling distribution’s middle 95%.

The Estimate Works. Why Does it Work?

0.00 0.01 0.02 0.03 0.60 0.65 0.70 0.75 0.80

0.00 0.01 0.02 0.03 -0.10 -0.05 0.00 0.05 0.10

  • You can think of this as a sort of ‘confidence interval’ for our estimate of the sampling distribution.
    • 95% of the time, you’ll get an estimate somewhere between the red and blue ones.
    • And, as a result, the width of your interval estimate will be somewhere between the red and blue widths.
  • One way of looking at it is, when calibrate interval estimates this way, they’re almost perfectly calibrated.
    • Coverage may not actually be 95% but it’s very close.
    • You can see that the red interval does cover.
      • So will an interval around any point estimate between it and \(\theta\).
    • The blue interval doesn’t cover. It’s a fingernail too narrow.
      • But an interval around any estimate a fingernail smaller will cover.
      • It’ll be at least as wide and a fingernail closer.
  • Another way of looking at it is that 95% of the time, your interval misses by at most a fingernail.
  • This, or bit worse, is usually what’s meant when someone says ‘95% interval.’ Don’t expect perfect calibration.