Lab 4

Conditional Expectations

A New Dataset

income education county
1 $120k 16 kern
2 $4k 13 kern
3 $15k 12 kern
4 $15k 13 kern
2271 $0k 20 tulare
  • We’ll do the same thing we’ve been doing, but for non-binary outcomes.
  • Using data from the Current Population Survey, we’ll estimate …
    • the mean income in our population.
    • the mean income among people in our population with 4-year degrees
    • the difference in mean income between people with 4-year degrees and people without them.
  • Our population will be the set of California residents
    • between the ages of 25 and 35
    • with at least an 8th-grade education

Imagining a Population

income education county
1 $51k 14 sacramento
2 $0k 16 monterey
3 $0k 13 unknown
14194 $62k 12 LA
income education county
1 $120k 16 kern
2 $4k 13 kern
3 $15k 12 kern
2271 $0k 20 tulare
  • We don’t have much information about the people in our population. Except the ones in our sample.
  • But, for the sake of visualization, I’ve made some up.
    • We’ll put visualizations of this fake population on the left.
    • Visualizations of the (real) sample will be on the right.

Sampling

degree no 4-year degree 4-year degree

  • We’ll act as if our data were sampled, with replacement, from this population.
    • On the left, I’ve illustrated the population.
    • On the right, I’ve illustrated our sample.
  • Our working assumption: each dot on the right was chosen, from those on the left, by rolling a big die.
  • That’s not quite right, but we’ll stick with it throughout the semester.

Sample and Population Means

$0k $25k $50k $75k $100k $125k $150k $175k $200k -0.2 0.0 0.2

$0k $25k $50k $75k $100k $125k $150k $175k $200k -0.2 0.0 0.2

  • One thing we often want to estimate is the mean of our population.

\[ \mu = \frac{1}{m}\sum_{j=1}^m y_j \]

  • We’ll think of the mean in our sample as an estimate of it.

\[ \hat \mu = \frac{1}{n}\sum_{i=1}^n Y_i \]

  • We have shown last time that this estimator is unbiased.

\[ \mathop{\mathrm{E}}[\hat \mu] = \mu \qfor \underset{\text{\color[RGB]{64,64,64}{population mean}}}{\mu = \mathop{\mathrm{E}}[Y_i] = \frac{1}{m}\sum_{j=1}^m y_j} \]

  • We have almost calculated its standard deviation. Here it is.

\[ \mathop{\mathrm{sd}}[\hat \mu] = \frac{\sigma}{\sqrt{n}} \qfor \underset{\text{\color[RGB]{64,64,64}{population standard deviation}}}{\sigma=\sqrt{\mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i]}=\sqrt{\frac{1}{m}\sum_{j=1}^m (y_j - \mu)^2}} \]

  • We did it in the special case of binary \(Y_i\) in our last lecture.
  • We’ll generalize next time. It’s a one-line change.

Why Do We Care About Unbiasedness?

0.0 0.1 0.2 0.3 0.4 -6 -3 0 3 6

0.0 0.1 0.2 0.3 0.4 -6 -3 0 3 6

  • Unbiasedness means that our sampling distribution is centered at our estimation target.
  • On the left, we see the sampling distribution of an unbiased estimator.
    • When it looks like that, we’re in good shape.
  • On the right, we see one for an estimator with substantial bias.
    • When it looks like that, we’re in trouble.
    • You can see that a good number of our samples on the right are far off-target.
    • Further away, e.g., than the width of the distribution.
    • You can see this might cause problems with coverage.
  • If we calibrate interval estimates to cover the estimator’s mean 95% of the time,
    how often will they cover the thing we actually intend to estimate?
  • Give me a rough estimate for the picture on the right. Is it about 90%? 80%? 50%? 20%?
  • We’ll see our first examples of biased estimators in this week’s homework.
  • And we’ll have to start thinking about this relationship between bias and coverage.

Subsample and Subpopulation Means

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

  • We’ll often be interested in the mean income in subpopulations, too.
    • We’ll think about the subsample with \(X_i=x\) for some value \(x\).
    • e.g. \(X_i=1\) for people with a 4-year degree and \(X_i=0\) for people without. \[ \mu(x) = \frac{1}{m_x}\sum_{j:x_j = x } y_j \quad \text{ where } \quad m_x = \sum_{j:x_j=x} 1. \]
  • We’ll use the mean in the corresponding subsample to estimate it.

\[ \hat \mu(x) = \frac{1}{N_x}\sum_{i:X_i=x} Y_i \quad \text{ where } \quad N_x = \sum_{i:X_i=x} 1. \]

  • If we want to know a difference of subpopulation means, as we often do …
  • … we can estimate it using a difference of subsample means.

Are These Good Estimators?

0.0 0.1 0.2 0.3 0.4 -6 -3 0 3 6

0.0 0.1 0.2 0.3 0.4 -6 -3 0 3 6

  • We want to know about bias, i.e. location of the sampling distribution relative to the estimation target.
    • That’s where we’re headed today. We’ll show they’re unbiased.
  • And we want to know about precision, i.e., how wide our calibrated intervals would be.
    • We’ll get to this next time.

Random Variables and Conditioning

A lot of this is review.

To make this a nice cohesive read, I’ve included all the unconditional probability stuff we’ve covered so far and the new conditional stuff.

Observations as Random Variables

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

  • The notation \(y_j\) refers to a number.
    • It’s the income of the \(j\)th person in our population. That’s just some person.
  • The notation \(Y_i\) refers to something else.
    • It’s the income of the \(i\)th person we call in our survey. That’s not a person.
    • That’s the result of a random process—the roll of a die.
  • To summarize this result, we talk about the probability distribution of \(Y_i\).

Notation Conventions

  • Random variables are written in uppercase: \(X\), \(Y\), \(Z\), etc. Constants are written in lowercase: \(x\), \(y\), \(z\), etc.
  • We’ll use the same letter for a random variable and the value it takes on. \(x\) is a possible value of \(X\), \(y\) of \(Y\), etc.
  • Estimators are also random variables, but instead of uppercase we use a hat: \(\hat\theta\),\(\hat \mu\), \(\hat \sigma\), etc.
  • We’ll use the same letter for the estimator and what it’s meant to estimate.
    • \(\hat \mu\) is an estimator of \(\mu\); \(\hat\theta\) is an estimator of \(\theta\).
  • We’ll use \(\theta\) and \(\mu\) in different but ocassionally overlapping ways.
    • \(\mu\) will be our population mean and \(\mu(x)\) the mean in the subpopulation with \(X=x\).
    • \(\theta\) will be our estimation target.
      • So far, it’s often been the population mean, so \(\theta=\mu\).
      • Or a subpopulation mean, so \(\theta=\mu(x)\).
      • Later in the semseter, it’ll tend to be something a bit more complicated.

Probability Distributions: The Binary Case

income50k income education county
1 1 $51k 14 sacramento
2 0 $0k 16 monterey
3 0 $0k 13 unknown
14194 1 $62k 12 LA
roll income50k income education county
1 1017 1 $120k 16 kern
2 8004 0 $4k 13 kern
3 4775 0 $15k 12 kern
2271 7117 0 $0k 20 tulare
  • When our outcomes are binary, it’s easy to describe this distribution.
    • All we need to know is the probability that \(Y_i\) is 1. Why?
    • To do that, we sum the probabilities of the rolls that result in it being one.

\[ P(Y_i = 1) = \sum_{j: y_j = 1} P(\text{roll}_i = j) = \sum_{j: y_j = 1} \frac{1}{m} \times y_j = \mu. \]

  • This collapses out information about our random process that’s irrelevant to \(Y_i\).
  • That is, it collapses the probability we roll each number in 1…m into our ‘weighted coin flip’.

Probability Distributions: The General Case

income50k income education county
1 1 $51k 14 sacramento
2 0 $0k 16 monterey
3 0 $0k 13 unknown
14194 1 $62k 12 LA
roll income50k income education county
1 1017 1 $120k 16 kern
2 8004 0 $4k 13 kern
3 4775 0 $15k 12 kern
2271 7117 0 $0k 20 tulare
  • When our outcomes are are nonbinary, it’s a bit more complicated.
    • We need to know the probability that \(Y_i\) takes on each possible value.But we calculate it the same way.
    • We sum the probabilities of the rolls that result in it taking on those values.

\[ P(Y_i = y) = \sum_{j: y_j = y} P(\text{roll}_i = j) \]

  • We’re still collapsing out irrelevant information, but what we’re left with is more complicated.
  • It’s a weighted die roll, with one face for each possible value of \(Y_i\).
  • And If each person’s income is different, then there’s nothing to collapse out. Then, … \[ P(Y_i = y) = \begin{cases} \frac{1}{m} & \qqtext{for} y \in y_1 \ldots y_m \\ 0 & \qqtext{otherwise} \end{cases} \]

Expectations

income50k income education county
1 1 $51k 14 sacramento
2 0 $0k 16 monterey
3 0 $0k 13 unknown
14194 1 $62k 12 LA
roll income50k income education county
1 1017 1 $120k 16 kern
2 8004 0 $4k 13 kern
3 4775 0 $15k 12 kern
2271 7117 0 $0k 20 tulare
  • That said, we often don’t need to know the whole distribution.
  • Often the only thing we care about is the expected value of \(Y_i\). Or some related quantity.
    • The expected value of \(Y_i\) is the probability-weighted average of the values it can take on.
    • What’s nice about this is that we can think of this in ‘uncollapsed’ terms.
  • When we sample as usual, this is just the population mean.
  • The collapsed form is, in a sense, just summing in a different order.

\[ \begin{aligned} \mathop{\mathrm{E}}[Y_i] = \sum_y P(Y_i = y) \times y = \sum_y \qty(\sum_{j: y_j = y} \frac{1}{m}) \times y = \frac{1}{m}\sum_y\sum_{j:y_j = y} y = \frac{1}{m}\sum_{j=1}^m y_j \end{aligned} \]

Expectations

income50k income education county
1 1 $51k 14 sacramento
2 0 $0k 16 monterey
3 0 $0k 13 unknown
14194 1 $62k 12 LA
roll income50k income education county
1 1017 1 $120k 16 kern
2 8004 0 $4k 13 kern
3 4775 0 $15k 12 kern
2271 7117 0 $0k 20 tulare
  • What’s nice about the binary case is, in fact, that it’s an expectation.
  • When \(Y_i\) is binary, the probability it’s 1 is its expected value \(Y_i\).

\[ \begin{aligned} P(Y_i=1) = \sum_{j: y_j = 1} P(\text{roll}_i = j) &= \sum_{j:y_j=1} \frac{1}{m} \\ &= \sum_{j=1}^m \frac{1}{m} \times \begin{cases} 1 & \ \ \text{ if } y_j = 1 \\ 0 & \ \ \text{ otherwise} \\ \end{cases} \\ &= \sum_{j=1}^m \frac{1}{m} \times y_j = \mathop{\mathrm{E}}[Y_i]. \end{aligned} \]

  • In fact, we like expectations so much that we often use them to work with probabilities.

\[ P(Z \in A) = \mathop{\mathrm{E}}[1_A(Z)] \qfor 1_A(z) = \begin{cases} 1 & \qqtext{ for } z \in A \\ 0 & \qqtext{ otherwise} \end{cases} \]

Independence

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

  • Random variables are independent if …
    • … in intuitive terms, knowing the value of one doesn’t tell us anything about the value of the other.
    • … in mathematical terms, their joint probability distribution is the product of their individual marginals ones.

\[ P(Y_1 ... Y_k = y_1 \ldots y_k) = P(Y_1=y_1) \times \ldots \times P(Y_k=y_k). \]

  • That’s what happens when the randomness in each \(Y_i\) comes from a different roll of the die.
    • And since that’s how we’re doing our sampling in the Current Population Survey, it’ll be true in our sample.
    • Because we draw each of our observations the same way, they also have the same probability distribution.
    • We say they’re independent and indentically distributed.
  • At least, that’s what we’re pretending when we analyze CPS data in this class. Reality is more complicated.

Conditioning

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

  • Conditioning is, in effect, a way of thinking about sampling as a two-stage process.
    • First, we choose the color of our dot, i.e., the value of \(X_i\), according to it frequency in the population.
    • Then, we choose a specific one of those dots, i.e. \(J_i\), from those with that color—with equal probability.
  • Because this is just a way of thinking, each person still gets chosen with probability \(1/m\). \[ P(J_i=j) = \begin{cases} \frac{m_{green}}{m} \ \ \ \ \ \times \frac{1}{m_{green}} \ \ \ \ \ =\ \frac{1}{m} & \text{if the $j$th dot is green ($x_j=1$) } \\ \frac{m-m_{green}}{m} \times \frac{1}{m-m_{green}} \ = \ \frac{1}{m} & \text{otherwise} \end{cases} \]

Conditioning

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

  • The Conditional Probability of \(Y_i\) is the probability resulting from the second stage.
  • It’s a function of the result of the first.
    • \(P(Y_i=y \mid X_i=1)\) is the probability distribution of \(Y_i\) when we’re rolling the ‘green die’.
    • \(P(Y_i=y \mid X_i=0)\) is the probability distribution of \(Y_i\) when we’re rolling the ‘red die’.
  • And the Conditional Expectation of \(Y_i\) is the ‘second stage expected value’ in the same sense.
    • \(\mathop{\mathrm{E}}[Y_i \mid X_i=1]\) is the expected value of \(Y_i\) when we’re rolling the ‘green die’.
    • That is, the mean value \(\mu(1)\) of \(y_j\) in the subpopulation drawn as little green dots.
    • \(\mathop{\mathrm{E}}[Y_i \mid X_i=0]\) is the expected value of \(Y_i\) when we’re rolling the ‘red die’.
    • That is, the mean value \(\mu(0)\) of \(y_j\) in the subpopulation drawn as little red dots.

Working with Expectations

Conditioning

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

The law of iterated expectations

\[ E \{ E( Y \mid X ) \} \quad \text{ for any random variables $X, Y$} \]

To calculate the mean of \(Y\), we can average within subpopulations, then across subpopulations.

Irrelevance of independent conditioning variables

\[ E( Y \mid X, X' ) = E( Y \mid X ) \quad \text{ when $X'$ is independent of $X,Y$ } \]

If \(X'\) is unrelated to \(X\) and \(Y\), holding it constant doesn’t affect the relationship between them.

Linearity of Expectations

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

\[ \begin{aligned} E ( a Y + b Z ) &= E (aY) + E (bZ) \\ &= aE(Y) + bE(Z) \\ & \text{ for random variables $Y, Z$ and numbers $a,b$ } \end{aligned} \]

  • There are two things going on here.
    • To average a sum of two things, we can take two averages and sum.
    • To average a constant times a random variable, we multiply the random variable’s average by the constant.
  • In other words, we can distribute expectations and can pull constants out of them.

Proof

  • In essence, it comes down to the fact that all we’re doing is summing.
    • Expectations are probability-weighted sums.
    • And we’re looking at the expectation of a sum.
  • And we can change the order we sum in without changing what we get.

\[ \small{ \begin{aligned} \mathop{\mathrm{E}}\qty( a Y + b Z ) &= \sum_{y}\sum_z (a y + b z) \ P(Y=y, Z=z) && \text{ by definition of expectation} \\ &= \sum_{y}\sum_z a y \ P(Y=y, Z=z) + \sum_{z}\sum_y b z \ P(Y=y, Z=z) && \text{changing the order in which we sum} \\ &= \sum_{y} a y \ \sum_z P(Y=y,Z=z) + \sum_{z} b z \ \sum_y P(Y=y,Z=z) && \text{pulling constants out of the inner sums} \\ &= \sum_{y} a y \ P(Y=y) + \sum_{z} b z \ P(Z=z) && \text{summing to get marginal probabilities from our joint } \\ &= a\sum_{y} y \ P(Y=y) + b\sum_{z} z \ P(Z=z) && \text{ pulling constants out of the remaining sum } \\ &= a\mathop{\mathrm{E}}Y + b \mathop{\mathrm{E}}Z && \text{by definition} \end{aligned} } \]

Linearity of Conditional Expectations

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

\[ \begin{aligned} E\{ a(X) Y + b(X) Z \mid X \} &= E\{a(X)Y \mid X\} + E\{ b(X)Z \mid X\} \\ &= a(X)E(Y \mid X) + b(X)E(Z \mid X) \end{aligned} \]

  • This is like linearity of expectations, but with a twist.
    • When we condition on \(X\), we’re working with subpopulations in which \(X\) is constant.
    • This means we can act as if functions of \(X\) are constants.
    • So we can pull them out of expectations that are conditional on \(X\).

It’s important to distinguish between two things

0.75 1.00 1.25 1.50 0 1

  1. The conditional expectation function \(\mu(x)=E[Y \mid X=x]\).
    • \(\mu\) is a function; evaluated at \(x\), it’s a number. It’s not random.
    • It’s the mean of the subpopulation of people with \(X=x\).
  2. The conditional expectation \(\mu(X)=E[Y \mid X]\).
    • \(\mu(X)\) is the mean of a random subpopulation of people.
    • It’s the conditional expectation function evaluated at the random variable \(X\).
    • This is the sort of thing that shows up when we use the law of iterated expectations.

Check Your Understanding

0.75 1.00 1.25 1.50 0 1

  • Suppose we sample a point \((X, Y)\) uniformly at random from the population of 6 points above.
    • What is the conditional expectation function \(\mu(x)\) at \(x=0\) and \(x=1\)?
    • What is the conditional expectation \(\mu(X)\)?
  • \(\mu(0)=1\) and \(\mu(1)=1.25\)
  • \(\mu(X)\) is a random variable taking on these two values. \[ \mu(X) = \begin{cases} 1 & \text{ when } X=0 \\ 1.25 & \text{ when } X=1 \end{cases} \]

The Indicator Trick

0.75 1.00 1.25 1.50 0 1

  • Suppose we sample a point \((X, Y)\) uniformly at random from the population of 6 points above.
    • What is \(1_{=1}(X)\mu(X)\)?
  • \(1_{=1}(X)\mu(X)\) is a random variable taking on these two values. \[ 1_{=1}(X)\mu(X) = \begin{cases} 0 \times \mu(0) = 0 \times 1 & \text{ when } X=0 \\ 1 \times \mu(1) = 1 \times 1.25 & \text{ when } X=1 \end{cases} \]

  • We can write it equivalently as \(1_{=1}(X)\mu(1)\) because either …

    • \(X=0\), so \(1_{=1}(X)\mu(X) = 1_{=1}(X)\mu(1) = 0\)
    • \(X=1\), so \(1_{=1}(X)\mu(X) = 1_{=1}(X)\mu(1) =\mu(1)\)
  • This comes up a lot working with subsample means.

  • We’ll swap \(1_{=1}(X)\mu(X)\) for \(1_{=1}(X)\mu(1)\) often, referring to the indicator trick.

A More Realistic Example

$0k $25k $50k $75k $100k $125k $150k $175k $200k no 4-year degree 4-year degree

  • Let’s think about a random person drawn from this population

    • \(Y\) is their income
    • \(X\) is an indicator for having a four-year degree.
  • Suppose the subpopulation means are 70k for people with degrees and 30k for people without.

  • And that 4/10 of our population has degrees.

  • Q. What is \(E(Y \mid X)\)? And what is \(E(Y)\)?

  • \(E(Y \mid X)\) is a random variable that is either 70k or 30k
    • It’s 70k with probability 4/10, when \(X=1\).
    • It’s 30k with probability 6/10, when \(X=0\).
  • \(\mathop{\mathrm{E}}(Y)\) is its expectation. A number.
  • We’ll calculate it using iterated expectations.

\[ \begin{aligned} E\{ E( Y \mid X ) \} &= E(Y|X=1)P(X=1) + E(Y|X=0)P(X=0) \\ &= 70k \cdot 4/10 + 30k \cdot 6/10 = 28k + 18k = 46k \end{aligned} \]

Review Exercise 1

0.75 1.00 1.25 1.50 0 1

  • Suppose we sample a point \((X, Y)\) in the plot above uniformly at random.
    • Ignore the jitter; just think of the points as being at \(X=0\) and \(X=1\).
    • What is \(\mathop{\mathrm{E}}\{ \mu(X) \}\)?
  • There are two ways of thinking about calculating \(\mathop{\mathrm{E}}\{\mu(X)\}\).
    1. We just calculate the expectation of it, thinking of it as an arbitrary random variable.
    2. We use the law of iterated expectations to show it’s the unconditional mean of \(Y\).

\[ \small{ \begin{aligned} \mathop{\mathrm{E}}\qty{ \mu(X) } &= \frac{1}{2}\mu(0) + \frac{1}{2}\mu(1)= \frac{1}{2} \cdot 1 + \frac{1}{2} \cdot 1.25 = 1.125 && \text{ the first way } \\ \mathop{\mathrm{E}}\qty{ \mu(X) } &= \mathop{\mathrm{E}}\qty{\mathop{\mathrm{E}}\qty(Y \mid X)} = \mathop{\mathrm{E}}\qty{Y} = \frac{1}{6} \cdot 0.75 + \frac{1}{6} \cdot 1 + \ldots && \text{ the second way } \end{aligned} } \]

Review Exercise 2

0.75 1.00 1.25 1.50 0 1

  • Suppose we sample a point \((X, Y)\) in the plot above uniformly at random.
    • Ignore the jitter; just think of the points as being at \(X=0\) and \(X=1\).
    • What is \(\mathop{\mathrm{E}}\{ 1_{=1}(X)\mu(X) \}/\mathop{\mathrm{E}}(1_{=1}(X))\)?
  • It’s \(\mu(1)\). \(1_{=1}(X)\mu(X)=1_{=1}(X)\mu(1)\), so … \[ \begin{aligned} \mathop{\mathrm{E}}\{ 1_{=1}(X)\mu(X) \} &= \mathop{\mathrm{E}}\{ 1_{=1}(X)\mu(1) \} \\ &= \mu(1) \ \mathop{\mathrm{E}}\{ 1_{=1}(X) \} \\ \end{aligned} \]

Review Exercise 3

0.75 1.00 1.25 1.50 0 1

  • Suppose we sample a point \((X, Y)\) in the plot above uniformly at random.
    • Ignore the jitter; just think of the points as being at \(X=0\) and \(X=1\).
    • What is \(\mathop{\mathrm{E}}\{ 1_{=0}(X)\mu(X) \}/\mathop{\mathrm{E}}(1_{=0}(X))\)?
  • It’s \(\mu(0)\). It’s analogous to the last one. \(1_{=0}(X)\mu(X)=1_{=0}(X)\mu(0)\), so … \[ \begin{aligned} \mathop{\mathrm{E}}\{ 1_{=0}(X)\mu(X) \} &= \mathop{\mathrm{E}}\{ 1_{=0}(X)\mu(0) \} \\ &= \mu(0) \ \mathop{\mathrm{E}}\{ 1_{=0}(X) \} \\ \end{aligned} \]

Unbiasedness of Means

The Sample Mean

0.0 0.1 0.2 0.3 0.4 -6 -3 0 3 6

0.0 0.1 0.2 0.3 0.4 -6 -3 0 3 6

Claim. The sample mean is an unbiased estimator of the population mean. \[ \mathop{\mathrm{E}}[\hat\mu] = \mu \]

\[ \begin{aligned} \mathop{\mathrm{E}}\qty[\frac1n\sum_{i=1}^n Y_i] &= \frac1n\sum_{i=1}^n \mathop{\mathrm{E}}[Y_i] && \text{ via linearity } \\ &= \frac1n\sum_{i=1}^n \mu && \text{ via equal-probability sampling } \\ &= \frac1n \times n \times \mu = \mu. \end{aligned} \]

A Subsample Mean

Claim. The subsample mean is unbiased for the subpopulation mean. \[ \mathop{\mathrm{E}}[\hat\mu(1)] = \mu(1) \]

  • Use the Law of Iterated Expectations, conditioning on \(X_1 \ldots X_n\).
  • Then the linearity of conditional expectations to push the conditional expectation into the sum.
  • Then irrelevance of independent conditioning variables to write things in terms of the random variable \(\mu(X_i)\)
  • Then the indicator trick. What’s \(1_{=1}(X_i) \mu(X_i)\)? How is it related to \(\mu(1)\)?

\[ \hat\mu(1) = \frac{\sum_{i:X_i=1} Y_i}{\sum_{i:X_i=1} 1} = \frac{\sum_{i=1}^{n} 1_{=1}(X_i) Y_{i}}{\sum_{i=1}^{n} 1_{=1}(X_i)} \]

  • It’s easy to make mistakes using linearity of expectations when we’re summing over a subsample.
    • Can we or can we not ‘push’ or ‘pull’ and expectation through a sum …
    • … when the terms in that sum depend on the value of a random variable?
  • To make this a lot more obvious, we can rewrite these as sums over the whole sample.
    • Instead of ‘excluding’ the terms where \(X_i=0\), we ‘make them zero’ by multiplying in the indicator \(1_{=1}(X_i)\).
    • Then, of course we can distribute expectations through the sum.
    • The question becomes whether we can ‘pull out’ the indicator.
    • And we have a rule for that. We can do it if we’re conditioning on \(X_i\).
  • When we write a subsample mean, we can do that …
    • in the numerator, the sum of observations in the subsample.
    • in the denominator, the number of those terms, which is a sum of ones over the subsample.

\[ \small{ \begin{aligned} \mathop{\mathrm{E}}[\hat\mu(1)] &=\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}\qty{\frac{\sum_{i:X_i=1} Y_{i}}{\sum_{i:X_i=1} 1} \mid X_1 \ldots X_n}] \\ &=\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}\qty{\frac{\sum_{i=1}^{n} 1_{=1}(X_i) Y_{i}}{\sum_{i=1}^{n} 1_{=1}(X_i)} \mid X_1 \ldots X_n}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n} 1_{=1}(X_i) \mathop{\mathrm{E}}\qty{ Y_{i} \mid X_i}}{\sum_{i=1}^{n}1_{=1}(X_i)}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n}1_{=1}(X_i) \mu(X_i)}{\sum_{i=1}^{n}1_{=1}(X_i)}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n}1_{=1}(X_i) \mu(1)}{\sum_{i=1}^{n}1_{=1}(X_{i})}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\mu(1)\sum_{i=1}^{n} 1_{=1}(X_i) }{\sum_{i=1}^{n}1_{=1}(X_{i})}] \\ &=\mu(1) \ \mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n} 1_{=1}(X_i) }{\sum_{i=1}^{n}1_{=1}(X_{i})}] = \mu(1) \mathop{\mathrm{E}}[1] = \mu(1). \end{aligned} } \]

Differences in Subsample Means

Claim. The difference in subsample means is unbiased for the difference in subpopulation means.

\[ \mathop{\mathrm{E}}[\hat\mu(1) - \hat\mu(0)] = \mu(1) - \mu(0) \]

This follows from the linearity of expectations and unbiasedness of the subsample means.

\[ \mathop{\mathrm{E}}[\hat\mu(1) - \hat\mu(0)] = \mathop{\mathrm{E}}[\hat\mu(1)] - \mathop{\mathrm{E}}[\hat\mu(0)] = \mu(1) - \mu(0) \]

Summary

0.0 0.1 0.2 0.3 0.4 -4 -2 0 2 4
Like this.
0.0 0.1 0.2 0.3 0.4 -4 -2 0 2 4
Not like this.
  • Subsample means are unbiased estimators of the corresponding population means.
  • And, expectation being linear, this extends to differences in subsample means.
  • Now that we’ve worked out that the location is good, it’s time to talk about spread.
  • That’s what we’ll do next time.