Conditional Expectations
$$
income | education | county | |
---|---|---|---|
1 | $120k | 16 | kern |
2 | $4k | 13 | kern |
3 | $15k | 12 | kern |
4 | $15k | 13 | kern |
⋮ | |||
2271 | $0k | 20 | tulare |
income | education | county | |
---|---|---|---|
1 | $51k | 14 | sacramento |
2 | $0k | 16 | monterey |
3 | $0k | 13 | unknown |
⋮ | |||
14194 | $62k | 12 | LA |
income | education | county | |
---|---|---|---|
1 | $120k | 16 | kern |
2 | $4k | 13 | kern |
3 | $15k | 12 | kern |
⋮ | |||
2271 | $0k | 20 | tulare |
\[ \mu = \frac{1}{m}\sum_{j=1}^m y_j \]
\[ \hat \mu = \frac{1}{n}\sum_{i=1}^n Y_i \]
\[ \mathop{\mathrm{E}}[\hat \mu] = \mu \qfor \underset{\text{\color[RGB]{64,64,64}{population mean}}}{\mu = \mathop{\mathrm{E}}[Y_i] = \frac{1}{m}\sum_{j=1}^m y_j} \]
\[ \mathop{\mathrm{sd}}[\hat \mu] = \frac{\sigma}{\sqrt{n}} \qfor \underset{\text{\color[RGB]{64,64,64}{population standard deviation}}}{\sigma=\sqrt{\mathop{\mathrm{\mathop{\mathrm{V}}}}[Y_i]}=\sqrt{\frac{1}{m}\sum_{j=1}^m (y_j - \mu)^2}} \]
\[ \hat \mu(x) = \frac{1}{N_x}\sum_{i:X_i=x} Y_i \quad \text{ where } \quad N_x = \sum_{i:X_i=x} 1. \]
A lot of this is review.
To make this a nice cohesive read, I’ve included all the unconditional probability stuff we’ve covered so far and the new conditional stuff.
income50k | income | education | county | |
---|---|---|---|---|
1 | 1 | $51k | 14 | sacramento |
2 | 0 | $0k | 16 | monterey |
3 | 0 | $0k | 13 | unknown |
⋮ | ||||
14194 | 1 | $62k | 12 | LA |
roll | income50k | income | education | county | |
---|---|---|---|---|---|
1 | 1017 | 1 | $120k | 16 | kern |
2 | 8004 | 0 | $4k | 13 | kern |
3 | 4775 | 0 | $15k | 12 | kern |
⋮ | |||||
2271 | 7117 | 0 | $0k | 20 | tulare |
\[ P(Y_i = 1) = \sum_{j: y_j = 1} P(\text{roll}_i = j) = \sum_{j: y_j = 1} \frac{1}{m} \times y_j = \mu. \]
income50k | income | education | county | |
---|---|---|---|---|
1 | 1 | $51k | 14 | sacramento |
2 | 0 | $0k | 16 | monterey |
3 | 0 | $0k | 13 | unknown |
⋮ | ||||
14194 | 1 | $62k | 12 | LA |
roll | income50k | income | education | county | |
---|---|---|---|---|---|
1 | 1017 | 1 | $120k | 16 | kern |
2 | 8004 | 0 | $4k | 13 | kern |
3 | 4775 | 0 | $15k | 12 | kern |
⋮ | |||||
2271 | 7117 | 0 | $0k | 20 | tulare |
\[ P(Y_i = y) = \sum_{j: y_j = y} P(\text{roll}_i = j) \]
income50k | income | education | county | |
---|---|---|---|---|
1 | 1 | $51k | 14 | sacramento |
2 | 0 | $0k | 16 | monterey |
3 | 0 | $0k | 13 | unknown |
⋮ | ||||
14194 | 1 | $62k | 12 | LA |
roll | income50k | income | education | county | |
---|---|---|---|---|---|
1 | 1017 | 1 | $120k | 16 | kern |
2 | 8004 | 0 | $4k | 13 | kern |
3 | 4775 | 0 | $15k | 12 | kern |
⋮ | |||||
2271 | 7117 | 0 | $0k | 20 | tulare |
\[ \begin{aligned} \mathop{\mathrm{E}}[Y_i] = \sum_y P(Y_i = y) \times y = \sum_y \qty(\sum_{j: y_j = y} \frac{1}{m}) \times y = \frac{1}{m}\sum_y\sum_{j:y_j = y} y = \frac{1}{m}\sum_{j=1}^m y_j \end{aligned} \]
income50k | income | education | county | |
---|---|---|---|---|
1 | 1 | $51k | 14 | sacramento |
2 | 0 | $0k | 16 | monterey |
3 | 0 | $0k | 13 | unknown |
⋮ | ||||
14194 | 1 | $62k | 12 | LA |
roll | income50k | income | education | county | |
---|---|---|---|---|---|
1 | 1017 | 1 | $120k | 16 | kern |
2 | 8004 | 0 | $4k | 13 | kern |
3 | 4775 | 0 | $15k | 12 | kern |
⋮ | |||||
2271 | 7117 | 0 | $0k | 20 | tulare |
\[ \begin{aligned} P(Y_i=1) = \sum_{j: y_j = 1} P(\text{roll}_i = j) &= \sum_{j:y_j=1} \frac{1}{m} \\ &= \sum_{j=1}^m \frac{1}{m} \times \begin{cases} 1 & \ \ \text{ if } y_j = 1 \\ 0 & \ \ \text{ otherwise} \\ \end{cases} \\ &= \sum_{j=1}^m \frac{1}{m} \times y_j = \mathop{\mathrm{E}}[Y_i]. \end{aligned} \]
\[ P(Z \in A) = \mathop{\mathrm{E}}[1_A(Z)] \qfor 1_A(z) = \begin{cases} 1 & \qqtext{ for } z \in A \\ 0 & \qqtext{ otherwise} \end{cases} \]
\[ P(Y_1 ... Y_k = y_1 \ldots y_k) = P(Y_1=y_1) \times \ldots \times P(Y_k=y_k). \]
\[ E \{ E( Y \mid X ) \} \quad \text{ for any random variables $X, Y$} \]
To calculate the mean of \(Y\), we can average within subpopulations, then across subpopulations.
\[ E( Y \mid X, X' ) = E( Y \mid X ) \quad \text{ when $X'$ is independent of $X,Y$ } \]
If \(X'\) is unrelated to \(X\) and \(Y\), holding it constant doesn’t affect the relationship between them.
\[ \begin{aligned} E ( a Y + b Z ) &= E (aY) + E (bZ) \\ &= aE(Y) + bE(Z) \\ & \text{ for random variables $Y, Z$ and numbers $a,b$ } \end{aligned} \]
\[ \small{ \begin{aligned} \mathop{\mathrm{E}}\qty( a Y + b Z ) &= \sum_{y}\sum_z (a y + b z) \ P(Y=y, Z=z) && \text{ by definition of expectation} \\ &= \sum_{y}\sum_z a y \ P(Y=y, Z=z) + \sum_{z}\sum_y b z \ P(Y=y, Z=z) && \text{changing the order in which we sum} \\ &= \sum_{y} a y \ \sum_z P(Y=y,Z=z) + \sum_{z} b z \ \sum_y P(Y=y,Z=z) && \text{pulling constants out of the inner sums} \\ &= \sum_{y} a y \ P(Y=y) + \sum_{z} b z \ P(Z=z) && \text{summing to get marginal probabilities from our joint } \\ &= a\sum_{y} y \ P(Y=y) + b\sum_{z} z \ P(Z=z) && \text{ pulling constants out of the remaining sum } \\ &= a\mathop{\mathrm{E}}Y + b \mathop{\mathrm{E}}Z && \text{by definition} \end{aligned} } \]
\[ \begin{aligned} E\{ a(X) Y + b(X) Z \mid X \} &= E\{a(X)Y \mid X\} + E\{ b(X)Z \mid X\} \\ &= a(X)E(Y \mid X) + b(X)E(Z \mid X) \end{aligned} \]
\(1_{=1}(X)\mu(X)\) is a random variable taking on these two values. \[ 1_{=1}(X)\mu(X) = \begin{cases} 0 \times \mu(0) = 0 \times 1 & \text{ when } X=0 \\ 1 \times \mu(1) = 1 \times 1.25 & \text{ when } X=1 \end{cases} \]
We can write it equivalently as \(1_{=1}(X)\mu(1)\) because either …
This comes up a lot working with subsample means.
We’ll swap \(1_{=1}(X)\mu(X)\) for \(1_{=1}(X)\mu(1)\) often, referring to the indicator trick.
Let’s think about a random person drawn from this population
Suppose the subpopulation means are 70k for people with degrees and 30k for people without.
And that 4/10 of our population has degrees.
Q. What is \(E(Y \mid X)\)? And what is \(E(Y)\)?
\[ \begin{aligned} E\{ E( Y \mid X ) \} &= E(Y|X=1)P(X=1) + E(Y|X=0)P(X=0) \\ &= 70k \cdot 4/10 + 30k \cdot 6/10 = 28k + 18k = 46k \end{aligned} \]
\[ \small{ \begin{aligned} \mathop{\mathrm{E}}\qty{ \mu(X) } &= \frac{1}{2}\mu(0) + \frac{1}{2}\mu(1)= \frac{1}{2} \cdot 1 + \frac{1}{2} \cdot 1.25 = 1.125 && \text{ the first way } \\ \mathop{\mathrm{E}}\qty{ \mu(X) } &= \mathop{\mathrm{E}}\qty{\mathop{\mathrm{E}}\qty(Y \mid X)} = \mathop{\mathrm{E}}\qty{Y} = \frac{1}{6} \cdot 0.75 + \frac{1}{6} \cdot 1 + \ldots && \text{ the second way } \end{aligned} } \]
Claim. The sample mean is an unbiased estimator of the population mean. \[ \mathop{\mathrm{E}}[\hat\mu] = \mu \]
\[ \begin{aligned} \mathop{\mathrm{E}}\qty[\frac1n\sum_{i=1}^n Y_i] &= \frac1n\sum_{i=1}^n \mathop{\mathrm{E}}[Y_i] && \text{ via linearity } \\ &= \frac1n\sum_{i=1}^n \mu && \text{ via equal-probability sampling } \\ &= \frac1n \times n \times \mu = \mu. \end{aligned} \]
Claim. The subsample mean is unbiased for the subpopulation mean. \[ \mathop{\mathrm{E}}[\hat\mu(1)] = \mu(1) \]
\[ \hat\mu(1) = \frac{\sum_{i:X_i=1} Y_i}{\sum_{i:X_i=1} 1} = \frac{\sum_{i=1}^{n} 1_{=1}(X_i) Y_{i}}{\sum_{i=1}^{n} 1_{=1}(X_i)} \]
\[ \small{ \begin{aligned} \mathop{\mathrm{E}}[\hat\mu(1)] &=\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}\qty{\frac{\sum_{i:X_i=1} Y_{i}}{\sum_{i:X_i=1} 1} \mid X_1 \ldots X_n}] \\ &=\mathop{\mathrm{E}}\qty[ \mathop{\mathrm{E}}\qty{\frac{\sum_{i=1}^{n} 1_{=1}(X_i) Y_{i}}{\sum_{i=1}^{n} 1_{=1}(X_i)} \mid X_1 \ldots X_n}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n} 1_{=1}(X_i) \mathop{\mathrm{E}}\qty{ Y_{i} \mid X_i}}{\sum_{i=1}^{n}1_{=1}(X_i)}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n}1_{=1}(X_i) \mu(X_i)}{\sum_{i=1}^{n}1_{=1}(X_i)}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n}1_{=1}(X_i) \mu(1)}{\sum_{i=1}^{n}1_{=1}(X_{i})}] \\ &=\mathop{\mathrm{E}}\qty[\frac{\mu(1)\sum_{i=1}^{n} 1_{=1}(X_i) }{\sum_{i=1}^{n}1_{=1}(X_{i})}] \\ &=\mu(1) \ \mathop{\mathrm{E}}\qty[\frac{\sum_{i=1}^{n} 1_{=1}(X_i) }{\sum_{i=1}^{n}1_{=1}(X_{i})}] = \mu(1) \mathop{\mathrm{E}}[1] = \mu(1). \end{aligned} } \]
Claim. The difference in subsample means is unbiased for the difference in subpopulation means.
\[ \mathop{\mathrm{E}}[\hat\mu(1) - \hat\mu(0)] = \mu(1) - \mu(0) \]
This follows from the linearity of expectations and unbiasedness of the subsample means.
\[ \mathop{\mathrm{E}}[\hat\mu(1) - \hat\mu(0)] = \mathop{\mathrm{E}}[\hat\mu(1)] - \mathop{\mathrm{E}}[\hat\mu(0)] = \mu(1) - \mu(0) \]