Summarizing Trends involving Many Groups
$$
income | education | age | county | |
---|---|---|---|---|
1 | $55k | 13 | 35 | orange |
2 | $25k | 13 | 27 | LA |
3 | $44k | 16 | 34 | san joaquin |
4 | $22k | 14 | 34 | orange |
5 | $0k | 16 | 31 | san diego |
6 | $105k | 16 | 27 | LA |
7 | $1k | 16 | 25 | LA |
8 | $21k | 14 | 30 | unknown |
⋮ | ||||
2270 | $85k | 16 | 31 | orange |
2271 | $150k | 16 | 32 | stanislaus |
income | education | county | |
---|---|---|---|
1 | $22k | 18 | unknown |
2 | $0k | 16 | solano |
3 | $98k | 16 | LA |
4 | $25k | 12 | tulare |
5 | $19k | 14 | san luis obispo |
⋮ | |||
5677499 | $11k | 10 | alameda |
5677500 | $116k | 18 | unknown |
income | education | county | |
---|---|---|---|
1 | $55k | 13 | orange |
2 | $25k | 13 | LA |
3 | $44k | 16 | san joaquin |
4 | $22k | 14 | orange |
5 | $0k | 16 | san diego |
⋮ | |||
2270 | $85k | 16 | orange |
2271 | $150k | 16 | stanislaus |
The population we’re looking at is made up. It’s just an illustration. We don’t actually have all this information.
education | mean | sd | N |
---|---|---|---|
< 12 | 19K | 22K | 270K |
14 | 33K | 28K | 548K |
16 | 58K | 64K | 2M |
> 16 | 84K | 82K | 770K |
education=case_match(cps.data$a_hga,
0 ~ 0, # child
31 ~ 0, # < grade 1
32 ~ 4, # grade 1-4
33 ~ 6, # grade 5-6
34 ~ 8, # grade 7-8
35 ~ 9, # grade 9
36 ~ 10, # grade 10
37 ~ 11, # grade 11
38 ~ 11, # grade 12 no diploma
39 ~ 12, # high school grad
40 ~ 13, # some college
41 ~ 14, # associate degree (vocational)
42 ~ 14, # associate degree (academic)
43 ~ 16, # bachelors degree
44 ~ 18, # masters degree
45 ~ 20, # professional school degree
46 ~ 20) # doctorate
Describe the comparison we’re visualizing, the difference in the means of the green dots and red dots, in …
People with graduate degrees vs. people with 4-year degrees (only).
Option 1. A compact, readable version \[ \frac{\sum\limits_{j:x_j > 16} y_j}{\sum_{j:x_j > 16} 1} - \frac{\sum\limits_{j:x_j = 16} y_j}{\sum_{j:x_j = 16} 1} \]
Option 2. A verbose version that’s easier to use in probability calculations. \[ \frac{\sum\limits_{j=1}^n 1_{\{18,20\}}(x_j) y_j}{\sum\limits_{j=1}^n 1_{\{18, 20\}}(x_j)} - \frac{\sum\limits_{j=1}^n1_{\{16\}}(x_j)y_j}{\sum\limits_{j=1}^n 1_{\{16\}}(x_j)} \ \ \text{ where } \ \ 1_{S}(x) = \begin{cases} 1 & \text{ if } x \in S \\ 0 & \text{ otherwise} \end{cases}. \]
Describe the comparison we’re visualizing, the difference in the means of the green dots and red dots, in …
People with 2-year degrees (only) vs. people who haven’t completed a year of college.
A compact, readable version \[ \frac{\sum\limits_{j:x_j = 14} y_j}{\sum\limits_{j:x_j = 14} 1} - \frac{\sum\limits_{j:x_j \le 12} y_j}{\sum\limits_{j:x_j \le 12} 1} \]
A verbose version that’s easier to use in probability calculations \[ \frac{\sum\limits_{j=1}^n 1_{\{14\}}(x_j) y_j}{\sum\limits_{j=1}^n 1_{\{14\}}(x_j)} - \frac{\sum\limits_{j=1}^n1_{\{8 \ldots 12\}}(x_j)y_j}{\sum\limits_{j=1}^n 1_{\{8 \ldots 12\}}(x_j)} \ \ \text{ where } \ \ 1_S(x) = \begin{cases} 1 & \text{ if } x \in S \\ 0 & \text{ otherwise} \end{cases}. \]
\[ \mu(x) = \frac{1}{m_x}\sum_{j:x_j=x} y_j \quad \text{ where } \quad m_x = \sum_{j:x_j=x} 1 \]
What, on the plot, corresponds to \(\mu(14)\)?
The black dot in the \(x=14\) column.
What, on the plot, corresponds to \(m_{12}\)?
The number of red dots in the column where \(x=12\).
What, in terms \(\mu(x)\), is the mean outcome among people in our sample who did not attend college?
Strategy.
\[ \begin{aligned} \frac{\sum\limits_{j:x_j \in 8 \ldots 12} y_j}{\sum\limits_{j:x_j \in 8 \ldots 12} 1} &= \frac{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} y_j}{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} 1} = \frac{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}} \textcolor[RGB]{239,71,111}{m_x} \times \textcolor[RGB]{17,138,178}{\frac{1}{m_x}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} y_j}{\textcolor[RGB]{239,71,111}{\sum}\limits_{x \in 8 \ldots 12}\textcolor[RGB]{17,138,178}{m_x}} \\ &= \textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}} p_x \times \textcolor[RGB]{17,138,178}{\mu(x)} \quad \text{ for } \quad p_x = \frac{\textcolor[RGB]{239,71,111}{m_x}}{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}\textcolor[RGB]{17,138,178}{m_x}}}. \end{aligned} \]
\[ \frac{1}{4}\sum_{x=9}^{12} \qty{ \mu(x) - \mu(x-1) } \]
\[ \frac{ \qty{\mu(12) - \mu(11)} + \qty{ \mu(11) - \mu(10) } + \qty{ \mu(10) - \mu(9) } + \qty{ \mu(9) - \mu(8) } }{4} = \frac{\mu(12) - \mu(8)}{4} \]
\[ \frac{\sum_{j:x_j \in 9 \ldots 12} \qty{ \mu(x_j) - \mu(x_j-1) }}{\sum_{j:x_j \in 9 \ldots 12} 1} = \sum_{x \in 9 \ldots 12} p_x \ \qty{ \mu(x) - \mu(x-1) } \quad \text{ for } \quad p_x = \frac{m_x}{\sum_{x \in 9 \ldots 12}m_x}. \]
We can try the ‘cancellation trick’ we used before, but it doesn’t work very well.
\[\small{ \begin{aligned} & p_{12} \qty{ \mu(12) - \mu(11) } + p_{11} \qty{ \mu(11) - \mu(10) } + p_{10} \qty{ \mu(10) - \mu(9) } + p_{9} \qty{ \mu(9) - \mu(8) } \\ &= p_{12}\mu(12) + (p_{11}-p_{12}) \mu(11) + (p_{10}-p_{11}) \mu(10) + (p_9 - p_{10}) \mu(9) - p_9\mu(8) \end{aligned}} \]
Year Increment | → 9 | → 10 | → 11 | → 12 |
---|---|---|---|---|
Mean Income Increment | 2.1K | -11.4K | 10.2K | 10.6K |
Proportion of People | 0.03 | 0.02 | 0.10 | 0.84 |
Year Increment | → 9 | → 10 | → 11 | → 12 |
---|---|---|---|---|
Mean Income Increment | 10.6K | |||
Proportion of People | 0 | 0 | 0 | 1 |
Technique | Over People | Over Years |
---|---|---|
Caricature | 10.6K | 2.9K |
Calculation | 9.79K | 2.86K |
Year Increment | → 14 | → 16 | → 18 | → 20 |
---|---|---|---|---|
Mean Income Increment | 9.9K | 24.0K | 23.0K | 23.8K |
Proportion of People | 0.18 | 0.55 | 0.21 | 0.06 |
Year Increment | → 14 | → 16 | → 18 | → 20 |
---|---|---|---|---|
Mean Income Increment | 24K | 24K | 24K | 24K |
Proportion of People | 0.18 | 0.55 | 0.21 | 0.06 |
Year Increment | → 14 | → 16 | → 18 | → 20 |
---|---|---|---|---|
Mean Income Increment | 10K | 24K | 24K | 24K |
Proportion of People | 0.18 | 0.55 | 0.21 | 0.06 |
Technique | Over People | Over Degrees |
---|---|---|
Coarse Caricature | 24K | 24K |
Fine Caricature | 21K | 20K |
Calculation | 21.22K | 20.18K |
Storytelling, Visualization, and Estimation Targets
Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
Story | 1 | 2 | 3 |
---|---|---|---|
Plot | B | A | C |
Average over Years | 2.95K | 2.95K | 2.95K |
Average over People | 10K | 185 | 485 |
Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
Story | 1 | 2 | 3 |
---|---|---|---|
Plot | B | A | C |
Average over Years | 2.9K | 2.9K | 2.9K |
Average over People | 9.6K | 200.0 | 500.0 |
Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
Story | 1 | 2 | 3 |
---|---|---|---|
Plot | A | C | B |
Average over Years | 8.31K | 8.31K | 20.01K |
Average over People | 17K | 19K | 16K |
Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
Story | 1 | 2 | 3 |
---|---|---|---|
Plot | A | C | B |
Average over Years | 8K | 8K | 20K |
Average over People | 17K | 19K | 16K |