Lecture 7

Summarizing Trends involving Many Groups

Today

We’re going to look at the relationship between income and education.
We’ve done that before. When we did, we thought of education in dichotomous terms.
- We had some people with 4-year degrees — the green dots.
- We had some people without them — the red dots.
Today, we’ll be a bit more granular. We’ll think of education in terms of years of schooling.
- 8+ years = finished middle school
- 12+ years = high school diploma
- 16+ years = 4-year college degree

Our Sample

As before, we’ll work with data from the 2022 Current Population Survey.
We’ll look at California residents age 25-35 with at least an 8th-grade education.
This sample includes 2271 people.
- To visualize it, I’ve marked each one on a map.
- To display some of the information we have, I’ve made a table.
I’ve used color to emphasize the dichotomous view of education we’ve been using.

	income	education	age	county
1	$55k	13	35	orange
2	$25k	13	27	LA
3	$44k	16	34	san joaquin
4	$22k	14	34	orange
5	$0k	16	31	san diego
6	$105k	16	27	LA
7	$1k	16	25	LA
8	$21k	14	30	unknown
⋮
2270	$85k	16	31	orange
2271	$150k	16	32	stanislaus

A Reminder about Visualization

with jitter

without jitter

Our primary visualization will be a scatter plot. It shows a dot for each person in our sample.
The person’s income is on the y-axis and their education on the x-axis.
We add a bit of fuzz to the displayed x-coordinate to space the dots out.
- But not so much that it’s not clear what the real value of $X$.
- This is called ‘jittering’. Sometimes a plot like this is called a ‘jitter plot’.
As before, we can use color to highlight the categories of people with and without 4-year degrees.

	income	education	county
1	$22k	18	unknown
2	$0k	16	solano
3	$98k	16	LA
4	$25k	12	tulare
5	$19k	14	san luis obispo
⋮
5677499	$11k	10	alameda
5677500	$116k	18	unknown

	income	education	county
1	$55k	13	orange
2	$25k	13	LA
3	$44k	16	san joaquin
4	$22k	14	orange
5	$0k	16	san diego
⋮
2270	$85k	16	orange
2271	$150k	16	stanislaus

Population

Sample

We’re not particularly interested in our sample itself, i.e., Californians age 25-35 who responded to the survey.
We’re interested in the population from which it’s sampled, i.e. all Californians age 25-35.
That’s not new. We’ve been doing this since the semester started. And we have a recipe for doing it.
1. Choosing a target. We think about how we would summarize the population if we had surveyed everyone.
2. Point estimation. However we would summarize the population, we do summarize our sample.
3. Interval estimation. We work out what our point estimator’s sampling distribution tells us about our target.
What is New
Now that we’re taking a more granular view of education, there’s a lot more to the first step.
Today, all we’ll do is think about what we might want to know. We’ll talk as if we had surveyed everyone.
Next time, we’ll talk about point and interval estimation. It’ll be familiar, but there’ll be some new twists.

The population we’re looking at is made up. It’s just an illustration. We don’t actually have all this information.

Summarization

Looking at the dots alone, we can get rough answers to some simple questions.
We can see that people with four-year-degrees tend to earn more than people without them.
But it’s hard to be precise just eyeballing things. It helps to look at numerical summaries.
We’ve can, e.g., overlay the mean income $\pm$ one standard deviation for each group. Do it!

That’s shown by the dot’s position and its ‘arms’ on the y-axis.
Its position on the x-axis shows the mean years of schooling within the group.

The Dichotomous Version

It’s worth visualizing the information lost by dichotomizing education.
To do this, we can look at a similar plot without that information. This one.
In it, I’ve replaced the x-coordinates with random values that tell us nothing beyond group membership.
People without 4-year degrees get random values uniformly distributed between 8 and 14.
- That’s the range of years of schooling in that group.
People with 4-year degrees get random values between 16 and 20.
- Same deal.

Column Comparisons

Let’s think about what we can do with the additional information we have now.
One simple option is to work with the finest grouping we can: the columns in our scatter plot.
We can, e.g., look at the income mean and standard deviation for people with 8,9,10, etc. years of schooling.
Or compare any pair of those groups. It’s natural to look at groups that are in some sense adjacent.
- 18 vs. 16: the value of a masters’ degree (only) vs. a 4-year one (only).
- 16 vs. 14: the value of a 4-year degree (only) vs. a 2-year degree (only).
- 14 vs. 12: the value of a 2-year degree (only) vs. a high school degree (only).

Column Comparisons as ‘Forgetting’

This is essentially the same as what we did when we dichotomized education.
- We just ‘forget’ the other groups. We don’t look at them.
- Everything we proved before the midterm still applies: unbiasedness, variance formulas, etc.
- When we did the math on comparing two groups, we didn’t assume they were the only two groups.
That’s great if we know exactly what we’re interested in—and it’s a comparison like this.
But it’s not so great if we’re a bit less focused—if we want to talk about the whole population.
- That’d mean we might have to think about/report a lot of different numbers.
- It can get to be too much. We’re not summarizing enough.

Coarsening

One remedy is to go back to working with coarser groupings.
We can make up our own by aggregating—combining—the columns we have.
Here, for example, is a breakdown in 4 groups.
- < 14 years: no college degree.
- =14 years: 2-year degree.
- =16 years: 4-year degree.
- > 16 years: graduate degree.
This is more information than a two-group comparison but perhaps little enough to be manageable.
- We can visualize it all in a plot like the one above.
- Or report the set of group-specific summaries in a table like the one below.

education	mean	sd	N
< 12	19K	22K	270K
14	33K	28K	548K
16	58K	64K	2M
> 16	84K	82K	770K

Coarsening is Always Happening

education=case_match(cps.data$a_hga,
                 0  ~ 0,   # child
                 31 ~ 0,   # < grade 1
                 32 ~ 4,   #  grade 1-4
                 33 ~ 6,   #  grade 5-6
                 34 ~ 8,   #  grade 7-8 
                 35 ~ 9,   #  grade 9
                 36 ~ 10,  #  grade 10
                 37 ~ 11,  #  grade 11
                 38 ~ 11,  #  grade 12 no diploma
                 39 ~ 12,  #  high school grad
                 40 ~ 13,  #  some college
                 41 ~ 14,  #  associate degree (vocational)
                 42 ~ 14,  #  associate degree (academic)
                 43 ~ 16,  #  bachelors degree
                 44 ~ 18,  #  masters degree
                 45 ~ 20,  #  professional school degree
                 46 ~ 20)  #  doctorate

Acknowledging This
Handling The Complexity

Often, the data we have is already coarsened to some degree.
- Our data does not, for example, distinguish between people with 7 and 8 years of schooling.
- Similarly, it doesn’t tell us years for people with graduate degrees. We’ve made that up.
And to fit it into our analysis, we often have to coarsen it further.
- To fit education into our ‘years of schooling’ framework, we’ve grouped together people with …
  - vocational and academic associate degrees. (12+2=14 years)
  - professional and doctoral degrees. (16 + 4ish ≈ 20 years)
- Our choices about some of these made-up numbers create the gaps we see at 15,17,and 19 years.

Keeping all this in mind can be overwhelming.
It’s often a good idea to pick and abstraction — like years of schooling — and go with it.
But it’s important to come back to this at some point to make sure your conclusions mean what you think they do.
Suppose you see a bigger income jump between ‘8’ and 9 years of education than between 9 and 10.
- That might be because the group with ‘8’ years actually includes people with 7 and 8 years.
- So many of the people with ‘8’ years have 2 fewer years of schooling than the people with 9.

Dichotomization

Often, when we want to report a numeric summary, we do resort to dichotomization.
But having the more granular information allows us to choose the groups we want.
There are many meaningful choices.
- >16: people with graduate degrees.
- ≤12: people who haven’t been to college.

A Coarsened Comparison: >16 vs. 16

Activity
In words
In math
In code

Describe the comparison we’re visualizing, the difference in the means of the green dots and red dots, in …

Words
Mathematical notation
R code

People with graduate degrees vs. people with 4-year degrees (only).

Option 1. A compact, readable version \[ \frac{\sum\limits_{j:x_j > 16} y_j}{\sum_{j:x_j > 16} 1} - \frac{\sum\limits_{j:x_j = 16} y_j}{\sum_{j:x_j = 16} 1} \]
Option 2. A verbose version that’s easier to use in probability calculations. \[ \frac{\sum\limits_{j=1}^n 1_{\{18,20\}}(x_j) y_j}{\sum\limits_{j=1}^n 1_{\{18, 20\}}(x_j)} - \frac{\sum\limits_{j=1}^n1_{\{16\}}(x_j)y_j}{\sum\limits_{j=1}^n 1_{\{16\}}(x_j)} \ \ \text{ where } \ \ 1_{S}(x) = \begin{cases} 1 & \text{ if } x \in S \\ 0 & \text{ otherwise} \end{cases}. \]

Option 1. A compact, readable version

mean(y[x > 16]) - mean(y[x==16])

[1] 25930.81

Option 2. A verbose version that’s easier to combine with more abstract code

groups = list(a = c(18,20), b=16)
mean(y[x %in% groups$a]) - mean(y[x %in% groups$b])

[1] 25930.81

A Coarsened Comparison: 14 vs ≤12

Activity
In words
In math
In code

Describe the comparison we’re visualizing, the difference in the means of the green dots and red dots, in …

Words
Mathematical notation
R code

People with 2-year degrees (only) vs. people who haven’t completed a year of college.

A compact, readable version \[ \frac{\sum\limits_{j:x_j = 14} y_j}{\sum\limits_{j:x_j = 14} 1} - \frac{\sum\limits_{j:x_j \le 12} y_j}{\sum\limits_{j:x_j \le 12} 1} \]
A verbose version that’s easier to use in probability calculations \[ \frac{\sum\limits_{j=1}^n 1_{\{14\}}(x_j) y_j}{\sum\limits_{j=1}^n 1_{\{14\}}(x_j)} - \frac{\sum\limits_{j=1}^n1_{\{8 \ldots 12\}}(x_j)y_j}{\sum\limits_{j=1}^n 1_{\{8 \ldots 12\}}(x_j)} \ \ \text{ where } \ \ 1_S(x) = \begin{cases} 1 & \text{ if } x \in S \\ 0 & \text{ otherwise} \end{cases}. \]

Option 1. A compact, readable version

mean(y[x == 14]) - mean(y[x<=12])

[1] 5338.928

Option 2. A verbose version that’s easier to combine with more abstract code

groups = list(a = c(14), b=8:12)
mean(y[x %in% groups$a]) - mean(y[x %in% groups$b])

[1] 5338.928

From Column Means to Coarsened Means

We can express the means in our aggregate groups in terms of the means in our columns.
Let’s define some notation describing the dots in each column to make this easier.
- $m_x$ will be the number of dots, i.e. the number of people with $x_j=x$.
- $\mu(x)$ will be the mean height of the dots, i.e. the mean income of people with $x_j=x$.

\[ \mu(x) = \frac{1}{m_x}\sum_{j:x_j=x} y_j \quad \text{ where } \quad m_x = \sum_{j:x_j=x} 1 \]

Question 1
Answer
Question 2
Answer
Question 3
Answer

What, on the plot, corresponds to $\mu(14)$?

The black dot in the $x=14$ column.

What, on the plot, corresponds to $m_{12}$?

The number of red dots in the column where $x=12$.

What, in terms $\mu(x)$, is the mean outcome among people in our sample who did not attend college?

Strategy.

Where we see a sum over the people in the aggregate group, we rewrite it as a sum of sums.
A sum over columns of sums over the people in them.
We express each column’s sum in terms of its mean and its size.
We make sense of the result.

\[ \begin{aligned} \frac{\sum\limits_{j:x_j \in 8 \ldots 12} y_j}{\sum\limits_{j:x_j \in 8 \ldots 12} 1} &= \frac{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} y_j}{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} 1} = \frac{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}} \textcolor[RGB]{239,71,111}{m_x} \times \textcolor[RGB]{17,138,178}{\frac{1}{m_x}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} y_j}{\textcolor[RGB]{239,71,111}{\sum}\limits_{x \in 8 \ldots 12}\textcolor[RGB]{17,138,178}{m_x}} \\ &= \textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}} p_x \times \textcolor[RGB]{17,138,178}{\mu(x)} \quad \text{ for } \quad p_x = \frac{\textcolor[RGB]{239,71,111}{m_x}}{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}\textcolor[RGB]{17,138,178}{m_x}}}. \end{aligned} \]

It’s a weighted average of the column means $\mu(x)$.
The weight $p_x$ is the proportion of the people in the aggregate group that are in that column.

Aggregating Comparisons

Another option—rather than comparing aggregates—is aggregating comparisons.
We can summarize the ‘value’ of each year of high school by the associated increment in mean income.
- Visually, that’s the slope of the dashed black line leading up to it.
To reduce these to a single number, we can take the average of the increments.
- Visually, that’s the slope of the secant—the blue one. In a sense.
- If that’s not obvious to you, don’t worry. We’ll come back to it in a minute.
Why in a sense? Because there are two things we might mean by ‘the average of the increments’.

Different Ways of Aggregating

The average, over the four years of high school, of the increments. Let’s write it out.

\[ \frac{1}{4}\sum_{x=9}^{12} \qty{ \mu(x) - \mu(x-1) } \]

That’s the slope of the secant from $\hat\mu(8)$ to $\hat \mu(12)$—the blue line.
- To see this, we can write out our four terms and see what cancels.
- It helps to count backward from $x=12$.

\[ \frac{ \qty{\mu(12) - \mu(11)} + \qty{ \mu(11) - \mu(10) } + \qty{ \mu(10) - \mu(9) } + \qty{ \mu(9) - \mu(8) } }{4} = \frac{\mu(12) - \mu(8)}{4} \]

As we go from $x=8$ to $x=12$, the ‘rise’ is $\mu(12)-\mu(8)$ and the ‘run’ is $12-8=4$; slope= rise / run.

Different Ways of Aggregating

The average, over people with ≥1 year of high school, of the increments affecting them in their last year.
- That’s something different. For the majority of these people, their last year was their fourth.
- So most of the terms we’re averaging are the increment from the third to fourth year.

\[ \frac{\sum_{j:x_j \in 9 \ldots 12} \qty{ \mu(x_j) - \mu(x_j-1) }}{\sum_{j:x_j \in 9 \ldots 12} 1} = \sum_{x \in 9 \ldots 12} p_x \ \qty{ \mu(x) - \mu(x-1) } \quad \text{ for } \quad p_x = \frac{m_x}{\sum_{x \in 9 \ldots 12}m_x}. \]

We can try the ‘cancellation trick’ we used before, but it doesn’t work very well.

\[\small{ \begin{aligned} & p_{12} \qty{ \mu(12) - \mu(11) } + p_{11} \qty{ \mu(11) - \mu(10) } + p_{10} \qty{ \mu(10) - \mu(9) } + p_{9} \qty{ \mu(9) - \mu(8) } \\ &= p_{12}\mu(12) + (p_{11}-p_{12}) \mu(11) + (p_{10}-p_{11}) \mu(10) + (p_9 - p_{10}) \mu(9) - p_9\mu(8) \end{aligned}} \]

We do get a linear combination of the column means, but it’s not an interpretable one.
It will, however, be useful in calculations next time when we’re talking about inference.

Comparing Two Summaries

Year Increment	→ 9	→ 10	→ 11	→ 12
Mean Income Increment	2.1K	-11.4K	10.2K	10.6K
Proportion of People	0.03	0.02	0.10	0.84

Question
Strategy: Caricature
Answer

Compare the average over people to the average over years.
Is it bigger, smaller, or about the same?

To answer this without calculation, we’ll draw a caricature of our data.
4 in 5 people graduate. We’ll think about what’d happen if 5 in 5 did.
- We’ve said the average over years is the slope of the blue line.
- In this caricature, what does the average over people look like?

Year Increment	→ 9	→ 10	→ 11	→ 12
Mean Income Increment				10.6K
Proportion of People	0	0	0	1

The average over people is bigger.
In our caricature, it’s the increment in mean income from finishing year 12.
- That’s the the biggest increment in mean income we see from any year of high school.
- So it’ll be bigger than the average of all 4.
In visual terms, we’re comparing the slope of the blue line to the slope of the last dashed line.

Technique	Over People	Over Years
Caricature	10.6K	2.9K
Calculation	9.79K	2.86K

Comparing Two More

Year Increment	→ 14	→ 16	→ 18	→ 20
Mean Income Increment	9.9K	24.0K	23.0K	23.8K
Proportion of People	0.18	0.55	0.21	0.06

Question
A Caricature
A Highly Refined Caricature
Answers

Let’s look at the same two types of average in a new context.
We’ll think about the average value of another degree.
That is, the average value of the increments …
- 12 → 14: from a high school diploma to a 2-year degree.
- 14 → 16: from a 2-year degree to a 4-year one.
- 16 → 18: from a 4-year degree to a masters degree.
- 18 → 20: from a masters degree to a doctorate.
Compare the average over people to the average over degrees.
Is it bigger, smaller, or about the same?

In our last example, our caricature exaggerated the proportions—it made them 0 or 1.
In this one, we can exaggerate the constancy of the increments.

Year Increment	→ 14	→ 16	→ 18	→ 20
Mean Income Increment	24K	24K	24K	24K
Proportion of People	0.18	0.55	0.21	0.06

If we’re worried that we’re not being precise enough, we can use a more refined caricature.
In this one, most — but not all — of the increments are the same.

Year Increment	→ 14	→ 16	→ 18	→ 20
Mean Income Increment	10K	24K	24K	24K
Proportion of People	0.18	0.55	0.21	0.06

The two summaries are pretty close to the same. In our coarse caricature, they are the same.
In our highly refined one, they’re still more or less the same.
- The average over years is $ 1/4 10K + 3/4 24K$
- The average over people is $0.18 \times 10K + 0.82 \times 24K$.
The one income difference we do acknowledge — the one for 2-year degrees — gets washed out.
- Why? Because the column it’s in has almost the same weight in both averages: 1/4.
- Roughly 1 in 4 people have 2-year degrees and that’s 1 of 4 degrees we’re talking about.

Technique	Over People	Over Degrees
Coarse Caricature	24K	24K
Fine Caricature	21K	20K
Calculation	21.22K	20.18K

The Impact of Variation in Slope

A Question
An Answer
Summary and Implications

In our high school example, our two summaries were pretty different.
In our degree example, they were very similar.
Why the difference?

If the things you’re averaging are constant, it doesn’t matter how you average them.
If they aren’t, it does matter how you average them. Or it might. The differences might wash out, too.
- Sometimes we happen to have the right weight in the right place, like in our refined caricature a minute ago.
- Sometimes things that are bigger and smaller than average wind up canceling out.

Meaningfully different estimation targets often, but not always, have roughly the same numerical value.
This has created a bit of a cultural problem in the sciences.
- Many people act as if different estimation targets were the same.
- They use language and notation that obscures the distinction.
And people wind up choosing their estimation target when they think they’re only choosing an estimator.
- This leads to confusion when two people estimate different targets but don’t know it.
- And to some hard-to-make-sense-of estimation targets being very common.
Next class, we’ll start to see why people make the decisions they do.
- We’ll talk about interval estimates for the targets we’ve discussed today.
- We’ll see that some targets are easier to estimate than others.
  - i.e. we’ll get narrower intervals for some targets than for others using the same data.

Exercises

Storytelling, Visualization, and Estimation Targets

Three Stories About High School

Story A

Story B

Story C

Exercise
Solution

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.

All that matters is the diploma.
All that matters is 10th grade algebra.
They teach the most important stuff first.

Story	1	2	3
Plot	B	A	C
Average over Years	2.95K	2.95K	2.95K
Average over People	10K	185	485

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.

All that matters is the diploma.
All that matters is 10th grade algebra.
They teach the most important stuff first.

Story	1	2	3
Plot	B	A	C
Average over Years	2.9K	2.9K	2.9K
Average over People	9.6K	200.0	500.0

Three Stories About Degrees

Story A

Story B

Story C

Exercise
Solution

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.

Graduate degrees don’t pay.
Doctorates actually hurt your earning potential.
4-year degrees aren’t worth it anymore.

Story	1	2	3
Plot	A	C	B
Average over Years	8.31K	8.31K	20.01K
Average over People	17K	19K	16K

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.

Graduate degrees don’t pay.
Doctorates actually hurt your earning potential.
4-year degrees aren’t worth it anymore.

Story	1	2	3
Plot	A	C	B
Average over Years	8K	8K	20K
Average over People	17K	19K	16K

Lecture 7

Today

Our Sample

A Reminder about Visualization

What We Want to Know

Population

Sample

What is New

Summarization

The Dichotomous Version

Column Comparisons

Column Comparisons as ‘Forgetting’

Coarsening

Coarsening is Always Happening

Dichotomization

A Coarsened Comparison: >16 vs. 16

A Coarsened Comparison: 14 vs ≤12

From Column Means to Coarsened Means

Aggregating Comparisons

Different Ways of Aggregating

Different Ways of Aggregating

Comparing Two Summaries

Comparing Two More

The Impact of Variation in Slope

Exercises

Three Stories About High School

Three Stories About Degrees