Lecture 7

Summarizing Trends involving Many Groups

Today

  • We’re going to look at the relationship between income and education.
  • We’ve done that before. When we did, we thought of education in dichotomous terms.
    • We had some people with 4-year degrees — the green dots.
    • We had some people without them — the red dots.
  • Today, we’ll be a bit more granular. We’ll think of education in terms of years of schooling.
    • 8+ years = finished middle school
    • 12+ years = high school diploma
    • 16+ years = 4-year college degree

Our Sample

  • As before, we’ll work with data from the 2022 Current Population Survey.
  • We’ll look at California residents age 25-35 with at least an 8th-grade education.
  • This sample includes 2271 people.
    • To visualize it, I’ve marked each one on a map.
    • To display some of the information we have, I’ve made a table.
  • I’ve used color to emphasize the dichotomous view of education we’ve been using.

income education age county
1 $55k 13 35 orange
2 $25k 13 27 LA
3 $44k 16 34 san joaquin
4 $22k 14 34 orange
5 $0k 16 31 san diego
6 $105k 16 27 LA
7 $1k 16 25 LA
8 $21k 14 30 unknown
2270 $85k 16 31 orange
2271 $150k 16 32 stanislaus

A Reminder about Visualization

with jitter
without jitter
  • Our primary visualization will be a scatter plot. It shows a dot for each person in our sample.
  • The person’s income is on the y-axis and their education on the x-axis.
  • We add a bit of fuzz to the displayed x-coordinate to space the dots out.
    • But not so much that it’s not clear what the real value of \(X\).
    • This is called ‘jittering’. Sometimes a plot like this is called a ‘jitter plot’.
  • As before, we can use color to highlight the categories of people with and without 4-year degrees.

What We Want to Know

income education county
1 $22k 18 unknown
2 $0k 16 solano
3 $98k 16 LA
4 $25k 12 tulare
5 $19k 14 san luis obispo
5677499 $11k 10 alameda
5677500 $116k 18 unknown
income education county
1 $55k 13 orange
2 $25k 13 LA
3 $44k 16 san joaquin
4 $22k 14 orange
5 $0k 16 san diego
2270 $85k 16 orange
2271 $150k 16 stanislaus

Population

Sample

  • We’re not particularly interested in our sample itself, i.e., Californians age 25-35 who responded to the survey.
  • We’re interested in the population from which it’s sampled, i.e. all Californians age 25-35.
  • That’s not new. We’ve been doing this since the semester started. And we have a recipe for doing it.
    1. Choosing a target. We think about how we would summarize the population if we had surveyed everyone.
    2. Point estimation. However we would summarize the population, we do summarize our sample.
    3. Interval estimation. We work out what our point estimator’s sampling distribution tells us about our target.

    What is New

  • Now that we’re taking a more granular view of education, there’s a lot more to the first step.
  • Today, all we’ll do is think about what we might want to know. We’ll talk as if we had surveyed everyone.
  • Next time, we’ll talk about point and interval estimation. It’ll be familiar, but there’ll be some new twists.

The population we’re looking at is made up. It’s just an illustration. We don’t actually have all this information.

Summarization

  • Looking at the dots alone, we can get rough answers to some simple questions.
  • We can see that people with four-year-degrees tend to earn more than people without them.
  • But it’s hard to be precise just eyeballing things. It helps to look at numerical summaries.
  • We’ve can, e.g., overlay the mean income \(\pm\) one standard deviation for each group. Do it!
  • That’s shown by the dot’s position and its ‘arms’ on the y-axis.
  • Its position on the x-axis shows the mean years of schooling within the group.

The Dichotomous Version

  • It’s worth visualizing the information lost by dichotomizing education.
  • To do this, we can look at a similar plot without that information. This one.
  • In it, I’ve replaced the x-coordinates with random values that tell us nothing beyond group membership.
  • People without 4-year degrees get random values uniformly distributed between 8 and 14.
    • That’s the range of years of schooling in that group.
  • People with 4-year degrees get random values between 16 and 20.
    • Same deal.

Column Comparisons

  • Let’s think about what we can do with the additional information we have now.
  • One simple option is to work with the finest grouping we can: the columns in our scatter plot.
  • We can, e.g., look at the income mean and standard deviation for people with 8,9,10, etc. years of schooling.
  • Or compare any pair of those groups. It’s natural to look at groups that are in some sense adjacent.
    • 18 vs. 16: the value of a masters’ degree (only) vs. a 4-year one (only).
    • 16 vs. 14: the value of a 4-year degree (only) vs. a 2-year degree (only).
    • 14 vs. 12: the value of a 2-year degree (only) vs. a high school degree (only).

Column Comparisons as ‘Forgetting’

  • This is essentially the same as what we did when we dichotomized education.
    • We just ‘forget’ the other groups. We don’t look at them.
    • Everything we proved before the midterm still applies: unbiasedness, variance formulas, etc.
    • When we did the math on comparing two groups, we didn’t assume they were the only two groups.
  • That’s great if we know exactly what we’re interested in—and it’s a comparison like this.
  • But it’s not so great if we’re a bit less focused—if we want to talk about the whole population.
    • That’d mean we might have to think about/report a lot of different numbers.
    • It can get to be too much. We’re not summarizing enough.

Coarsening

  • One remedy is to go back to working with coarser groupings.
  • We can make up our own by aggregating—combining—the columns we have.
  • Here, for example, is a breakdown in 4 groups.
    • < 14 years: no college degree.
    • =14 years: 2-year degree.
    • =16 years: 4-year degree.
    • > 16 years: graduate degree.
  • This is more information than a two-group comparison but perhaps little enough to be manageable.
    • We can visualize it all in a plot like the one above.
    • Or report the set of group-specific summaries in a table like the one below.



education mean sd N
< 12 19K 22K 270K
14 33K 28K 548K
16 58K 64K 2M
> 16 84K 82K 770K

Coarsening is Always Happening

education=case_match(cps.data$a_hga,
                 0  ~ 0,   # child
                 31 ~ 0,   # < grade 1
                 32 ~ 4,   #  grade 1-4
                 33 ~ 6,   #  grade 5-6
                 34 ~ 8,   #  grade 7-8 
                 35 ~ 9,   #  grade 9
                 36 ~ 10,  #  grade 10
                 37 ~ 11,  #  grade 11
                 38 ~ 11,  #  grade 12 no diploma
                 39 ~ 12,  #  high school grad
                 40 ~ 13,  #  some college
                 41 ~ 14,  #  associate degree (vocational)
                 42 ~ 14,  #  associate degree (academic)
                 43 ~ 16,  #  bachelors degree
                 44 ~ 18,  #  masters degree
                 45 ~ 20,  #  professional school degree
                 46 ~ 20)  #  doctorate

  • Often, the data we have is already coarsened to some degree.
    • Our data does not, for example, distinguish between people with 7 and 8 years of schooling.
    • Similarly, it doesn’t tell us years for people with graduate degrees. We’ve made that up.
  • And to fit it into our analysis, we often have to coarsen it further.
    • To fit education into our ‘years of schooling’ framework, we’ve grouped together people with …
      • vocational and academic associate degrees. (12+2=14 years)
      • professional and doctoral degrees. (16 + 4ish ≈ 20 years)
    • Our choices about some of these made-up numbers create the gaps we see at 15,17,and 19 years.
  • Keeping all this in mind can be overwhelming.
  • It’s often a good idea to pick and abstraction — like years of schooling — and go with it.
  • But it’s important to come back to this at some point to make sure your conclusions mean what you think they do.
  • Suppose you see a bigger income jump between ‘8’ and 9 years of education than between 9 and 10.
    • That might be because the group with ‘8’ years actually includes people with 7 and 8 years.
    • So many of the people with ‘8’ years have 2 fewer years of schooling than the people with 9.

Dichotomization

  • Often, when we want to report a numeric summary, we do resort to dichotomization.
  • But having the more granular information allows us to choose the groups we want.
  • There are many meaningful choices.
    • >16: people with graduate degrees.
    • ≤12: people who haven’t been to college.

A Coarsened Comparison: >16 vs. 16

Describe the comparison we’re visualizing, the difference in the means of the green dots and red dots, in …

  1. Words
  2. Mathematical notation
  3. R code

People with graduate degrees vs. people with 4-year degrees (only).

  • Option 1. A compact, readable version \[ \frac{\sum\limits_{j:x_j > 16} y_j}{\sum_{j:x_j > 16} 1} - \frac{\sum\limits_{j:x_j = 16} y_j}{\sum_{j:x_j = 16} 1} \]

  • Option 2. A verbose version that’s easier to use in probability calculations. \[ \frac{\sum\limits_{j=1}^n 1_{\{18,20\}}(x_j) y_j}{\sum\limits_{j=1}^n 1_{\{18, 20\}}(x_j)} - \frac{\sum\limits_{j=1}^n1_{\{16\}}(x_j)y_j}{\sum\limits_{j=1}^n 1_{\{16\}}(x_j)} \ \ \text{ where } \ \ 1_{S}(x) = \begin{cases} 1 & \text{ if } x \in S \\ 0 & \text{ otherwise} \end{cases}. \]

  • Option 1. A compact, readable version
mean(y[x > 16]) - mean(y[x==16])
[1] 25930.81
  • Option 2. A verbose version that’s easier to combine with more abstract code
groups = list(a = c(18,20), b=16)
mean(y[x %in% groups$a]) - mean(y[x %in% groups$b]) 
[1] 25930.81

A Coarsened Comparison: 14 vs ≤12

Describe the comparison we’re visualizing, the difference in the means of the green dots and red dots, in …

  1. Words
  2. Mathematical notation
  3. R code

People with 2-year degrees (only) vs. people who haven’t completed a year of college.

  • A compact, readable version \[ \frac{\sum\limits_{j:x_j = 14} y_j}{\sum\limits_{j:x_j = 14} 1} - \frac{\sum\limits_{j:x_j \le 12} y_j}{\sum\limits_{j:x_j \le 12} 1} \]

  • A verbose version that’s easier to use in probability calculations \[ \frac{\sum\limits_{j=1}^n 1_{\{14\}}(x_j) y_j}{\sum\limits_{j=1}^n 1_{\{14\}}(x_j)} - \frac{\sum\limits_{j=1}^n1_{\{8 \ldots 12\}}(x_j)y_j}{\sum\limits_{j=1}^n 1_{\{8 \ldots 12\}}(x_j)} \ \ \text{ where } \ \ 1_S(x) = \begin{cases} 1 & \text{ if } x \in S \\ 0 & \text{ otherwise} \end{cases}. \]

  • Option 1. A compact, readable version
mean(y[x == 14]) - mean(y[x<=12])
[1] 5338.928
  • Option 2. A verbose version that’s easier to combine with more abstract code
groups = list(a = c(14), b=8:12)
mean(y[x %in% groups$a]) - mean(y[x %in% groups$b]) 
[1] 5338.928

From Column Means to Coarsened Means

  • We can express the means in our aggregate groups in terms of the means in our columns.
  • Let’s define some notation describing the dots in each column to make this easier.
    • \(m_x\) will be the number of dots, i.e. the number of people with \(x_j=x\).
    • \(\mu(x)\) will be the mean height of the dots, i.e. the mean income of people with \(x_j=x\).

\[ \mu(x) = \frac{1}{m_x}\sum_{j:x_j=x} y_j \quad \text{ where } \quad m_x = \sum_{j:x_j=x} 1 \]

What, on the plot, corresponds to \(\mu(14)\)?

The black dot in the \(x=14\) column.

What, on the plot, corresponds to \(m_{12}\)?

The number of red dots in the column where \(x=12\).

What, in terms \(\mu(x)\), is the mean outcome among people in our sample who did not attend college?

Strategy.

  1. Where we see a sum over the people in the aggregate group, we rewrite it as a sum of sums.
    A sum over columns of sums over the people in them.
  2. We express each column’s sum in terms of its mean and its size.
  3. We make sense of the result.

\[ \begin{aligned} \frac{\sum\limits_{j:x_j \in 8 \ldots 12} y_j}{\sum\limits_{j:x_j \in 8 \ldots 12} 1} &= \frac{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} y_j}{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} 1} = \frac{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}} \textcolor[RGB]{239,71,111}{m_x} \times \textcolor[RGB]{17,138,178}{\frac{1}{m_x}}\textcolor[RGB]{17,138,178}{\sum\limits_{j:x_j=x}} y_j}{\textcolor[RGB]{239,71,111}{\sum}\limits_{x \in 8 \ldots 12}\textcolor[RGB]{17,138,178}{m_x}} \\ &= \textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}} p_x \times \textcolor[RGB]{17,138,178}{\mu(x)} \quad \text{ for } \quad p_x = \frac{\textcolor[RGB]{239,71,111}{m_x}}{\textcolor[RGB]{239,71,111}{\sum\limits_{x \in 8 \ldots 12}\textcolor[RGB]{17,138,178}{m_x}}}. \end{aligned} \]

  • It’s a weighted average of the column means \(\mu(x)\).
  • The weight \(p_x\) is the proportion of the people in the aggregate group that are in that column.

Aggregating Comparisons

  • Another option—rather than comparing aggregates—is aggregating comparisons.
  • We can summarize the ‘value’ of each year of high school by the associated increment in mean income.
    • Visually, that’s the slope of the dashed black line leading up to it.
  • To reduce these to a single number, we can take the average of the increments.
    • Visually, that’s the slope of the secant—the blue one. In a sense.
    • If that’s not obvious to you, don’t worry. We’ll come back to it in a minute.
  • Why in a sense? Because there are two things we might mean by ‘the average of the increments’.

Different Ways of Aggregating

  1. The average, over the four years of high school, of the increments. Let’s write it out.

\[ \frac{1}{4}\sum_{x=9}^{12} \qty{ \mu(x) - \mu(x-1) } \]

  • That’s the slope of the secant from \(\hat\mu(8)\) to \(\hat \mu(12)\)—the blue line.
    • To see this, we can write out our four terms and see what cancels.
    • It helps to count backward from \(x=12\).

\[ \frac{ \qty{\mu(12) - \mu(11)} + \qty{ \mu(11) - \mu(10) } + \qty{ \mu(10) - \mu(9) } + \qty{ \mu(9) - \mu(8) } }{4} = \frac{\mu(12) - \mu(8)}{4} \]

  • As we go from \(x=8\) to \(x=12\), the ‘rise’ is \(\mu(12)-\mu(8)\) and the ‘run’ is \(12-8=4\); slope= rise / run.

Different Ways of Aggregating

  1. The average, over people with ≥1 year of high school, of the increments affecting them in their last year.
    • That’s something different. For the majority of these people, their last year was their fourth.
    • So most of the terms we’re averaging are the increment from the third to fourth year.

\[ \frac{\sum_{j:x_j \in 9 \ldots 12} \qty{ \mu(x_j) - \mu(x_j-1) }}{\sum_{j:x_j \in 9 \ldots 12} 1} = \sum_{x \in 9 \ldots 12} p_x \ \qty{ \mu(x) - \mu(x-1) } \quad \text{ for } \quad p_x = \frac{m_x}{\sum_{x \in 9 \ldots 12}m_x}. \]

We can try the ‘cancellation trick’ we used before, but it doesn’t work very well.

\[\small{ \begin{aligned} & p_{12} \qty{ \mu(12) - \mu(11) } + p_{11} \qty{ \mu(11) - \mu(10) } + p_{10} \qty{ \mu(10) - \mu(9) } + p_{9} \qty{ \mu(9) - \mu(8) } \\ &= p_{12}\mu(12) + (p_{11}-p_{12}) \mu(11) + (p_{10}-p_{11}) \mu(10) + (p_9 - p_{10}) \mu(9) - p_9\mu(8) \end{aligned}} \]

  • We do get a linear combination of the column means, but it’s not an interpretable one.
  • It will, however, be useful in calculations next time when we’re talking about inference.

Comparing Two Summaries

Year Increment → 9 → 10 → 11 → 12
Mean Income Increment 2.1K -11.4K 10.2K 10.6K
Proportion of People 0.03 0.02 0.10 0.84
  • Compare the average over people to the average over years.
  • Is it bigger, smaller, or about the same?
  • To answer this without calculation, we’ll draw a caricature of our data.
  • 4 in 5 people graduate. We’ll think about what’d happen if 5 in 5 did.
    • We’ve said the average over years is the slope of the blue line.
    • In this caricature, what does the average over people look like?

Year Increment → 9 → 10 → 11 → 12
Mean Income Increment 10.6K
Proportion of People 0 0 0 1
  • The average over people is bigger.
  • In our caricature, it’s the increment in mean income from finishing year 12.
    • That’s the the biggest increment in mean income we see from any year of high school.
    • So it’ll be bigger than the average of all 4.
  • In visual terms, we’re comparing the slope of the blue line to the slope of the last dashed line.
Technique Over People Over Years
Caricature 10.6K 2.9K
Calculation 9.79K 2.86K

Comparing Two More

Year Increment → 14 → 16 → 18 → 20
Mean Income Increment 9.9K 24.0K 23.0K 23.8K
Proportion of People 0.18 0.55 0.21 0.06
  • Let’s look at the same two types of average in a new context.
  • We’ll think about the average value of another degree.
  • That is, the average value of the increments …
    • 12 → 14: from a high school diploma to a 2-year degree.
    • 14 → 16: from a 2-year degree to a 4-year one.
    • 16 → 18: from a 4-year degree to a masters degree.
    • 18 → 20: from a masters degree to a doctorate.
  • Compare the average over people to the average over degrees.
  • Is it bigger, smaller, or about the same?
  • In our last example, our caricature exaggerated the proportions—it made them 0 or 1.
  • In this one, we can exaggerate the constancy of the increments.
Year Increment → 14 → 16 → 18 → 20
Mean Income Increment 24K 24K 24K 24K
Proportion of People 0.18 0.55 0.21 0.06
  • If we’re worried that we’re not being precise enough, we can use a more refined caricature.
  • In this one, most — but not all — of the increments are the same.
Year Increment → 14 → 16 → 18 → 20
Mean Income Increment 10K 24K 24K 24K
Proportion of People 0.18 0.55 0.21 0.06
  • The two summaries are pretty close to the same. In our coarse caricature, they are the same.
  • In our highly refined one, they’re still more or less the same.
    • The average over years is $ 1/4 10K + 3/4 24K$
    • The average over people is \(0.18 \times 10K + 0.82 \times 24K\).
  • The one income difference we do acknowledge — the one for 2-year degrees — gets washed out.
    • Why? Because the column it’s in has almost the same weight in both averages: 1/4.
    • Roughly 1 in 4 people have 2-year degrees and that’s 1 of 4 degrees we’re talking about.
Technique Over People Over Degrees
Coarse Caricature 24K 24K
Fine Caricature 21K 20K
Calculation 21.22K 20.18K

The Impact of Variation in Slope

  • In our high school example, our two summaries were pretty different.
  • In our degree example, they were very similar.
  • Why the difference?
  • If the things you’re averaging are constant, it doesn’t matter how you average them.
  • If they aren’t, it does matter how you average them. Or it might. The differences might wash out, too.
    • Sometimes we happen to have the right weight in the right place, like in our refined caricature a minute ago.
    • Sometimes things that are bigger and smaller than average wind up canceling out.
  • Meaningfully different estimation targets often, but not always, have roughly the same numerical value.
  • This has created a bit of a cultural problem in the sciences.
    • Many people act as if different estimation targets were the same.
    • They use language and notation that obscures the distinction.
  • And people wind up choosing their estimation target when they think they’re only choosing an estimator.
    • This leads to confusion when two people estimate different targets but don’t know it.
    • And to some hard-to-make-sense-of estimation targets being very common.
  • Next class, we’ll start to see why people make the decisions they do.
    • We’ll talk about interval estimates for the targets we’ve discussed today.
    • We’ll see that some targets are easier to estimate than others.
      • i.e. we’ll get narrower intervals for some targets than for others using the same data.

Exercises

Storytelling, Visualization, and Estimation Targets

Three Stories About High School

Story A
Story B
Story C

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
 

  1. All that matters is the diploma.
  2. All that matters is 10th grade algebra.
  3. They teach the most important stuff first.
Story 1 2 3
Plot
Average over Years
Average over People

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
 

  1. All that matters is the diploma.
  2. All that matters is 10th grade algebra.
  3. They teach the most important stuff first.
Story 1 2 3
Plot B A C
Average over Years 2.9K 2.9K 2.9K
Average over People 9.6K 200.0 500.0

Three Stories About Degrees

Story A
Story B
Story C

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
 

  1. Graduate degrees don’t pay.
  2. Doctorates actually hurt your earning potential.
  3. 4-year degrees aren’t worth it anymore.
Story 1 2 3
Plot
Average over Years
Average over People

Match each story to a plot above. Then, for each story, approximately calculate our two summaries.
 

  1. Graduate degrees don’t pay.
  2. Doctorates actually hurt your earning potential.
  3. 4-year degrees aren’t worth it anymore.
Story 1 2 3
Plot A C B
Average over Years 8K 8K 20K
Average over People 17K 19K 16K