This is a little harder to understand than either the horizontal or saturated case.
In intuitive terms, you can think of what’s happening as follows.
We choose the shape of the two curves to follow the trend we see when we ignore color.
On the right, we’re following the trend of the green folks. That’s who is on the right.
On the left, we’re following the trend of the red folks. That’s who is on the left.
We choose the heights, shifting that shape up or down, to get the within-group means right. \[
0 = \sum_{i=1}^n \gamma(W_i,X_i) \ \qty{Y_i - m(W_i,X_i)}^2 \ m(W_i,X_i) \qfor \textcolor[RGB]{248,118,109}{m=1_{=0}} \qand \textcolor[RGB]{0,191,196}{m=1_{=1}}.
\]
This means that—unless the trend is the same for both groups—we’re not really fitting the red folks on the right.
Unless we weight, which emphasizes the red folks on the right when we choose both shape and height.
When we have lots of bins, our predictions are all over the place. It ‘feels noisy’.
What’s happening is that this noise is averaging out. Let’s start somewhere simple.
Suppose we pair up all our observations, average the pairs, and average the averages.
We get the same thing as if we just averaged all the observations without pairing.
That’s essentially what’s going on with the easy part of the ATT formula.
If we have \(1\) bin for each level of \(X\), we get the same thing as if we have 1 bin outright. \[
\sum_x \textcolor[RGB]{0,191,196}{P_{x \mid 1}} \times \textcolor[RGB]{0,191,196}{\hat\mu(1,x)}
= \sum_{x} \frac{\sum_{i:W_i=1,X_i=x} 1}{\sum_{i:W_i=1} 1} \times \frac{\sum_{i:W_i=1, X_i=x} Y_i}{\sum_{i:W_i=1, X_i=x} 1}
= \frac{\sum_{i:W_i=1} Y_i}{\sum_{i:W_i=1} 1}
\]
Now let’s think about the hardpart—the mixed-color one.
After coarsening to bins, we’re just using within-bin subsample means as our estimates \(\textcolor[RGB]{248,118,109}{\hat\mu(0,x)}\).
So what we really have is a linear combinations of subsample means.
The purple lines show the ratio of bin-averaged densities from the variance formula: \(\textcolor{purple}{\{ \textcolor[RGB]{0,191,196}{\ldots} / \textcolor[RGB]{248,118,109}{\ldots}\}}\).
For a 3-piece model, we get the solid one. The sum in the variance is 1.49.
For a 6-piece model, we get the dashed one. The sum in the variance is 1.53.
For a 30-piece (saturated) model, we get the dotted one. The sum in the variance is 1.55.
The variance improvement we get by using a smaller model is pretty small.
When our densities our rougher, the inverse probability weights \(\textcolor[RGB]{160,32,240}{\gamma(1,x)}=\textcolor[RGB]{0,191,196}{P_{x\mid 1}}/\textcolor[RGB]{248,118,109}{P_{x\mid 0}}\) aren’t constant within segments.
That means we don’t ‘automatically’ get the benefits of weighting when we do least squares.
We’re paying for our variance reduction with a bias increase. Or the potential for one.
Grids and Trees
Piecewise Constant Models in 2D
A 2D Example
Suppose we’re interested in the health impacts of a chemical leak.
What we’re looking at on the left is a map of the chemical concentration in the air.
Lighter colors on the left mean higher concentrations.
Those are our supopulation means. But we can’t measure concentration everywhere all the time.
We’ll measure it a set of randomly chosen locations—location density on the right
A 2D Example
Our sample is this set of measurements \((Z_i,X_i,Y_i)\).
\(Z_i\) = concentration at measurement location i.
\(X_i\) = longitude at location i
\(Y_i\) = latitude at location i
A 2D Example
We’ll take them at random times, too. In the population we’re sampling from …
There are many observations \((z_j,x_j,y_j)\) at the same location \((x_j,y_j)\)
We’ll sample with replacement. The probability a sample is at a given location is shown on the right.
A 2D Example
There are 2000 observations in our sample.
We want to estimate \(\mu(x,y)=\mathop{\mathrm{E}}[Z_i \mid X_i=x, Y_i=y]\).
What we see on the right are the predictions \(\hat\mu(x,y)\) of the saturated model.
i.e., the mean of the observations at each location.
Fitting A Saturated Model
Here are the predictions of the saturated model.
i.e., the means of the observed concentrations at at each location.
white = zero observations at that location.
It’s ok near the center but bad elsewhere, where we sample less.
Grid Models
Here we see a ‘grid model’—a piecewise-constant model with square bins.
This is less messy than the saturated model.
But it’s not resolving detail very well.
A Piecewise-Constant Model
Larger grid models—ones with smaller bins—resolve detail better, but they’re huge.
On the left, we have \(12 \time 12 = 144\) bins; on the right, \(16 \times 16 = 256\) bins.
We only have 2000 observations.
Anisotropic Grid Models
We can try using rectangles, rather than squares, as bins.
Both of these have \(32 \times 8 = 256\).
But, fundamentally, there’s no right rectangle-shape either.
We want smaller bins near the center of the data, where we have more observations and more variation in the outcomes.
Anisotropic Grid Models
There’s no reason, statistically speaking, that our bins should be …
all the same shape
all the same number of locations
spatially contiguous locations
Any set of pixels can be a bin
But we need some way to search for good partitions of the pixels into bins.
Trying all possible partitions is computationally infeasible.
And it runs into a major statistical issue we haven’t had time to address satisfactorily—overfitting.
To make things manageable computationally and statistically…
we need to make some restrictions on the set of possible partitions.
People have, for the most part, settled on this: bins are axis-aligned rectangles.
Partioning as Tree-Search
We can describe any partition like this as the result of a tree of splits based on one covariate.
This aligns the problem naturally with tree-search algorithms from the AI/Optimization tradition in CS.
All we need is a way to evaluate the quality of a partition. For that, we have…