Least Squares Regression in Linear Models
$$
\[ \mu(x) = \frac{1}{m_x}\sum_{j:x_j=x} y_j \quad \text{ for } m_x = \sum_{j:x_j=x} 1 \]
\[ \hat\mu(x) = \frac{1}{N_x}\sum_{i:X_i=x} Y_i \quad \text{ for } N_x = \sum_{i:X_i=x} 1 \]
\[ \hat\varepsilon_i = Y_i - \hat\mu \qquad \text{ are the \textbf{residuals}} \]
\[ \textcolor{red}{\hat \mu} = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i=1}^n \qty{ Y_i - m }^2 \quad \text{ is the \textbf{least squares estimate}} \]
\[ \hat \mu = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i=1}^n \qty{ Y_i - m }^2 \quad \text{ satisfies the \textbf{zero-derivative condition} } \] \[ \begin{aligned} 0 &= \frac{d}{dm} \sum_{i=1}^n \qty{ Y_i - m }^2 \ \ \mid_{m=\hat\mu} \\ &= \class{fragment}{\sum_{i=1}^n \frac{d}{dm} \qty{ Y_i - m }^2 \ \ \mid_{m=\hat\mu}} \\ &= \class{fragment}{\sum_{i=1}^n -2 \qty{ Y_i - m } \ \ \mid_{m=\hat\mu}} \\ &= \class{fragment}{\sum_{i=1}^n -2 \qty{ Y_i - \hat\mu }} \end{aligned} \]
\[ \begin{aligned} &0 = \class{fragment}{\sum_{i=1}^n -2\qty{ Y_i - \hat\mu }} && \text{ when } \\ &\class{fragment}{2\sum_{i=1}^n Y_i = 2\sum_{i=1}^n \hat\mu = 2n\hat\mu} && \text{ and therefore when } \\ &\class{fragment}{\frac{1}{n}\sum_{i=1}^n Y_i = \hat\mu} \end{aligned} \]
\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{functions}\ m(x)} \sum_{i=1}^n \qty{ Y_i - m(X_i) }^2 \]
\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{functions}\ m(x)} \sum_{i:X_i=8} \{ Y_i - m(8) \}^2 + \sum_{i:X_i=12} \{ Y_i - m(12) \}^2 \]
\[ \begin{aligned} \hat \mu(8) &= \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:X_i=8} \{ Y_i - m \}^2 = \frac{\sum_{i:X_i=8} Y_i}{\sum_{i:X_i=8} 1 } \\ \hat \mu(12) &= \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:X_i=12} \{ Y_i - m \}^2 = \frac{\sum_{i:X_i=12} Y_i}{\sum_{i:X_i=12} 1 } \end{aligned} \]
There’s nothing special about two locations. We can do this for as many locations as we like.
We break our sum into pieces where \(X\) is the same. \[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{functions}\ m(x)} \sum_x \sum_{i:X_i=x} \{ Y_i - m(x) \}^2 \]
And we get the best location at each \(x\), in terms of squared residuals, by minimizing each piece.
The solution is, as before, the mean of each subsample.
\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:X_i=x} \{ Y_i - m \}^2 = \frac{\sum_{i:X_i=x} Y_i}{\sum_{i:X_i=x} 1 } \]
\[ \begin{aligned} \hat \mu(c(x)) &= \mathop{\mathrm{argmin}}_{\text{functions}\ m(c(x)) } \sum_{c(x)} \sum_{i:c(X_i)=c(x)} \{ Y_i - m(c(x)) \}^2 \\ &= \mathop{\mathrm{argmin}}_{\text{functions}\ m(c(x)) } \textcolor[RGB]{248,118,109}{\sum_{i:c(X_i)=\text{red}} \{ Y_i - m(\text{red}) \}^2} + \textcolor[RGB]{0,191,196}{\sum_{i:c(X_i)=\text{green}} \{ Y_i - m(\text{green}) \}^2} \end{aligned} \]
\[ \hat \mu(x) = \mathop{\mathrm{argmin}}_{\text{numbers}\ m} \sum_{i:c(X_i)=c(x)} \{ Y_i - m \}^2 = \frac{\sum_{i:c(X_i)=c(x)} Y_i}{\sum_{i:c(X_i)=c(x)} 1 } \]
\[ \hat\mu = \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \{ Y_i - m(X_i) \}^2 \qquad \text{ where } \quad \mathcal{M} \ \ \text{ is our model.} \]
\[ Y_i \approx \hat\mu(X_i) \qfor \hat\mu \in \mathcal{M} \]
\[ \begin{aligned} \hat\mu &= \mathop{\mathrm{argmin}}_{m \in \mathcal{M}} \sum_{i=1}^n \{ Y_i - m(X_i) \}^2 \qfor && \textcolor{red}{\mathcal{M}} = \qty{ \text{all functions} \ m(x) } \\ &&& \textcolor{blue}{\mathcal{M}} = \qty{ \text{all lines} \ m(x) = a + bx } \\ &&& \textcolor{magenta}{\mathcal{M}} = \qty{ \text{all increasing functions} \ m(x) } \\ &&& \textcolor{cyan}{\mathcal{M}} = \qty{ \text{all functions of an indicator} \ m(x)=f(1_{\ge 16}(x))} \\ \end{aligned} \]
\[ \begin{aligned} \textcolor{red}{\mathcal{M}} &= \qty{ \text{all functions} \ m(x) } \\ \textcolor{blue}{\mathcal{M}} &= \qty{ \text{all lines} \ m(x)=a + bx } \\ \textcolor{magenta}{\mathcal{M}} &= \qty{ \text{all increasing functions} \ m(x) } \\ \textcolor{cyan}{\mathcal{M}} &= \qty{ \text{all functions of an indicator} \ m(x)=f(1_{\ge 16}(x))} \\ \end{aligned} \]
A few examples.
\[\small{ \begin{aligned} \textcolor{blue}{\mathcal{M}} &= \qty{ m(w,x) = m_0(w) + m_1(x) \ \ \text{for univariate functions} \ \ m_0, m_1 } &&\text{additive bivariate model} \\ \end{aligned} } \]
\[\small{ \begin{aligned} \textcolor{red}{\mathcal{M}} &= \qty{ m(w,x) = a(w) + b x } &&\text{parallel lines} \\ \end{aligned} } \]
\[\small{ \begin{aligned} \textcolor{magenta}{\mathcal{M}} &= \qty{ m(w,x) = a(w) + b(w) x } &&\text{lines} \\ \end{aligned} } \]
Sketch in the least squares predictor in the horizontal lines model.
\[ \mathcal{M} = \qty{ m(w,x) = a(w) \ \ \text{for univariate functions} \ \ a } \]
Flip a slide forward to check your answer.
Sketch in the least squares predictor in the not-necessarily-parallel lines model.
\[ \mathcal{M} = \qty{ m(w,x) = a(w) + b(w) x \ \ \text{for univariate functions} \ \ a,b } \]
Flip a slide forward to check your answer.
Sketch in the least squares predictor in the parallel lines model.
\[ \mathcal{M} = \qty{ m(w,x) = a(w) + b x \ \ \text{for univariate functions} \ \ a \ \ \text{and constants} \ \ b } \]
Flip a slide forward to check your answer.
Sketch in the least squares predictor in the additive model. \[ \mathcal{M} = \qty{ m(w,x) = a(w) + b(x) \ \ \text{for univariate functions} \ \ a,b } \]
Flip a slide forward to check your answer. Is what you’ve drawn familiar? If so, why?