v1 · padrão canônico

Lesson 105 — Simple linear regression

OLS model, least squares estimators, R², residuals, inference on the slope. Foundation of supervised learning and econometrics.

Used in: Stochastik LK alemão (Klasse 12) · H2 Mathematics Singapura (§14) · Math B japonês

\hat{Y} = \hat\beta_0 + \hat\beta_1 X, \qquad \hat\beta_1 = \frac{S_{xy}}{S_{xx}}

Choose your door

Rigorous notation, full derivation, hypotheses

Rigorous definition

Simple linear regression model

"The regression equation is written as $\hat{y} = a + bx$ , where $b$ is the slope and $a$ is the $y$ -intercept." — OpenStax Statistics, §12.3

Definition· Ordinary Least Squares (OLS) estimators

OLS estimators minimize $SSE = \sum_{i=1}^n (Y_i - \hat Y_i)^2$ . The closed-form solution is:

\hat\beta_1 = \frac{S_{xy}}{S_{xx}} = \frac{\sum_{i=1}^n (X_i - \bar X)(Y_i - \bar Y)}{\sum_{i=1}^n (X_i - \bar X)^2}

what this means · Sample slope: covariance of X and Y divided by the variance of X.

\hat\beta_0 = \bar Y - \hat\beta_1 \bar X

what this means · Intercept: forces the line to pass through the data centroid.

The residual $e_i = Y_i - \hat Y_i$ is the vertical distance of each point from the fitted line.

Variance decomposition and R²

"The coefficient of determination $r^2$ is the square of the correlation coefficient $r$ . It tells you the fraction of total variability in the response that is explained by the least-squares line." — OpenIntro Statistics, §7.2, p. 331

Inference on the slope

Least squares line (gold) minimizing the sum of squared residuals (orange). Each residual e is the vertical distance from the point to the line.

Solved examples

Example— 105.1· Calculate the regression line with small data

Problem. Five municipalities in the interior of São Paulo recorded GDP per capita $X$ (in thousand BRL/year) and HDI $Y$ :

$X$	18	24	30	36	42
$Y$	0.62	0.68	0.72	0.78	0.84

Find the least squares line and interpret the coefficients.

Strategy. Calculate $\bar X$ , $\bar Y$ , $S_{xx}$ , $S_{xy}$ , then apply the estimator formulas.

Resolution.

$\bar X = (18+24+30+36+42)/5 = 30$ ; $\bar Y = (0.62+0.68+0.72+0.78+0.84)/5 = 0.728$ .

$S_{xx} = (18-30)^2+(24-30)^2+(30-30)^2+(36-30)^2+(42-30)^2 = 144+36+0+36+144 = 360$ .

$S_{xy} = (18-30)(0.62-0.728)+\ldots = (-12)(-0.108)+(-6)(-0.048)+0+(6)(0.052)+(12)(0.112)$ $= 1.296+0.288+0+0.312+1.344 = 3.24$ .

$\hat\beta_1 = 3.24/360 = 0.009$ (HDI point per thousand BRL of GDP per capita).

$\hat\beta_0 = 0.728 - 0.009 \times 30 = 0.458$ .

Line: $\hat Y = 0.458 + 0.009 X$ .

Verification. For $X=30$ : $\hat Y = 0.458+0.009\times30 = 0.728 = \bar Y$ . Correct — the line passes through the centroid.

Source. OpenStax Statistics, §12.3, Example 12.5 — CC-BY

Example— 105.2· Calculate and interpret R²

Problem. With the data from the previous example, calculate $SST$ , $SSE$ , $SSR$ and $R^2$ .

Strategy. Calculate $\hat Y_i$ for each point, then the three sums of squares.

Resolution.

$X$	$Y$	$\hat Y$	$e = Y-\hat Y$	$e^2$	$(Y-\bar Y)^2$
18	0.62	0.620	0	0	0.01166
24	0.68	0.674	0.006	0.000036	0.00230
30	0.72	0.728	-0.008	0.000064	0.000064
36	0.78	0.782	-0.002	0.000004	0.00274
42	0.84	0.836	0.004	0.000016	0.01254

$SSE = 0.000120$ ; $SST = 0.02930$ ; $SSR = 0.02918$ .

$R^2 = 0.02918/0.02930 \approx 0.996$ .

Verification. $R^2$ very close to 1 — makes sense: the points are almost perfectly aligned.

Source. OpenIntro Statistics, §7.2, Exercise 7.9 — CC-BY-SA

Example— 105.3· t-test for the slope

Problem. With $n=25$ pairs of observations, we obtained $\hat\beta_1 = 3.42$ and $SE(\hat\beta_1) = 1.14$ . Test $H_0: \beta_1 = 0$ at the 5% level (two-tailed).

Strategy. Calculate the $T$ statistic and compare with the critical value $t_{23;\,0.025}$ .

Resolution.

$T = \hat\beta_1 / SE(\hat\beta_1) = 3.42/1.14 = 3.00$ .

Degrees of freedom: $n-2 = 23$ . Critical value $t_{23;\,0.025} \approx 2.069$ .

Since $|T| = 3.00 > 2.069$ , we reject $H_0$ at the 5% level.

Verification. p-value: $P(|t_{23}| > 3.00) \approx 0.006 < 0.05$ . Consistent with rejection.

Source. OpenStax Statistics, §12.4, Example 12.8 — CC-BY

Example— 105.4· Point prediction and confidence interval

Problem. The fitted line is $\hat Y = 42.6 + 1.8 X$ with $n=20$ , $MSE = 9.61$ , $\bar X = 15$ , $S_{xx} = 280$ . Obtain: (a) point prediction for $X^* = 20$ ; (b) 95% CI for the mean value of $Y$ when $X = 20$ .

Strategy. Substitute $X^*$ into the fitted line. Use the CI formula for the conditional mean.

Resolution.

(a) $\hat Y^* = 42.6 + 1.8 \times 20 = 42.6 + 36 = 78.6$ .

(b) $SE(\hat Y^*) = \hat\sigma\sqrt{\frac{1}{n} + \frac{(X^*-\bar X)^2}{S_{xx}}} = \sqrt{9.61}\sqrt{\frac{1}{20}+\frac{25}{280}} = 3.10 \times \sqrt{0.0500+0.0893} = 3.10 \times 0.373 = 1.156$ .

$t_{18;\,0.025} \approx 2.101$ .

95% CI: $78.6 \pm 2.101 \times 1.156 = 78.6 \pm 2.43 = (76.2; 81.0)$ .

Verification. The further $X^*$ is from $\bar X$ , the larger the SE — CI widens at the extremes.

Source. OpenIntro Statistics, §7.3 — CC-BY-SA

Example— 105.5· Residual diagnosis and assumption violation

Problem. A regression of energy consumption $(Y)$ vs. temperature $(X)$ produced a residual vs. $\hat Y$ plot with a "U" shape (negative residuals in the center, positive at the ends). Which assumption was violated and what to do?

Strategy. Identify the pattern in the residual plot and relate it to the linearity assumption.

Resolution.

"U" pattern (systematic curvature) in residuals vs. $\hat Y$ indicates violation of the linearity assumption: the real relationship between $X$ and $Y$ is not linear.

Corrective action: include $X^2$ in the model (polynomial regression) or apply a transformation to $X$ (e.g., $\log X$ , $\sqrt{X}$ ).

Other common patterns:

Funnel (variance increasing with $\hat Y$ ) → homoscedasticity violated → transform $Y$ (e.g., $\log Y$ ) or use robust errors.
Diagonal bands → discrete or grouped data → mixed effect.

Verification. After including $X^2$ , the new residual plot should be random around zero.

Source. OpenIntro Statistics, §7.4, Figure 7.17 — CC-BY-SA

Exercise list

30 exercises · 7 with worked solution (25%)

Application 15Understanding 4Modeling 5Challenge 4Proof 2

Ex. 105.1Application
Data: $n=6$ , $\bar X = 4$ , $\bar Y = 10$ , $S_{xx} = 20$ , $S_{xy} = 30$ . Calculate $\hat\beta_0$ and $\hat\beta_1$ .
Solve online
Ex. 105.2Application
Pairs $(X,Y)$ : $(2,5)$ , $(4,9)$ , $(6,11)$ , $(8,15)$ , $(10,20)$ . Calculate the least squares line.
Solve online
Ex. 105.3Application
Using $\hat Y = 1.2 + 1.8X$ (previous exercise), predict $Y$ for $X=7$ and $X=12$ . Identify which prediction is extrapolation.
Solve online
Ex. 105.4Application
For the data in Exercise 105.1: $\bar X=4$ , $\bar Y=10$ , $S_{xx}=20$ , $S_{xy}=30$ , $S_{yy}=52$ . Calculate $R^2$ and interpret.
Solve online
Ex. 105.5ApplicationAnswer key
The Pearson correlation coefficient between two variables is $r = 0.87$ . What is the $R^2$ of the simple regression of $Y$ on $X$ ?
Solve online
Ex. 105.6ApplicationAnswer key
Regression of annual salary (in thousand BRL) on years of experience produced $\hat Y = 32.4 + 2.5X$ . Interpret $\hat\beta_0$ and $\hat\beta_1$ .
Solve online
Ex. 105.7Application
Using $\hat Y = 32.4 + 2.5X$ , an employee with 14 years of experience earns 72,000 BRL/year. Calculate the residual.
Solve online
Ex. 105.8ApplicationAnswer key
Five observed values of $Y$ : $(8, 10, 12, 9, 11)$ with $\bar Y = 10$ . The SSE of the regression is 3.2. Calculate SST, SSR and $R^2$ .
Solve online
Ex. 105.9Application
A regression with $n=20$ produced $SSE = 48.6$ . Calculate $MSE$ and $\hat\sigma$ and interpret.
Solve online
Ex. 105.10Application
$\hat\beta_1 = 3.6$ , $\hat\sigma = 2.1$ , $S_{xx} = 144$ . Calculate $SE(\hat\beta_1)$ and the $T$ statistic.
Solve online
Ex. 105.11Application
$n=30$ , $\hat\beta_1 = 1.4$ , $SE(\hat\beta_1) = 0.38$ . Construct a 95% CI for $\beta_1$ and interpret.
Solve online
Ex. 105.12Application
$r = -0.73$ , $s_X = 4$ , $s_Y = 6$ . What is the sign of $\hat\beta_1$ ? Calculate $\hat\beta_1$ using the relation $\hat\beta_1 = r(s_Y/s_X)$ .
Solve online
Ex. 105.13UnderstandingAnswer key
Which of the statements about the least squares line is CORRECT?
Solve online
Ex. 105.14Understanding
What is the correct interpretation of $R^2 = 0$ in simple linear regression?
Solve online
Ex. 105.15Understanding
A regression produced $R^2 = 0.85$ and $\hat\beta_1 = 2.3 > 0$ . What can be concluded?
Solve online
Ex. 105.16Modeling
A real estate agent in Curitiba collected data from 10 apartments: area ( $X$ , in m²) and rent cost ( $Y$ , in BRL/month). $\bar X=80$ , $\bar Y=1600$ , $S_{xx}=3200$ , $S_{xy}=64000$ . Fit the line and predict the rent for a 95 m² apartment.
Solve online
Ex. 105.17Modeling
Children aged 10 to 25: $\bar X = 22$ years, $\bar Y = 74$ kg, $s_X = 2.3$ , $s_Y = 8.5$ , $r = 0.82$ . Fit the line using $\hat\beta_1 = r(s_Y/s_X)$ and predict the weight of a 30-year-old child.
Solve online
Ex. 105.18ModelingAnswer key
Regression with $n=25$ , $SST=1200$ , $R^2=0.72$ . Build the ANOVA table (SSR, SSE, MSR, MSE, F) and test $H_0: \beta_1 = 0$ at the 5% level.
Solve online
Ex. 105.19Modeling
A regression of water consumption (liters/day) on temperature (°C) produced $\hat Y = 50 + 8X$ with $R^2=0.91$ for $n=30$ points. The point $(15; 430)$ appears very far from the others. What procedure should be used to evaluate its influence?
Solve online
Ex. 105.20Modeling
A carrier recorded the number of orders $X$ and monthly logistics cost $Y$ (in thousand BRL) for 5 branches: $(10,100)$ , $(20,180)$ , $(30,270)$ , $(40,340)$ , $(50,400)$ . Fit the line.
Solve online
Ex. 105.21Application
Using $\hat Y = 30 + 7.6X$ , calculate the prediction and the residual for a branch with $X=35$ orders and an observed cost of 310,000 BRL.
Solve online
Ex. 105.22Application
For the regression in Exercise 105.20, calculate the 5 residuals, the SSE, and the residual standard deviation $\hat\sigma$ .
Solve online
Ex. 105.23Understanding
The residuals vs. $\hat Y$ plot has a funnel shape (increasing variance). What does this indicate?
Solve online
Ex. 105.24Application
For the regression in Exercise 105.20 ( $\hat Y = 30 + 7.6X$ , $n=5$ , $\bar X=30$ , $S_{xx}=1000$ , $\hat\sigma \approx 10.95$ ), construct a 95% CI for the average cost of a branch with $X^*=40$ orders. Use $t_{3;\,0.025} = 3.182$ .
Solve online
Ex. 105.25ChallengeAnswer key
Prove algebraically that, for simple linear regression, $R^2 = r^2$ (square of the Pearson correlation coefficient).
Ex. 105.26ChallengeAnswer key
Derive the formulas for $\hat\beta_0$ and $\hat\beta_1$ by minimizing $SSE = \sum (Y_i - \beta_0 - \beta_1 X_i)^2$ via differential calculus (normal equations).
Solve online
Ex. 105.27Proof
Prove that, for any least squares line, the sum of residuals is zero: $\sum_{i=1}^n e_i = 0$ .
Ex. 105.28Challenge
Summary data: $n=15$ , $\bar X=12$ , $\bar Y=45$ , $S_{xx}=420$ , $S_{xy}=1260$ , $S_{yy}=4800$ . Calculate: fitted line, $R^2$ , test $H_0:\beta_1=0$ at the 5% level.
Solve online
Ex. 105.29Challenge
Why does reducing the variability of $X$ (narrowing the sampled range) hurt the estimation of $\beta_1$ ? Relate to the formula for $SE(\hat\beta_1)$ .
Solve online
Ex. 105.30Proof
Prove that the OLS estimators $\hat\beta_0$ and $\hat\beta_1$ are unbiased, i.e., $E[\hat\beta_j] = \beta_j$ .

Sources

Statistics — OpenStax — Illowsky, Dean · CC-BY · Chapters 12 (Linear Regression and Correlation). Primary source for examples, equations, and exercises in this lesson.
OpenIntro Statistics (4th ed.) — Diez, Çetinkaya-Rundel, Barr · CC-BY-SA · Chapter 7 (Introduction to linear regression). Primary source for residual diagnostics, inference, and exercises with real data.
Probabilidade e Estatística — Wikilivros — collaborative · CC-BY-SA · Linear regression section. Reference in PT-BR with notation compatible with the national curriculum.