Lesson 78 — Correlation and simple linear regression
Pearson's r coefficient, covariance, least-squares line, coefficient of determination r². Correlation is not causation — the Anscombe quartet, the set every scientist must know.
Used in: 2.º ano do EM (16-17 anos) · Stochastik LK alemão §12 · H2 Math singapurense §19 · AP Statistics USA §3
Rigorous notation, full derivation, hypotheses
Rigorous definitions and properties
Covariance
"The covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive." — OpenStax Statistics, §12.1
Pearson correlation coefficient
Four scatterplots with different r values. The cloud of points concentrates more around a line when |r| is close to 1.
Least-squares line (OLS)
Coefficient of determination
LINE assumptions
Solved examples
Exercise list
32 exercises · 8 with worked solution (25%)
- Ex. 78.1ApplicationAnswer key
, . Calculate r without using a calculator and justify the result.
- Ex. 78.2Application
, . Calculate r and identify the expected sign before computing.
- Ex. 78.3Application
, . Calculate r and discuss if the relationship is linear.
- Ex. 78.4ApplicationAnswer key
If and , what is the relationship between and ? Justify with the definition.
- Ex. 78.5ApplicationAnswer key
Data with pairs: and . Calculate r.
- Ex. 78.6ApplicationAnswer key
, . Calculate r and the covariance .
- Ex. 78.7Application
, , , , . Find the least-squares line.
- Ex. 78.8Application
Using the line from exercise 78.7 (), predict Y for and for .
- Ex. 78.9Application
With (exercise 78.7), calculate r² and interpret in terms of explained variance.
- Ex. 78.10Application
Using the line from 78.7, calculate the residual of the point .
- Ex. 78.11Understanding
What does mean?
- Ex. 78.12Understanding
Ice cream sales correlate positively with drowning deaths (). The best explanation is:
- Ex. 78.13Application
With , , , calculate the slopes of the two regression lines: Y on X and X on Y. Do the lines coincide?
- Ex. 78.14Application
A regression model explains 64% of the variance in spending as a function of income. What is ?
- Ex. 78.15Application
If , what is the relationship between and ?
- Ex. 78.16Modeling
Height () vs. weight () relationship: cm, kg, cm, kg, . Line equation and prediction for a person of 175 cm.
- Ex. 78.17Modeling
A researcher found between the Corruption Perception Index and GDP per capita in 120 countries. Interpret r² and discuss causal limitations.
- Ex. 78.18Modeling
A plot of residuals vs. fitted values shows a U-pattern (residuals first negative, then positive). What does this indicate about the linear model?
- Ex. 78.19Application
, . Test vs. at the 5% level.
- Ex. 78.20Application
, . Construct a 95% CI for using the Fisher transformation.
- Ex. 78.21Modeling
For each pair, identify if it is causal correlation, spurious, or reverse causality: (a) rain and umbrella sales; (b) number of police and crime per city.
- Ex. 78.22ApplicationAnswer key
Interpret in a study relating years of schooling to salary.
- Ex. 78.23Application
Explain the risk of extrapolating the regression line to x-values outside the sample range.
- Ex. 78.24Modeling
In finance, the "beta" of a stock is the regression coefficient of the stock return on the market return. Express beta in terms of , , and .
- Ex. 78.25Modeling
An energy distributor has monthly data on average temperature (°C) and consumption (MWh) for the last 5 years. Describe the correlation and regression analysis flow to predict consumption.
- Ex. 78.26Application
The four Anscombe sets have and the same regression line. Why is the linear model adequate for set I but not for the other three?
- Ex. 78.27ModelingAnswer key
Why is Spearman correlation more suitable than Pearson for ordinal data (e.g., satisfaction 1 to 5) or with outliers?
- Ex. 78.28Modeling
Differentiate confounder, mediator, and moderator in an observational study.
- Ex. 78.29ChallengeAnswer key
pairs; ; SST = 500. Calculate the Sum of Squared Errors (SSE) and the RMSE.
- Ex. 78.30Challenge
Why does R² never decrease when adding a variable to the model, and how does adjusted R² solve this problem?
- Ex. 78.31Understanding
What property defines the least-squares line (OLS)?
- Ex. 78.32ProofAnswer key
Prove that using the Cauchy-Schwarz inequality.
Sources
- OpenStax Statistics — Illowsky, Dean · 2022 · CC-BY. Primary source for exercises 78.1–2, 78.5–10, 78.14, 78.16, 78.19–20, 78.22–25, 78.29–31 and examples 1–3, 5.
- OpenIntro Statistics (4th ed) — Diez, Çetinkaya-Rundel, Barr · 2019 · CC-BY-SA. Source for exercises 78.3, 78.9, 78.11–12, 78.17–18, 78.21, 78.23, 78.26–28, 78.32 and example 4.
- Introduction to Probability (Grinstead-Snell) — Grinstead, Snell · Dartmouth · GNU FDL. Source for exercises 78.4, 78.13, 78.15 and proof of |r| ≤ 1.