Math ClubMath Club
v1 · padrão canônico

Lesson 78 — Correlation and simple linear regression

Pearson's r coefficient, covariance, least-squares line, coefficient of determination r². Correlation is not causation — the Anscombe quartet, the set every scientist must know.

Used in: 2.º ano do EM (16-17 anos) · Stochastik LK alemão §12 · H2 Math singapurense §19 · AP Statistics USA §3

r=i=1n(xixˉ)(yiyˉ)i=1n(xixˉ)2i=1n(yiyˉ)2r = \frac{\displaystyle\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{\sqrt{\displaystyle\sum_{i=1}^n (x_i-\bar x)^2 \cdot \sum_{i=1}^n (y_i-\bar y)^2}}
Choose your door

Rigorous notation, full derivation, hypotheses

Rigorous definitions and properties

Covariance

"The covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive." — OpenStax Statistics, §12.1

Pearson correlation coefficient

r ≈ +1r ≈ −1r ≈ 0r ≈ 0.7

Four scatterplots with different r values. The cloud of points concentrates more around a line when |r| is close to 1.

Least-squares line (OLS)

Coefficient of determination

r2=1SSESST,SST=(yiyˉ)2r^2 = 1 - \frac{\text{SSE}}{\text{SST}}, \quad \text{SST} = \sum(y_i - \bar y)^2
what this means · r² measures the fraction of the variance of Y explained by the linear model in X.

LINE assumptions

Solved examples

Exercise list

32 exercises · 8 with worked solution (25%)

Application 18Understanding 3Modeling 8Challenge 2Proof 1
  1. Ex. 78.1ApplicationAnswer key

    X=(1,2,3,4)X = (1, 2, 3, 4), Y=(2,4,6,8)Y = (2, 4, 6, 8). Calculate r without using a calculator and justify the result.

  2. Ex. 78.2Application

    X=(1,2,3,4)X = (1, 2, 3, 4), Y=(8,6,4,2)Y = (8, 6, 4, 2). Calculate r and identify the expected sign before computing.

  3. Ex. 78.3Application

    X=(1,2,3)X = (1, 2, 3), Y=(1,4,9)Y = (1, 4, 9). Calculate r and discuss if the relationship is linear.

  4. Ex. 78.4ApplicationAnswer key

    If U=X+5U = X + 5 and V=2YV = 2Y, what is the relationship between r(U,V)r(U, V) and r(X,Y)r(X, Y)? Justify with the definition.

  5. Ex. 78.5ApplicationAnswer key

    Data with n=5n = 5 pairs: x=(1,2,3,4,5)x = (1, 2, 3, 4, 5) and y=(10,7,5,4,3)y = (10, 7, 5, 4, 3). Calculate r.

  6. Ex. 78.6ApplicationAnswer key

    X=(1,2,3,4,5)X = (1, 2, 3, 4, 5), Y=(1,4,5,9,10)Y = (1, 4, 5, 9, 10). Calculate r and the covariance sxys_{xy}.

  7. Ex. 78.7Application

    r=0.85r = 0.85, xˉ=10\bar x = 10, yˉ=50\bar y = 50, sx=3s_x = 3, sy=12s_y = 12. Find the least-squares line.

  8. Ex. 78.8Application

    Using the line from exercise 78.7 (y^=16+3.4x\hat y = 16 + 3.4x), predict Y for x=15x = 15 and for x=5x = 5.

  9. Ex. 78.9Application

    With r=0.85r = 0.85 (exercise 78.7), calculate r² and interpret in terms of explained variance.

  10. Ex. 78.10Application

    Using the line from 78.7, calculate the residual of the point (10,55)(10, 55).

  11. Ex. 78.11Understanding

    What does r=0r = 0 mean?

  12. Ex. 78.12Understanding

    Ice cream sales correlate positively with drowning deaths (r0.8r \approx 0.8). The best explanation is:

  13. Ex. 78.13Application

    With r=0.6r = 0.6, sx=2s_x = 2, sy=5s_y = 5, calculate the slopes of the two regression lines: Y on X and X on Y. Do the lines coincide?

  14. Ex. 78.14Application

    A regression model explains 64% of the variance in spending as a function of income. What is r|r|?

  15. Ex. 78.15Application

    If V=YV = -Y, what is the relationship between r(X,V)r(X, V) and r(X,Y)r(X, Y)?

  16. Ex. 78.16Modeling

    Height (XX) vs. weight (YY) relationship: xˉ=170\bar x = 170 cm, yˉ=70\bar y = 70 kg, sx=8s_x = 8 cm, sy=12s_y = 12 kg, r=0.75r = 0.75. Line equation and prediction for a person of 175 cm.

  17. Ex. 78.17Modeling

    A researcher found r=0.82r = 0.82 between the Corruption Perception Index and GDP per capita in 120 countries. Interpret r² and discuss causal limitations.

  18. Ex. 78.18Modeling

    A plot of residuals vs. fitted values shows a U-pattern (residuals first negative, then positive). What does this indicate about the linear model?

  19. Ex. 78.19Application

    n=25n = 25, r=0.45r = 0.45. Test H0:ρ=0H_0: \rho = 0 vs. H1:ρ0H_1: \rho \neq 0 at the 5% level.

  20. Ex. 78.20Application

    n=50n = 50, r=0.60r = 0.60. Construct a 95% CI for ρ\rho using the Fisher transformation.

  21. Ex. 78.21Modeling

    For each pair, identify if it is causal correlation, spurious, or reverse causality: (a) rain and umbrella sales; (b) number of police and crime per city.

  22. Ex. 78.22ApplicationAnswer key

    Interpret r2=0.25r^2 = 0.25 in a study relating years of schooling to salary.

  23. Ex. 78.23Application

    Explain the risk of extrapolating the regression line to x-values outside the sample range.

  24. Ex. 78.24Modeling

    In finance, the "beta" of a stock is the regression coefficient of the stock return on the market return. Express beta in terms of rr, sris_{r_i}, and srms_{r_m}.

  25. Ex. 78.25Modeling

    An energy distributor has monthly data on average temperature (°C) and consumption (MWh) for the last 5 years. Describe the correlation and regression analysis flow to predict consumption.

  26. Ex. 78.26Application

    The four Anscombe sets have r0.82r \approx 0.82 and the same regression line. Why is the linear model adequate for set I but not for the other three?

  27. Ex. 78.27ModelingAnswer key

    Why is Spearman correlation more suitable than Pearson for ordinal data (e.g., satisfaction 1 to 5) or with outliers?

  28. Ex. 78.28Modeling

    Differentiate confounder, mediator, and moderator in an observational study.

  29. Ex. 78.29ChallengeAnswer key

    n=22n = 22 pairs; r2=0.64r^2 = 0.64; SST = 500. Calculate the Sum of Squared Errors (SSE) and the RMSE.

  30. Ex. 78.30Challenge

    Why does R² never decrease when adding a variable to the model, and how does adjusted R² solve this problem?

  31. Ex. 78.31Understanding

    What property defines the least-squares line (OLS)?

  32. Ex. 78.32ProofAnswer key

    Prove that 1r1-1 \leq r \leq 1 using the Cauchy-Schwarz inequality.

Sources

  • OpenStax Statistics — Illowsky, Dean · 2022 · CC-BY. Primary source for exercises 78.1–2, 78.5–10, 78.14, 78.16, 78.19–20, 78.22–25, 78.29–31 and examples 1–3, 5.
  • OpenIntro Statistics (4th ed) — Diez, Çetinkaya-Rundel, Barr · 2019 · CC-BY-SA. Source for exercises 78.3, 78.9, 78.11–12, 78.17–18, 78.21, 78.23, 78.26–28, 78.32 and example 4.
  • Introduction to Probability (Grinstead-Snell) — Grinstead, Snell · Dartmouth · GNU FDL. Source for exercises 78.4, 78.13, 78.15 and proof of |r| ≤ 1.

Updated on 2025-05-14 · Author(s): Clube da Matemática

Found an error? Open an issue on GitHub or submit a PR — open source forever.