v1 · padrão canônico

Lesson 118 — Principal Component Analysis (PCA)

Hotelling 1933: diagonalize covariance to find directions of maximum variance. Scores, explained variance, scree plot. Connection with SVD. Applications in ML, finance, genomics.

Used in: 3rd year of High School (17-18 years) · Equiv. Stochastik LK German · Equiv. H2 Math Statistics Singaporean · Equiv. Math B advanced Japanese

\Sigma = V \Lambda V^T, \quad z_k = V_k^T (x - \bar{x})

Choose your door

Rigorous notation, full derivation, hypotheses

Mathematical definition

Setup and sample covariance

"The covariance matrix $\Sigma$ is always symmetric and positive semidefinite. Its eigenvalues are nonnegative and the eigenvectors form an orthonormal basis of $\mathbb{R}^d$ ." — Introduction to Applied Linear Algebra (VMLS), §10.1

Optimality

"The principal components are the eigenvectors of the data covariance matrix, ordered by decreasing eigenvalue. The first principal component captures the maximum variance; successive components capture maximum residual variance subject to orthogonality." — Understanding Linear Algebra, §7.1

Reconstruction and approximation error

\hat{x}_i = \bar{x} + \sum_{k=1}^K z_{ik}\, v_k

what this means · Keeping K components minimizes the mean squared reconstruction error among all rank-K projections (Eckart-Young applied to PCA).

Reconstruction error: .

Worked examples

Example— 3· Explained variance and scree plot

Problem. A PCA analysis of 50 variables results in the 5 largest eigenvalues: 15, 8, 4, 2, 1. Total variance = 50 (standardized data, each variable has variance 1). Compute: (a) explained variance by each PC; (b) cumulative explained variance; (c) which K to choose for 90%.

Strategy. Simple proportions, then cumulative sum.

Solution.

(a) ExpVar(k): 15/50 = 30%, 8/50 = 16%, 4/50 = 8%, 2/50 = 4%, 1/50 = 2%.

(b) Cumulative: 30%, 46%, 54%, 58%, 60%.

(c) After 5 PCs, cumulative = 60% — insufficient for 90%. The remaining 45 PCs sum to 40%. To reach 90% probably need ~15–20 PCs (depending on remaining eigenvalue distribution).

Verification. Sum of 5 portions = 60%, coherent with remaining total variance of 40%. ✓

Source. OpenIntro Statistics, §8.3 Exercise 8.27 — Diez, Çetinkaya-Rundel, Barr · CC-BY-SA.

Example— 5· PCA applied to Brazilian stock returns

Problem. You have daily returns of 5 B3 stocks (PETR4, VALE3, ITUB4, BBDC4, WEGE3) over 252 days. The annualized covariance matrix (in %) is:

The 3 largest eigenvalues are approximately 82, 23, 11. Compute: (a) total variance; (b) explained variance by the 3 PCs; (c) economic interpretation.

Strategy. Total variance = trace of ; explained variance = eigenvalue proportion.

Solution.

(a) Total variance: .

(b) 3 PCs explain: .

(c) PC1 (82/140 = 58.6%): positively correlated with all stocks — market factor ("beta"). PC2 (23/140 = 16.4%): probably differentiates commodities (PETR4, VALE3) from financials (ITUB4, BBDC4). PC3: remaining idiosyncratic risk.

Verification. 3 of 5 components explain 82.9% — consistent with empirical studies of Brazilian yield curves showing dominance of first PCs.

Source. OpenIntro Statistics, §8.3 — Diez et al. · CC-BY-SA. Covariance structure motivated by empirical results from emerging markets.

Exercise list

30 exercises · 7 with worked solution (25%)

Application 17Understanding 3Modeling 5Challenge 2Proof 3

Ex. 118.1Understanding
Why is it necessary to center the data (subtract the mean) before applying PCA?
Solve online
Ex. 118.2Application
Given , compute the covariance and principal components.
Solve online
Ex. 118.3Application
Eigenvalues of : . Compute the explained variance by each PC and the cumulative. For which does cumulative variance reach 90%?
Solve online
Ex. 118.4Application
PCs of a 2D dataset: , . Compute the scores of in both components.
Solve online
Ex. 118.5Application
Using the data from the previous exercise, reconstruct retaining only PC1. What is the reconstruction error?
Solve online
Ex. 118.6UnderstandingAnswer key
Why is computing PCA via SVD of the data matrix preferable to eigendecomposing directly?
Solve online
Ex. 118.7ApplicationAnswer key
Compute the PCs of . What is the explained variance by each component?
Solve online
Ex. 118.8ApplicationAnswer key
With standardized data (z-score), what is the total variance? What does the Kaiser criterion of retaining only PCs with eigenvalue greater than 1 mean?
Solve online
Ex. 118.9Application
Dataset with eigenvalues (total 10). Compute the mean squared reconstruction error when keeping K = 1 and K = 2 components.
Solve online
Ex. 118.10Modeling
PCA of 10 stocks' returns resulted in PC1 with similar magnitude positive loading for all stocks. What does PC1 represent economically? How would a portfolio manager use this information?
Solve online
Ex. 118.11ApplicationAnswer key
SVD of with samples gave singular values . Compute the corresponding covariance eigenvalues and the explained variance by PC1.
Solve online
Ex. 118.12UnderstandingAnswer key
Prove that the scores of different principal components are uncorrelated with each other.
Ex. 118.13Application
A dataset has 50 standardized features. With K = 10 PCs capturing 95% variance, how many parameters are needed to represent covariance via rank-K PCA versus full covariance?
Solve online
Ex. 118.14ModelingAnswer key
Explain what the 3 first PCs of the Brazilian yield curve represent. Why do these 3 factors explain ~99% of variance?
Solve online
Ex. 118.15Application
Explain the difference between performing PCA with and without prior standardization (z-score). When should you NOT standardize?
Solve online
Ex. 118.16Proof
Show that projection onto the K first PCs minimizes mean squared reconstruction error among all rank-K linear projections. What is the value of the minimum error in terms of eigenvalues?
Solve online
Ex. 118.17Application
Eigenvalues in descending order: 12, 8, 3, 1, 1, 1, 1, 1. Mentally construct the scree plot and identify the "elbow". How many PCs should you retain for 80% variance?
Solve online
Ex. 118.18Modeling
Describe the Eigenfaces method (Turk-Pentland 1991) for facial recognition using PCA. What dimensionality is achieved compared to original pixels?
Ex. 118.19Application
What is the conceptual difference between PCA and ICA (Independent Component Analysis)? In what type of problem is ICA necessary?
Solve online
Ex. 118.20Application
What happens when the covariance matrix is the identity ()? What does this imply for PCA and dimensionality reduction?
Solve online
Ex. 118.21Proof
Prove that eigenvectors of a symmetric matrix corresponding to distinct eigenvalues are orthogonal. Use this to justify the orthogonality of PCs.
Ex. 118.22ModelingAnswer key
Explain what a PCA biplot shows. How do you interpret the direction and length of feature arrows and the position of samples?
Solve online
Ex. 118.23Application
Why is classical PCA sensitive to outliers? What is the idea of Robust PCA for handling this problem?
Solve online
Ex. 118.24Application
Dataset: N = 1000 samples, d = 100 standardized features. PCA with K = 5 PCs explains 80% variance. Compute the data compression factor (ratio between original and PCA storage).
Solve online
Ex. 118.25Challenge
Explain the idea of Kernel PCA. How does replacing the inner product with a kernel allow capturing non-linear structure? What is the computational complexity?
Solve online
Ex. 118.26Application
Explain the "dual trick" of PCA: when (more features than samples), how do you compute PCA efficiently? What is the complexity in each case?
Solve online
Ex. 118.27Application
PCA of 2023 ENEM microdata (5 grades: CN, CH, LC, MT, Essay) resulted in PC1 with similar magnitude positive loadings for all grades. Interpret PC1. What could PC2 represent?
Solve online
Ex. 118.28Proof
Prove that the sample variance of the k-th score equals the k-th eigenvalue of covariance . Use the connection with SVD.
Ex. 118.29Modeling
In the 1000 Genomes Project, PCA of genomic data from ~2500 people across 26 populations reveals clusters by continent. Explain how this is possible and what the 3 first PCs represent genetically.
Solve online
Ex. 118.30Challenge
Describe the Probabilistic PCA model (Tipping-Bishop 1999). What are the advantages over classical PCA? How does this model reduce to classical PCA in a limiting case?
Solve online

Sources

Understanding Linear Algebra — David Austin · Grand Valley State University · CC-BY-SA · Chapter 7: PCA via SVD, explained variance, scree plot, applications.
Introduction to Applied Linear Algebra (VMLS) — Stephen Boyd, Lieven Vandenberghe · Stanford University · CC-BY-NC-ND · Ch. 10: rigorous PCA theory, optimality, SVD connection, ML applications.
OpenIntro Statistics — Diez, Çetinkaya-Rundel, Barr · CC-BY-SA · §8.3: statistical perspective, explained variance, component interpretation, real data exercises.