Lesson 118 — Principal Component Analysis (PCA)
Hotelling 1933: diagonalize covariance to find directions of maximum variance. Scores, explained variance, scree plot. Connection with SVD. Applications in ML, finance, genomics.
Used in: 3rd year of High School (17-18 years) · Equiv. Stochastik LK German · Equiv. H2 Math Statistics Singaporean · Equiv. Math B advanced Japanese
Rigorous notation, full derivation, hypotheses
Mathematical definition
Setup and sample covariance
"The covariance matrix is always symmetric and positive semidefinite. Its eigenvalues are nonnegative and the eigenvectors form an orthonormal basis of ." — Introduction to Applied Linear Algebra (VMLS), §10.1
Principal components
Optimality
"The principal components are the eigenvectors of the data covariance matrix, ordered by decreasing eigenvalue. The first principal component captures the maximum variance; successive components capture maximum residual variance subject to orthogonality." — Understanding Linear Algebra, §7.1
Connection with SVD
Reconstruction and approximation error
Reconstruction error: .
Worked examples
Exercise list
30 exercises · 7 with worked solution (25%)
- Ex. 118.1Understanding
Why is it necessary to center the data (subtract the mean) before applying PCA?
- Ex. 118.2Application
Given , compute the covariance and principal components.
- Ex. 118.3Application
Eigenvalues of : . Compute the explained variance by each PC and the cumulative. For which does cumulative variance reach 90%?
- Ex. 118.4Application
PCs of a 2D dataset: , . Compute the scores of in both components.
- Ex. 118.5Application
Using the data from the previous exercise, reconstruct retaining only PC1. What is the reconstruction error?
- Ex. 118.6UnderstandingAnswer key
Why is computing PCA via SVD of the data matrix preferable to eigendecomposing directly?
- Ex. 118.7ApplicationAnswer key
Compute the PCs of . What is the explained variance by each component?
- Ex. 118.8ApplicationAnswer key
With standardized data (z-score), what is the total variance? What does the Kaiser criterion of retaining only PCs with eigenvalue greater than 1 mean?
- Ex. 118.9Application
Dataset with eigenvalues (total 10). Compute the mean squared reconstruction error when keeping K = 1 and K = 2 components.
- Ex. 118.10Modeling
PCA of 10 stocks' returns resulted in PC1 with similar magnitude positive loading for all stocks. What does PC1 represent economically? How would a portfolio manager use this information?
- Ex. 118.11ApplicationAnswer key
SVD of with samples gave singular values . Compute the corresponding covariance eigenvalues and the explained variance by PC1.
- Ex. 118.12UnderstandingAnswer key
Prove that the scores of different principal components are uncorrelated with each other.
- Ex. 118.13Application
A dataset has 50 standardized features. With K = 10 PCs capturing 95% variance, how many parameters are needed to represent covariance via rank-K PCA versus full covariance?
- Ex. 118.14ModelingAnswer key
Explain what the 3 first PCs of the Brazilian yield curve represent. Why do these 3 factors explain ~99% of variance?
- Ex. 118.15Application
Explain the difference between performing PCA with and without prior standardization (z-score). When should you NOT standardize?
- Ex. 118.16Proof
Show that projection onto the K first PCs minimizes mean squared reconstruction error among all rank-K linear projections. What is the value of the minimum error in terms of eigenvalues?
- Ex. 118.17Application
Eigenvalues in descending order: 12, 8, 3, 1, 1, 1, 1, 1. Mentally construct the scree plot and identify the "elbow". How many PCs should you retain for 80% variance?
- Ex. 118.18Modeling
Describe the Eigenfaces method (Turk-Pentland 1991) for facial recognition using PCA. What dimensionality is achieved compared to original pixels?
- Ex. 118.19Application
What is the conceptual difference between PCA and ICA (Independent Component Analysis)? In what type of problem is ICA necessary?
- Ex. 118.20Application
What happens when the covariance matrix is the identity ()? What does this imply for PCA and dimensionality reduction?
- Ex. 118.21Proof
Prove that eigenvectors of a symmetric matrix corresponding to distinct eigenvalues are orthogonal. Use this to justify the orthogonality of PCs.
- Ex. 118.22ModelingAnswer key
Explain what a PCA biplot shows. How do you interpret the direction and length of feature arrows and the position of samples?
- Ex. 118.23Application
Why is classical PCA sensitive to outliers? What is the idea of Robust PCA for handling this problem?
- Ex. 118.24Application
Dataset: N = 1000 samples, d = 100 standardized features. PCA with K = 5 PCs explains 80% variance. Compute the data compression factor (ratio between original and PCA storage).
- Ex. 118.25Challenge
Explain the idea of Kernel PCA. How does replacing the inner product with a kernel allow capturing non-linear structure? What is the computational complexity?
- Ex. 118.26Application
Explain the "dual trick" of PCA: when (more features than samples), how do you compute PCA efficiently? What is the complexity in each case?
- Ex. 118.27Application
PCA of 2023 ENEM microdata (5 grades: CN, CH, LC, MT, Essay) resulted in PC1 with similar magnitude positive loadings for all grades. Interpret PC1. What could PC2 represent?
- Ex. 118.28Proof
Prove that the sample variance of the k-th score equals the k-th eigenvalue of covariance . Use the connection with SVD.
- Ex. 118.29Modeling
In the 1000 Genomes Project, PCA of genomic data from ~2500 people across 26 populations reveals clusters by continent. Explain how this is possible and what the 3 first PCs represent genetically.
- Ex. 118.30Challenge
Describe the Probabilistic PCA model (Tipping-Bishop 1999). What are the advantages over classical PCA? How does this model reduce to classical PCA in a limiting case?
Sources
- Understanding Linear Algebra — David Austin · Grand Valley State University · CC-BY-SA · Chapter 7: PCA via SVD, explained variance, scree plot, applications.
- Introduction to Applied Linear Algebra (VMLS) — Stephen Boyd, Lieven Vandenberghe · Stanford University · CC-BY-NC-ND · Ch. 10: rigorous PCA theory, optimality, SVD connection, ML applications.
- OpenIntro Statistics — Diez, Çetinkaya-Rundel, Barr · CC-BY-SA · §8.3: statistical perspective, explained variance, component interpretation, real data exercises.