Quantitative Methods·Machine Learning

Section: Machine Learning Methods in Finance

Estimated study time: 60 minutes

Content:

Machine learning (ML) has become an increasingly important topic in CFA Level 2, reflecting the growing role of data science in investment management. The CFA curriculum distinguishes between supervised learning (where the model learns a mapping from labeled input data to outputs), unsupervised learning (where the model finds structure in unlabeled data), and deep learning (multi-layer neural networks). At Level 2, candidates must understand the conceptual underpinnings of key algorithms, their applications in finance, their limitations, and the overfitting problem — they are not expected to implement algorithms mathematically but must evaluate their appropriate use and interpret results in vignette scenarios.

Supervised learning methods include linear regression (for continuous outcomes), logistic regression (for binary classification such as predicting default vs. non-default), classification and regression trees (CART), random forests, gradient boosting (e.g., XGBoost), and support vector machines (SVM). In finance, supervised learning applications include predicting credit default, classifying market regimes, forecasting earnings surprises, and identifying alpha signals. A critical concept is the bias-variance tradeoff: simple models (e.g., linear regression) have high bias and low variance — they tend to underfit; complex models (e.g., deep neural networks with many layers) have low bias and high variance — they tend to overfit to training data and generalize poorly to new data. Regularization techniques such as LASSO (L1 penalty on coefficient magnitude) and ridge regression (L2 penalty on squared coefficient magnitude) reduce variance by shrinking coefficients, at the cost of slightly increased bias.

The training, validation, and test set framework is fundamental to evaluating ML models without overfitting. The full dataset is divided into a training set (used to fit the model), a validation set (used to tune hyperparameters and compare alternative models), and a test set (held out entirely until final evaluation). In finance, time series data requires a specific split: the test set must always be the most recent data, not a random sample, because using future data to train models introduces look-ahead bias. Cross-validation for time series uses walk-forward (rolling-window) validation rather than k-fold cross-validation to preserve temporal ordering. Evaluating model performance uses metrics appropriate to the task: mean squared error (MSE) for regression, accuracy, precision, recall, and the AUC-ROC curve for classification.

Unsupervised learning includes clustering (k-means, hierarchical clustering) and dimensionality reduction (principal components analysis, PCA). In finance, k-means clustering groups stocks or economic periods into regimes without predefined labels — for example, identifying bull, bear, and sideways market regimes from a multi-factor dataset. PCA reduces high-dimensional factor data to a smaller number of principal components that capture the most variance, useful for constructing factor models or compressing macroeconomic datasets. An important caveat is that unsupervised methods produce outputs that require human interpretation — the algorithm identifies patterns, but the analyst must determine whether those patterns are economically meaningful or statistical artifacts.

Overfitting is the central risk in ML applications to finance. A model overfit to historical data will have high in-sample R-squared but poor out-of-sample performance. Symptoms include dramatic performance decay from backtested to live trading results. Techniques to mitigate overfitting include: regularization (LASSO/ridge), limiting model complexity (reducing tree depth or number of layers), early stopping in neural network training, ensemble methods (random forests average many trees, reducing variance), and proper out-of-sample testing. In the context of investment backtests, data mining bias (or data snooping) occurs when many strategies are tested on the same dataset and only the best is reported — the reported performance overstates expected future performance because some of the strategy's apparent alpha is actually sampling error. The required adjustment is a correction for multiple testing or use of an independent out-of-sample dataset.

Key Terms:

  • Supervised Learning: ML methods where models are trained on labeled data (input-output pairs) to learn a predictive mapping; includes regression and classification algorithms.
  • Unsupervised Learning: ML methods that identify structure in unlabeled data without predefined outputs; includes clustering and dimensionality reduction.
  • Bias-Variance Tradeoff: The fundamental tension between model flexibility (low bias, high variance — overfitting) and model simplicity (high bias, low variance — underfitting).
  • Overfitting: When a model learns the noise in training data rather than the underlying signal, resulting in poor out-of-sample performance.
  • Regularization: Techniques (LASSO, ridge) that penalize model complexity to reduce variance and mitigate overfitting.
  • Cross-Validation: A technique for estimating model performance using held-out data subsets; for time series, walk-forward validation preserves temporal order.
  • Principal Components Analysis (PCA): A dimensionality reduction technique that transforms correlated variables into uncorrelated principal components capturing maximum variance.
  • Data Mining Bias (Data Snooping): The inflation of apparent model performance from testing many hypotheses on the same dataset and reporting only the best result.

Quiz Questions:

Q1. A quant analyst builds a gradient boosting model to predict next-month stock returns using 50 financial and technical features. The model achieves an in-sample R-squared of 0.82 on training data but only 0.03 on a held-out test set. What does this pattern most likely indicate and what is the appropriate response?

A) The model is underfit; the analyst should add more features. B) The model is overfit; the analyst should apply regularization, reduce model complexity, or use a simpler algorithm, and re-evaluate on an independent test set. C) The low test R-squared indicates the model is wrong about the direction of returns; the analyst should reverse the predictions. D) The model is correctly calibrated; an R-squared of 0.03 is typical for monthly return predictions.

Answer: B — The dramatic gap between in-sample R-squared (0.82) and out-of-sample R-squared (0.03) is a hallmark of severe overfitting. The model has memorized training data noise rather than learning generalizable signals. Appropriate responses include reducing model complexity (e.g., limiting tree depth, reducing the number of features), applying LASSO or ridge regularization, using ensemble averaging, or switching to a simpler model class. An R-squared of 0.03 might be realistic for a well-calibrated model (Option D), but the gap, not the level, is the diagnostic signal here.

---

Q2. An investment firm applies k-means clustering to 10 years of monthly macroeconomic data (GDP growth, inflation, credit spreads, yield curve slope) to identify market regimes. The algorithm identifies three clusters, which the research team labels "expansion," "stagflation," and "recession." The team then constructs a strategy that rotates between equities and bonds based on the identified regime. Which of the following is a key limitation of this approach?

A) K-means clustering cannot handle more than two input variables. B) The algorithm identifies patterns but the economic labels are imposed by the human analyst, and the patterns may not be stable out-of-sample. C) Unsupervised methods cannot be used with macroeconomic data. D) K-means clustering always produces exactly three clusters regardless of the data.

Answer: B — K-means clustering identifies mathematical clusters in the data but does not provide economic meaning. The labels ("expansion," "stagflation," "recession") are interpretations added by the analyst and may not accurately characterize what the algorithm found. Furthermore, clusters derived from historical data may not persist in future regimes, and the algorithm is sensitive to initialization and the choice of k. These are fundamental limitations of applying unsupervised learning to financial strategy construction.

---

Q3. An analyst wants to build a model to predict corporate bond default within one year (1 = default, 0 = no default). She has 10,000 historical observations, of which only 200 experienced default. She trains a logistic regression model and reports 98% accuracy. Which of the following best evaluates the model's performance?

A) 98% accuracy is excellent and the model is ready for deployment. B) The 98% accuracy is misleading because a model that predicts "no default" for every observation would also achieve 98% accuracy; precision, recall, and AUC-ROC are more informative metrics for imbalanced classification. C) The model is overfit because accuracy exceeds 95%. D) Logistic regression is inappropriate for binary classification and should be replaced with linear regression.

Answer: B — With only 200 defaults out of 10,000 observations (2% base rate), a naive model that always predicts "no default" achieves 98% accuracy without any predictive skill. This is the class imbalance problem. For imbalanced classification, relevant metrics are precision (of predicted defaults, how many actually defaulted), recall (of actual defaults, how many did the model catch), F1-score (harmonic mean of precision and recall), and AUC-ROC (the model's ability to discriminate between default and non-default across all probability thresholds). Logistic regression is an appropriate and common tool for binary classification.

---

Q4. A portfolio manager uses LASSO regularization when building a factor model for stock selection. As the LASSO penalty parameter lambda increases, which of the following best describes the effect on the model?

A) Model variance decreases and some coefficients are shrunk to exactly zero, effectively performing variable selection at the cost of increased bias. B) Model variance increases and all coefficients grow larger. C) LASSO increases model complexity, improving in-sample fit. D) LASSO cannot be applied to factor models; it is only appropriate for time series.

Answer: A — LASSO (Least Absolute Shrinkage and Selection Operator) adds an L1 penalty (proportional to the sum of absolute coefficient values) to the regression objective. As lambda increases, more coefficients are shrunk toward zero, and some reach exactly zero — effectively removing those variables from the model. This reduces model complexity and variance (less overfitting) at the cost of increased bias. The variable selection property of LASSO distinguishes it from ridge regression, which shrinks coefficients toward zero but rarely to exactly zero. LASSO is widely applicable to factor models and cross-sectional regressions.

---

Q5. A research team backtests 200 different quantitative trading strategies on the same 20-year dataset and selects the strategy with the highest Sharpe ratio of 1.8. They present this to investors as evidence of the strategy's strength. Which concept best describes the primary concern with this approach?

A) Survivorship bias, because losing strategies are not represented. B) Data mining bias (multiple testing problem), because the best result from 200 trials is expected to be inflated; the true expected Sharpe ratio of the selected strategy is likely much lower. C) Look-ahead bias, because the analyst used future price data. D) Sample selection bias, because the 20-year period does not represent all market conditions.

Answer: B — When 200 strategies are tested on the same dataset, the best result is expected to appear good by chance even if none of the strategies have true predictive power. The probability of finding a strategy with an apparently strong Sharpe ratio increases with the number of trials. This is the multiple testing (data snooping) problem. The Sharpe ratio of the selected strategy is biased upward. Proper corrections include adjusting for the number of strategies tested (Bonferroni correction) or validating on a completely independent out-of-sample period. Survivorship bias refers to excluding failed strategies from a dataset, not to the multiple-testing issue described here.

---