Quantitative Methods·Multiple Regression

Section: Multiple Regression Analysis

Estimated study time: 60 minutes

Content:

Multiple regression is a cornerstone of quantitative finance and is tested extensively at CFA Level 2. The general multiple regression model takes the form: Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk + epsilon, where Y is the dependent variable, X1 through Xk are independent variables (predictors), b0 is the intercept, b1 through bk are slope coefficients, and epsilon is the error term. The ordinary least squares (OLS) estimator minimizes the sum of squared residuals. At Level 2, candidates must interpret regression output from a standard regression table, assess model quality, and identify violations of the classical linear regression model (CLRM) assumptions.

The six CLRM assumptions are: (1) the relationship between the dependent and independent variables is linear in parameters; (2) the independent variables are not random and are not perfectly collinear with each other; (3) the expected value of the error term is zero — E(epsilon) = 0; (4) the variance of the error term is constant across all observations (homoskedasticity); (5) the error terms are uncorrelated with each other (no serial correlation); and (6) the error term is normally distributed. Violations of assumptions 4 and 5 — heteroskedasticity and serial correlation — are the most commonly tested at Level 2. These violations do not bias coefficient estimates but render standard errors (and therefore t-statistics and p-values) unreliable, leading to incorrect inference.

Evaluating regression model fit involves several key statistics. The coefficient of determination (R-squared) measures the proportion of the dependent variable's variance explained by the regression: R-squared = SSR/SST = 1 - SSE/SST, where SST is total sum of squares, SSR is regression (explained) sum of squares, and SSE is error (residual) sum of squares. Adjusted R-squared penalizes for adding independent variables that don't improve fit: Adjusted R-squared = 1 - [(1 - R-squared)(n-1)/(n-k-1)], where n is the number of observations and k is the number of independent variables. The F-statistic tests whether at least one slope coefficient is nonzero: F = (SSR/k) / (SSE/(n-k-1)) = MSR/MSE. Individual slope coefficients are tested with t-statistics: t = (b_hat - b_hypothesized) / (standard error of b_hat), with n-k-1 degrees of freedom.

Heteroskedasticity occurs when the variance of the regression error term is not constant — it may increase with the size of the independent variable (conditional heteroskedasticity) or change over time. Consequences include incorrect standard errors (typically understated in conditional heteroskedasticity), inflated t-statistics, and spurious significance. Detection methods include the Breusch-Pagan test and visual inspection of residual plots. Correction methods include using heteroskedasticity-consistent (White) standard errors, or transforming variables (e.g., using log returns instead of price levels). Serial correlation (autocorrelation) occurs when regression residuals are correlated across observations — common in time series financial data. The Durbin-Watson statistic tests for first-order serial correlation: DW approximately equals 2*(1-r), where r is the correlation of residuals with lagged residuals. DW near 2 indicates no serial correlation; DW near 0 indicates positive serial correlation; DW near 4 indicates negative serial correlation.

Multicollinearity occurs when two or more independent variables are highly correlated with each other. Unlike assumption violations 4 and 5, multicollinearity does not bias coefficient estimates but inflates their standard errors, making it harder to establish statistical significance even when variables are individually important. Indicators of multicollinearity include: a high F-statistic (significant overall regression) combined with low t-statistics for individual coefficients, high R-squared with few significant individual predictors, and large changes in coefficient estimates when variables are added or removed. The variance inflation factor (VIF) quantifies multicollinearity for each variable: VIF_j = 1/(1 - R-squared_j), where R-squared_j is the R-squared from regressing variable j on all other independent variables. VIF > 5 suggests problematic multicollinearity; VIF > 10 is severe. Solutions include removing one of the collinear variables or combining them into a composite factor.

Key Terms:

  • Ordinary Least Squares (OLS): The standard method for estimating regression coefficients by minimizing the sum of squared differences between observed and predicted values.
  • R-Squared: The proportion of the dependent variable's variation explained by the independent variables in the regression model; ranges from 0 to 1.
  • Adjusted R-Squared: A modified R-squared that penalizes for the number of independent variables, used to compare models with different numbers of predictors.
  • F-Statistic: A test statistic for the joint hypothesis that all slope coefficients equal zero; a significant F-test indicates that at least one predictor is useful.
  • Heteroskedasticity: A violation of CLRM assumption 4, where the variance of the error term is not constant across observations, distorting standard error estimates.
  • Serial Correlation (Autocorrelation): A violation of CLRM assumption 5, where regression error terms are correlated with each other, commonly occurring in time series data.
  • Multicollinearity: A condition where two or more independent variables are highly correlated, inflating coefficient standard errors and reducing statistical power.
  • Variance Inflation Factor (VIF): A statistic measuring the degree of multicollinearity for each independent variable; VIF > 5 indicates concern, VIF > 10 is severe.

Quiz Questions:

Q1. An analyst runs a multiple regression of monthly portfolio excess returns on three Fama-French factors: market excess return (MKT), size (SMB), and value (HML). The regression output shows R-squared = 0.72, adjusted R-squared = 0.69, and an F-statistic of 38.4 with a p-value < 0.001. The t-statistics for MKT, SMB, and HML are 9.2, 1.1, and 0.8, respectively, and the sample includes 60 months. What is the most likely conclusion from this output?

A) The model explains no variation in returns because two of three variables are insignificant. B) The overall model is statistically significant, but SMB and HML individually are not significant contributors at conventional significance levels, suggesting possible multicollinearity or irrelevance. C) The model is misspecified because adjusted R-squared is lower than R-squared. D) Serial correlation is present because the F-statistic exceeds 30.

Answer: B — The high F-statistic (p < 0.001) confirms the overall model is statistically significant. However, only MKT (t = 9.2) is individually significant; SMB and HML have low t-statistics (1.1 and 0.8), suggesting they may not be contributing independently. This pattern — high F with low individual t-statistics — is a classic sign of multicollinearity, or the variables may simply not be relevant for this portfolio. Adjusted R-squared being lower than R-squared is always true (it is designed that way), not a sign of misspecification.

---

Q2. A regression of stock returns on earnings surprise and trading volume yields a Durbin-Watson statistic of 0.62. The sample includes 120 monthly observations with 2 independent variables. What does this indicate and what is the appropriate remediation?

A) DW = 0.62 suggests no serial correlation; no remediation is needed. B) DW = 0.62 suggests positive serial correlation in the residuals; the analyst should use Newey-West (HAC) standard errors or include lagged values of the dependent variable. C) DW = 0.62 suggests negative serial correlation; the analyst should use the Prais-Winsten transformation. D) DW = 0.62 is inconclusive and requires a Breusch-Pagan test for confirmation.

Answer: B — The Durbin-Watson statistic ranges from 0 to 4, with DW near 2 indicating no serial correlation, DW near 0 indicating positive serial correlation, and DW near 4 indicating negative serial correlation. DW = 0.62 is far below 2, indicating positive serial correlation. Common corrections include using heteroskedasticity- and autocorrelation-consistent (HAC or Newey-West) standard errors, adding lagged dependent variables, or using generalized least squares. The Breusch-Pagan test is for heteroskedasticity, not serial correlation.

---

Q3. A regression model estimates the return on a bond portfolio as a function of changes in the yield curve slope and credit spread. The output shows an intercept of 0.003, a slope coefficient on yield curve change of -1.42 (t-stat = -4.1), and a slope on credit spread of 0.87 (t-stat = 0.9). The p-value for credit spread is 0.37. At the 5% significance level, which conclusion is most appropriate?

A) Both variables are statistically significant and should be retained. B) Yield curve change is statistically significant; credit spread is not statistically significant and may be removed or further examined. C) Neither variable is significant because the overall model needs an F-test first. D) The negative coefficient on yield curve change indicates model misspecification.

Answer: B — At the 5% significance level, a variable is significant if its p-value < 0.05 (equivalently, |t-stat| > critical value, approximately 2 for large samples). Yield curve change has |t| = 4.1, clearly significant. Credit spread has |t| = 0.9 and p = 0.37, far from significant — we fail to reject the null that its coefficient is zero. This does not mean the variable is unimportant conceptually, but statistically it does not appear to add explanatory power in this model. The negative coefficient on duration-sensitive variables is expected, not a sign of misspecification.

---

Q4. An analyst suspects conditional heteroskedasticity in a regression of stock volatility on earnings variability and market cap. She plots the squared residuals against the fitted values and observes a clear upward fan shape. She then runs a Breusch-Pagan test and obtains a chi-squared statistic of 18.3 with 2 degrees of freedom (p-value = 0.0001). What are the consequences of conditional heteroskedasticity and the preferred correction?

A) Coefficient estimates are biased; the analyst must re-estimate the model with different variables. B) Coefficient estimates remain unbiased and consistent, but standard errors are incorrect, leading to unreliable t-statistics; the analyst should use heteroskedasticity-consistent (White) standard errors. C) The model R-squared is inflated and must be recalculated. D) The F-statistic is unaffected by heteroskedasticity.

Answer: B — Heteroskedasticity does not bias OLS coefficient estimates — they remain unbiased and consistent. However, the standard errors of the coefficients are incorrect (typically understated when heteroskedasticity is positively related to variable size), making t-statistics unreliable. This can lead to spurious significance. The standard correction is to use heteroskedasticity-consistent (White's) standard errors, which re-weight observations appropriately. The F-statistic and R-squared are also affected by incorrect standard error estimation.

---

Q5. In a cross-sectional regression of 200 stocks' annual returns on three factors — price-to-earnings ratio, 12-month momentum, and analyst coverage — the analyst computes VIF values of 1.2, 1.5, and 7.8 respectively for the three factors. What is the most appropriate interpretation and action?

A) All VIF values are below 10; multicollinearity is not a concern for any variable. B) Analyst coverage (VIF = 7.8) shows evidence of problematic multicollinearity with one or more other predictors; the analyst should investigate the correlation structure and consider removing or transforming this variable. C) Only variables with VIF > 10 require attention; no action is needed at this time. D) VIF values cannot be used in cross-sectional regressions; they apply only to time series.

Answer: B — VIF > 5 is generally considered a warning sign of multicollinearity; VIF > 10 is considered severe. The analyst coverage variable at VIF = 7.8 suggests it is substantially correlated with one or more of the other predictors. This will inflate the standard error of the analyst coverage coefficient, making it harder to establish its statistical significance even if it genuinely matters. The analyst should examine pairwise correlations, consider removing one of the correlated variables, or use a dimensionality reduction technique. VIF applies to any regression, including cross-sectional.

---