New Zealand Statistical Association 2024 Conference

Alan Welsh – Plenary Session 1

Australian National University

Double descent and noise in fitting linear regression models

This is joint work with Insha Ullah

"Double descent" is used in statistical machine learning to describe the fact that models with more parameters than observations can have better predictive performance (as measured by the test error) than models with fewer parameters than observations. This challenge to the belief that simpler models are generally better, implies we need a rethink of fundamental statistical ideas. We explore the effects of including noise predictors and noise observations when fitting linear regression models. We present empirical and theoretical results that show that double descent occurs in both cases, albeit with contradictory implications: the implication for noise predictors is that complex models are often better than simple ones, while the implication for noise observations is that relatively simple models are often better than complex ones. That is, double descent is not just a high-dimensional big data/machine learning phenomenon but can also occur in small datasets fitted with simple statistical models. We resolve this contradiction by showing that it is not the model complexity but rather the implicit shrinkage by the inclusion of noise in the model that drives the double descent. We also show that including noise observations in the model makes the (usually unbiased) ordinary least squares estimator biased and indicates that the ridge regression estimator may need a negative ridge parameter to avoid over-shrinkage.