1
|
Agliari E, Alemanno F, Aquaro M, Fachechi A. Regularization, early-stopping and dreaming: A Hopfield-like setup to address generalization and overfitting. Neural Netw 2024; 177:106389. [PMID: 38788291 DOI: 10.1016/j.neunet.2024.106389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 04/12/2024] [Accepted: 05/12/2024] [Indexed: 05/26/2024]
Abstract
In this work we approach attractor neural networks from a machine learning perspective: we look for optimal network parameters by applying a gradient descent over a regularized loss function. Within this framework, the optimal neuron-interaction matrices turn out to be a class of matrices which correspond to Hebbian kernels revised by a reiterated unlearning protocol. Remarkably, the extent of such unlearning is proved to be related to the regularization hyperparameter of the loss function and to the training time. Thus, we can design strategies to avoid overfitting that are formulated in terms of regularization and early-stopping tuning. The generalization capabilities of these attractor networks are also investigated: analytical results are obtained for random synthetic datasets, next, the emerging picture is corroborated by numerical experiments that highlight the existence of several regimes (i.e., overfitting, failure and success) as the dataset parameters are varied.
Collapse
Affiliation(s)
- E Agliari
- Dipartimento di Matematica "Guido Castelnuovo", Sapienza Università di Roma, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy.
| | - F Alemanno
- Dipartimento di Matematica, Università di Bologna, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy
| | - M Aquaro
- Dipartimento di Matematica "Guido Castelnuovo", Sapienza Università di Roma, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy
| | - A Fachechi
- Dipartimento di Matematica "Guido Castelnuovo", Sapienza Università di Roma, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy
| |
Collapse
|
2
|
Bahri Y, Dyer E, Kaplan J, Lee J, Sharma U. Explaining neural scaling laws. Proc Natl Acad Sci U S A 2024; 121:e2311878121. [PMID: 38913889 PMCID: PMC11228526 DOI: 10.1073/pnas.2311878121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 03/05/2024] [Indexed: 06/26/2024] Open
Abstract
The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origin and relationships between scaling exponents.
Collapse
Affiliation(s)
| | | | - Jared Kaplan
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD21218
| | | | - Utkarsh Sharma
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD21218
| |
Collapse
|
3
|
Gerace F, Krzakala F, Loureiro B, Stephan L, Zdeborová L. Gaussian universality of perceptrons with random labels. Phys Rev E 2024; 109:034305. [PMID: 38632742 DOI: 10.1103/physreve.109.034305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Accepted: 12/08/2023] [Indexed: 04/19/2024]
Abstract
While classical in many theoretical settings-and in particular in statistical physics-inspired works-the assumption of Gaussian i.i.d. input data is often perceived as a strong limitation in the context of statistics and machine learning. In this study, we redeem this line of work in the case of generalized linear classification, also known as the perceptron model, with random labels. We argue that there is a large universality class of high-dimensional input data for which we obtain the same minimum training loss as for Gaussian data with corresponding data covariance. In the limit of vanishing regularization, we further demonstrate that the training loss is independent of the data covariance. On the theoretical side, we prove this universality for an arbitrary mixture of homogeneous Gaussian clouds. Empirically, we show that the universality holds also for a broad range of real data sets.
Collapse
Affiliation(s)
- Federica Gerace
- International School of Advanced Studies (SISSA), Trieste, Via Bonomea, 265, 34136 Trieste, Italy
- EPFL Statistical Physics of Computation (SPOC) Laboratory, Rte Cantonale, 1015 Lausanne, Switzerland
| | - Florent Krzakala
- EPFL, Information, Learning and Physics (IdePHICS) Laboratory, Rte Cantonale, 1015 Lausanne, Switzerland
| | - Bruno Loureiro
- EPFL, Information, Learning and Physics (IdePHICS) Laboratory, Rte Cantonale, 1015 Lausanne, Switzerland
- Département d'Informatique, École Normale Supérieure (ENS)-PSL & CNRS, F-75230 Paris Cedex 05, France
| | - Ludovic Stephan
- EPFL, Information, Learning and Physics (IdePHICS) Laboratory, Rte Cantonale, 1015 Lausanne, Switzerland
| | - Lenka Zdeborová
- EPFL Statistical Physics of Computation (SPOC) Laboratory, Rte Cantonale, 1015 Lausanne, Switzerland
| |
Collapse
|
4
|
Ruben BS, Pehlevan C. Learning Curves for Noisy Heterogeneous Feature-Subsampled Ridge Ensembles. ARXIV 2024:arXiv:2307.03176v3. [PMID: 37461424 PMCID: PMC10350086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
Abstract
Feature bagging is a well-established ensembling method which aims to reduce prediction variance by combining predictions of many estimators trained on subsets or projections of features. Here, we develop a theory of feature-bagging in noisy least-squares ridge ensembles and simplify the resulting learning curves in the special case of equicorrelated data. Using analytical learning curves, we demonstrate that subsampling shifts the double-descent peak of a linear predictor. This leads us to introduce heterogeneous feature ensembling, with estimators built on varying numbers of feature dimensions, as a computationally efficient method to mitigate double-descent. Then, we compare the performance of a feature-subsampling ensemble to a single linear predictor, describing a trade-off between noise amplification due to subsampling and noise reduction due to ensembling. Our qualitative insights carry over to linear classifiers applied to image classification tasks with realistic datasets constructed using a state-of-the-art deep learning feature map.
Collapse
Affiliation(s)
| | - Cengiz Pehlevan
- Center for Brain Science, Harvard University, Cambridge, MA 02138
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA 02138
| |
Collapse
|
5
|
Hanin B, Zlokapa A. Bayesian interpolation with deep linear networks. Proc Natl Acad Sci U S A 2023; 120:e2301345120. [PMID: 37252994 PMCID: PMC10266010 DOI: 10.1073/pnas.2301345120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 04/25/2023] [Indexed: 06/01/2023] Open
Abstract
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find non-asymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through novel asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is a novel emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit.
Collapse
Affiliation(s)
- Boris Hanin
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08540
| | - Alexander Zlokapa
- Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139
- Google Quantum AI, Venice, CA 90291
| |
Collapse
|
6
|
Okuno A, Yano K. A generalization gap estimation for overparameterized models via the Langevin functional variance. J Comput Graph Stat 2023. [DOI: 10.1080/10618600.2023.2197488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/07/2023]
Affiliation(s)
- Akifumi Okuno
- The Institute of Statistical Mathematics and RIKEN AIP
| | | |
Collapse
|
7
|
DeBacker JR, McMillan GP, Martchenke N, Lacey CM, Stuehm HR, Hungerford ME, Konrad-Martin D. Ototoxicity prognostic models in adult and pediatric cancer patients: a rapid review. J Cancer Surviv 2023; 17:82-100. [PMID: 36729346 DOI: 10.1007/s11764-022-01315-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2022] [Accepted: 12/07/2022] [Indexed: 02/03/2023]
Abstract
PURPOSE A cornerstone of treatment for many cancers is the administration of platinum-based chemotherapies and/or ionizing radiation, which can be ototoxic. An accurate ototoxicity risk assessment would be useful for counseling, treatment planning, and survivorship follow-up in patients with cancer. METHODS This systematic review evaluated the literature on predictive models for estimating a patient's risk for chemotherapy-related auditory injury to accelerate development of computational approaches for the clinical management of ototoxicity in cancer patients. Of the 1195 articles identified in a PubMed search from 2010 forward, 15 studies met inclusion for the review. CONCLUSIONS All but 1 study used an abstraction of the audiogram as a modeled outcome; however, specific outcome measures varied. Consistently used predictors were age, baseline hearing, cumulative cisplatin dose, and radiation dose to the cochlea. Just 5 studies were judged to have an overall low risk of bias. Future studies should attempt to minimize bias by following statistical best practices including not selecting multivariate predictors based on univariate analysis, validation in independent cohorts, and clearly reporting the management of missing and censored data. Future modeling efforts should adopt a transdisciplinary approach to define a unified set of clinical, treatment, and/or genetic risk factors. Creating a flexible model that uses a common set of predictors to forecast the full post-treatment audiogram may accelerate work in this area. Such a model could be adapted for use in counseling, treatment planning, and follow-up by audiologists and oncologists and could be incorporated into ototoxicity genetic association studies as well as clinical trials investigating otoprotective agents. IMPLICATIONS FOR CANCER SURVIVORS Improvements in the ability to model post-treatment hearing loss can help to improve patient quality of life following cancer care. The improvements advocated for in this review should allow for the acceleration of advancements in modeling the auditory impact of these treatments to support treatment planning and patient counseling during and after care.
Collapse
Affiliation(s)
- J R DeBacker
- VA RR&D National Center for Rehabilitative Auditory Research, VA Portland Health Care System, 3710 SW US Veterans Hospital Road (NCRAR - P5), Portland, OR, 97239, USA.
- Oregon Health and Science University, Portland, OR, USA.
| | - G P McMillan
- VA RR&D National Center for Rehabilitative Auditory Research, VA Portland Health Care System, 3710 SW US Veterans Hospital Road (NCRAR - P5), Portland, OR, 97239, USA
- Oregon Health and Science University, Portland, OR, USA
| | - N Martchenke
- VA RR&D National Center for Rehabilitative Auditory Research, VA Portland Health Care System, 3710 SW US Veterans Hospital Road (NCRAR - P5), Portland, OR, 97239, USA
- Oregon Health and Science University, Portland, OR, USA
| | - C M Lacey
- VA RR&D National Center for Rehabilitative Auditory Research, VA Portland Health Care System, 3710 SW US Veterans Hospital Road (NCRAR - P5), Portland, OR, 97239, USA
- University of Pittsburgh, Pittsburgh, PA, USA
| | - H R Stuehm
- VA RR&D National Center for Rehabilitative Auditory Research, VA Portland Health Care System, 3710 SW US Veterans Hospital Road (NCRAR - P5), Portland, OR, 97239, USA
- Oregon Health and Science University, Portland, OR, USA
| | - M E Hungerford
- VA RR&D National Center for Rehabilitative Auditory Research, VA Portland Health Care System, 3710 SW US Veterans Hospital Road (NCRAR - P5), Portland, OR, 97239, USA
- Oregon Health and Science University, Portland, OR, USA
| | - D Konrad-Martin
- VA RR&D National Center for Rehabilitative Auditory Research, VA Portland Health Care System, 3710 SW US Veterans Hospital Road (NCRAR - P5), Portland, OR, 97239, USA
- Oregon Health and Science University, Portland, OR, USA
| |
Collapse
|
8
|
Anceschi N, Fasano A, Durante D, Zanella G. Bayesian Conjugacy in Probit, Tobit, Multinomial Probit and Extensions: A Review and New Results. J Am Stat Assoc 2023. [DOI: 10.1080/01621459.2023.2169150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Affiliation(s)
- Niccolò Anceschi
- Department of Decision Sciences and Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy
| | | | - Daniele Durante
- Department of Decision Sciences and Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy
| | - Giacomo Zanella
- Department of Decision Sciences and Bocconi Institute for Data Science and Analytics, Bocconi University, Milan, Italy
| |
Collapse
|
9
|
Ngampruetikorn V, Schwab DJ. Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2022; 35:9784-9796. [PMID: 37332888 PMCID: PMC10275337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. This puzzling contradiction necessitates new approaches to the study of overfitting. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental trade-off between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms. Finally, using results from random matrix theory, we reveal the information complexity of learning a linear map in high dimensions and unveil information-theoretic analogs of double and multiple descent phenomena.
Collapse
Affiliation(s)
| | - David J. Schwab
- Initiative for the Theoretical Sciences, The Graduate Center, CUNY
| |
Collapse
|
10
|
Ma X, Sardy S, Hengartner N, Bobenko N, Lin YT. A phase transition for finding needles in nonlinear haystacks with LASSO artificial neural networks. STATISTICS AND COMPUTING 2022; 32:99. [PMID: 36299529 PMCID: PMC9587964 DOI: 10.1007/s11222-022-10169-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 10/12/2022] [Indexed: 06/16/2023]
Abstract
To fit sparse linear associations, a LASSO sparsity inducing penalty with a single hyperparameter provably allows to recover the important features (needles) with high probability in certain regimes even if the sample size is smaller than the dimension of the input vector (haystack). More recently learners known as artificial neural networks (ANN) have shown great successes in many machine learning tasks, in particular fitting nonlinear associations. Small learning rate, stochastic gradient descent algorithm and large training set help to cope with the explosion in the number of parameters present in deep neural networks. Yet few ANN learners have been developed and studied to find needles in nonlinear haystacks. Driven by a single hyperparameter, our ANN learner, like for sparse linear associations, exhibits a phase transition in the probability of retrieving the needles, which we do not observe with other ANN learners. To select our penalty parameter, we generalize the universal threshold of Donoho and Johnstone (Biometrika 81(3):425-455, 1994) which is a better rule than the conservative (too many false detections) and expensive cross-validation. In the spirit of simulated annealing, we propose a warm-start sparsity inducing algorithm to solve the high-dimensional, non-convex and non-differentiable optimization problem. We perform simulated and real data Monte Carlo experiments to quantify the effectiveness of our approach.
Collapse
Affiliation(s)
- Xiaoyu Ma
- Shandong University, Jinan, China
- Department of Mathematics, University of Geneva, Geneva, Switzerland
| | - Sylvain Sardy
- Department of Mathematics, University of Geneva, Geneva, Switzerland
| | - Nick Hengartner
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, USA
| | - Nikolai Bobenko
- Department of Mathematics, University of Geneva, Geneva, Switzerland
| | - Yen Ting Lin
- Information Sciences Group, Los Alamos National Laboratory, Los Alamos, USA
| |
Collapse
|
11
|
Montanari A, Zhong Y. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. Ann Stat 2022. [DOI: 10.1214/22-aos2211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Andrea Montanari
- Department of Electrical Engineering and Department of Statistics, Stanford University
| | - Yiqiao Zhong
- Department of Electrical Engineering and Department of Statistics, Stanford University
| |
Collapse
|
12
|
Chinot G, Löffler M, van de Geer S. On the robustness of minimum norm interpolators and regularized empirical risk minimizers. Ann Stat 2022. [DOI: 10.1214/22-aos2190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Geoffrey Chinot
- Seminar for Statistics, Department of Mathematics, ETH Zürich
| | | | | |
Collapse
|
13
|
Javanmard A, Soltanolkotabi M. Precise statistical analysis of classification accuracies for adversarial training. Ann Stat 2022. [DOI: 10.1214/22-aos2180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Adel Javanmard
- Department of Data Sciences and Operations, University of Southern California
| | - Mahdi Soltanolkotabi
- Department of Electrical and Computer Engineering, University of Southern California
| |
Collapse
|
14
|
Ariosto S, Pacelli R, Ginelli F, Gherardi M, Rotondo P. Universal mean-field upper bound for the generalization gap of deep neural networks. Phys Rev E 2022; 105:064309. [PMID: 35854557 DOI: 10.1103/physreve.105.064309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Accepted: 06/08/2022] [Indexed: 11/07/2022]
Abstract
Modern deep neural networks (DNNs) represent a formidable challenge for theorists: according to the commonly accepted probabilistic framework that describes their performance, these architectures should overfit due to the huge number of parameters to train, but in practice they do not. Here we employ results from replica mean field theory to compute the generalization gap of machine learning models with quenched features, in the teacher-student scenario and for regression problems with quadratic loss function. Notably, this framework includes the case of DNNs where the last layer is optimized given a specific realization of the remaining weights. We show how these results-combined with ideas from statistical learning theory-provide a stringent asymptotic upper bound on the generalization gap of fully trained DNN as a function of the size of the dataset P. In particular, in the limit of large P and N_{out} (where N_{out} is the size of the last layer) and N_{out}≪P, the generalization gap approaches zero faster than 2N_{out}/P, for any choice of both architecture and teacher function. Notably, this result greatly improves existing bounds from statistical learning theory. We test our predictions on a broad range of architectures, from toy fully connected neural networks with few hidden layers to state-of-the-art deep convolutional neural networks.
Collapse
Affiliation(s)
- S Ariosto
- Dipartimento di Scienza e Alta Tecnologia and Center for Nonlinear and Complex Systems, Università degli Studi dell'Insubria, Via Valleggio 11, 22100 Como, Italy.,I.N.F.N. Sezione di Milano, Via Celoria 16, 20133 Milan, Italy
| | - R Pacelli
- Dipartimento di Scienza Applicata e Tecnologia, Politecnico di Torino, 10129 Turin, Italy
| | - F Ginelli
- Dipartimento di Scienza e Alta Tecnologia and Center for Nonlinear and Complex Systems, Università degli Studi dell'Insubria, Via Valleggio 11, 22100 Como, Italy.,I.N.F.N. Sezione di Milano, Via Celoria 16, 20133 Milan, Italy
| | - M Gherardi
- I.N.F.N. Sezione di Milano, Via Celoria 16, 20133 Milan, Italy.,Università degli Studi di Milano, Via Celoria 16, 20133 Milan, Italy
| | - P Rotondo
- I.N.F.N. Sezione di Milano, Via Celoria 16, 20133 Milan, Italy.,Università degli Studi di Milano, Via Celoria 16, 20133 Milan, Italy
| |
Collapse
|
15
|
Rocks JW, Mehta P. Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models. PHYSICAL REVIEW RESEARCH 2022; 4:013201. [PMID: 36713351 PMCID: PMC9879296 DOI: 10.1103/physrevresearch.4.013201] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in both the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.
Collapse
Affiliation(s)
- Jason W Rocks
- Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | - Pankaj Mehta
- Department of Physics, Boston University, Boston, Massachusetts 02215, USA
- Faculty of Computing and Data Sciences, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
16
|
Chen X, Liu Q, Tong XT. Dimension independent excess risk by stochastic gradient descent. Electron J Stat 2022. [DOI: 10.1214/22-ejs2055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Xi Chen
- Stern School of Business, New York University
| | - Qiang Liu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Xin T. Tong
- Department of Mathematics, National University of Singapore
| |
Collapse
|