1
|
Agliari E, Alemanno F, Aquaro M, Fachechi A. Regularization, early-stopping and dreaming: A Hopfield-like setup to address generalization and overfitting. Neural Netw 2024; 177:106389. [PMID: 38788291 DOI: 10.1016/j.neunet.2024.106389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2023] [Revised: 04/12/2024] [Accepted: 05/12/2024] [Indexed: 05/26/2024]
Abstract
In this work we approach attractor neural networks from a machine learning perspective: we look for optimal network parameters by applying a gradient descent over a regularized loss function. Within this framework, the optimal neuron-interaction matrices turn out to be a class of matrices which correspond to Hebbian kernels revised by a reiterated unlearning protocol. Remarkably, the extent of such unlearning is proved to be related to the regularization hyperparameter of the loss function and to the training time. Thus, we can design strategies to avoid overfitting that are formulated in terms of regularization and early-stopping tuning. The generalization capabilities of these attractor networks are also investigated: analytical results are obtained for random synthetic datasets, next, the emerging picture is corroborated by numerical experiments that highlight the existence of several regimes (i.e., overfitting, failure and success) as the dataset parameters are varied.
Collapse
Affiliation(s)
- E Agliari
- Dipartimento di Matematica "Guido Castelnuovo", Sapienza Università di Roma, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy.
| | - F Alemanno
- Dipartimento di Matematica, Università di Bologna, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy
| | - M Aquaro
- Dipartimento di Matematica "Guido Castelnuovo", Sapienza Università di Roma, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy
| | - A Fachechi
- Dipartimento di Matematica "Guido Castelnuovo", Sapienza Università di Roma, Italy; GNFM-INdAM, Gruppo Nazionale di Fisica Matematica (Istituto Nazionale di Alta Matematica), Italy
| |
Collapse
|
2
|
Levine H, Tu Y. Machine learning meets physics: A two-way street. Proc Natl Acad Sci U S A 2024; 121:e2403580121. [PMID: 38913898 PMCID: PMC11228530 DOI: 10.1073/pnas.2403580121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/26/2024] Open
Affiliation(s)
- Herbert Levine
- Center for Theoretical Biological Physics, Northeastern University, Boston, MA 02115
| | - Yuhai Tu
- IBM T. J. Watson Research Center, Yorktown Heights, New York, NY 10598
| |
Collapse
|
3
|
Sclocchi A, Wyart M. On the different regimes of stochastic gradient descent. Proc Natl Acad Sci U S A 2024; 121:e2316301121. [PMID: 38377198 PMCID: PMC10907278 DOI: 10.1073/pnas.2316301121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Accepted: 01/22/2024] [Indexed: 02/22/2024] Open
Abstract
Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size [Formula: see text], and the step size or learning rate [Formula: see text]. For small [Formula: see text] and large [Formula: see text], SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the "temperature" [Formula: see text]. Yet this description is observed to break down for sufficiently large batches [Formula: see text], or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the [Formula: see text]-[Formula: see text] plane that separates three dynamical phases: i) a noise-dominated SGD governed by temperature, ii) a large-first-step-dominated SGD and iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size [Formula: see text] separating regimes (i) and (ii) scale with the size [Formula: see text] of the training set, with an exponent that characterizes the hardness of the classification problem.
Collapse
Affiliation(s)
- Antonio Sclocchi
- Institute of Physics, Ecole Polytechnique Fédérale de Lausanne, Lausanne1015, Switzerland
| | - Matthieu Wyart
- Institute of Physics, Ecole Polytechnique Fédérale de Lausanne, Lausanne1015, Switzerland
| |
Collapse
|
4
|
Annesi BL, Lauditi C, Lucibello C, Malatesta EM, Perugini G, Pittorino F, Saglietti L. Star-Shaped Space of Solutions of the Spherical Negative Perceptron. PHYSICAL REVIEW LETTERS 2023; 131:227301. [PMID: 38101365 DOI: 10.1103/physrevlett.131.227301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 09/05/2023] [Accepted: 11/08/2023] [Indexed: 12/17/2023]
Abstract
Empirical studies on the landscape of neural networks have shown that low-energy configurations are often found in complex connected structures, where zero-energy paths between pairs of distant solutions can be constructed. Here, we consider the spherical negative perceptron, a prototypical nonconvex neural network model framed as a continuous constraint satisfaction problem. We introduce a general analytical method for computing energy barriers in the simplex with vertex configurations sampled from the equilibrium. We find that in the overparametrized regime the solution manifold displays simple connectivity properties. There exists a large geodesically convex component that is attractive for a wide range of optimization dynamics. Inside this region we identify a subset of atypical high-margin solutions that are geodesically connected with most other solutions, giving rise to a star-shaped geometry. We analytically characterize the organization of the connected space of solutions and show numerical evidence of a transition, at larger constraint densities, where the aforementioned simple geodesic connectivity breaks down.
Collapse
Affiliation(s)
| | - Clarissa Lauditi
- Department of Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy
| | - Carlo Lucibello
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
- Bocconi Institute for Data Science and Analytics, 20136 Milano, Italy
| | - Enrico M Malatesta
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
- Bocconi Institute for Data Science and Analytics, 20136 Milano, Italy
| | - Gabriele Perugini
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
| | - Fabrizio Pittorino
- Bocconi Institute for Data Science and Analytics, 20136 Milano, Italy
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano, 20125 Milano, Italy
| | - Luca Saglietti
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
- Bocconi Institute for Data Science and Analytics, 20136 Milano, Italy
| |
Collapse
|
5
|
Baldassi C, Malatesta EM, Perugini G, Zecchina R. Typical and atypical solutions in nonconvex neural networks with discrete and continuous weights. Phys Rev E 2023; 108:024310. [PMID: 37723812 DOI: 10.1103/physreve.108.024310] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 08/07/2023] [Indexed: 09/20/2023]
Abstract
We study the binary and continuous negative-margin perceptrons as simple nonconvex neural network models learning random rules and associations. We analyze the geometry of the landscape of solutions in both models and find important similarities and differences. Both models exhibit subdominant minimizers which are extremely flat and wide. These minimizers coexist with a background of dominant solutions which are composed by an exponential number of algorithmically inaccessible small clusters for the binary case (the frozen 1-RSB phase) or a hierarchical structure of clusters of different sizes for the spherical case (the full RSB phase). In both cases, when a certain threshold in constraint density is crossed, the local entropy of the wide flat minima becomes nonmonotonic, indicating a breakup of the space of robust solutions into disconnected components. This has a strong impact on the behavior of algorithms in binary models, which cannot access the remaining isolated clusters. For the spherical case the behavior is different, since even beyond the disappearance of the wide flat minima the remaining solutions are shown to always be surrounded by a large number of other solutions at any distance, up to capacity. Indeed, we exhibit numerical evidence that algorithms seem to find solutions up to the SAT/UNSAT transition, that we compute here using an 1RSB approximation. For both models, the generalization performance as a learning device is shown to be greatly improved by the existence of wide flat minimizers even when trained in the highly underconstrained regime of very negative margins.
Collapse
Affiliation(s)
- Carlo Baldassi
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
| | - Enrico M Malatesta
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
| | - Gabriele Perugini
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
| | - Riccardo Zecchina
- Department of Computing Sciences, Bocconi University, 20136 Milano, Italy
| |
Collapse
|
6
|
Baldassi C, Lauditi C, Malatesta EM, Pacelli R, Perugini G, Zecchina R. Learning through atypical phase transitions in overparameterized neural networks. Phys Rev E 2022; 106:014116. [PMID: 35974501 DOI: 10.1103/physreve.106.014116] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Accepted: 06/09/2022] [Indexed: 06/15/2023]
Abstract
Current deep neural networks are highly overparameterized (up to billions of connection weights) and nonlinear. Yet they can fit data almost perfectly through variants of gradient descent algorithms and achieve unexpected levels of prediction accuracy without overfitting. These are formidable results that defy predictions of statistical learning and pose conceptual challenges for nonconvex optimization. In this paper, we use methods from statistical physics of disordered systems to analytically study the computational fallout of overparameterization in nonconvex binary neural network models, trained on data generated from a structurally simpler but "hidden" network. As the number of connection weights increases, we follow the changes of the geometrical structure of different minima of the error loss function and relate them to learning and generalization performance. A first transition happens at the so-called interpolation point, when solutions begin to exist (perfect fitting becomes possible). This transition reflects the properties of typical solutions, which however are in sharp minima and hard to sample. After a gap, a second transition occurs, with the discontinuous appearance of a different kind of "atypical" structures: wide regions of the weight space that are particularly solution dense and have good generalization properties. The two kinds of solutions coexist, with the typical ones being exponentially more numerous, but empirically we find that efficient algorithms sample the atypical, rare ones. This suggests that the atypical phase transition is the relevant one for learning. The results of numerical tests with realistic networks on observables suggested by the theory are consistent with this scenario.
Collapse
Affiliation(s)
- Carlo Baldassi
- Artificial Intelligence Lab, Bocconi University, 20136 Milano, Italy
| | - Clarissa Lauditi
- Department of Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy
| | | | - Rosalba Pacelli
- Department of Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy
| | - Gabriele Perugini
- Artificial Intelligence Lab, Bocconi University, 20136 Milano, Italy
| | - Riccardo Zecchina
- Artificial Intelligence Lab, Bocconi University, 20136 Milano, Italy
| |
Collapse
|
7
|
Lucibello C, Pittorino F, Perugini G, Zecchina R. Deep learning via message passing algorithms based on belief propagation. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac7d3b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Abstract
Message-passing algorithms based on the Belief Propagation (BP) equations constitute a well-known distributed computational scheme. They yield exact marginals on tree-like graphical models and have also proven to be effective in many problems defined on loopy graphs, from inference to optimization, from signal processing to clustering. The BP-based schemes are fundamentally different from stochastic gradient descent (SGD), on which the current success of deep networks is based. In this paper, we present and adapt to mini-batch training on GPUs a family of BP-based message-passing algorithms with a reinforcement term that biases distributions towards locally entropic solutions. These algorithms are capable of training multi-layer neural networks with performance comparable to SGD heuristics in a diverse set of experiments on natural datasets including multi-class image classification and continual learning, while being capable of yielding improved performances on sparse networks. Furthermore, they allow to make approximate Bayesian predictions that have higher accuracy than point-wise ones.
Collapse
|
8
|
Chen G, Qu CK, Gong P. Anomalous diffusion dynamics of learning in deep neural networks. Neural Netw 2022; 149:18-28. [DOI: 10.1016/j.neunet.2022.01.019] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2021] [Revised: 12/17/2021] [Accepted: 01/26/2022] [Indexed: 11/17/2022]
|
9
|
Niroomand MP, Cafolla CT, Morgan JWR, Wales DJ. Characterising the area under the curve loss function landscape. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac49a9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
One of the most common metrics to evaluate neural network classifiers is the area under the receiver operating characteristic curve (AUC). However, optimisation of the AUC as the loss function during network training is not a standard procedure. Here we compare minimising the cross-entropy (CE) loss and optimising the AUC directly. In particular, we analyse the loss function landscape (LFL) of approximate AUC (appAUC) loss functions to discover the organisation of this solution space. We discuss various surrogates for AUC approximation and show their differences. We find that the characteristics of the appAUC landscape are significantly different from the CE landscape. The approximate AUC loss function improves testing AUC, and the appAUC landscape has substantially more minima, but these minima are less robust, with larger average Hessian eigenvalues. We provide a theoretical foundation to explain these results. To generalise our results, we lastly provide an overview of how the LFL can help to guide loss function analysis and selection.
Collapse
|
10
|
Baskerville NP, Keating JP, Mezzadri F, Najnudel J. A Spin Glass Model for the Loss Surfaces of Generative Adversarial Networks. JOURNAL OF STATISTICAL PHYSICS 2022; 186:29. [PMID: 35125517 PMCID: PMC8766428 DOI: 10.1007/s10955-022-02875-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Accepted: 01/03/2022] [Indexed: 06/14/2023]
Abstract
We present a novel mathematical model that seeks to capture the key design feature of generative adversarial networks (GANs). Our model consists of two interacting spin glasses, and we conduct an extensive theoretical analysis of the complexity of the model's critical points using techniques from Random Matrix Theory. The result is insights into the loss surfaces of large GANs that build upon prior insights for simpler networks, but also reveal new structure unique to this setting which explains the greater difficulty of training GANs.
Collapse
Affiliation(s)
| | | | - Francesco Mezzadri
- School of Mathematics, University of Bristol, Fry Building, Bristol, BS8 1UG UK
| | - Joseph Najnudel
- School of Mathematics, University of Bristol, Fry Building, Bristol, BS8 1UG UK
| |
Collapse
|
11
|
Baldassi C, Lauditi C, Malatesta EM, Perugini G, Zecchina R. Unveiling the Structure of Wide Flat Minima in Neural Networks. PHYSICAL REVIEW LETTERS 2021; 127:278301. [PMID: 35061428 DOI: 10.1103/physrevlett.127.278301] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2021] [Revised: 12/06/2021] [Accepted: 12/08/2021] [Indexed: 06/14/2023]
Abstract
The success of deep learning has revealed the application potential of neural networks across the sciences and opened up fundamental theoretical problems. In particular, the fact that learning algorithms based on simple variants of gradient methods are able to find near-optimal minima of highly nonconvex loss functions is an unexpected feature of neural networks. Moreover, such algorithms are able to fit the data even in the presence of noise, and yet they have excellent predictive capabilities. Several empirical results have shown a reproducible correlation between the so-called flatness of the minima achieved by the algorithms and the generalization performance. At the same time, statistical physics results have shown that in nonconvex networks a multitude of narrow minima may coexist with a much smaller number of wide flat minima, which generalize well. Here, we show that wide flat minima arise as complex extensive structures, from the coalescence of minima around "high-margin" (i.e., locally robust) configurations. Despite being exponentially rare compared to zero-margin ones, high-margin minima tend to concentrate in particular regions. These minima are in turn surrounded by other solutions of smaller and smaller margin, leading to dense regions of solutions over long distances. Our analysis also provides an alternative analytical method for estimating when flat minima appear and when algorithms begin to find solutions, as the number of model parameters varies.
Collapse
Affiliation(s)
- Carlo Baldassi
- Artificial Intelligence Lab, Bocconi University, 20136 Milano, Italy
| | - Clarissa Lauditi
- Department of Applied Science and Technology, Politecnico di Torino, 10129 Torino, Italy
| | | | - Gabriele Perugini
- Artificial Intelligence Lab, Bocconi University, 20136 Milano, Italy
| | - Riccardo Zecchina
- Artificial Intelligence Lab, Bocconi University, 20136 Milano, Italy
| |
Collapse
|
12
|
Negri M, Tiana G, Zecchina R. Native state of natural proteins optimizes local entropy. Phys Rev E 2021; 104:064117. [PMID: 35030941 DOI: 10.1103/physreve.104.064117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 11/24/2021] [Indexed: 06/14/2023]
Abstract
The differing ability of polypeptide conformations to act as the native state of proteins has long been rationalized in terms of differing kinetic accessibility or thermodynamic stability. Building on the successful applications of physical concepts and sampling algorithms recently introduced in the study of disordered systems, in particular artificial neural networks, we quantitatively explore how well a quantity known as the local entropy describes the native state of model proteins. In lattice models and all-atom representations of proteins, we are able to efficiently sample high local entropy states and to provide a proof of concept of enhanced stability and folding rate. Our methods are based on simple and general statistical-mechanics arguments, and thus we expect that they are of very general use.
Collapse
Affiliation(s)
- M Negri
- Department Applied Science and Technology, Politecnico di Torino, CorsoDuca degli Abruzzi 24, I-10129 Turin, Italy
| | - G Tiana
- Department of Physics and Center for Complexity and Biosystems, Università degli Studi di Milano and INFN, via Celoria 16, 20133 Milan, Italy
| | - R Zecchina
- Artificial Intelligence Lab, Bocconi University, Via Sarfatti, 25, 20136 Milan, Italy
| |
Collapse
|
13
|
Bulso N, Roudi Y. Restricted Boltzmann Machines as Models of Interacting Variables. Neural Comput 2021; 33:2646-2681. [PMID: 34280260 DOI: 10.1162/neco_a_01420] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 04/14/2021] [Indexed: 11/04/2022]
Abstract
We study the type of distributions that restricted Boltzmann machines (RBMs) with different activation functions can express by investigating the effect of the activation function of the hidden nodes on the marginal distribution they impose on observed bi nary nodes. We report an exact expression for these marginals in the form of a model of interacting binary variables with the explicit form of the interactions depending on the hidden node activation function. We study the properties of these interactions in detail and evaluate how the accuracy with which the RBM approximates distributions over binary variables depends on the hidden node activation function and the number of hidden nodes. When the inferred RBM parameters are weak, an intuitive pattern is found for the expression of the interaction terms, which reduces substantially the differences across activation functions. We show that the weak parameter approximation is a good approximation for different RBMs trained on the MNIST data set. Interestingly, in these cases, the mapping reveals that the inferred models are essentially low order interaction models.
Collapse
Affiliation(s)
- Nicola Bulso
- Kavli Institute for Systems Neuroscience and Centre for Neural Computation, Norwegian University of Science and Technology, 7491 Trondheim, Norway, and SISSA-Cognitive Neuroscience, 34136 Trieste, Italy
| | - Yasser Roudi
- Kavli Institute for Systems Neuroscience and Centre for Neural Computation, Norwegian University of Science and Technology, 7491 Trondheim, Norway
| |
Collapse
|
14
|
Benedetti M, Dotsenko V, Fischetti G, Marinari E, Oshanin G. Recognition capabilities of a Hopfield model with auxiliary hidden neurons. Phys Rev E 2021; 103:L060401. [PMID: 34271731 DOI: 10.1103/physreve.103.l060401] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Accepted: 05/24/2021] [Indexed: 11/07/2022]
Abstract
We study the recognition capabilities of the Hopfield model with auxiliary hidden layers, which emerge naturally upon a Hubbard-Stratonovich transformation. We show that the recognition capabilities of such a model at zero temperature outperform those of the original Hopfield model, due to a substantial increase of the storage capacity and the lack of a naturally defined basin of attraction. The modified model does not fall abruptly into the regime of complete confusion when memory load exceeds a sharp threshold. This latter circumstance, together with an increase of the storage capacity, renders such a modified Hopfield model a promising candidate for further research, with possible diverse applications.
Collapse
Affiliation(s)
- Marco Benedetti
- Università di Roma La Sapienza, Piazzale Aldo Moro 5, I-00185 Rome, Italy
| | - Victor Dotsenko
- Sorbonne Université, CNRS, Laboratoire de Physique Théorique de la Matière Condensée (UMR 7600), 4 Place Jussieu, F-75252 Paris Cedex 05, France
| | - Giulia Fischetti
- Università di Roma La Sapienza, Piazzale Aldo Moro 5, I-00185 Rome, Italy
| | - Enzo Marinari
- Università di Roma La Sapienza, Piazzale Aldo Moro 5, I-00185 Rome, Italy.,CNR-Nanotec and INFN, Sezione di Roma 1, I-00185 Rome, Italy
| | - Gleb Oshanin
- Sorbonne Université, CNRS, Laboratoire de Physique Théorique de la Matière Condensée (UMR 7600), 4 Place Jussieu, F-75252 Paris Cedex 05, France
| |
Collapse
|
15
|
Feng Y, Tu Y. The inverse variance-flatness relation in stochastic gradient descent is critical for finding flat minima. Proc Natl Acad Sci U S A 2021; 118:e2015617118. [PMID: 33619091 PMCID: PMC7936325 DOI: 10.1073/pnas.2015617118] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Despite tremendous success of the stochastic gradient descent (SGD) algorithm in deep learning, little is known about how SGD finds generalizable solutions at flat minima of the loss function in high-dimensional weight space. Here, we investigate the connection between SGD learning dynamics and the loss function landscape. A principal component analysis (PCA) shows that SGD dynamics follow a low-dimensional drift-diffusion motion in the weight space. Around a solution found by SGD, the loss function landscape can be characterized by its flatness in each PCA direction. Remarkably, our study reveals a robust inverse relation between the weight variance and the landscape flatness in all PCA directions, which is the opposite to the fluctuation-response relation (aka Einstein relation) in equilibrium statistical physics. To understand the inverse variance-flatness relation, we develop a phenomenological theory of SGD based on statistical properties of the ensemble of minibatch loss functions. We find that both the anisotropic SGD noise strength (temperature) and its correlation time depend inversely on the landscape flatness in each PCA direction. Our results suggest that SGD serves as a landscape-dependent annealing algorithm. The effective temperature decreases with the landscape flatness so the system seeks out (prefers) flat minima over sharp ones. Based on these insights, an algorithm with landscape-dependent constraints is developed to mitigate catastrophic forgetting efficiently when learning multiple tasks sequentially. In general, our work provides a theoretical framework to understand learning dynamics, which may eventually lead to better algorithms for different learning tasks.
Collapse
Affiliation(s)
- Yu Feng
- Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598
- Department of Physics, Duke University, Durham, NC 27710
| | - Yuhai Tu
- Foundations of AI, IBM T. J. Watson Research Center, Yorktown Heights, NY 10598;
| |
Collapse
|
16
|
Dawson KA, Yan Y. Current understanding of biological identity at the nanoscale and future prospects. NATURE NANOTECHNOLOGY 2021; 16:229-242. [PMID: 33597736 DOI: 10.1038/s41565-021-00860-0] [Citation(s) in RCA: 79] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 01/19/2021] [Indexed: 06/12/2023]
Abstract
Nanoscale objects are processed by living organisms using highly evolved and sophisticated endogenous cellular networks, specifically designed to manage objects of this size. While these processes potentially allow nanostructures unique access to and control over key biological machineries, they are also highly protected by cell or host defence mechanisms at all levels. A thorough understanding of bionanoscale recognition events, including the molecules involved in the cell recognition machinery, the nature of information transferred during recognition processes and the coupled downstream cellular processing, would allow us to achieve a qualitatively novel form of biological control and advanced therapeutics. Here we discuss evolving fundamental microscopic and mechanistic understanding of biological nanoscale recognition. We consider the interface between a nanostructure and a target cell membrane, outlining the categories of nanostructure properties that are recognized, and the associated nanoscale signal transduction and cellular programming mechanisms that constitute biological recognition.
Collapse
Affiliation(s)
- Kenneth A Dawson
- Guangdong Provincial Education Department Key Laboratory of Nano-Immunoregulation Tumour Microenvironment, The Second Affiliated Hospital, Guangzhou Medical University, Guangzhou, Guangdong, PR China.
- Centre for BioNano Interactions, School of Chemistry, University College Dublin, Dublin, Ireland.
| | - Yan Yan
- Centre for BioNano Interactions, School of Chemistry, University College Dublin, Dublin, Ireland.
- School of Biomolecular and Biomedical Science, UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin, Ireland.
| |
Collapse
|
17
|
Xu W, Lin J, Gao M, Chen Y, Cao J, Pu J, Huang L, Zhao J, Qian K. Rapid Computer-Aided Diagnosis of Stroke by Serum Metabolic Fingerprint Based Multi-Modal Recognition. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2020; 7:2002021. [PMID: 33173737 PMCID: PMC7610260 DOI: 10.1002/advs.202002021] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Revised: 06/30/2020] [Indexed: 05/07/2023]
Abstract
Stroke is a leading cause of mortality and disability worldwide, expected to result in 61 million disability-adjusted life-years in 2020. Rapid diagnostics is the core of stroke management for early prevention and medical treatment. Serum metabolic fingerprints (SMFs) reflect underlying disease progression, predictive of patient phenotypes. Deep learning (DL) encoding SMFs with clinical indexes outperforms single biomarkers, while posing challenges with poor prediction to interpret by feature selection. Herein, rapid computer-aided diagnosis of stroke is performed using SMF based multi-modal recognition by DL, to combine adaptive machine learning with a novel feature selection approach. SMFs are extracted by nano-assisted laser desorption/ionization mass spectrometry (LDI MS), consuming 100 nL of serum in seconds. A multi-modal recognition is constructed by integrating SMFs and clinical indexes with an enhanced area under curve (AUC) up to 0.845 for stroke screening, compared to single-modal diagnosis by only SMFs or clinical indexes. The prediction of DL is addressed by selecting 20 key metabolite features with differential regulation through a saliency map approach, shedding light on the molecular mechanisms in stroke. The approach highlights the emerging role of DL in precision medicine and suggests an expanding utility for computational analysis of SMFs in stroke screening.
Collapse
Affiliation(s)
- Wei Xu
- State Key Laboratory for Oncogenes and Related GenesDivision of CardiologyRenji HospitalSchool of MedicineShanghai Jiao Tong University160 Pujian RoadShanghai200127P. R. China
- State Key Laboratory for Oncogenes and Related GenesSchool of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200030P. R. China
| | - Jixian Lin
- Department of NeurologyMinhang HospitalFudan University170 Xinsong RoadShanghai201199P. R. China
| | - Ming Gao
- School of Management Science and EngineeringDongbei University of Finance and EconomicsDalian116025P. R. China
- Center for Post‐doctoral Studies of Computer ScienceNortheastern UniversityShenyang110819P. R. China
| | - Yuhan Chen
- School of Management Science and EngineeringDongbei University of Finance and EconomicsDalian116025P. R. China
- Center for Post‐doctoral Studies of Computer ScienceNortheastern UniversityShenyang110819P. R. China
| | - Jing Cao
- State Key Laboratory for Oncogenes and Related GenesDivision of CardiologyRenji HospitalSchool of MedicineShanghai Jiao Tong University160 Pujian RoadShanghai200127P. R. China
- State Key Laboratory for Oncogenes and Related GenesSchool of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200030P. R. China
| | - Jun Pu
- State Key Laboratory for Oncogenes and Related GenesDivision of CardiologyRenji HospitalSchool of MedicineShanghai Jiao Tong University160 Pujian RoadShanghai200127P. R. China
- State Key Laboratory for Oncogenes and Related GenesSchool of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200030P. R. China
| | - Lin Huang
- Stem Cell Research CenterRenji HospitalSchool of MedicineShanghai Jiao Tong University160 Pujian RoadShanghai200127P. R. China
| | - Jing Zhao
- Department of NeurologyMinhang HospitalFudan University170 Xinsong RoadShanghai201199P. R. China
| | - Kun Qian
- State Key Laboratory for Oncogenes and Related GenesDivision of CardiologyRenji HospitalSchool of MedicineShanghai Jiao Tong University160 Pujian RoadShanghai200127P. R. China
- State Key Laboratory for Oncogenes and Related GenesSchool of Biomedical EngineeringShanghai Jiao Tong UniversityShanghai200030P. R. China
| |
Collapse
|