1
|
Lee H, Kim Y, Yang SY, Choi H. Improved weight initialization for deep and narrow feedforward neural network. Neural Netw 2024; 176:106362. [PMID: 38733795 DOI: 10.1016/j.neunet.2024.106362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 04/03/2024] [Accepted: 04/29/2024] [Indexed: 05/13/2024]
Abstract
Appropriate weight initialization settings, along with the ReLU activation function, have become cornerstones of modern deep learning, enabling the training and deployment of highly effective and efficient neural network models across diverse areas of artificial intelligence. The problem of "dying ReLU," where ReLU neurons become inactive and yield zero output, presents a significant challenge in the training of deep neural networks with ReLU activation function. Theoretical research and various methods have been introduced to address the problem. However, even with these methods and research, training remains challenging for extremely deep and narrow feedforward networks with ReLU activation function. In this paper, we propose a novel weight initialization method to address this issue. We establish several properties of our initial weight matrix and demonstrate how these properties enable the effective propagation of signal vectors. Through a series of experiments and comparisons with existing methods, we demonstrate the effectiveness of the novel initialization method.
Collapse
Affiliation(s)
- Hyunwoo Lee
- Department of Mathematics, Kyungpook National University, Daegu 41566, Republic of Korea.
| | - Yunho Kim
- Department of Mathematical Sciences, Ulsan National Institute of Science and Technology, Ulsan 44919, Republic of Korea.
| | - Seung Yeop Yang
- Department of Mathematics, Kyungpook National University, Daegu 41566, Republic of Korea; KNU LAMP Research Center, KNU Institute of Basic Sciences, Kyungpook National University, Daegu, 41566, Republic of Korea.
| | - Hayoung Choi
- Department of Mathematics, Kyungpook National University, Daegu 41566, Republic of Korea.
| |
Collapse
|
2
|
Bahri Y, Dyer E, Kaplan J, Lee J, Sharma U. Explaining neural scaling laws. Proc Natl Acad Sci U S A 2024; 121:e2311878121. [PMID: 38913889 PMCID: PMC11228526 DOI: 10.1073/pnas.2311878121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 03/05/2024] [Indexed: 06/26/2024] Open
Abstract
The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origin and relationships between scaling exponents.
Collapse
Affiliation(s)
| | | | - Jared Kaplan
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD21218
| | | | - Utkarsh Sharma
- Department of Physics and Astronomy, Johns Hopkins University, Baltimore, MD21218
| |
Collapse
|
3
|
Mastrovito D, Liu YH, Kusmierz L, Shea-Brown E, Koch C, Mihalas S. Transition to chaos separates learning regimes and relates to measure of consciousness in recurrent neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.15.594236. [PMID: 38798582 PMCID: PMC11118502 DOI: 10.1101/2024.05.15.594236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Recurrent neural networks exhibit chaotic dynamics when the variance in their connection strengths exceed a critical value. Recent work indicates connection variance also modulates learning strategies; networks learn "rich" representations when initialized with low coupling and "lazier" solutions with larger variance. Using Watts-Strogatz networks of varying sparsity, structure, and hidden weight variance, we find that the critical coupling strength dividing chaotic from ordered dynamics also differentiates rich and lazy learning strategies. Training moves both stable and chaotic networks closer to the edge of chaos, with networks learning richer representations before the transition to chaos. In contrast, biologically realistic connectivity structures foster stability over a wide range of variances. The transition to chaos is also reflected in a measure that clinically discriminates levels of consciousness, the perturbational complexity index (PCIst). Networks with high values of PCIst exhibit stable dynamics and rich learning, suggesting a consciousness prior may promote rich learning. The results suggest a clear relationship between critical dynamics, learning regimes and complexity-based measures of consciousness.
Collapse
|
4
|
Thomas T, Straub D, Tatai F, Shene M, Tosik T, Kersting K, Rothkopf CA. Modelling dataset bias in machine-learned theories of economic decision-making. Nat Hum Behav 2024; 8:679-691. [PMID: 38216691 PMCID: PMC11045447 DOI: 10.1038/s41562-023-01784-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Normative and descriptive models have long vied to explain and predict human risky choices, such as those between goods or gambles. A recent study reported the discovery of a new, more accurate model of human decision-making by training neural networks on a new online large-scale dataset, choices13k. Here we systematically analyse the relationships between several models and datasets using machine-learning methods and find evidence for dataset bias. Because participants' choices in stochastically dominated gambles were consistently skewed towards equipreference in the choices13k dataset, we hypothesized that this reflected increased decision noise. Indeed, a probabilistic generative model adding structured decision noise to a neural network trained on data from a laboratory study transferred best, that is, outperformed all models apart from those trained on choices13k. We conclude that a careful combination of theory and data analysis is still required to understand the complex interactions of machine-learning models and data of human risky choices.
Collapse
Affiliation(s)
- Tobias Thomas
- Centre for Cognitive Science and Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany.
- Hessian Center for Artificial Intelligence, Darmstadt, Germany.
| | - Dominik Straub
- Centre for Cognitive Science and Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany
| | - Fabian Tatai
- Centre for Cognitive Science and Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany
| | - Megan Shene
- Centre for Cognitive Science and Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany
| | - Tümer Tosik
- Centre for Cognitive Science and Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany
| | - Kristian Kersting
- Hessian Center for Artificial Intelligence, Darmstadt, Germany
- Centre for Cognitive Science and Computer Science Department, Technical University of Darmstadt, Darmstadt, Germany
| | - Constantin A Rothkopf
- Centre for Cognitive Science and Institute of Psychology, Technical University of Darmstadt, Darmstadt, Germany
- Hessian Center for Artificial Intelligence, Darmstadt, Germany
| |
Collapse
|
5
|
Katsuno H, Kimura Y, Yamazaki T, Takigawa I. Machine Learning Refinement of In Situ Images Acquired by Low Electron Dose LC-TEM. MICROSCOPY AND MICROANALYSIS : THE OFFICIAL JOURNAL OF MICROSCOPY SOCIETY OF AMERICA, MICROBEAM ANALYSIS SOCIETY, MICROSCOPICAL SOCIETY OF CANADA 2024; 30:77-84. [PMID: 38285924 DOI: 10.1093/micmic/ozad142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 11/21/2023] [Accepted: 12/21/2023] [Indexed: 01/31/2024]
Abstract
We have studied a machine learning (ML) technique for refining images acquired during in situ observation using liquid-cell transmission electron microscopy. Our model is constructed using a U-Net architecture and a ResNet encoder. For training our ML model, we prepared an original image dataset that contained pairs of images of samples acquired with and without a solution present. The former images were used as noisy images, and the latter images were used as corresponding ground truth images. The number of pairs of image sets was 1,204, and the image sets included images acquired at several different magnifications and electron doses. The trained model converted a noisy image into a clear image. The time necessary for the conversion was on the order of 10 ms, and we applied the model to in situ observations using the software Gatan DigitalMicrograph (DM). Even if a nanoparticle was not visible in a view window in the DM software because of the low electron dose, it was visible in a successive refined image generated by our ML model.
Collapse
Affiliation(s)
- Hiroyasu Katsuno
- Emerging Media Initiative, Kanazawa University, Kakuma-machi, Kanazawa, 920-1192 Ishikawa, Japan
| | - Yuki Kimura
- Institute of Low Temperature Science, Hokkaido University, Kita-19, Nishi-8, Kita-ku, Sapporo, 060-0819 Hokkaido, Japan
| | - Tomoya Yamazaki
- Institute of Low Temperature Science, Hokkaido University, Kita-19, Nishi-8, Kita-ku, Sapporo, 060-0819 Hokkaido, Japan
| | - Ichigaku Takigawa
- Institute for Liberal Arts and Sciences, Kyoto University, 302 Konoe-kae, 69 Konoe-cho, Sakyo-ku, Kyoto, 606-8315 Kyoto, Japan
- Institute for Chemical Reaction Design and Discovery, Hokkaido University, N21 W10, Kita-ku, Sapporo, 001-0021 Hokkaido, Japan
| |
Collapse
|
6
|
Huang L, Zhang C, Zhang H. Self-Adaptive Training: Bridging Supervised and Self-Supervised Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2024; 46:1362-1377. [PMID: 36306295 DOI: 10.1109/tpami.2022.3217792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
We propose self-adaptive training-a unified training algorithm that dynamically calibrates and enhances training processes by model predictions without incurring an extra computational cost-to advance both supervised and self-supervised learning of deep neural networks. We analyze the training dynamics of deep networks on training data that are corrupted by, e.g., random noise and adversarial examples. Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of any label information, highlighting that model predictions could substantially benefit the training processes: self-adaptive training improves the generalization of deep networks under noise and enhances the self-supervised representation learning. The analysis also sheds light on understanding deep learning, e.g., a potential explanation of the recently-discovered double-descent phenomenon in empirical risk minimization and the collapsing issue of the state-of-the-art self-supervised learning algorithms. Experiments on the CIFAR, STL, and ImageNet datasets verify the effectiveness of our approach in three applications: classification with label noise, selective classification, and linear evaluation. To facilitate future research, the code has been made publicly available at https://github.com/LayneH/self-adaptive-training.
Collapse
|
7
|
Lasko TA, Strobl EV, Stead WW. Why do probabilistic clinical models fail to transport between sites. NPJ Digit Med 2024; 7:53. [PMID: 38429353 PMCID: PMC10907678 DOI: 10.1038/s41746-024-01037-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Accepted: 02/14/2024] [Indexed: 03/03/2024] Open
Abstract
The rising popularity of artificial intelligence in healthcare is highlighting the problem that a computational model achieving super-human clinical performance at its training sites may perform substantially worse at new sites. In this perspective, we argue that we should typically expect this failure to transport, and we present common sources for it, divided into those under the control of the experimenter and those inherent to the clinical data-generating process. Of the inherent sources we look a little deeper into site-specific clinical practices that can affect the data distribution, and propose a potential solution intended to isolate the imprint of those practices on the data from the patterns of disease cause and effect that are the usual target of probabilistic clinical models.
Collapse
Affiliation(s)
- Thomas A Lasko
- Vanderbilt University Medical Center, Nashville, TN, USA.
| | - Eric V Strobl
- Vanderbilt University Medical Center, Nashville, TN, USA
| | | |
Collapse
|
8
|
Liu YH, Baratin A, Cornford J, Mihalas S, Shea-Brown E, Lajoie G. How connectivity structure shapes rich and lazy learning in neural circuits. ARXIV 2024:arXiv:2310.08513v2. [PMID: 37873007 PMCID: PMC10593070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
In theoretical neuroscience, recent work leverages deep learning tools to explore how some network attributes critically influence its learning dynamics. Notably, initial weight distributions with small (resp. large) variance may yield a rich (resp. lazy) regime, where significant (resp. minor) changes to network states and representation are observed over the course of learning. However, in biology, neural circuit connectivity could exhibit a low-rank structure and therefore differs markedly from the random initializations generally used for these studies. As such, here we investigate how the structure of the initial weights -- in particular their effective rank -- influences the network learning regime. Through both empirical and theoretical analyses, we discover that high-rank initializations typically yield smaller network changes indicative of lazier learning, a finding we also confirm with experimentally-driven initial connectivity in recurrent neural networks. Conversely, low-rank initialization biases learning towards richer learning. Importantly, however, as an exception to this rule, we find lazier learning can still occur with a low-rank initialization that aligns with task and data statistics. Our research highlights the pivotal role of initial weight structures in shaping learning regimes, with implications for metabolic costs of plasticity and risks of catastrophic forgetting.
Collapse
|
9
|
Fischer B, Chemnitz M, Zhu Y, Perron N, Roztocki P, MacLellan B, Di Lauro L, Aadhi A, Rimoldi C, Falk TH, Morandotti R. Neuromorphic Computing via Fission-based Broadband Frequency Generation. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2023; 10:e2303835. [PMID: 37786262 PMCID: PMC10724387 DOI: 10.1002/advs.202303835] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 08/31/2023] [Indexed: 10/04/2023]
Abstract
The performance limitations of traditional computer architectures have led to the rise of brain-inspired hardware, with optical solutions gaining popularity due to the energy efficiency, high speed, and scalability of linear operations. However, the use of optics to emulate the synaptic activity of neurons has remained a challenge since the integration of nonlinear nodes is power-hungry and, thus, hard to scale. Neuromorphic wave computing offers a new paradigm for energy-efficient information processing, building upon transient and passively nonlinear interactions between optical modes in a waveguide. Here, an implementation of this concept is presented using broadband frequency conversion by coherent higher-order soliton fission in a single-mode fiber. It is shown that phase encoding on femtosecond pulses at the input, alongside frequency selection and weighting at the system output, makes transient spectro-temporal system states interpretable and allows for the energy-efficient emulation of various digital neural networks. The experiments in a compact, fully fiber-integrated setup substantiate an anticipated enhancement in computational performance with increasing system nonlinearity. The findings suggest that broadband frequency generation, accessible on-chip and in-fiber with off-the-shelf components, may challenge the traditional approach to node-based brain-inspired hardware design, ultimately leading to energy-efficient, scalable, and dependable computing with minimal optical hardware requirements.
Collapse
Affiliation(s)
- Bennet Fischer
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
- Leibniz Institute of Photonic TechnologyAlbert‐Einstein Str. 907745JenaGermany
| | - Mario Chemnitz
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
- Leibniz Institute of Photonic TechnologyAlbert‐Einstein Str. 907745JenaGermany
| | - Yi Zhu
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
| | - Nicolas Perron
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
| | - Piotr Roztocki
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
- Ki3 Photonics Technologies2547 Rue SicardMontrealQuebecH1V 2Y8Canada
| | - Benjamin MacLellan
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
| | - Luigi Di Lauro
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
| | - A. Aadhi
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
| | - Cristina Rimoldi
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
- Dipartimento di Elettronica e TelecomunicazioniPolitecnico di TorinoCorso Duca degli Abruzzi 24Torino10129Italy
| | - Tiago H. Falk
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
| | - Roberto Morandotti
- Institut National de la Recherche Scientifique – ÉnergieMatériaux et Télécommunications1650 Blvd. Lionel‐BouletVarennesQuebecJ3X1S2Canada
| |
Collapse
|
10
|
Sun W, Advani M, Spruston N, Saxe A, Fitzgerald JE. Organizing memories for generalization in complementary learning systems. Nat Neurosci 2023; 26:1438-1448. [PMID: 37474639 PMCID: PMC10400413 DOI: 10.1038/s41593-023-01382-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 06/13/2023] [Indexed: 07/22/2023]
Abstract
Memorization and generalization are complementary cognitive processes that jointly promote adaptive behavior. For example, animals should memorize safe routes to specific water sources and generalize from these memories to discover environmental features that predict new ones. These functions depend on systems consolidation mechanisms that construct neocortical memory traces from hippocampal precursors, but why systems consolidation only applies to a subset of hippocampal memories is unclear. Here we introduce a new neural network formalization of systems consolidation that reveals an overlooked tension-unregulated neocortical memory transfer can cause overfitting and harm generalization in an unpredictable world. We resolve this tension by postulating that memories only consolidate when it aids generalization. This framework accounts for partial hippocampal-cortical memory transfer and provides a normative principle for reconceptualizing numerous observations in the field. Generalization-optimized systems consolidation thus provides new insight into how adaptive behavior benefits from complementary learning systems specialized for memorization and generalization.
Collapse
Affiliation(s)
- Weinan Sun
- Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA, USA
| | - Madhu Advani
- Center for Brain Science, Harvard University, Cambridge, MA, USA
| | - Nelson Spruston
- Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA, USA
| | - Andrew Saxe
- Center for Brain Science, Harvard University, Cambridge, MA, USA.
- Department of Experimental Psychology, University of Oxford, Oxford, UK.
- Gatsby Computational Neuroscience Unit & Sainsbury Wellcome Centre, UCL, London, UK.
- CIFAR Azrieli Global Scholars Program, CIFAR, Toronto, Ontario, Canada.
| | - James E Fitzgerald
- Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA, USA.
| |
Collapse
|
11
|
Hanin B, Zlokapa A. Bayesian interpolation with deep linear networks. Proc Natl Acad Sci U S A 2023; 120:e2301345120. [PMID: 37252994 PMCID: PMC10266010 DOI: 10.1073/pnas.2301345120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2023] [Accepted: 04/25/2023] [Indexed: 06/01/2023] Open
Abstract
Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find non-asymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through novel asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is a novel emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit.
Collapse
Affiliation(s)
- Boris Hanin
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08540
| | - Alexander Zlokapa
- Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139
- Google Quantum AI, Venice, CA 90291
| |
Collapse
|
12
|
Shan H, Sompolinsky H. Minimum perturbation theory of deep perceptual learning. Phys Rev E 2022; 106:064406. [PMID: 36671118 DOI: 10.1103/physreve.106.064406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2022] [Accepted: 11/22/2022] [Indexed: 06/17/2023]
Abstract
Perceptual learning (PL) involves long-lasting improvement in perceptual tasks following extensive training and is accompanied by modified neuronal responses in sensory cortical areas in the brain. Understanding the dynamics of PL and the resultant synaptic changes is important for causally connecting PL to the observed neural plasticity. This is theoretically challenging because learning-related changes are distributed across many stages of the sensory hierarchy. In this paper, we modeled the sensory hierarchy as a deep nonlinear neural network and studied PL of fine discrimination, a common and well-studied paradigm of PL. Using tools from statistical physics, we developed a mean-field theory of the network in the limit of a large number of neurons and large number of examples. Our theory suggests that, in this thermodynamic limit, the input-output function of the network can be exactly mapped to that of a deep linear network, allowing us to characterize the space of solutions for the task. Surprisingly, we found that modifying synaptic weights in the first layer of the hierarchy is both sufficient and necessary for PL. To address the degeneracy of the space of solutions, we postulate that PL dynamics are constrained by a normative minimum perturbation (MP) principle, which favors weight matrices with minimal changes relative to their prelearning values. Interestingly, MP plasticity induces changes to weights and neural representations in all layers of the network, except for the readout weight vector. While weight changes in higher layers are not necessary for learning, they help reduce overall perturbation to the network. In addition, such plasticity can be learned simply through slow learning. We further elucidate the properties of MP changes and compare them against experimental findings. Overall, our statistical mechanics theory of PL provides mechanistic and normative understanding of several important empirical findings of PL.
Collapse
Affiliation(s)
- Haozhe Shan
- Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA and Program in Neuroscience, Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Haim Sompolinsky
- Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA and Edmond and Lily Safra Center for Brain Sciences, Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| |
Collapse
|
13
|
Saglietti L, Mannelli SS, Saxe A. An analytical theory of curriculum learning in teacher-student networks. JOURNAL OF STATISTICAL MECHANICS (ONLINE) 2022; 2022:114014. [PMID: 37817944 PMCID: PMC10561397 DOI: 10.1088/1742-5468/ac9b3c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 10/13/2022] [Indexed: 10/12/2023]
Abstract
In animals and humans, curriculum learning-presenting data in a curated order-is critical to rapid learning and effective pedagogy. A long history of experiments has demonstrated the impact of curricula in a variety of animals but, despite its ubiquitous presence, a theoretical understanding of the phenomenon is still lacking. Surprisingly, in contrast to animal learning, curricula strategies are not widely used in machine learning and recent simulation studies reach the conclusion that curricula are moderately effective or even ineffective in most cases. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. We study a task in which a sparse set of informative features are embedded amidst a large set of noisy features. We analytically derive average learning trajectories for simple neural networks on this task, which establish a clear speed benefit for curriculum learning in the online setting. However, when training experiences can be stored and replayed (for instance, during sleep), the advantage of curriculum in standard neural networks disappears, in line with observations from the deep learning literature. Inspired by synaptic consolidation techniques developed to combat catastrophic forgetting, we propose curriculum-aware algorithms that consolidate synapses at curriculum change points and investigate whether this can boost the benefits of curricula. We derive generalisation performance as a function of consolidation strength (implemented as an L 2 regularisation/elastic coupling connecting learning phases), and show that curriculum-aware algorithms can yield a large improvement in test performance. Our reduced analytical descriptions help reconcile apparently conflicting empirical results, trace regimes where curriculum learning yields the largest gains, and provide experimentally-accessible predictions for the impact of task parameters on curriculum benefits. More broadly, our results suggest that fully exploiting a curriculum may require explicit adjustments in the loss.
Collapse
Affiliation(s)
- Luca Saglietti
- Institute for Data Science and Analytics, Bocconi University, Italy
| | - Stefano Sarao Mannelli
- Gatsby Computational Neuroscience Unit and Sainsbury Wellcome Centre, University College, London, United Kingdom
| | - Andrew Saxe
- Institute for Data Science and Analytics, Bocconi University, Italy
- FAIR, Meta AI, United States of America
| |
Collapse
|
14
|
Ma X, Sardy S, Hengartner N, Bobenko N, Lin YT. A phase transition for finding needles in nonlinear haystacks with LASSO artificial neural networks. STATISTICS AND COMPUTING 2022; 32:99. [PMID: 36299529 PMCID: PMC9587964 DOI: 10.1007/s11222-022-10169-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 10/12/2022] [Indexed: 06/16/2023]
Abstract
To fit sparse linear associations, a LASSO sparsity inducing penalty with a single hyperparameter provably allows to recover the important features (needles) with high probability in certain regimes even if the sample size is smaller than the dimension of the input vector (haystack). More recently learners known as artificial neural networks (ANN) have shown great successes in many machine learning tasks, in particular fitting nonlinear associations. Small learning rate, stochastic gradient descent algorithm and large training set help to cope with the explosion in the number of parameters present in deep neural networks. Yet few ANN learners have been developed and studied to find needles in nonlinear haystacks. Driven by a single hyperparameter, our ANN learner, like for sparse linear associations, exhibits a phase transition in the probability of retrieving the needles, which we do not observe with other ANN learners. To select our penalty parameter, we generalize the universal threshold of Donoho and Johnstone (Biometrika 81(3):425-455, 1994) which is a better rule than the conservative (too many false detections) and expensive cross-validation. In the spirit of simulated annealing, we propose a warm-start sparsity inducing algorithm to solve the high-dimensional, non-convex and non-differentiable optimization problem. We perform simulated and real data Monte Carlo experiments to quantify the effectiveness of our approach.
Collapse
Affiliation(s)
- Xiaoyu Ma
- Shandong University, Jinan, China
- Department of Mathematics, University of Geneva, Geneva, Switzerland
| | - Sylvain Sardy
- Department of Mathematics, University of Geneva, Geneva, Switzerland
| | - Nick Hengartner
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, USA
| | - Nikolai Bobenko
- Department of Mathematics, University of Geneva, Geneva, Switzerland
| | - Yen Ting Lin
- Information Sciences Group, Los Alamos National Laboratory, Los Alamos, USA
| |
Collapse
|
15
|
Ingrosso A, Goldt S. Data-driven emergence of convolutional structure in neural networks. Proc Natl Acad Sci U S A 2022; 119:e2201854119. [PMID: 36161906 PMCID: PMC9546588 DOI: 10.1073/pnas.2201854119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Accepted: 08/12/2022] [Indexed: 11/18/2022] Open
Abstract
Exploiting data invariances is crucial for efficient learning in both artificial and biological neural circuits. Understanding how neural networks can discover appropriate representations capable of harnessing the underlying symmetries of their inputs is thus crucial in machine learning and neuroscience. Convolutional neural networks, for example, were designed to exploit translation symmetry, and their capabilities triggered the first wave of deep learning successes. However, learning convolutions directly from translation-invariant data with a fully connected network has so far proven elusive. Here we show how initially fully connected neural networks solving a discrimination task can learn a convolutional structure directly from their inputs, resulting in localized, space-tiling receptive fields. These receptive fields match the filters of a convolutional network trained on the same task. By carefully designing data models for the visual scene, we show that the emergence of this pattern is triggered by the non-Gaussian, higher-order local structure of the inputs, which has long been recognized as the hallmark of natural images. We provide an analytical and numerical characterization of the pattern formation mechanism responsible for this phenomenon in a simple model and find an unexpected link between receptive field formation and tensor decomposition of higher-order input correlations. These results provide a perspective on the development of low-level feature detectors in various sensory modalities and pave the way for studying the impact of higher-order statistics on learning in neural networks.
Collapse
Affiliation(s)
- Alessandro Ingrosso
- Quantitative Life Sciences, The Abdus Salam International Centre for Theoretical Physics, 34151 Trieste, Italy
| | - Sebastian Goldt
- Department of Physics, International School of Advanced Studies, 34136 Trieste, Italy
| |
Collapse
|
16
|
Rocks JW, Mehta P. Bias-variance decomposition of overparameterized regression with random linear features. Phys Rev E 2022; 106:025304. [PMID: 36109970 PMCID: PMC9906786 DOI: 10.1103/physreve.106.025304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 07/12/2022] [Indexed: 01/21/2023]
Abstract
In classical statistics, the bias-variance trade-off describes how varying a model's complexity (e.g., number of fit parameters) affects its ability to make accurate predictions. According to this trade-off, optimal performance is achieved when a model is expressive enough to capture trends in the data, yet not so complex that it overfits idiosyncratic features of the training data. Recently, it has become clear that this classic understanding of the bias variance must be fundamentally revisited in light of the incredible predictive performance of overparameterized models-models that avoid overfitting even when the number of fit parameters is large enough to perfectly fit the training data. Here, we present results for one of the simplest examples of an overparameterized model: regression with random linear features (i.e., a two-layer neural network with a linear activation function). Using the zero-temperature cavity method, we derive analytic expressions for the training error, test error, bias, and variance. We show that the linear random features model exhibits three phase transitions: two different transitions to an interpolation regime where the training error is zero, along with an additional transition between regimes with large bias and minimal bias. Using random matrix theory, we show how each transition arises due to small nonzero eigenvalues in the Hessian matrix. Finally, we compare and contrast the phase diagram of the random linear features model to the random nonlinear features model and ordinary regression, highlighting the additional phase transitions that result from the use of linear basis functions.
Collapse
Affiliation(s)
- Jason W. Rocks
- Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | - Pankaj Mehta
- Department of Physics, Boston University, Boston, Massachusetts 02215, USA,Faculty of Computing and Data Sciences, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
17
|
Gu K, Masotto X, Bachani V, Lakshminarayanan B, Nikodem J, Yin D. An instance-dependent simulation framework for learning with label noise. Mach Learn 2022. [DOI: 10.1007/s10994-022-06207-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
18
|
Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00498-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023]
|
19
|
Sarraf A, Khalili S. An upper bound on the variance of scalar multilayer perceptrons for log-concave distributions. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.11.062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
20
|
Zavatone-Veth JA, Tong WL, Pehlevan C. Contrasting random and learned features in deep Bayesian linear regression. Phys Rev E 2022; 105:064118. [PMID: 35854590 DOI: 10.1103/physreve.105.064118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2022] [Accepted: 05/26/2022] [Indexed: 06/15/2023]
Abstract
Understanding how feature learning affects generalization is among the foremost goals of modern deep learning theory. Here, we study how the ability to learn representations affects the generalization performance of a simple class of models: deep Bayesian linear neural networks trained on unstructured Gaussian data. By comparing deep random feature models to deep networks in which all layers are trained, we provide a detailed characterization of the interplay between width, depth, data density, and prior mismatch. We show that both models display samplewise double-descent behavior in the presence of label noise. Random feature models can also display modelwise double descent if there are narrow bottleneck layers, while deep networks do not show these divergences. Random feature models can have particular widths that are optimal for generalization at a given data density, while making neural networks as wide or as narrow as possible is always optimal. Moreover, we show that the leading-order correction to the kernel-limit learning curve cannot distinguish between random feature models and deep networks in which all layers are trained. Taken together, our findings begin to elucidate how architectural details affect generalization performance in this simple class of deep regression models.
Collapse
Affiliation(s)
- Jacob A Zavatone-Veth
- Department of Physics, Harvard University, Cambridge, Massachusetts 02138, USA
- Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA
| | - William L Tong
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138, USA
| | - Cengiz Pehlevan
- Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA
- John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts 02138, USA
| |
Collapse
|
21
|
Sahs J, Pyle R, Damaraju A, Caro JO, Tavaslioglu O, Lu A, Anselmi F, Patel AB. Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics. Front Artif Intell 2022; 5:889981. [PMID: 35647529 PMCID: PMC9131019 DOI: 10.3389/frai.2022.889981] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 04/04/2022] [Indexed: 11/18/2022] Open
Abstract
Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of the neurons and a continuous family of transformations of the scale of weight and bias parameters. We propose taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines. Using this spline lens, we study learning dynamics in shallow univariate ReLU NNs, finding unexpected insights and explanations for several perplexing phenomena. We develop a surprisingly simple and transparent view of the structure of the loss surface, including its critical and fixed points, Hessian, and Hessian spectrum. We also show that standard weight initializations yield very flat initial functions, and that this flatness, together with overparametrization and the initial weight scale, is responsible for the strength and type of implicit regularization, consistent with previous work. Our implicit regularization results are complementary to recent work, showing that initialization scale critically controls implicit regularization via a kernel-based argument. Overall, removing the weight scale symmetry enables us to prove these results more simply and enables us to prove new results and gain new insights while offering a far more transparent and intuitive picture. Looking forward, our quotiented spline-based approach will extend naturally to the multivariate and deep settings, and alongside the kernel-based view, we believe it will play a foundational role in efforts to understand neural networks. Videos of learning dynamics using a spline-based visualization are available at http://shorturl.at/tFWZ2.
Collapse
Affiliation(s)
- Justin Sahs
- Department of Neuroscience, Baylor College of Medicine, Houston, TX, United States
| | - Ryan Pyle
- Department of Neuroscience, Baylor College of Medicine, Houston, TX, United States
| | - Aneel Damaraju
- Department of Electrical Engineering, Rice University, Houston, TX, United States
| | - Josue Ortega Caro
- Department of Neuroscience, Baylor College of Medicine, Houston, TX, United States
| | - Onur Tavaslioglu
- Department of Computational and Applied Mathematics, Rice University, Houston, TX, United States
| | - Andy Lu
- Department of Electrical Engineering, Rice University, Houston, TX, United States
| | - Fabio Anselmi
- Department of Neuroscience, Baylor College of Medicine, Houston, TX, United States
| | - Ankit B. Patel
- Department of Neuroscience, Baylor College of Medicine, Houston, TX, United States
- Department of Electrical Engineering, Rice University, Houston, TX, United States
- *Correspondence: Ankit B. Patel
| |
Collapse
|
22
|
Hastie T, Montanari A, Rosset S, Tibshirani RJ. SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION. Ann Stat 2022; 50:949-986. [PMID: 36120512 PMCID: PMC9481183 DOI: 10.1214/21-aos2133] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
Abstract
Interpolators-estimators that achieve zero training error-have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ 2 norm ("ridgeless") interpolation least squares regression, focusing on the high-dimensional regime in which the number of unknown parameters p is of the same order as the number of samples n. We consider two different models for the feature distribution: a linear model, where the feature vectors x i ∈ ℝ p are obtained by applying a linear transform to a vector of i.i.d. entries, x i = Σ1/2 z i (with z i ∈ ℝ p ); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, xi = φ(Wz i ) (with z i ∈ ℝ d , W ∈ ℝ p × d a matrix of i.i.d. entries, and φ an activation function acting componentwise on Wz i ). We recover-in a precise quantitative way-several phenomena that have been observed in large-scale neural networks and kernel machines, including the "double descent" behavior of the prediction risk, and the potential benefits of overparametrization.
Collapse
Affiliation(s)
- Trevor Hastie
- Department of Statistics and Department of Biomedical Data Science, Stanford University
| | - Andrea Montanari
- Department of Statistics and Department of Electrical Engineering, Stanford University
| | | | - Ryan J. Tibshirani
- Department of Statistics and Department of Machine Learning, Carnegie Mellon University
| |
Collapse
|
23
|
Hiratani N, Latham PE. Developmental and evolutionary constraints on olfactory circuit selection. Proc Natl Acad Sci U S A 2022; 119:e2100600119. [PMID: 35263217 PMCID: PMC8931209 DOI: 10.1073/pnas.2100600119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Accepted: 01/14/2022] [Indexed: 11/18/2022] Open
Abstract
SignificanceIn this work, we explore the hypothesis that biological neural networks optimize their architecture, through evolution, for learning. We study early olfactory circuits of mammals and insects, which have relatively similar structure but a huge diversity in size. We approximate these circuits as three-layer networks and estimate, analytically, the scaling of the optimal hidden-layer size with input-layer size. We find that both longevity and information in the genome constrain the hidden-layer size, so a range of allometric scalings is possible. However, the experimentally observed allometric scalings in mammals and insects are consistent with biologically plausible values. This analysis should pave the way for a deeper understanding of both biological and artificial networks.
Collapse
Affiliation(s)
- Naoki Hiratani
- Gatsby Computational Neuroscience Unit, University College London, London W1T 4JG, United Kingdom
| | - Peter E. Latham
- Gatsby Computational Neuroscience Unit, University College London, London W1T 4JG, United Kingdom
| |
Collapse
|
24
|
Rocks JW, Mehta P. Memorizing without overfitting: Bias, variance, and interpolation in overparameterized models. PHYSICAL REVIEW RESEARCH 2022; 4:013201. [PMID: 36713351 PMCID: PMC9879296 DOI: 10.1103/physrevresearch.4.013201] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in two minimal models of over-parameterization (linear regression and two-layer neural networks with nonlinear data distributions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. In both models, increasing the number of fit parameters leads to a phase transition where the training error goes to zero and the test error diverges as a result of the variance (while the bias remains finite). Beyond this threshold, the test error of the two-layer neural network decreases due to a monotonic decrease in both the bias and variance in contrast with the classical bias-variance trade-off. We also show that in contrast with classical intuition, over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.
Collapse
Affiliation(s)
- Jason W Rocks
- Department of Physics, Boston University, Boston, Massachusetts 02215, USA
| | - Pankaj Mehta
- Department of Physics, Boston University, Boston, Massachusetts 02215, USA
- Faculty of Computing and Data Sciences, Boston University, Boston, Massachusetts 02215, USA
| |
Collapse
|
25
|
Gerace F, Saglietti L, Sarao Mannelli S, Saxe A, Zdeborová L. Probing transfer learning with a model of synthetic correlated datasets. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2022. [DOI: 10.1088/2632-2153/ac4f3f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Abstract
Transfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets. This setup allows for an analytic characterization of the generalization performance obtained when transferring the learned feature map from the source to the target task. Focusing on the problem of training two-layer networks in a binary classification setting, we show that our model can capture a range of salient features of transfer learning with real data. Moreover, by exploiting parametric control over the correlation between the two data-sets, we systematically investigate under which conditions the transfer of features is beneficial for generalization.
Collapse
|
26
|
D'Amario V, Srivastava S, Sasaki T, Boix X. The Data Efficiency of Deep Learning Is Degraded by Unnecessary Input Dimensions. Front Comput Neurosci 2022; 16:760085. [PMID: 35173595 PMCID: PMC8842477 DOI: 10.3389/fncom.2022.760085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 01/03/2022] [Indexed: 11/29/2022] Open
Abstract
Biological learning systems are outstanding in their ability to learn from limited training data compared to the most successful learning machines, i.e., Deep Neural Networks (DNNs). What are the key aspects that underlie this data efficiency gap is an unresolved question at the core of biological and artificial intelligence. We hypothesize that one important aspect is that biological systems rely on mechanisms such as foveations in order to reduce unnecessary input dimensions for the task at hand, e.g., background in object recognition, while state-of-the-art DNNs do not. Datasets to train DNNs often contain such unnecessary input dimensions, and these lead to more trainable parameters. Yet, it is not clear whether this affects the DNNs' data efficiency because DNNs are robust to increasing the number of parameters in the hidden layers, and it is uncertain whether this holds true for the input layer. In this paper, we investigate the impact of unnecessary input dimensions on the DNNs data efficiency, namely, the amount of examples needed to achieve certain generalization performance. Our results show that unnecessary input dimensions that are task-unrelated substantially degrade data efficiency. This highlights the need for mechanisms that remove task-unrelated dimensions, such as foveation for image classification, in order to enable data efficiency gains.
Collapse
Affiliation(s)
- Vanessa D'Amario
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, United States
- Center for Brains, Minds and Machines, Cambridge, MA, United States
| | - Sanjana Srivastava
- Center for Brains, Minds and Machines, Cambridge, MA, United States
- Department of Computer Science, Stanford University, Stanford, CA, United States
| | - Tomotake Sasaki
- Artificial Intelligence Laboratory, Fujitsu Limited, Kawasaki, Japan
| | - Xavier Boix
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, United States
- Center for Brains, Minds and Machines, Cambridge, MA, United States
| |
Collapse
|
27
|
Schneider J. Correlated Initialization for Correlated Data. Neural Process Lett 2022. [DOI: 10.1007/s11063-021-10728-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
28
|
Allaire F, Mallet V, Filippi JB. Emulation of wildland fire spread simulation using deep learning. Neural Netw 2021; 141:184-198. [PMID: 33906084 DOI: 10.1016/j.neunet.2021.04.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Revised: 02/28/2021] [Accepted: 04/05/2021] [Indexed: 02/08/2023]
Abstract
Numerical simulation of wildland fire spread is useful to predict the locations that are likely to burn and to support decision in an operational context, notably for crisis situations and long-term planning. For short-term, the computational time of traditional simulators is too high to be tractable over large zones like a country or part of a country, especially for fire danger mapping. This issue is tackled by emulating the area of the burned surface returned after simulation of a fire igniting anywhere in Corsica island and spreading freely during one hour, with a wide range of possible environmental input conditions. A deep neural network with a hybrid architecture is used to account for two types of inputs: the spatial fields describing the surrounding landscape and the remaining scalar inputs. After training on a large simulation dataset, the network shows a satisfactory approximation error on a complementary test dataset with a MAPE of 32.8%. The convolutional part is pre-computed and the emulator is defined as the remaining part of the network, saving significant computational time. On a 32-core machine, the emulator has a speed-up factor of several thousands compared to the simulator and the overall relationship between its inputs and output is consistent with the expected physical behavior of fire spread. This reduction in computational time allows the computation of one-hour burned area map for the whole island of Corsica in less than a minute, opening new application in short-term fire danger mapping.
Collapse
Affiliation(s)
- Frédéric Allaire
- Institut national de recherche en informatique et en automatique (INRIA), 2 rue Simone Iff, Paris, France; Sorbonne Université, Laboratoire Jacques-Louis Lions, France.
| | - Vivien Mallet
- Institut national de recherche en informatique et en automatique (INRIA), 2 rue Simone Iff, Paris, France; Sorbonne Université, Laboratoire Jacques-Louis Lions, France
| | - Jean-Baptiste Filippi
- Centre national de la recherche scientifique (CNRS), Sciences pour l'Environnement - Unité Mixte de Recherche 6134, Università di Corsica, Campus Grossetti, Corte, France
| |
Collapse
|
29
|
A Statistician Teaches Deep Learning. JOURNAL OF STATISTICAL THEORY AND PRACTICE 2021. [DOI: 10.1007/s42519-021-00193-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
30
|
Rossbroich J, Trotter D, Beninger J, Tóth K, Naud R. Linear-nonlinear cascades capture synaptic dynamics. PLoS Comput Biol 2021; 17:e1008013. [PMID: 33720935 PMCID: PMC7993773 DOI: 10.1371/journal.pcbi.1008013] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 03/25/2021] [Accepted: 02/25/2021] [Indexed: 11/18/2022] Open
Abstract
Short-term synaptic dynamics differ markedly across connections and strongly regulate how action potentials communicate information. To model the range of synaptic dynamics observed in experiments, we have developed a flexible mathematical framework based on a linear-nonlinear operation. This model can capture various experimentally observed features of synaptic dynamics and different types of heteroskedasticity. Despite its conceptual simplicity, we show that it is more adaptable than previous models. Combined with a standard maximum likelihood approach, synaptic dynamics can be accurately and efficiently characterized using naturalistic stimulation patterns. These results make explicit that synaptic processing bears algorithmic similarities with information processing in convolutional neural networks.
Collapse
Affiliation(s)
- Julian Rossbroich
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
| | - Daniel Trotter
- Department of Physics, University of Ottawa, Ottawa, ON, Canada
| | - John Beninger
- uOttawa Brain Mind Institute, Center for Neural Dynamics, Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Katalin Tóth
- uOttawa Brain Mind Institute, Center for Neural Dynamics, Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Richard Naud
- Department of Physics, University of Ottawa, Ottawa, ON, Canada
- uOttawa Brain Mind Institute, Center for Neural Dynamics, Department of Cellular and Molecular Medicine, University of Ottawa, Ottawa, ON, Canada
| |
Collapse
|
31
|
Steinberg J, Advani M, Sompolinsky H. New role for circuit expansion for learning in neural networks. Phys Rev E 2021; 103:022404. [PMID: 33736047 DOI: 10.1103/physreve.103.022404] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Accepted: 12/16/2020] [Indexed: 11/07/2022]
Abstract
Many sensory pathways in the brain include sparsely active populations of neurons downstream from the input stimuli. The biological purpose of this expanded structure is unclear, but it may be beneficial due to the increased expressive power of the network. In this work, we show that certain ways of expanding a neural network can improve its generalization performance even when the expanded structure is pruned after the learning period. To study this setting, we use a teacher-student framework where a perceptron teacher network generates labels corrupted with small amounts of noise. We then train a student network structurally matched to the teacher. In this scenario, the student can achieve optimal accuracy if given the teacher's synaptic weights. We find that sparse expansion of the input layer of a student perceptron network both increases its capacity and improves the generalization performance of the network when learning a noisy rule from a teacher perceptron when the expansion is pruned after learning. We find similar behavior when the expanded units are stochastic and uncorrelated with the input and analyze this network in the mean-field limit. By solving the mean-field equations, we show that the generalization error of the stochastic expanded student network continues to drop as the size of the network increases. This improvement in generalization performance occurs despite the increased complexity of the student network relative to the teacher it is trying to learn. We show that this effect is closely related to the addition of slack variables in artificial neural networks and suggest possible implications for artificial and biological neural networks.
Collapse
Affiliation(s)
- Julia Steinberg
- Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA
- Department of Physics, Harvard University, Cambridge, Massachusetts 02138, USA
| | - Madhu Advani
- Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA
| | - Haim Sompolinsky
- Center for Brain Science, Harvard University, Cambridge, Massachusetts 02138, USA
- Edmond and Lily Safra Center for Brain Sciences, Hebrew University, Jerusalem 91904, Israel
| |
Collapse
|
32
|
Goldt S, Advani MS, Saxe AM, Krzakala F, Zdeborová L. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. JOURNAL OF STATISTICAL MECHANICS (ONLINE) 2020; 2020:124010. [PMID: 34262607 PMCID: PMC8252911 DOI: 10.1088/1742-5468/abc61e] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Accepted: 10/28/2020] [Indexed: 06/13/2023]
Abstract
Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.
Collapse
Affiliation(s)
- Sebastian Goldt
- Institut de Physique Théorique, CNRS, CEA, Université Paris-Saclay, France
| | - Madhu S Advani
- Center for Brain Science, Harvard University, Cambridge, MA 02138, United States of America
| | - Andrew M Saxe
- Department of Experimental Psychology, University of Oxford, Oxford, United Kingdom
| | - Florent Krzakala
- Laboratoire de Physique Statistique, Sorbonne Universités, Université Pierre et Marie Curie Paris 6, Ecole Normale Supérieure, 75005 Paris, France
| | - Lenka Zdeborová
- Institut de Physique Théorique, CNRS, CEA, Université Paris-Saclay, France
| |
Collapse
|