1
|
Giroud J, Trébuchon A, Mercier M, Davis MH, Morillon B. The human auditory cortex concurrently tracks syllabic and phonemic timescales via acoustic spectral flux. SCIENCE ADVANCES 2024; 10:eado8915. [PMID: 39705351 DOI: 10.1126/sciadv.ado8915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Accepted: 11/15/2024] [Indexed: 12/22/2024]
Abstract
Dynamical theories of speech processing propose that the auditory cortex parses acoustic information in parallel at the syllabic and phonemic timescales. We developed a paradigm to independently manipulate both linguistic timescales, and acquired intracranial recordings from 11 patients who are epileptic listening to French sentences. Our results indicate that (i) syllabic and phonemic timescales are both reflected in the acoustic spectral flux; (ii) during comprehension, the auditory cortex tracks the syllabic timescale in the theta range, while neural activity in the alpha-beta range phase locks to the phonemic timescale; (iii) these neural dynamics occur simultaneously and share a joint spatial location; (iv) the spectral flux embeds two timescales-in the theta and low-beta ranges-across 17 natural languages. These findings help us understand how the human brain extracts acoustic information from the continuous speech signal at multiple timescales simultaneously, a prerequisite for subsequent linguistic processing.
Collapse
Affiliation(s)
- Jérémy Giroud
- MRC Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, UK
| | - Agnès Trébuchon
- Aix Marseille Université, INSERM, INS, Institut de Neurosciences des Systèmes, Marseille, France
- APHM, Clinical Neurophysiology, Timone Hospital, Marseille, France
| | - Manuel Mercier
- Aix Marseille Université, INSERM, INS, Institut de Neurosciences des Systèmes, Marseille, France
| | - Matthew H Davis
- MRC Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, UK
| | - Benjamin Morillon
- Aix Marseille Université, INSERM, INS, Institut de Neurosciences des Systèmes, Marseille, France
| |
Collapse
|
2
|
Ahmed B, Downer JD, Malone BJ, Makin JG. Deep Neural Networks Explain Spiking Activity in Auditory Cortex. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.12.623280. [PMID: 39605715 PMCID: PMC11601425 DOI: 10.1101/2024.11.12.623280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
For static stimuli or at gross (~1-s) time scales, artificial neural networks (ANNs) that have been trained on challenging engineering tasks, like image classification and automatic speech recognition, are now the best predictors of neural responses in primate visual and auditory cortex. It is, however, unknown whether this success can be extended to spiking activity at fine time scales, which are particularly relevant to audition. Here we address this question with ANNs trained on speech audio, and acute multi-electrode recordings from the auditory cortex of squirrel monkeys. We show that layers of trained ANNs can predict the spike counts of neurons responding to speech audio and to monkey vocalizations at bin widths of 50 ms and below. For some neurons, the ANNs explain close to all of the explainable variance-much more than traditional spectrotemporal-receptive-field models, and more than untrained networks. Non-primary neurons tend to be more predictable by deeper layers of the ANNs, but there is much variation by neuron, which would be invisible to coarser recording modalities.
Collapse
Affiliation(s)
- Bilal Ahmed
- Elmore School of Electrical and Computer Engineering, Purdue University
| | - Joshua D Downer
- Otolaryngology and Head and Neck Surgery, University of California, San Francisco
| | - Brian J Malone
- Otolaryngology and Head and Neck Surgery, University of California, San Francisco
- Center for Neurscience, U.C. Davis
| | - Joseph G Makin
- Elmore School of Electrical and Computer Engineering, Purdue University
| |
Collapse
|
3
|
Anderson AJ, Davis C, Lalor EC. Deep-learning models reveal how context and listener attention shape electrophysiological correlates of speech-to-language transformation. PLoS Comput Biol 2024; 20:e1012537. [PMID: 39527649 PMCID: PMC11581396 DOI: 10.1371/journal.pcbi.1012537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 11/21/2024] [Accepted: 10/04/2024] [Indexed: 11/16/2024] Open
Abstract
To transform continuous speech into words, the human brain must resolve variability across utterances in intonation, speech rate, volume, accents and so on. A promising approach to explaining this process has been to model electroencephalogram (EEG) recordings of brain responses to speech. Contemporary models typically invoke context invariant speech categories (e.g. phonemes) as an intermediary representational stage between sounds and words. However, such models may not capture the complete picture because they do not model the brain mechanism that categorizes sounds and consequently may overlook associated neural representations. By providing end-to-end accounts of speech-to-text transformation, new deep-learning systems could enable more complete brain models. We model EEG recordings of audiobook comprehension with the deep-learning speech recognition system Whisper. We find that (1) Whisper provides a self-contained EEG model of an intermediary representational stage that reflects elements of prelexical and lexical representation and prediction; (2) EEG modeling is more accurate when informed by 5-10s of speech context, which traditional context invariant categorical models do not encode; (3) Deep Whisper layers encoding linguistic structure were more accurate EEG models of selectively attended speech in two-speaker "cocktail party" listening conditions than early layers encoding acoustics. No such layer depth advantage was observed for unattended speech, consistent with a more superficial level of linguistic processing in the brain.
Collapse
Affiliation(s)
- Andrew J. Anderson
- Department of Neurology. Medical College of Wisconsin, Milwaukee, Wisconsin United States of America
- Department of Biomedical Engineering. Medical College of Wisconsin. Milwaukee, Wisconsin United States of America
- Department of Neurosurgery. Medical College of Wisconsin. Milwaukee, Wisconsin United States of America
- Department of Neuroscience and Del Monte Institute for Neuroscience, University of Rochester, Rochester, New York, United States of America
| | - Chris Davis
- Western Sydney University, The MARCS Institute for Brain, Behaviour and Development, Westmead Innovation Quarter, Westmead, New South Wales, Australia
| | - Edmund C. Lalor
- Department of Neuroscience and Del Monte Institute for Neuroscience, University of Rochester, Rochester, New York, United States of America
- Department of Biomedical Engineering, University of Rochester, Rochester, New York, United States of America
- Center for Visual Science, University of Rochester, Rochester, New York, United States of America
| |
Collapse
|
4
|
Nentwich M, Leszczynski M, Schroeder CE, Bickel S, Parra LC. Intrinsic dynamic shapes responses to external stimulation in the human brain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.05.606665. [PMID: 39463938 PMCID: PMC11507726 DOI: 10.1101/2024.08.05.606665] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/29/2024]
Abstract
Sensory stimulation of the brain reverberates in its recurrent neuronal networks. However, current computational models of brain activity do not separate immediate sensory responses from intrinsic recurrent dynamics. We apply a vector-autoregressive model with external input (VARX), combining the concepts of "functional connectivity" and "encoding models", to intracranial recordings in humans. We find that the recurrent connectivity during rest is largely unaltered during movie watching. The intrinsic recurrent dynamic enhances and prolongs the neural responses to scene cuts, eye movements, and sounds. Failing to account for these exogenous inputs, leads to spurious connections in the intrinsic "connectivity". The model shows that an external stimulus can reduce intrinsic noise. It also shows that sensory areas have mostly outward, whereas higher-order brain areas mostly incoming connections. We conclude that the response to an external audiovisual stimulus can largely be attributed to the intrinsic dynamic of the brain, already observed during rest.
Collapse
Affiliation(s)
- Maximilian Nentwich
- The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Marcin Leszczynski
- Departments of Psychiatry and Neurology, Columbia University College of Physicians and Surgeons, New York, NY, USA
- Translational Neuroscience Lab Division, Center for Biomedical Imaging and Neuromodulation, Nathan Kline Institute, Orangeburg, NY, USA
- Cognitive Science Department, Institute of Philosophy, Jagiellonian University, Kraków, Poland
| | - Charles E Schroeder
- Departments of Psychiatry and Neurology, Columbia University College of Physicians and Surgeons, New York, NY, USA
- Translational Neuroscience Lab Division, Center for Biomedical Imaging and Neuromodulation, Nathan Kline Institute, Orangeburg, NY, USA
| | - Stephan Bickel
- The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
- Departments of Neurology and Neurosurgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY, USA
- Center for Biomedical Imaging and Neuromodulation, Nathan Kline Institute, Orangeburg, NY, USA
| | - Lucas C Parra
- Department of Biomedical Engineering, The City College of New York, New York, NY, USA
| |
Collapse
|
5
|
Ara A, Provias V, Sitek K, Coffey EBJ, Zatorre RJ. Cortical-subcortical interactions underlie processing of auditory predictions measured with 7T fMRI. Cereb Cortex 2024; 34:bhae316. [PMID: 39087881 PMCID: PMC11292673 DOI: 10.1093/cercor/bhae316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Revised: 07/04/2024] [Accepted: 07/12/2024] [Indexed: 08/02/2024] Open
Abstract
Perception integrates both sensory inputs and internal models of the environment. In the auditory domain, predictions play a critical role because of the temporal nature of sounds. However, the precise contribution of cortical and subcortical structures in these processes and their interaction remain unclear. It is also unclear whether these brain interactions are specific to abstract rules or if they also underlie the predictive coding of local features. We used high-field 7T functional magnetic resonance imaging to investigate interactions between cortical and subcortical areas during auditory predictive processing. Volunteers listened to tone sequences in an oddball paradigm where the predictability of the deviant was manipulated. Perturbations in periodicity were also introduced to test the specificity of the response. Results indicate that both cortical and subcortical auditory structures encode high-order predictive dynamics, with the effect of predictability being strongest in the auditory cortex. These predictive dynamics were best explained by modeling a top-down information flow, in contrast to unpredicted responses. No error signals were observed to deviations of periodicity, suggesting that these responses are specific to abstract rule violations. Our results support the idea that the high-order predictive dynamics observed in subcortical areas propagate from the auditory cortex.
Collapse
Affiliation(s)
- Alberto Ara
- Montreal Neurological Institute, McGill University, 3801 University Street, Montreal, QC H3A 2B4, Canada
- International Laboratory for Brain, Music and Sound Research (BRAMS), 90 Vincent-d’Indy Avenue, Outremont, QC H2V 2S9, Canada
- Centre for Research in Brain, Language and Music (CRBLM), 3640 de la Montagne Street, Montreal, QC H3G 2A8, Canada
| | - Vasiliki Provias
- International Laboratory for Brain, Music and Sound Research (BRAMS), 90 Vincent-d’Indy Avenue, Outremont, QC H2V 2S9, Canada
- Centre for Research in Brain, Language and Music (CRBLM), 3640 de la Montagne Street, Montreal, QC H3G 2A8, Canada
- Department of Psychology, Concordia University, 7141 Sherbrooke Street West, Montreal, QCH4B 1R6, Canada
| | - Kevin Sitek
- Department of Communication Sciences and Disorders, Northwestern University, 2240 Campus Drive, Evanston, 60208 IL, USA
| | - Emily B J Coffey
- International Laboratory for Brain, Music and Sound Research (BRAMS), 90 Vincent-d’Indy Avenue, Outremont, QC H2V 2S9, Canada
- Centre for Research in Brain, Language and Music (CRBLM), 3640 de la Montagne Street, Montreal, QC H3G 2A8, Canada
- Department of Psychology, Concordia University, 7141 Sherbrooke Street West, Montreal, QCH4B 1R6, Canada
| | - Robert J Zatorre
- Montreal Neurological Institute, McGill University, 3801 University Street, Montreal, QC H3A 2B4, Canada
- International Laboratory for Brain, Music and Sound Research (BRAMS), 90 Vincent-d’Indy Avenue, Outremont, QC H2V 2S9, Canada
- Centre for Research in Brain, Language and Music (CRBLM), 3640 de la Montagne Street, Montreal, QC H3G 2A8, Canada
| |
Collapse
|
6
|
Kumar S, Sumers TR, Yamakoshi T, Goldstein A, Hasson U, Norman KA, Griffiths TL, Hawkins RD, Nastase SA. Shared functional specialization in transformer-based language models and the human brain. Nat Commun 2024; 15:5523. [PMID: 38951520 PMCID: PMC11217339 DOI: 10.1038/s41467-024-49173-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 05/24/2024] [Indexed: 07/03/2024] Open
Abstract
When processing language, the brain is thought to deploy specialized computations to construct meaning from complex linguistic structures. Recently, artificial neural networks based on the Transformer architecture have revolutionized the field of natural language processing. Transformers integrate contextual information across words via structured circuit computations. Prior work has focused on the internal representations ("embeddings") generated by these circuits. In this paper, we instead analyze the circuit computations directly: we deconstruct these computations into the functionally-specialized "transformations" that integrate contextual information across words. Using functional MRI data acquired while participants listened to naturalistic stories, we first verify that the transformations account for considerable variance in brain activity across the cortical language network. We then demonstrate that the emergent computations performed by individual, functionally-specialized "attention heads" differentially predict brain activity in specific cortical regions. These heads fall along gradients corresponding to different layers and context lengths in a low-dimensional cortical space.
Collapse
Affiliation(s)
- Sreejan Kumar
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, 08540, USA.
| | - Theodore R Sumers
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, USA.
| | - Takateru Yamakoshi
- Faculty of Medicine, The University of Tokyo, Bunkyo-ku, Tokyo, 113-0033, Japan
| | - Ariel Goldstein
- Department of Cognitive and Brain Sciences and Business School, Hebrew University, Jerusalem, 9190401, Israel
| | - Uri Hasson
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, 08540, USA
- Department of Psychology, Princeton University, Princeton, NJ, 08540, USA
| | - Kenneth A Norman
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, 08540, USA
- Department of Psychology, Princeton University, Princeton, NJ, 08540, USA
| | - Thomas L Griffiths
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, USA
- Department of Psychology, Princeton University, Princeton, NJ, 08540, USA
| | - Robert D Hawkins
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, 08540, USA
- Department of Psychology, Princeton University, Princeton, NJ, 08540, USA
| | - Samuel A Nastase
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, 08540, USA.
| |
Collapse
|
7
|
Wang R, Chen ZS. Large-scale foundation models and generative AI for BigData neuroscience. Neurosci Res 2024:S0168-0102(24)00075-0. [PMID: 38897235 PMCID: PMC11649861 DOI: 10.1016/j.neures.2024.06.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 04/15/2024] [Accepted: 05/15/2024] [Indexed: 06/21/2024]
Abstract
Recent advances in machine learning have led to revolutionary breakthroughs in computer games, image and natural language understanding, and scientific discovery. Foundation models and large-scale language models (LLMs) have recently achieved human-like intelligence thanks to BigData. With the help of self-supervised learning (SSL) and transfer learning, these models may potentially reshape the landscapes of neuroscience research and make a significant impact on the future. Here we present a mini-review on recent advances in foundation models and generative AI models as well as their applications in neuroscience, including natural language and speech, semantic memory, brain-machine interfaces (BMIs), and data augmentation. We argue that this paradigm-shift framework will open new avenues for many neuroscience research directions and discuss the accompanying challenges and opportunities.
Collapse
Affiliation(s)
- Ran Wang
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA
| | - Zhe Sage Chen
- Department of Psychiatry, New York University Grossman School of Medicine, New York, NY 10016, USA; Department of Neuroscience and Physiology, Neuroscience Institute, New York University Grossman School of Medicine, New York, NY 10016, USA; Department of Biomedical Engineering, New York University Tandon School of Engineering, Brooklyn, NY 11201, USA.
| |
Collapse
|
8
|
Li Y, Yang H, Gu S. Enhancing neural encoding models for naturalistic perception with a multi-level integration of deep neural networks and cortical networks. Sci Bull (Beijing) 2024; 69:1738-1747. [PMID: 38490889 DOI: 10.1016/j.scib.2024.02.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Revised: 06/27/2023] [Accepted: 02/23/2024] [Indexed: 03/17/2024]
Abstract
Cognitive neuroscience aims to develop computational models that can accurately predict and explain neural responses to sensory inputs in the cortex. Recent studies attempt to leverage the representation power of deep neural networks (DNNs) to predict the brain response and suggest a correspondence between artificial and biological neural networks in their feature representations. However, typical voxel-wise encoding models tend to rely on specific networks designed for computer vision tasks, leading to suboptimal brain-wide correspondence during cognitive tasks. To address this challenge, this work proposes a novel approach that upgrades voxel-wise encoding models through multi-level integration of features from DNNs and information from brain networks. Our approach combines DNN feature-level ensemble learning and brain atlas-level model integration, resulting in significant improvements in predicting whole-brain neural activity during naturalistic video perception. Furthermore, this multi-level integration framework enables a deeper understanding of the brain's neural representation mechanism, accurately predicting the neural response to complex visual concepts. We demonstrate that neural encoding models can be optimized by leveraging a framework that integrates both data-driven approaches and theoretical insights into the functional structure of the cortical networks.
Collapse
Affiliation(s)
- Yuanning Li
- School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai 201210, China.
| | - Huzheng Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Shi Gu
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China; Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen 518110, China.
| |
Collapse
|
9
|
Rupp KM, Hect JL, Harford EE, Holt LL, Ghuman AS, Abel TJ. A hierarchy of processing complexity and timescales for natural sounds in human auditory cortex. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.24.595822. [PMID: 38826304 PMCID: PMC11142240 DOI: 10.1101/2024.05.24.595822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Efficient behavior is supported by humans' ability to rapidly recognize acoustically distinct sounds as members of a common category. Within auditory cortex, there are critical unanswered questions regarding the organization and dynamics of sound categorization. Here, we performed intracerebral recordings in the context of epilepsy surgery as 20 patient-participants listened to natural sounds. We built encoding models to predict neural responses using features of these sounds extracted from different layers within a sound-categorization deep neural network (DNN). This approach yielded highly accurate models of neural responses throughout auditory cortex. The complexity of a cortical site's representation (measured by the depth of the DNN layer that produced the best model) was closely related to its anatomical location, with shallow, middle, and deep layers of the DNN associated with core (primary auditory cortex), lateral belt, and parabelt regions, respectively. Smoothly varying gradients of representational complexity also existed within these regions, with complexity increasing along a posteromedial-to-anterolateral direction in core and lateral belt, and along posterior-to-anterior and dorsal-to-ventral dimensions in parabelt. When we estimated the time window over which each recording site integrates information, we found shorter integration windows in core relative to lateral belt and parabelt. Lastly, we found a relationship between the length of the integration window and the complexity of information processing within core (but not lateral belt or parabelt). These findings suggest hierarchies of timescales and processing complexity, and their interrelationship, represent a functional organizational principle of the auditory stream that underlies our perception of complex, abstract auditory information.
Collapse
Affiliation(s)
- Kyle M. Rupp
- Department of Neurological Surgery, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Jasmine L. Hect
- Department of Neurological Surgery, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Emily E. Harford
- Department of Neurological Surgery, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Lori L. Holt
- Department of Psychology, The University of Texas at Austin, Austin, Texas, United States of America
| | - Avniel Singh Ghuman
- Department of Neurological Surgery, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Taylor J. Abel
- Department of Neurological Surgery, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Bioengineering, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| |
Collapse
|
10
|
Tuckute G, Feather J, Boebinger D, McDermott JH. Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. PLoS Biol 2023; 21:e3002366. [PMID: 38091351 PMCID: PMC10718467 DOI: 10.1371/journal.pbio.3002366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 10/06/2023] [Indexed: 12/18/2023] Open
Abstract
Models that predict brain responses to stimuli provide one measure of understanding of a sensory system and have many potential applications in science and engineering. Deep artificial neural networks have emerged as the leading such predictive models of the visual system but are less explored in audition. Prior work provided examples of audio-trained neural networks that produced good predictions of auditory cortical fMRI responses and exhibited correspondence between model stages and brain regions, but left it unclear whether these results generalize to other neural network models and, thus, how to further improve models in this domain. We evaluated model-brain correspondence for publicly available audio neural network models along with in-house models trained on 4 different tasks. Most tested models outpredicted standard spectromporal filter-bank models of auditory cortex and exhibited systematic model-brain correspondence: Middle stages best predicted primary auditory cortex, while deep stages best predicted non-primary cortex. However, some state-of-the-art models produced substantially worse brain predictions. Models trained to recognize speech in background noise produced better brain predictions than models trained to recognize speech in quiet, potentially because hearing in noise imposes constraints on biological auditory representations. The training task influenced the prediction quality for specific cortical tuning properties, with best overall predictions resulting from models trained on multiple tasks. The results generally support the promise of deep neural networks as models of audition, though they also indicate that current models do not explain auditory cortical responses in their entirety.
Collapse
Affiliation(s)
- Greta Tuckute
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
| | - Jenelle Feather
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
| | - Dana Boebinger
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
- Program in Speech and Hearing Biosciences and Technology, Harvard, Cambridge, Massachusetts, United States of America
- University of Rochester Medical Center, Rochester, New York, New York, United States of America
| | - Josh H. McDermott
- Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research MIT, Cambridge, Massachusetts, United States of America
- Center for Brains, Minds, and Machines, MIT, Cambridge, Massachusetts, United States of America
- Program in Speech and Hearing Biosciences and Technology, Harvard, Cambridge, Massachusetts, United States of America
| |
Collapse
|
11
|
Antonello RJ, Vaidya AR, Huth AG. Scaling laws for language encoding models in fMRI. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2023; 36:21895-21907. [PMID: 39035676 PMCID: PMC11258918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/23/2024]
Abstract
Representations from transformer-based unidirectional language models are known to be effective at predicting brain responses to natural language. However, most studies comparing language models to brains have used GPT-2 or similarly sized language models. Here we tested whether larger open-source models such as those from the OPT and LLaMA families are better at predicting brain responses recorded using fMRI. Mirroring scaling results from other contexts, we found that brain prediction performance scales logarithmically with model size from 125M to 30B parameter models, with ~15% increased encoding performance as measured by correlation with a held-out test set across 3 subjects. Similar logarithmic behavior was observed when scaling the size of the fMRI training set. We also characterized scaling for acoustic encoding models that use HuBERT, WavLM, and Whisper, and we found comparable improvements with model size. A noise ceiling analysis of these large, high-performance encoding models showed that performance is nearing the theoretical maximum for brain areas such as the precuneus and higher auditory cortex. These results suggest that increasing scale in both models and data will yield incredibly effective models of language processing in the brain, enabling better scientific understanding as well as applications such as decoding.
Collapse
Affiliation(s)
| | - Aditya R Vaidya
- Department of Computer Science, The University of Texas at Austin
| | - Alexander G Huth
- Departments of Computer Science and Neuroscience, The University of Texas at Austin
| |
Collapse
|