1
|
Cusimano M, Hewitt LB, McDermott JH. Listening with generative models. Cognition 2024; 253:105874. [PMID: 39216190 DOI: 10.1016/j.cognition.2024.105874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 03/31/2024] [Accepted: 07/03/2024] [Indexed: 09/04/2024]
Abstract
Perception has long been envisioned to use an internal model of the world to explain the causes of sensory signals. However, such accounts have historically not been testable, typically requiring intractable search through the space of possible explanations. Using auditory scenes as a case study, we leveraged contemporary computational tools to infer explanations of sounds in a candidate internal generative model of the auditory world (ecologically inspired audio synthesizers). Model inferences accounted for many classic illusions. Unlike traditional accounts of auditory illusions, the model is applicable to any sound, and exhibited human-like perceptual organization for real-world sound mixtures. The combination of stimulus-computability and interpretable model structure enabled 'rich falsification', revealing additional assumptions about sound generation needed to account for perception. The results show how generative models can account for the perception of both classic illusions and everyday sensory signals, and illustrate the opportunities and challenges involved in incorporating them into theories of perception.
Collapse
Affiliation(s)
- Maddie Cusimano
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, United States of America.
| | - Luke B Hewitt
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, United States of America
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, United States of America; McGovern Institute, Massachusetts Institute of Technology, United States of America; Center for Brains Minds and Machines, Massachusetts Institute of Technology, United States of America; Speech and Hearing Bioscience and Technology, Harvard University, United States of America.
| |
Collapse
|
2
|
García-Lázaro HG, Teng S. Sensory and Perceptual Decisional Processes Underlying the Perception of Reverberant Auditory Environments. eNeuro 2024; 11:ENEURO.0122-24.2024. [PMID: 39122554 PMCID: PMC11335967 DOI: 10.1523/eneuro.0122-24.2024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/29/2024] [Accepted: 07/25/2024] [Indexed: 08/12/2024] Open
Abstract
Reverberation, a ubiquitous feature of real-world acoustic environments, exhibits statistical regularities that human listeners leverage to self-orient, facilitate auditory perception, and understand their environment. Despite the extensive research on sound source representation in the auditory system, it remains unclear how the brain represents real-world reverberant environments. Here, we characterized the neural response to reverberation of varying realism by applying multivariate pattern analysis to electroencephalographic (EEG) brain signals. Human listeners (12 males and 8 females) heard speech samples convolved with real-world and synthetic reverberant impulse responses and judged whether the speech samples were in a "real" or "fake" environment, focusing on the reverberant background rather than the properties of speech itself. Participants distinguished real from synthetic reverberation with ∼75% accuracy; EEG decoding reveals a multistage decoding time course, with dissociable components early in the stimulus presentation and later in the perioffset stage. The early component predominantly occurred in temporal electrode clusters, while the later component was prominent in centroparietal clusters. These findings suggest distinct neural stages in perceiving natural acoustic environments, likely reflecting sensory encoding and higher-level perceptual decision-making processes. Overall, our findings provide evidence that reverberation, rather than being largely suppressed as a noise-like signal, carries relevant environmental information and gains representation along the auditory system. This understanding also offers various applications; it provides insights for including reverberation as a cue to aid navigation for blind and visually impaired people. It also helps to enhance realism perception in immersive virtual reality settings, gaming, music, and film production.
Collapse
Affiliation(s)
| | - Santani Teng
- Smith-Kettlewell Eye Research Institute, San Francisco, California 94115
| |
Collapse
|
3
|
de Hoz L, McAlpine D. Noises on-How the Brain Deals with Acoustic Noise. BIOLOGY 2024; 13:501. [PMID: 39056695 PMCID: PMC11274191 DOI: 10.3390/biology13070501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 07/01/2024] [Accepted: 07/01/2024] [Indexed: 07/28/2024]
Abstract
What is noise? When does a sound form part of the acoustic background and when might it come to our attention as part of the foreground? Our brain seems to filter out irrelevant sounds in a seemingly effortless process, but how this is achieved remains opaque and, to date, unparalleled by any algorithm. In this review, we discuss how noise can be both background and foreground, depending on what a listener/brain is trying to achieve. We do so by addressing questions concerning the brain's potential bias to interpret certain sounds as part of the background, the extent to which the interpretation of sounds depends on the context in which they are heard, as well as their ethological relevance, task-dependence, and a listener's overall mental state. We explore these questions with specific regard to the implicit, or statistical, learning of sounds and the role of feedback loops between cortical and subcortical auditory structures.
Collapse
Affiliation(s)
- Livia de Hoz
- Neuroscience Research Center, Charité—Universitätsmedizin Berlin, 10117 Berlin, Germany
- Bernstein Center for Computational Neuroscience, 10115 Berlin, Germany
| | - David McAlpine
- Neuroscience Research Center, Charité—Universitätsmedizin Berlin, 10117 Berlin, Germany
- Department of Linguistics, Macquarie University Hearing, Australian Hearing Hub, Sydney, NSW 2109, Australia
| |
Collapse
|
4
|
Lavechin M, de Seyssel M, Métais M, Metze F, Mohamed A, Bredin H, Dupoux E, Cristia A. Modeling early phonetic acquisition from child-centered audio data. Cognition 2024; 245:105734. [PMID: 38335906 DOI: 10.1016/j.cognition.2024.105734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 12/29/2023] [Accepted: 01/22/2024] [Indexed: 02/12/2024]
Abstract
Infants learn their native language(s) at an amazing speed. Before they even talk, their perception adapts to the language(s) they hear. However, the mechanisms responsible for this perceptual attunement and the circumstances in which it takes place remain unclear. This paper presents the first attempt to study perceptual attunement using ecological child-centered audio data. We show that a simple prediction algorithm exhibits perceptual attunement when applied on unrealistic clean audio-book data, but fails to do so when applied on ecologically-valid child-centered data. In the latter scenario, perceptual attunement only emerges when the prediction mechanism is supplemented with inductive biases that force the algorithm to focus exclusively on speech segments while learning speaker-, pitch-, and room-invariant representations. We argue these biases are plausible given previous research on infants and non-human animals. More generally, we show that what our model learns and how it develops through exposure to speech depends exquisitely on the details of the input signal. By doing so, we illustrate the importance of considering ecologically valid input data when modeling language acquisition.
Collapse
Affiliation(s)
- Marvin Lavechin
- Laboratoire de Sciences Cognitives et Psycholinguistique, Département d'Etudes Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France; Cognitive Machine Learning Team, INRIA, Paris, France; Meta AI Research, Paris, France.
| | - Maureen de Seyssel
- Laboratoire de Sciences Cognitives et Psycholinguistique, Département d'Etudes Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France; Cognitive Machine Learning Team, INRIA, Paris, France; Laboratoire de linguistique formelle, Université de Paris, CNRS, Paris, France
| | - Marianne Métais
- Laboratoire de Sciences Cognitives et Psycholinguistique, Département d'Etudes Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France; Cognitive Machine Learning Team, INRIA, Paris, France
| | | | | | - Hervé Bredin
- Institut de Recherche en Informatique de Toulouse, Université de Toulouse, CNRS, Toulouse, France
| | - Emmanuel Dupoux
- Laboratoire de Sciences Cognitives et Psycholinguistique, Département d'Etudes Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France; Cognitive Machine Learning Team, INRIA, Paris, France; Meta AI Research, Paris, France
| | - Alejandrina Cristia
- Laboratoire de Sciences Cognitives et Psycholinguistique, Département d'Etudes Cognitives, ENS, EHESS, CNRS, PSL University, Paris, France; Cognitive Machine Learning Team, INRIA, Paris, France
| |
Collapse
|
5
|
Tsironis A, Vlahou E, Kontou P, Bagos P, Kopčo N. Adaptation to Reverberation for Speech Perception: A Systematic Review. Trends Hear 2024; 28:23312165241273399. [PMID: 39246212 PMCID: PMC11384524 DOI: 10.1177/23312165241273399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/10/2024] Open
Abstract
In everyday acoustic environments, reverberation alters the speech signal received at the ears. Normal-hearing listeners are robust to these distortions, quickly recalibrating to achieve accurate speech perception. Over the past two decades, multiple studies have investigated the various adaptation mechanisms that listeners use to mitigate the negative impacts of reverberation and improve speech intelligibility. Following the PRISMA guidelines, we performed a systematic review of these studies, with the aim to summarize existing research, identify open questions, and propose future directions. Two researchers independently assessed a total of 661 studies, ultimately including 23 in the review. Our results showed that adaptation to reverberant speech is robust across diverse environments, experimental setups, speech units, and tasks, in noise-masked or unmasked conditions. The time course of adaptation is rapid, sometimes occurring in less than 1 s, but this can vary depending on the reverberation and noise levels of the acoustic environment. Adaptation is stronger in moderately reverberant rooms and minimal in rooms with very intense reverberation. While the mechanisms underlying the recalibration are largely unknown, adaptation to the direct-to-reverberant ratio-related changes in amplitude modulation appears to be the predominant candidate. However, additional factors need to be explored to provide a unified theory for the effect and its applications.
Collapse
Affiliation(s)
- Avgeris Tsironis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Eleni Vlahou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Panagiota Kontou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Pantelis Bagos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Norbert Kopčo
- Institute of Computer Science, Faculty of Science, Pavol Jozef Šafárik University, Košice, Slovakia
| |
Collapse
|
6
|
Brown AD, Hayward T, Portfors CV, Coffin AB. On the value of diverse organisms in auditory research: From fish to flies to humans. Hear Res 2023; 432:108754. [PMID: 37054531 PMCID: PMC10424633 DOI: 10.1016/j.heares.2023.108754] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 02/28/2023] [Accepted: 03/27/2023] [Indexed: 03/31/2023]
Abstract
Historically, diverse organisms have contributed to our understanding of auditory function. In recent years, the laboratory mouse has become the prevailing non-human model in auditory research, particularly for biomedical studies. There are many questions in auditory research for which the mouse is the most appropriate (or the only) model system available. But mice cannot provide answers for all auditory problems of basic and applied importance, nor can any single model system provide a synthetic understanding of the diverse solutions that have evolved to facilitate effective detection and use of acoustic information. In this review, spurred by trends in funding and publishing and inspired by parallel observations in other domains of neuroscience, we highlight a few examples of the profound impact and lasting benefits of comparative and basic organismal research in the auditory system. We begin with the serendipitous discovery of hair cell regeneration in non-mammalian vertebrates, a finding that has fueled an ongoing search for pathways to hearing restoration in humans. We then turn to the problem of sound source localization - a fundamental task that most auditory systems have been compelled to solve despite large variation in the magnitudes and kinds of spatial acoustic cues available, begetting varied direction-detecting mechanisms. Finally, we consider the power of work in highly specialized organisms to reveal exceptional solutions to sensory problems - and the diverse returns of deep neuroethological inquiry - via the example of echolocating bats. Throughout, we consider how discoveries made possible by comparative and curiosity-driven organismal research have driven fundamental scientific, biomedical, and technological advances in the auditory field.
Collapse
Affiliation(s)
- Andrew D Brown
- Department of Speech and Hearing Sciences, University of Washington, 1417 NE 42nd St, Seattle, WA, 98105 USA; Virginia-Merrill Bloedel Hearing Research Center, University of Washington, 1701 NE Columbia Rd, Seattle, WA, 98195 USA.
| | - Tamasen Hayward
- College of Arts and Sciences, Washington State University, 14204 NE Salmon Creek Ave, Vancouver, WA 98686 USA
| | - Christine V Portfors
- School of Biological Sciences, Washington State University, 14204 NE Salmon Creek Ave, Vancouver, WA 98686 USA
| | - Allison B Coffin
- College of Arts and Sciences, Washington State University, 14204 NE Salmon Creek Ave, Vancouver, WA 98686 USA; School of Biological Sciences, Washington State University, 14204 NE Salmon Creek Ave, Vancouver, WA 98686 USA; Department of Integrative Physiology and Neuroscience, Washington State University, 14204 NE Salmon Creek Ave, Vancouver, WA 98686 USA.
| |
Collapse
|
7
|
Apoux F, Miller-Viacava N, Ferrière R, Dai H, Krause B, Sueur J, Lorenzi C. Auditory discrimination of natural soundscapes. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:2706. [PMID: 37133815 DOI: 10.1121/10.0017972] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Accepted: 04/08/2023] [Indexed: 05/04/2023]
Abstract
A previous modelling study reported that spectro-temporal cues perceptually relevant to humans provide enough information to accurately classify "natural soundscapes" recorded in four distinct temperate habitats of a biosphere reserve [Thoret, Varnet, Boubenec, Ferriere, Le Tourneau, Krause, and Lorenzi (2020). J. Acoust. Soc. Am. 147, 3260]. The goal of the present study was to assess this prediction for humans using 2 s samples taken from the same soundscape recordings. Thirty-one listeners were asked to discriminate these recordings based on differences in habitat, season, or period of the day using an oddity task. Listeners' performance was well above chance, demonstrating effective processing of these differences and suggesting a general high sensitivity for natural soundscape discrimination. This performance did not improve with training up to 10 h. Additional results obtained for habitat discrimination indicate that temporal cues play only a minor role; instead, listeners appear to base their decisions primarily on gross spectral cues related to biological sound sources and habitat acoustics. Convolutional neural networks were trained to perform a similar task using spectro-temporal cues extracted by an auditory model as input. The results are consistent with the idea that humans exclude the available temporal information when discriminating short samples of habitats, implying a form of a sub-optimality.
Collapse
Affiliation(s)
- Frédéric Apoux
- Laboratoire des Systèmes Perceptifs, UMR CNRS 8248, Département d'Etudes Cognitives, Ecole normale supérieure, Université Paris Sciences et Lettres (PSL), Paris, 75005, France
| | - Nicole Miller-Viacava
- Laboratoire des Systèmes Perceptifs, UMR CNRS 8248, Département d'Etudes Cognitives, Ecole normale supérieure, Université Paris Sciences et Lettres (PSL), Paris, 75005, France
| | - Régis Ferrière
- International Research Laboratory for Interdisciplinary Global Environmental Studies (iGLOBES), CNRS, ENS-PSL University, University of Arizona, Tucson, Arizona 85721, USA
| | - Huanping Dai
- Speech Language and Hearing Sciences, University of Arizona, Tucson, Arizona 85721-0071, USA
| | - Bernie Krause
- Wild Sanctuary, 1102 Princeton Drive, Sonoma, California 95476, USA
| | - Jérôme Sueur
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum national d'Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, 57 rue Cuvier, 75005 Paris, France
| | - Christian Lorenzi
- Laboratoire des Systèmes Perceptifs, UMR CNRS 8248, Département d'Etudes Cognitives, Ecole normale supérieure, Université Paris Sciences et Lettres (PSL), Paris, 75005, France
| |
Collapse
|
8
|
Barzelay O, David S, Delgutte B. Effect of Reverberation on Neural Responses to Natural Speech in Rabbit Auditory Midbrain: No Evidence for a Neural Dereverberation Mechanism. eNeuro 2023; 10:ENEURO.0447-22.2023. [PMID: 37072174 PMCID: PMC10179871 DOI: 10.1523/eneuro.0447-22.2023] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 02/10/2023] [Accepted: 02/22/2023] [Indexed: 04/20/2023] Open
Abstract
Reverberation is ubiquitous in everyday acoustic environments. It degrades both binaural cues and the envelope modulations of sounds and thus can impair speech perception. Still, both humans and animals can accurately perceive reverberant stimuli in most everyday settings. Previous neurophysiological and perceptual studies have suggested the existence of neural mechanisms that partially compensate for the effects of reverberation. However, these studies were limited by their use of either highly simplified stimuli or rudimentary reverberation simulations. To further characterize how reverberant stimuli are processed by the auditory system, we recorded single-unit (SU) and multiunit (MU) activity from the inferior colliculus (IC) of unanesthetized rabbits in response to natural speech utterances presented with no reverberation ("dry") and in various degrees of simulated reverberation (direct-to-reverberant energy ratios (DRRs) ranging from 9.4 to -8.2 dB). Linear stimulus reconstruction techniques (Mesgarani et al., 2009) were used to quantify the amount of speech information available in the responses of neural ensembles. We found that high-quality spectrogram reconstructions could be obtained for dry speech and in moderate reverberation from ensembles of 25 units. However, spectrogram reconstruction quality deteriorated in severe reverberation for both MUs and SUs such that the neural degradation paralleled the degradation in the stimulus spectrogram. Furthermore, spectrograms reconstructed from responses to reverberant stimuli resembled spectrograms of reverberant speech better than spectrograms of dry speech. Overall, the results provide no evidence for a dereverberation mechanism in neural responses from the rabbit IC when studied with linear reconstruction techniques.
Collapse
Affiliation(s)
- Oded Barzelay
- Eaton-Peabody Laboratories, Massachusetts Eye and Ear, Boston, MA 02114-3096
- Department of Otolaryngology, Head and Neck Surgery, Harvard Medical School, Boston, MA 02115
| | - Stephen David
- Oregon Research Hearing Center, Oregon Health and Science University, Portland, OR 97239-3098
| | - Bertrand Delgutte
- Eaton-Peabody Laboratories, Massachusetts Eye and Ear, Boston, MA 02114-3096
- Department of Otolaryngology, Head and Neck Surgery, Harvard Medical School, Boston, MA 02115
| |
Collapse
|
9
|
Willmore BDB, King AJ. Adaptation in auditory processing. Physiol Rev 2023; 103:1025-1058. [PMID: 36049112 PMCID: PMC9829473 DOI: 10.1152/physrev.00011.2022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
Adaptation is an essential feature of auditory neurons, which reduces their responses to unchanging and recurring sounds and allows their response properties to be matched to the constantly changing statistics of sounds that reach the ears. As a consequence, processing in the auditory system highlights novel or unpredictable sounds and produces an efficient representation of the vast range of sounds that animals can perceive by continually adjusting the sensitivity and, to a lesser extent, the tuning properties of neurons to the most commonly encountered stimulus values. Together with attentional modulation, adaptation to sound statistics also helps to generate neural representations of sound that are tolerant to background noise and therefore plays a vital role in auditory scene analysis. In this review, we consider the diverse forms of adaptation that are found in the auditory system in terms of the processing levels at which they arise, the underlying neural mechanisms, and their impact on neural coding and perception. We also ask what the dynamics of adaptation, which can occur over multiple timescales, reveal about the statistical properties of the environment. Finally, we examine how adaptation to sound statistics is influenced by learning and experience and changes as a result of aging and hearing loss.
Collapse
Affiliation(s)
- Ben D. B. Willmore
- Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
| | - Andrew J. King
- Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
10
|
Zhang J, Liss J, Jayasuriya S, Berisha V. Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2023; 31:1348-1359. [PMID: 37899766 PMCID: PMC10602198 DOI: 10.1109/taslp.2023.3261753] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/31/2023]
Abstract
Approximately 1.2% of the world's population has impaired voice production. As a result, automatic dysphonic voice detection has attracted considerable academic and clinical interest. However, existing methods for automated voice assessment often fail to generalize outside the training conditions or to other related applications. In this paper, we propose a deep learning framework for generating acoustic feature embeddings sensitive to vocal quality and robust across different corpora. A contrastive loss is combined with a classification loss to train our deep learning model jointly. Data warping methods are used on input voice samples to improve the robustness of our method. Empirical results demonstrate that our method not only achieves high in-corpus and cross-corpus classification accuracy but also generates good embeddings sensitive to voice quality and robust across different corpora. We also compare our results against three baseline methods on clean and three variations of deteriorated in-corpus and cross-corpus datasets and demonstrate that the proposed model consistently outperforms the baseline methods.
Collapse
Affiliation(s)
- Jianwei Zhang
- School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85281, USA
| | - Julie Liss
- College of Health Solutions, Arizona State University, Tempe, AZ 85287, USA
| | - Suren Jayasuriya
- School of Arts, Media and Engineering and the School of Electrical, Computer and Energy Engineering at Arizona State University, Tempe, AZ 85281, USA
| | - Visar Berisha
- College of Health Solutions and School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85281, USA
| |
Collapse
|
11
|
Lorenzi C, Apoux F, Grinfeder E, Krause B, Miller-Viacava N, Sueur J. Human Auditory Ecology: Extending Hearing Research to the Perception of Natural Soundscapes by Humans in Rapidly Changing Environments. Trends Hear 2023; 27:23312165231212032. [PMID: 37981813 PMCID: PMC10658775 DOI: 10.1177/23312165231212032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 10/13/2023] [Accepted: 10/18/2023] [Indexed: 11/21/2023] Open
Abstract
Research in hearing sciences has provided extensive knowledge about how the human auditory system processes speech and assists communication. In contrast, little is known about how this system processes "natural soundscapes," that is the complex arrangements of biological and geophysical sounds shaped by sound propagation through non-anthropogenic habitats [Grinfeder et al. (2022). Frontiers in Ecology and Evolution. 10: 894232]. This is surprising given that, for many species, the capacity to process natural soundscapes determines survival and reproduction through the ability to represent and monitor the immediate environment. Here we propose a framework to encourage research programmes in the field of "human auditory ecology," focusing on the study of human auditory perception of ecological processes at work in natural habitats. Based on large acoustic databases with high ecological validity, these programmes should investigate the extent to which this presumably ancestral monitoring function of the human auditory system is adapted to specific information conveyed by natural soundscapes, whether it operate throughout the life span or whether it emerges through individual learning or cultural transmission. Beyond fundamental knowledge of human hearing, these programmes should yield a better understanding of how normal-hearing and hearing-impaired listeners monitor rural and city green and blue spaces and benefit from them, and whether rehabilitation devices (hearing aids and cochlear implants) restore natural soundscape perception and emotional responses back to normal. Importantly, they should also reveal whether and how humans hear the rapid changes in the environment brought about by human activity.
Collapse
Affiliation(s)
- Christian Lorenzi
- Laboratoire des Systèmes Perceptifs, UMR CNRS 8248, Département d’Etudes Cognitives, Ecole Normale Supérieure, Université Paris Sciences et Lettres (PSL), Paris, France
| | - Frédéric Apoux
- Laboratoire des Systèmes Perceptifs, UMR CNRS 8248, Département d’Etudes Cognitives, Ecole Normale Supérieure, Université Paris Sciences et Lettres (PSL), Paris, France
| | - Elie Grinfeder
- Laboratoire des Systèmes Perceptifs, UMR CNRS 8248, Département d’Etudes Cognitives, Ecole Normale Supérieure, Université Paris Sciences et Lettres (PSL), Paris, France
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, Paris, France
| | | | - Nicole Miller-Viacava
- Laboratoire des Systèmes Perceptifs, UMR CNRS 8248, Département d’Etudes Cognitives, Ecole Normale Supérieure, Université Paris Sciences et Lettres (PSL), Paris, France
| | - Jérôme Sueur
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum national d’Histoire naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, Paris, France
| |
Collapse
|
12
|
Atmaja BT, Sasou A. Effects of Data Augmentations on Speech Emotion Recognition. SENSORS (BASEL, SWITZERLAND) 2022; 22:5941. [PMID: 36015717 PMCID: PMC9415521 DOI: 10.3390/s22165941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Revised: 08/03/2022] [Accepted: 08/05/2022] [Indexed: 06/15/2023]
Abstract
Data augmentation techniques have recently gained more adoption in speech processing, including speech emotion recognition. Although more data tend to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the effects of data augmentation in speech emotion recognition. The investigation aims at finding the most useful type of data augmentation and the number of data augmentations for speech emotion recognition in various conditions. The experiments are conducted on the Japanese Twitter-based emotional speech and IEMOCAP datasets. The results show that for speaker-independent data, two data augmentations with glottal source extraction and silence removal exhibited the best performance among others, even with more data augmentation techniques. For the text-independent data (including speaker and text-independent), more data augmentations tend to improve speech emotion recognition performances. The results highlight the trade-off between the number of data augmentations and the performance of speech emotion recognition showing the necessity to choose a proper data augmentation technique for a specific condition.
Collapse
Affiliation(s)
- Bagus Tris Atmaja
- National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8560, Japan
- Institut Teknologi Sepuluh Nopember, Surabaya 60111, Indonesia
| | - Akira Sasou
- National Institute of Advanced Industrial Science and Technology, Tsukuba 305-8560, Japan
| |
Collapse
|
13
|
Ivanov AZ, King AJ, Willmore BDB, Walker KMM, Harper NS. Cortical adaptation to sound reverberation. eLife 2022; 11:e75090. [PMID: 35617119 PMCID: PMC9213001 DOI: 10.7554/elife.75090] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 05/25/2022] [Indexed: 11/13/2022] Open
Abstract
In almost every natural environment, sounds are reflected by nearby objects, producing many delayed and distorted copies of the original sound, known as reverberation. Our brains usually cope well with reverberation, allowing us to recognize sound sources regardless of their environments. In contrast, reverberation can cause severe difficulties for speech recognition algorithms and hearing-impaired people. The present study examines how the auditory system copes with reverberation. We trained a linear model to recover a rich set of natural, anechoic sounds from their simulated reverberant counterparts. The model neurons achieved this by extending the inhibitory component of their receptive filters for more reverberant spaces, and did so in a frequency-dependent manner. These predicted effects were observed in the responses of auditory cortical neurons of ferrets in the same simulated reverberant environments. Together, these results suggest that auditory cortical neurons adapt to reverberation by adjusting their filtering properties in a manner consistent with dereverberation.
Collapse
Affiliation(s)
- Aleksandar Z Ivanov
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Andrew J King
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Ben DB Willmore
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Kerry MM Walker
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| | - Nicol S Harper
- Department of Physiology, Anatomy and Genetics, University of OxfordOxfordUnited Kingdom
| |
Collapse
|
14
|
Luke R, Innes-Brown H, Undurraga JA, McAlpine D. Human cortical processing of interaural coherence. iScience 2022; 25:104181. [PMID: 35494228 PMCID: PMC9051632 DOI: 10.1016/j.isci.2022.104181] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 11/29/2021] [Accepted: 03/29/2022] [Indexed: 11/17/2022] Open
Abstract
Sounds reach the ears as a mixture of energy generated by different sources. Listeners extract cues that distinguish different sources from one another, including how similar sounds arrive at the two ears, the interaural coherence (IAC). Here, we find listeners cannot reliably distinguish two completely interaurally coherent sounds from a single sound with reduced IAC. Pairs of sounds heard toward the front were readily confused with single sounds with high IAC, whereas those heard to the sides were confused with single sounds with low IAC. Sounds that hold supra-ethological spatial cues are perceived as more diffuse than can be accounted for by their IAC, and this is accounted for by a computational model comprising a restricted, and sound-frequency dependent, distribution of auditory-spatial detectors. We observed elevated cortical hemodynamic responses for sounds with low IAC, suggesting that the ambiguity elicited by sounds with low interaural similarity imposes elevated cortical load.
Collapse
Affiliation(s)
- Robert Luke
- Macquarie University, Sydney, NSW, Australia
- The Bionics Institute, Melbourne, VIC, Australia
| | | | | | | |
Collapse
|
15
|
Peng ZE, Waz S, Buss E, Shen Y, Richards V, Bharadwaj H, Stecker GC, Beim JA, Bosen AK, Braza MD, Diedesch AC, Dorey CM, Dykstra AR, Gallun FJ, Goldsworthy RL, Gray L, Hoover EC, Ihlefeld A, Koelewijn T, Kopun JG, Mesik J, Shub DE, Venezia JH. FORUM: Remote testing for psychological and physiological acoustics. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 151:3116. [PMID: 35649891 PMCID: PMC9305596 DOI: 10.1121/10.0010422] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2021] [Revised: 03/22/2022] [Accepted: 04/22/2022] [Indexed: 05/30/2023]
Abstract
Acoustics research involving human participants typically takes place in specialized laboratory settings. Listening studies, for example, may present controlled sounds using calibrated transducers in sound-attenuating or anechoic chambers. In contrast, remote testing takes place outside of the laboratory in everyday settings (e.g., participants' homes). Remote testing could provide greater access to participants, larger sample sizes, and opportunities to characterize performance in typical listening environments at the cost of reduced control of environmental conditions, less precise calibration, and inconsistency in attentional state and/or response behaviors from relatively smaller sample sizes and unintuitive experimental tasks. The Acoustical Society of America Technical Committee on Psychological and Physiological Acoustics launched the Task Force on Remote Testing (https://tcppasa.org/remotetesting/) in May 2020 with goals of surveying approaches and platforms available to support remote testing and identifying challenges and considerations for prospective investigators. The results of this task force survey were made available online in the form of a set of Wiki pages and summarized in this report. This report outlines the state-of-the-art of remote testing in auditory-related research as of August 2021, which is based on the Wiki and a literature search of papers published in this area since 2020, and provides three case studies to demonstrate feasibility during practice.
Collapse
Affiliation(s)
- Z Ellen Peng
- Boys Town National Research Hospital, Omaha, Nebraska 68131, USA
| | - Sebastian Waz
- University of California, Irvine, Irvine, California 92697, USA
| | - Emily Buss
- The University of North Carolina, Chapel Hill, North Carolina, 27599, USA
| | - Yi Shen
- University of Washington, Seattle, Washington 98195, USA
| | | | | | | | - Jordan A Beim
- University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Adam K Bosen
- Boys Town National Research Hospital, Omaha, Nebraska 68131, USA
| | - Meredith D Braza
- The University of North Carolina, Chapel Hill, North Carolina, 27599, USA
| | - Anna C Diedesch
- Western Washington University, Bellingham, Washington 98225, USA
| | | | | | | | | | - Lincoln Gray
- James Madison University, Harrisburg, Virginia 22807, USA
| | - Eric C Hoover
- University of Maryland, College Park, Maryland 20742, USA
| | - Antje Ihlefeld
- Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA
| | | | - Judy G Kopun
- Boys Town National Research Hospital, Omaha, Nebraska 68131, USA
| | - Juraj Mesik
- University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Daniel E Shub
- Walter Reed National Military Medical Center, Bethesda, Maryland 20814, USA
| | | |
Collapse
|
16
|
Stowell D. Computational bioacoustics with deep learning: a review and roadmap. PeerJ 2022; 10:e13152. [PMID: 35341043 PMCID: PMC8944344 DOI: 10.7717/peerj.13152] [Citation(s) in RCA: 50] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2021] [Accepted: 03/01/2022] [Indexed: 01/20/2023] Open
Abstract
Animal vocalisations and natural soundscapes are fascinating objects of study, and contain valuable evidence about animal behaviours, populations and ecosystems. They are studied in bioacoustics and ecoacoustics, with signal processing and analysis an important component. Computational bioacoustics has accelerated in recent decades due to the growth of affordable digital sound recording devices, and to huge progress in informatics such as big data, signal processing and machine learning. Methods are inherited from the wider field of deep learning, including speech and image processing. However, the tasks, demands and data characteristics are often different from those addressed in speech or music analysis. There remain unsolved problems, and tasks for which evidence is surely present in many acoustic signals, but not yet realised. In this paper I perform a review of the state of the art in deep learning for computational bioacoustics, aiming to clarify key concepts and identify and analyse knowledge gaps. Based on this, I offer a subjective but principled roadmap for computational bioacoustics with deep learning: topics that the community should aim to address, in order to make the most of future developments in AI and informatics, and to use audio data in answering zoological and ecological questions.
Collapse
Affiliation(s)
- Dan Stowell
- Department of Cognitive Science and Artificial Intelligence, Tilburg University, Tilburg, The Netherlands,Naturalis Biodiversity Center, Leiden, The Netherlands
| |
Collapse
|
17
|
Deep neural network models of sound localization reveal how perception is adapted to real-world environments. Nat Hum Behav 2022; 6:111-133. [PMID: 35087192 PMCID: PMC8830739 DOI: 10.1038/s41562-021-01244-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 10/29/2021] [Indexed: 11/15/2022]
Abstract
Mammals localize sounds using information from their two ears.
Localization in real-world conditions is challenging, as echoes provide
erroneous information, and noises mask parts of target sounds. To better
understand real-world localization we equipped a deep neural network with human
ears and trained it to localize sounds in a virtual environment. The resulting
model localized accurately in realistic conditions with noise and reverberation.
In simulated experiments, the model exhibited many features of human spatial
hearing: sensitivity to monaural spectral cues and interaural time and level
differences, integration across frequency, biases for sound onsets, and limits
on localization of concurrent sources. But when trained in unnatural
environments without either reverberation, noise, or natural sounds, these
performance characteristics deviated from those of humans. The results show how
biological hearing is adapted to the challenges of real-world environments and
illustrate how artificial neural networks can reveal the real-world constraints
that shape perception.
Collapse
|
18
|
Lowe MX, Mohsenzadeh Y, Lahner B, Charest I, Oliva A, Teng S. Cochlea to categories: The spatiotemporal dynamics of semantic auditory representations. Cogn Neuropsychol 2021; 38:468-489. [PMID: 35729704 PMCID: PMC10589059 DOI: 10.1080/02643294.2022.2085085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 03/31/2022] [Accepted: 05/25/2022] [Indexed: 10/17/2022]
Abstract
How does the auditory system categorize natural sounds? Here we apply multimodal neuroimaging to illustrate the progression from acoustic to semantically dominated representations. Combining magnetoencephalographic (MEG) and functional magnetic resonance imaging (fMRI) scans of observers listening to naturalistic sounds, we found superior temporal responses beginning ∼55 ms post-stimulus onset, spreading to extratemporal cortices by ∼100 ms. Early regions were distinguished less by onset/peak latency than by functional properties and overall temporal response profiles. Early acoustically-dominated representations trended systematically toward category dominance over time (after ∼200 ms) and space (beyond primary cortex). Semantic category representation was spatially specific: Vocalizations were preferentially distinguished in frontotemporal voice-selective regions and the fusiform; scenes and objects were distinguished in parahippocampal and medial place areas. Our results are consistent with real-world events coded via an extended auditory processing hierarchy, in which acoustic representations rapidly enter multiple streams specialized by category, including areas typically considered visual cortex.
Collapse
Affiliation(s)
- Matthew X. Lowe
- Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA
- Unlimited Sciences, Colorado Springs, CO
| | - Yalda Mohsenzadeh
- Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA
- The Brain and Mind Institute, The University of Western Ontario, London, ON, Canada
- Department of Computer Science, The University of Western Ontario, London, ON, Canada
| | - Benjamin Lahner
- Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA
| | - Ian Charest
- Département de Psychologie, Université de Montréal, Montréal, Québec, Canada
- Center for Human Brain Health, University of Birmingham, UK
| | - Aude Oliva
- Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA
| | - Santani Teng
- Computer Science and Artificial Intelligence Lab (CSAIL), MIT, Cambridge, MA
- Smith-Kettlewell Eye Research Institute (SKERI), San Francisco, CA
| |
Collapse
|
19
|
Vlahou E, Ueno K, Shinn-Cunningham BG, Kopčo N. Calibration of Consonant Perception to Room Reverberation. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2021; 64:2956-2976. [PMID: 34297606 DOI: 10.1044/2021_jslhr-20-00396] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Purpose We examined how consonant perception is affected by a preceding speech carrier simulated in the same or a different room, for different classes of consonants. Carrier room, carrier length, and carrier length/target room uncertainty were manipulated. A phonetic feature analysis tested which phonetic categories are influenced by the manipulations in the acoustic context of the carrier. Method Two experiments were performed, each with nine participants. Targets consisted of 10 or 16 vowel-consonant (VC) syllables presented in one of two strongly reverberant rooms, preceded by a multiple-VC carrier presented in either the same room, a different reverberant room, or an anechoic room. In Experiment 1, the carrier length and the target room randomly varied from trial to trial, whereas in Experiment 2, they were fixed within a block of trials. Results Overall, a consistent carrier provided an advantage for consonant perception compared to inconsistent carriers, whether in anechoic or differently reverberant rooms. Phonetic analysis showed that carrier inconsistency significantly degraded identification of the manner of articulation, especially for stop consonants and, in one of the rooms, also of voicing. Carrier length and carrier/target uncertainty did not affect adaptation to reverberation for individual phonetic features. The detrimental effects of anechoic and different reverberant carriers on target perception were similar. Conclusions The strength of calibration varies across different phonetic features, as well as across rooms with different levels of reverberation. Even though place of articulation is the feature that is affected by reverberation the most, it is the manner of articulation and, partially, voicing for which room adaptation is observed.
Collapse
Affiliation(s)
- Eleni Vlahou
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Volos, Greece
- Institute of Computer Science, Faculty of Science, Pavol Jozef Šafárik University, Košice, Slovakia
- Hearing Research Center and Department of Biomedical Engineering, Boston University, MA
| | - Kanako Ueno
- School of Science and Technology, Meiji University, Chiyoda, Japan
| | | | - Norbert Kopčo
- Institute of Computer Science, Faculty of Science, Pavol Jozef Šafárik University, Košice, Slovakia
- Hearing Research Center and Department of Biomedical Engineering, Boston University, MA
| |
Collapse
|
20
|
Auditory Brainstem Models: Adapting Cochlear Nuclei Improve Spatial Encoding by the Medial Superior Olive in Reverberation. J Assoc Res Otolaryngol 2021; 22:289-318. [PMID: 33861395 DOI: 10.1007/s10162-021-00797-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2020] [Accepted: 03/22/2021] [Indexed: 10/21/2022] Open
Abstract
Listeners typically perceive a sound as originating from the direction of its source, even as direct sound is followed milliseconds later by reflected sound from multiple different directions. Early-arriving sound is emphasised in the ascending auditory pathway, including the medial superior olive (MSO) where binaural neurons encode the interaural-time-difference (ITD) cue for spatial location. Perceptually, weighting of ITD conveyed during rising sound energy is stronger at 600 Hz than at 200 Hz, consistent with the minimum stimulus rate for binaural adaptation, and with the longer reverberation times at 600 Hz, compared with 200 Hz, in many natural outdoor environments. Here, we computationally explore the combined efficacy of adaptation prior to the binaural encoding of ITD cues, and excitatory binaural coincidence detection within MSO neurons, in emphasising ITDs conveyed in early-arriving sound. With excitatory inputs from adapting, nonlinear model spherical bushy cells (SBCs) of the bilateral cochlear nuclei, a nonlinear model MSO neuron with low-threshold potassium channels reproduces the rate-dependent emphasis of rising vs. peak sound energy in ITD encoding; adaptation is equally effective in the model MSO. Maintaining adaptation in model SBCs, and adjusting membrane speed in model MSO neurons, 'left' and 'right' populations of computationally efficient, linear model SBCs and MSO neurons reproduce this stronger weighting of ITD conveyed during rising sound energy at 600 Hz compared to 200 Hz. This hemispheric population model demonstrates a link between strong weighting of spatial information during rising sound energy, and correct unambiguous lateralisation of a speech source in reverberation.
Collapse
|
21
|
Using a cochlear implant processor as contralateral routing of signals device in unilateral cochlear implant recipients. Eur Arch Otorhinolaryngol 2021; 279:645-652. [PMID: 33616750 PMCID: PMC8794901 DOI: 10.1007/s00405-021-06684-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 02/05/2021] [Indexed: 11/18/2022]
Abstract
Purpose In unilateral cochlear implant (CI) recipients, a contralateral routing of signals (CROS) device enables to receive auditory information from the unaided side. This study investigates the feasibility as well as subjective and objective benefits of using a CI processor as a CROS device in unilateral CI recipients. Methods This is a single-center, prospective cohort study. First, we tested the directionality of the CROS processor in an acoustic chamber. Second, we examined the difference of speech perception in quiet and in noise in ten unilateral CI recipients with and without the CROS processor. Third, subjective ratings with the CROS processor were evaluated according to the Client Oriented Scale of Improvement Questionnaire. Results There was a time delay between the two devices of 3 ms. Connection of the CROS processor led to a summation effect of 3 dB as well as a more constant amplification along all azimuths. Speech perception in quiet showed an increased word recognition score at 50 dB (mean improvement 7%). In noise, the head shadow effect could be mitigated with significant gain in speech perception (mean improvement 8.4 dB). This advantage was reversed in unfavorable listening situations, where the CROS device considerably amplified the noise (mean: – 4.8 dB). Subjectively, patients who did not normally wear a hearing aid on the non-CI side were satisfied with the CROS device. Conclusions The connection and synchronization of a CI processor as a CROS device is technically feasible and the signal processing strategies of the device can be exploited. In contra-laterally unaided patients, a subjective benefit can be achieved when wearing the CROS processor.
Collapse
|
22
|
|
23
|
Kuusinen A, Saariniemi E, Sivonen V, Dietz A, Aarnisalo AA, Lokki T. An exploratory investigation of speech recognition thresholds in noise with auralisations of two reverberant rooms. Int J Audiol 2020; 60:210-219. [PMID: 32964762 DOI: 10.1080/14992027.2020.1817993] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
OBJECTIVE Speech-in-noise tests are widely used in hearing diagnostics but typically without reverberation, although reverberation is an inextricable part of everyday listening conditions. To support the development of more real-life-like test paradigms, the objective of this study was to explore how spatially reproduced reverberation affects speech recognition thresholds in normal-hearing and hearing-impaired listeners. DESIGN Thresholds were measured with a Finnish speech-in-noise test without reverberation and with two test conditions with reverberation times of ∼0.9 and 1.8 s. Reverberant conditions were produced with a multichannel auralisation technique not used before in this context. STUDY SAMPLE Thirty-four normal-hearing and 14 hearing-impaired listeners participated in this study. Five people were tested with and without hearing aids. RESULTS No significant differences between test conditions were found for the normal-hearing listeners. Results for the hearing-impaired listeners indicated better performance for the 0.9 s reverberation time compared to the reference and the 1.8 s conditions. Benefit from hearing aid use varied between individuals; for one person, an advantage was observed only with reverberation. CONCLUSIONS Auralisations may offer information on speech recognition performance that is not obtained with a test without reverberation. However, more complex stimuli and/or higher signal-to-noise ratios should be used in the future.
Collapse
Affiliation(s)
- Antti Kuusinen
- Aalto Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
| | - Eero Saariniemi
- Aalto Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
| | - Ville Sivonen
- Department of Otorhinolaryngology, Helsinki University Hospital, Helsinki, Finland
| | - Aarno Dietz
- Department of Otolaryngology, Kuopio University Hospital, Kuopio, Finland
| | - Antti A Aarnisalo
- Department of Otorhinolaryngology, Helsinki University Hospital, Helsinki, Finland
| | - Tapio Lokki
- Aalto Acoustics Lab, Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
| |
Collapse
|
24
|
Abstract
Being able to pick out particular sounds, such as speech, against a background of other sounds represents one of the key tasks performed by the auditory system. Understanding how this happens is important because speech recognition in noise is particularly challenging for older listeners and for people with hearing impairments. Central to this ability is the capacity of neurons to adapt to the statistics of sounds reaching the ears, which helps to generate noise-tolerant representations of sounds in the brain. In more complex auditory scenes, such as a cocktail party — where the background noise comprises other voices, sound features associated with each source have to be grouped together and segregated from those belonging to other sources. This depends on precise temporal coding and modulation of cortical response properties when attending to a particular speaker in a multi-talker environment. Furthermore, the neural processing underlying auditory scene analysis is shaped by experience over multiple timescales.
Collapse
|
25
|
Tang Z, Bryan NJ, Li D, Langlois TR, Manocha D. Scene-Aware Audio Rendering via Deep Acoustic Analysis. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2020; 26:1991-2001. [PMID: 32070967 DOI: 10.1109/tvcg.2020.2973058] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
We present a new method to capture the acoustic characteristics of real-world rooms using commodity devices, and use the captured characteristics to generate similar sounding sources with virtual models. Given the captured audio and an approximate geometric model of a real-world room, we present a novel learning-based method to estimate its acoustic material properties. Our approach is based on deep neural networks that estimate the reverberation time and equalization of the room from recorded audio. These estimates are used to compute material properties related to room reverberation using a novel material optimization objective. We use the estimated acoustic material characteristics for audio rendering using interactive geometric sound propagation and highlight the performance on many real-world scenarios. We also perform a user study to evaluate the perceptual similarity between the recorded sounds and our rendered audio.
Collapse
|
26
|
Chin BM, Burge J. Predicting the Partition of Behavioral Variability in Speed Perception with Naturalistic Stimuli. J Neurosci 2020; 40:864-879. [PMID: 31772139 PMCID: PMC6975300 DOI: 10.1523/jneurosci.1904-19.2019] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Revised: 11/12/2019] [Accepted: 11/17/2019] [Indexed: 11/21/2022] Open
Abstract
A core goal of visual neuroscience is to predict human perceptual performance from natural signals. Performance in any natural task can be limited by at least three sources of uncertainty: stimulus variability, internal noise, and suboptimal computations. Determining the relative importance of these factors has been a focus of interest for decades but requires methods for predicting the fundamental limits imposed by stimulus variability on sensory-perceptual precision. Most successes have been limited to simple stimuli and simple tasks. But perception science ultimately aims to understand how vision works with natural stimuli. Successes in this domain have proven elusive. Here, we develop a model of humans based on an image-computable (images in, estimates out) Bayesian ideal observer. Given biological constraints, the ideal optimally uses the statistics relating local intensity patterns in moving images to speed, specifying the fundamental limits imposed by natural stimuli. Next, we propose a theoretical link between two key decision-theoretic quantities that suggests how to experimentally disentangle the impacts of internal noise and deterministic suboptimal computations. In several interlocking discrimination experiments with three male observers, we confirm this link and determine the quantitative impact of each candidate performance-limiting factor. Human performance is near-exclusively limited by natural stimulus variability and internal noise, and humans use near-optimal computations to estimate speed from naturalistic image movies. The findings indicate that the partition of behavioral variability can be predicted from a principled analysis of natural images and scenes. The approach should be extendable to studies of neural variability with natural signals.SIGNIFICANCE STATEMENT Accurate estimation of speed is critical for determining motion in the environment, but humans cannot perform this task without error. Different objects moving at the same speed cast different images on the eyes. This stimulus variability imposes fundamental external limits on the human ability to estimate speed. Predicting these limits has proven difficult. Here, by analyzing natural signals, we predict the quantitative impact of natural stimulus variability on human performance given biological constraints. With integrated experiments, we compare its impact to well-studied performance-limiting factors internal to the visual system. The results suggest that the deterministic computations humans perform are near optimal, and that behavioral responses to natural stimuli can be studied with the rigor and interpretability defining work with simpler stimuli.
Collapse
Affiliation(s)
| | - Johannes Burge
- Department of Psychology,
- Neuroscience Graduate Group, and
- Bioengineering Graduate Group, University of Pennsylvania, Philadelphia, Pennsylvania 19104
| |
Collapse
|
27
|
Młynarski W, McDermott JH. Ecological origins of perceptual grouping principles in the auditory system. Proc Natl Acad Sci U S A 2019; 116:25355-25364. [PMID: 31754035 PMCID: PMC6911196 DOI: 10.1073/pnas.1903887116] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Events and objects in the world must be inferred from sensory signals to support behavior. Because sensory measurements are temporally and spatially local, the estimation of an object or event can be viewed as the grouping of these measurements into representations of their common causes. Perceptual grouping is believed to reflect internalized regularities of the natural environment, yet grouping cues have traditionally been identified using informal observation and investigated using artificial stimuli. The relationship of grouping to natural signal statistics has thus remained unclear, and additional or alternative cues remain possible. Here, we develop a general methodology for relating grouping to natural sensory signals and apply it to derive auditory grouping cues from natural sounds. We first learned local spectrotemporal features from natural sounds and measured their co-occurrence statistics. We then learned a small set of stimulus properties that could predict the measured feature co-occurrences. The resulting cues included established grouping cues, such as harmonic frequency relationships and temporal coincidence, but also revealed previously unappreciated grouping principles. Human perceptual grouping was predicted by natural feature co-occurrence, with humans relying on the derived grouping cues in proportion to their informativity about co-occurrence in natural sounds. The results suggest that auditory grouping is adapted to natural stimulus statistics, show how these statistics can reveal previously unappreciated grouping phenomena, and provide a framework for studying grouping in natural signals.
Collapse
Affiliation(s)
- Wiktor Młynarski
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139;
- Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139;
- Center for Brains, Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139
- McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA 02139
- Program in Speech and Hearing Biosciences and Technology, Harvard University, Boston, MA 02115
| |
Collapse
|
28
|
Bianco MJ, Gerstoft P, Traer J, Ozanich E, Roch MA, Gannot S, Deledalle CA. Machine learning in acoustics: Theory and applications. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:3590. [PMID: 31795641 DOI: 10.1121/1.5133944] [Citation(s) in RCA: 140] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/09/2019] [Accepted: 10/14/2019] [Indexed: 06/10/2023]
Abstract
Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of techniques, which are often based in statistics, for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features and desired labels or actions, or between features themselves. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in four acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, and environmental sounds in everyday scenes.
Collapse
Affiliation(s)
- Michael J Bianco
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, California 92093, USA
| | - Peter Gerstoft
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, California 92093, USA
| | - James Traer
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Emma Ozanich
- Scripps Institution of Oceanography, University of California San Diego, La Jolla, California 92093, USA
| | - Marie A Roch
- Department of Computer Science, San Diego State University, San Diego, California 92182, USA
| | - Sharon Gannot
- Faculty of Engineering, Bar-Ilan University, Ramat-Gan 5290002, Israel
| | - Charles-Alban Deledalle
- Department of Electrical and Computer Engineering, University of California San Diego, La Jolla, California 92093, USA
| |
Collapse
|
29
|
Kell AJE, McDermott JH. Invariance to background noise as a signature of non-primary auditory cortex. Nat Commun 2019; 10:3958. [PMID: 31477711 PMCID: PMC6718388 DOI: 10.1038/s41467-019-11710-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Accepted: 07/30/2019] [Indexed: 12/22/2022] Open
Abstract
Despite well-established anatomical differences between primary and non-primary auditory cortex, the associated representational transformations have remained elusive. Here we show that primary and non-primary auditory cortex are differentiated by their invariance to real-world background noise. We measured fMRI responses to natural sounds presented in isolation and in real-world noise, quantifying invariance as the correlation between the two responses for individual voxels. Non-primary areas were substantially more noise-invariant than primary areas. This primary-nonprimary difference occurred both for speech and non-speech sounds and was unaffected by a concurrent demanding visual task, suggesting that the observed invariance is not specific to speech processing and is robust to inattention. The difference was most pronounced for real-world background noise-both primary and non-primary areas were relatively robust to simple types of synthetic noise. Our results suggest a general representational transformation between auditory cortical stages, illustrating a representational consequence of hierarchical organization in the auditory system.
Collapse
Affiliation(s)
- Alexander J E Kell
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, 02139, USA.
- McGovern Institute for Brain Research, MIT, Cambridge, MA, 02139, USA.
- Center for Brains, Minds, and Machines, MIT, Cambridge, MA, 02139, USA.
- Zuckerman Institute of Mind, Brain, and Behavior, Columbia University, New York, NY, 10027, USA.
| | - Josh H McDermott
- Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, 02139, USA.
- McGovern Institute for Brain Research, MIT, Cambridge, MA, 02139, USA.
- Center for Brains, Minds, and Machines, MIT, Cambridge, MA, 02139, USA.
- Program in Speech and Hearing Biosciences and Technology, Harvard University, Boston, MA, USA.
| |
Collapse
|
30
|
Sterling A, Rewkowski N, Klatzky RL, Lin MC. Audio-Material Reconstruction for Virtualized Reality Using a Probabilistic Damping Model. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2019; 25:1855-1864. [PMID: 30762560 DOI: 10.1109/tvcg.2019.2898822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Modal sound synthesis has been used to create realistic sounds from rigid-body objects, but requires accurate real-world material parameters. These material parameters can be estimated from recorded sounds of an impacted object, but external factors can interfere with accurate parameter estimation. We present a novel technique for estimating the damping parameters of materials from recorded impact sounds that probabilistically models these external factors. We represent the combined effects of material damping, support damping, and sampling inaccuracies with a probabilistic generative model, then use maximum likelihood estimation to fit a damping model to recorded data. This technique greatly reduces the human effort needed and does not require the precise object geometry or the exact hit location. We validate the effectiveness of this technique with a comprehensive analysis of a synthetic dataset and a perceptual study on object identification. We also present a study establishing human performance on the same parameter estimation task for comparison.
Collapse
|
31
|
Schutte M, Ewert SD, Wiegrebe L. The percept of reverberation is not affected by visual room impression in virtual environments. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:EL229. [PMID: 31067971 DOI: 10.1121/1.5093642] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/03/2018] [Accepted: 02/21/2019] [Indexed: 06/09/2023]
Abstract
Humans possess mechanisms to suppress distracting early sound reflections, summarized as the precedence effect. Recent work shows that precedence is affected by visual stimulation. This paper investigates possible effects of visual stimulation on the perception of later reflections, i.e., reverberation. In a highly immersive audio-visual virtual reality environment, subjects were asked to quantify reverberation in conditions where simultaneously presented auditory and visual stimuli either match in room identity, sound source azimuth, and sound source distance, or diverge in one of these aspects. While subjects reliably judged reverberation across acoustic environments, the visual room impression did not affect reverberation estimates.
Collapse
Affiliation(s)
- Michael Schutte
- Division of Neurobiology, Department Biology II and Graduate School of Systemic Neurosciences, Ludwig-Maximilians-Universität München, Germany
| | - Stephan D Ewert
- Medical Physics and Cluster of Excellence Hearing4all, University of Oldenburg, , ,
| | - Lutz Wiegrebe
- Division of Neurobiology, Department Biology II and Graduate School of Systemic Neurosciences, Ludwig-Maximilians-Universität München, Germany
| |
Collapse
|
32
|
Kopp-Scheinpflug C, Sinclair JL, Linden JF. When Sound Stops: Offset Responses in the Auditory System. Trends Neurosci 2018; 41:712-728. [DOI: 10.1016/j.tins.2018.08.009] [Citation(s) in RCA: 60] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 07/30/2018] [Accepted: 08/10/2018] [Indexed: 11/17/2022]
|
33
|
Abstract
Psychophysical experiments conducted remotely over the internet permit data collection from large numbers of participants but sacrifice control over sound presentation and therefore are not widely employed in hearing research. To help standardize online sound presentation, we introduce a brief psychophysical test for determining whether online experiment participants are wearing headphones. Listeners judge which of three pure tones is quietest, with one of the tones presented 180° out of phase across the stereo channels. This task is intended to be easy over headphones but difficult over loudspeakers due to phase-cancellation. We validated the test in the lab by testing listeners known to be wearing headphones or listening over loudspeakers. The screening test was effective and efficient, discriminating between the two modes of listening with a small number of trials. When run online, a bimodal distribution of scores was obtained, suggesting that some participants performed the task over loudspeakers despite instructions to use headphones. The ability to detect and screen out these participants mitigates concerns over sound quality for online experiments, a first step toward opening auditory perceptual research to the possibilities afforded by crowdsourcing.
Collapse
|
34
|
Salminen NH, Jones SJ, Christianson GB, Marquardt T, McAlpine D. A common periodic representation of interaural time differences in mammalian cortex. Neuroimage 2018; 167:95-103. [PMID: 29122721 PMCID: PMC5854251 DOI: 10.1016/j.neuroimage.2017.11.012] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2017] [Revised: 10/01/2017] [Accepted: 11/04/2017] [Indexed: 11/16/2022] Open
Abstract
Binaural hearing, the ability to detect small differences in the timing and level of sounds at the two ears, underpins the ability to localize sound sources along the horizontal plane, and is important for decoding complex spatial listening environments into separate objects – a critical factor in ‘cocktail-party listening’. For human listeners, the most important spatial cue is the interaural time difference (ITD). Despite many decades of neurophysiological investigations of ITD sensitivity in small mammals, and computational models aimed at accounting for human perception, a lack of concordance between these studies has hampered our understanding of how the human brain represents and processes ITDs. Further, neural coding of spatial cues might depend on factors such as head-size or hearing range, which differ considerably between humans and commonly used experimental animals. Here, using magnetoencephalography (MEG) in human listeners, and electro-corticography (ECoG) recordings in guinea pig—a small mammal representative of a range of animals in which ITD coding has been assessed at the level of single-neuron recordings—we tested whether processing of ITDs in human auditory cortex accords with a frequency-dependent periodic code of ITD reported in small mammals, or whether alternative or additional processing stages implemented in psychoacoustic models of human binaural hearing must be assumed. Our data were well accounted for by a model consisting of periodically tuned ITD-detectors, and were highly consistent across the two species. The results suggest that the representation of ITD in human auditory cortex is similar to that found in other mammalian species, a representation in which neural responses to ITD are determined by phase differences relative to sound frequency rather than, for instance, the range of ITDs permitted by head size or the absolute magnitude or direction of ITD. ITD tuning is studied in human MEG and guinea pig ECoG with identical stimuli. Auditory cortical tuning to ITD is highly consistent across species. Results are consistent with a periodic, frequency-dependent code.
Collapse
Affiliation(s)
- Nelli H Salminen
- Brain and Mind Laboratory, Dept. of Neuroscience and Biomedical Engineering, MEG Core, Aalto NeuroImaging, Aalto University School of Science, Espoo, Finland.
| | - Simon J Jones
- UCL Ear Institute, 332 Gray's Inn Road, London, WC1X 8EE, UK
| | | | | | - David McAlpine
- UCL Ear Institute, 332 Gray's Inn Road, London, WC1X 8EE, UK; Dept of Linguistics, Australian Hearing Hub, Macquarie University, Sydney, NSW 2109, Australia
| |
Collapse
|
35
|
Context-Dependent Effect of Reverberation on Material Perception from Impact Sound. Sci Rep 2017; 7:16455. [PMID: 29184117 PMCID: PMC5705663 DOI: 10.1038/s41598-017-16651-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2017] [Accepted: 11/15/2017] [Indexed: 12/02/2022] Open
Abstract
Our hearing is usually robust against reverberation. This study asked how such robustness to daily sound is realized, and what kinds of acoustic cues contribute to the robustness. We focused on the perception of materials based on impact sounds, which is a common daily experience, and for which the responsible acoustic features have already been identified in the absence of reverberation. In our experiment, we instructed the participants to identify materials from impact sounds with and without reverberation. The imposition of reverberation did not alter the average responses across participants to perceived materials. However, an analysis of each participant revealed the significant effect of reverberation with response patterns varying among participants. The effect depended on the context of the stimulus presentation, namely it was smaller for a constant reverberation than when the reverberation varied presentation by presentation. The context modified the relative contribution of the spectral features of the sounds to material identification, while no consistent change across participants was observed as regards the temporal features. Although the detailed results varied greatly among the participants, these results suggest that a mechanism exists in the auditory system that compensates for reverberation based on adaptation to the spectral features of reverberant sound.
Collapse
|
36
|
Town SM, Brimijoin WO, Bizley JK. Egocentric and allocentric representations in auditory cortex. PLoS Biol 2017; 15:e2001878. [PMID: 28617796 PMCID: PMC5472254 DOI: 10.1371/journal.pbio.2001878] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2016] [Accepted: 05/08/2017] [Indexed: 11/18/2022] Open
Abstract
A key function of the brain is to provide a stable representation of an object's location in the world. In hearing, sound azimuth and elevation are encoded by neurons throughout the auditory system, and auditory cortex is necessary for sound localization. However, the coordinate frame in which neurons represent sound space remains undefined: classical spatial receptive fields in head-fixed subjects can be explained either by sensitivity to sound source location relative to the head (egocentric) or relative to the world (allocentric encoding). This coordinate frame ambiguity can be resolved by studying freely moving subjects; here we recorded spatial receptive fields in the auditory cortex of freely moving ferrets. We found that most spatially tuned neurons represented sound source location relative to the head across changes in head position and direction. In addition, we also recorded a small number of neurons in which sound location was represented in a world-centered coordinate frame. We used measurements of spatial tuning across changes in head position and direction to explore the influence of sound source distance and speed of head movement on auditory cortical activity and spatial tuning. Modulation depth of spatial tuning increased with distance for egocentric but not allocentric units, whereas, for both populations, modulation was stronger at faster movement speeds. Our findings suggest that early auditory cortex primarily represents sound source location relative to ourselves but that a minority of cells can represent sound location in the world independent of our own position.
Collapse
Affiliation(s)
- Stephen M. Town
- Ear Institute, University College London, London, United Kingdom
| | - W. Owen Brimijoin
- MRC/CSO Institute of Hearing Research – Scottish Section, Glasgow, United Kingdom
| | | |
Collapse
|
37
|
Fuglsang SA, Dau T, Hjortkjær J. Noise-robust cortical tracking of attended speech in real-world acoustic scenes. Neuroimage 2017; 156:435-444. [PMID: 28412441 DOI: 10.1016/j.neuroimage.2017.04.026] [Citation(s) in RCA: 97] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2016] [Revised: 04/07/2017] [Accepted: 04/10/2017] [Indexed: 11/30/2022] Open
Abstract
Selectively attending to one speaker in a multi-speaker scenario is thought to synchronize low-frequency cortical activity to the attended speech signal. In recent studies, reconstruction of speech from single-trial electroencephalogram (EEG) data has been used to decode which talker a listener is attending to in a two-talker situation. It is currently unclear how this generalizes to more complex sound environments. Behaviorally, speech perception is robust to the acoustic distortions that listeners typically encounter in everyday life, but it is unknown whether this is mirrored by a noise-robust neural tracking of attended speech. Here we used advanced acoustic simulations to recreate real-world acoustic scenes in the laboratory. In virtual acoustic realities with varying amounts of reverberation and number of interfering talkers, listeners selectively attended to the speech stream of a particular talker. Across the different listening environments, we found that the attended talker could be accurately decoded from single-trial EEG data irrespective of the different distortions in the acoustic input. For highly reverberant environments, speech envelopes reconstructed from neural responses to the distorted stimuli resembled the original clean signal more than the distorted input. With reverberant speech, we observed a late cortical response to the attended speech stream that encoded temporal modulations in the speech signal without its reverberant distortion. Single-trial attention decoding accuracies based on 40-50s long blocks of data from 64 scalp electrodes were equally high (80-90% correct) in all considered listening environments and remained statistically significant using down to 10 scalp electrodes and short (<30-s) unaveraged EEG segments. In contrast to the robust decoding of the attended talker we found that decoding of the unattended talker deteriorated with the acoustic distortions. These results suggest that cortical activity tracks an attended speech signal in a way that is invariant to acoustic distortions encountered in real-life sound environments. Noise-robust attention decoding additionally suggests a potential utility of stimulus reconstruction techniques in attention-controlled brain-computer interfaces.
Collapse
Affiliation(s)
- Søren Asp Fuglsang
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Ørsteds Plads, Building 352, 2800 Kgs. Lyngby, Denmark.
| | - Torsten Dau
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Ørsteds Plads, Building 352, 2800 Kgs. Lyngby, Denmark
| | - Jens Hjortkjær
- Hearing Systems Group, Department of Electrical Engineering, Technical University of Denmark, Ørsteds Plads, Building 352, 2800 Kgs. Lyngby, Denmark; Danish Research Centre for Magnetic Resonance, Centre for Functional and Diagnostic Imaging and Research, Copenhagen University Hospital Hvidovre, Kettegaard Allé 30, 2650 Hvidovre, Denmark.
| |
Collapse
|
38
|
Hearing Scenes: A Neuromagnetic Signature of Auditory Source and Reverberant Space Separation. eNeuro 2017; 4:eN-NWR-0007-17. [PMID: 28451630 PMCID: PMC5394928 DOI: 10.1523/eneuro.0007-17.2017] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Revised: 02/03/2017] [Accepted: 02/06/2017] [Indexed: 11/21/2022] Open
Abstract
Perceiving the geometry of surrounding space is a multisensory process, crucial to contextualizing object perception and guiding navigation behavior. Humans can make judgments about surrounding spaces from reverberation cues, caused by sounds reflecting off multiple interior surfaces. However, it remains unclear how the brain represents reverberant spaces separately from sound sources. Here, we report separable neural signatures of auditory space and source perception during magnetoencephalography (MEG) recording as subjects listened to brief sounds convolved with monaural room impulse responses (RIRs). The decoding signature of sound sources began at 57 ms after stimulus onset and peaked at 130 ms, while space decoding started at 138 ms and peaked at 386 ms. Importantly, these neuromagnetic responses were readily dissociable in form and time: while sound source decoding exhibited an early and transient response, the neural signature of space was sustained and independent of the original source that produced it. The reverberant space response was robust to variations in sound source, and vice versa, indicating a generalized response not tied to specific source-space combinations. These results provide the first neuromagnetic evidence for robust, dissociable auditory source and reverberant space representations in the human brain and reveal the temporal dynamics of how auditory scene analysis extracts percepts from complex naturalistic auditory signals.
Collapse
|