1
|
Dossa RFJ, Arulkumaran K, Juliani A, Sasai S, Kanai R. Design and evaluation of a global workspace agent embodied in a realistic multimodal environment. Front Comput Neurosci 2024; 18:1352685. [PMID: 38948336 PMCID: PMC11211627 DOI: 10.3389/fncom.2024.1352685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 05/20/2024] [Indexed: 07/02/2024] Open
Abstract
As the apparent intelligence of artificial neural networks (ANNs) advances, they are increasingly likened to the functional networks and information processing capabilities of the human brain. Such comparisons have typically focused on particular modalities, such as vision or language. The next frontier is to use the latest advances in ANNs to design and investigate scalable models of higher-level cognitive processes, such as conscious information access, which have historically lacked concrete and specific hypotheses for scientific evaluation. In this work, we propose and then empirically assess an embodied agent with a structure based on global workspace theory (GWT) as specified in the recently proposed "indicator properties" of consciousness. In contrast to prior works on GWT which utilized single modalities, our agent is trained to navigate 3D environments based on realistic audiovisual inputs. We find that the global workspace architecture performs better and more robustly at smaller working memory sizes, as compared to a standard recurrent architecture. Beyond performance, we perform a series of analyses on the learned representations of our architecture and share findings that point to task complexity and regularization being essential for feature learning and the development of meaningful attentional patterns within the workspace.
Collapse
|
2
|
Lutfi RA, Zandona M, Lee J. Simultaneous relative cue reliance in speech-on-speech masking. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 154:2530-2538. [PMID: 37870932 DOI: 10.1121/10.0021874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 09/27/2023] [Indexed: 10/25/2023]
Abstract
Modern hearing research has identified the ability of listeners to segregate simultaneous speech streams with a reliance on three major voice cues, fundamental frequency, level, and location. Few of these studies evaluated reliance for these cues presented simultaneously as occurs in nature, and fewer still considered the listeners' relative reliance on these cues owing to the cues' different units of measure. In the present study trial-by-trial analyses were used to isolate the listener's simultaneous reliance on the three voice cues, with the behavior of an ideal observer [Green and Swets (1966). (Wiley, New York), pp.151-178] serving as a comparison standard for evaluating relative reliance. Listeners heard on each trial a pair of randomly selected, simultaneous recordings of naturally spoken sentences. One of the recordings was always from the same talker, a distracter, and the other, with equal probability, was from one of two target talkers differing in the three voice cues. The listener's task was to identify the target talker. Among 33 clinically normal-hearing adults only one relied predominantly on voice level, the remaining were split between voice fundamental frequency and/or location. The results are discussed regarding their implications for the common practice in studies of using target-distracter level as a dependent measure of speech-on-speech masking.
Collapse
Affiliation(s)
- R A Lutfi
- Auditory Behavioral Research Lab, Department of Communication Sciences and Disorders, University of South Florida, Tampa, Florida 33620, USA
| | - M Zandona
- Auditory Behavioral Research Lab, Department of Communication Sciences and Disorders, University of South Florida, Tampa, Florida 33620, USA
| | - J Lee
- Auditory Behavioral Research Lab, Department of Communication Sciences and Disorders, University of South Florida, Tampa, Florida 33620, USA
| |
Collapse
|
3
|
Chou KF, Boyd AD, Best V, Colburn HS, Sen K. A biologically oriented algorithm for spatial sound segregation. Front Neurosci 2022; 16:1004071. [PMID: 36312015 PMCID: PMC9614053 DOI: 10.3389/fnins.2022.1004071] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Accepted: 09/28/2022] [Indexed: 11/13/2022] Open
Abstract
Listening in an acoustically cluttered scene remains a difficult task for both machines and hearing-impaired listeners. Normal-hearing listeners accomplish this task with relative ease by segregating the scene into its constituent sound sources, then selecting and attending to a target source. An assistive listening device that mimics the biological mechanisms underlying this behavior may provide an effective solution for those with difficulty listening in acoustically cluttered environments (e.g., a cocktail party). Here, we present a binaural sound segregation algorithm based on a hierarchical network model of the auditory system. In the algorithm, binaural sound inputs first drive populations of neurons tuned to specific spatial locations and frequencies. The spiking response of neurons in the output layer are then reconstructed into audible waveforms via a novel reconstruction method. We evaluate the performance of the algorithm with a speech-on-speech intelligibility task in normal-hearing listeners. This two-microphone-input algorithm is shown to provide listeners with perceptual benefit similar to that of a 16-microphone acoustic beamformer. These results demonstrate the promise of this biologically inspired algorithm for enhancing selective listening in challenging multi-talker scenes.
Collapse
Affiliation(s)
- Kenny F. Chou
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Alexander D. Boyd
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Virginia Best
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, MA, United States
| | - H. Steven Colburn
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
| | - Kamal Sen
- Department of Biomedical Engineering, Boston University, Boston, MA, United States
- *Correspondence: Kamal Sen,
| |
Collapse
|
4
|
Widmann A, Schröger E. Intention-based predictive information modulates auditory deviance processing. Front Neurosci 2022; 16:995119. [PMID: 36248631 PMCID: PMC9554204 DOI: 10.3389/fnins.2022.995119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 09/08/2022] [Indexed: 11/26/2022] Open
Abstract
The human brain is highly responsive to (deviant) sounds violating an auditory regularity. Respective brain responses are usually investigated in situations when the sounds were produced by the experimenter. Acknowledging that humans also actively produce sounds, the present event-related potential study tested for differences in the brain responses to deviants that were produced by the listeners by pressing one of two buttons. In one condition, deviants were unpredictable with respect to the button-sound association. In another condition, deviants were predictable with high validity yielding correctly predicted deviants and incorrectly predicted (mispredicted) deviants. Temporal principal component analysis revealed deviant-specific N1 enhancement, mismatch negativity (MMN) and P3a. N1 enhancements were highly similar for each deviant type, indicating that the underlying neural mechanism is not affected by intention-based expectation about the self-produced forthcoming sound. The MMN was abolished for predictable deviants, suggesting that the intention-based prediction for a deviant can overwrite the prediction derived from the auditory regularity (predicting a standard). The P3a was present for each deviant type but was largest for mispredicted deviants. It is argued that the processes underlying P3a not only evaluate the deviant with respect to the fact that it violates an auditory regularity but also with respect to the intended sensorial effect of an action. Overall, our results specify current theories of auditory predictive processing, as they reveal that intention-based predictions exert different effects on different deviance-specific brain responses.
Collapse
Affiliation(s)
- Andreas Widmann
- Wilhelm Wundt Institute for Psychology, Leipzig University, Leipzig, Germany
- Leibniz Institute for Neurobiology, Magdeburg, Germany
- *Correspondence: Andreas Widmann,
| | - Erich Schröger
- Wilhelm Wundt Institute for Psychology, Leipzig University, Leipzig, Germany
- Erich Schröger,
| |
Collapse
|
5
|
Huang Y, Hao Y, Xu J, Xu B. Compressing speaker extraction model with ultra-low precision quantization and knowledge distillation. Neural Netw 2022; 154:13-21. [PMID: 35841810 DOI: 10.1016/j.neunet.2022.06.026] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 04/20/2022] [Accepted: 06/21/2022] [Indexed: 11/25/2022]
Abstract
Recently, our proposed speaker extraction model, WASE (learning When to Attend for Speaker Extraction) yielded superior performance over the prior state-of-the-art methods by explicitly modeling onset clue and regarding it as important guidance in speaker extraction tasks. However, it still remains challenging when it comes to the deployments on the resource-constrained devices, where the model must be tiny and fast to perform inference with minimal budget in CPU and memory while keeping the speaker extraction performance. In this work, we utilize model compression techniques to alleviate the problem and propose a lightweight speaker extraction model, TinyWASE, which aims to run on resource-constrained devices. Specifically, we mainly investigate the grouping effects of quantization-aware training and knowledge distillation techniques in the speaker extraction task and propose Distillation-aware Quantization. Experiments on WSJ0-2mix dataset show that our proposed model can achieve comparable performance as the full-precision model while reducing the model size using ultra-low bits (e.g. 3 bits), obtaining 8.97x compression ratio and 2.15 MB model size. We further show that TinyWASE can combine with other model compression techniques, such as parameter sharing, to achieve compression ratio as high as 23.81 with limited performance degradation. Our code is available at https://github.com/aispeech-lab/TinyWASE.
Collapse
Affiliation(s)
- Yating Huang
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Future Technology, University of Chinese Academy of Sciences, Beijing, China.
| | - Yunzhe Hao
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Future Technology, University of Chinese Academy of Sciences, Beijing, China.
| | - Jiaming Xu
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China.
| | - Bo Xu
- Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China; School of Future Technology, University of Chinese Academy of Sciences, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; Center for Excellence in Brain Science and Intelligence Technology, CAS, Shanghai, China
| |
Collapse
|
6
|
Automated Beehive Acoustics Monitoring: A Comprehensive Review of the Literature and Recommendations for Future Work. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12083920] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Bees play an important role in agriculture and ecology, and their pollination efficiency is essential to the economic profitability of farms. The drastic decrease in bee populations witnessed over the last decade has attracted great attention to automated remote beehive monitoring research, with beehive acoustics analysis emerging as a prominent field. In this paper, we review the existing literature on bee acoustics analysis and report on the articles published between January 2012 and December 2021. Five categories are explored in further detail, including the origin of the articles, their study goal, experimental setup, audio analysis methodology, and reproducibility. Highlights and limitations in each of these categories are presented and discussed. We conclude with a set of recommendations for future studies, with suggestions ranging from bee species characterization, to recording and testing setup descriptions, to making data and codes available to help advance this new multidisciplinary field.
Collapse
|
7
|
Wang L, Wang Y, Liu Z, Wu EX, Chen F. A Speech-Level–Based Segmented Model to Decode the Dynamic Auditory Attention States in the Competing Speaker Scenes. Front Neurosci 2022; 15:760611. [PMID: 35221885 PMCID: PMC8866945 DOI: 10.3389/fnins.2021.760611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Accepted: 12/30/2021] [Indexed: 11/21/2022] Open
Abstract
In the competing speaker environments, human listeners need to focus or switch their auditory attention according to dynamic intentions. The reliable cortical tracking ability to the speech envelope is an effective feature for decoding the target speech from the neural signals. Moreover, previous studies revealed that the root mean square (RMS)–level–based speech segmentation made a great contribution to the target speech perception with the modulation of sustained auditory attention. This study further investigated the effect of the RMS-level–based speech segmentation on the auditory attention decoding (AAD) performance with both sustained and switched attention in the competing speaker auditory scenes. Objective biomarkers derived from the cortical activities were also developed to index the dynamic auditory attention states. In the current study, subjects were asked to concentrate or switch their attention between two competing speaker streams. The neural responses to the higher- and lower-RMS-level speech segments were analyzed via the linear temporal response function (TRF) before and after the attention switching from one to the other speaker stream. Furthermore, the AAD performance decoded by the unified TRF decoding model was compared to that by the speech-RMS-level–based segmented decoding model with the dynamic change of the auditory attention states. The results showed that the weight of the typical TRF component approximately 100-ms time lag was sensitive to the switching of the auditory attention. Compared to the unified AAD model, the segmented AAD model improved attention decoding performance under both the sustained and switched auditory attention modulations in a wide range of signal-to-masker ratios (SMRs). In the competing speaker scenes, the TRF weight and AAD accuracy could be used as effective indicators to detect the changes of the auditory attention. In addition, with a wide range of SMRs (i.e., from 6 to –6 dB in this study), the segmented AAD model showed the robust decoding performance even with short decision window length, suggesting that this speech-RMS-level–based model has the potential to decode dynamic attention states in the realistic auditory scenarios.
Collapse
Affiliation(s)
- Lei Wang
- Department of Electrical and Electronic Engineering, Southern University of Science and Technology, Shenzhen, China
- Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Yihan Wang
- Department of Electrical and Electronic Engineering, Southern University of Science and Technology, Shenzhen, China
| | - Zhixing Liu
- Department of Electrical and Electronic Engineering, Southern University of Science and Technology, Shenzhen, China
| | - Ed X. Wu
- Department of Electrical and Electronic Engineering, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Fei Chen
- Department of Electrical and Electronic Engineering, Southern University of Science and Technology, Shenzhen, China
- *Correspondence: Fei Chen,
| |
Collapse
|
8
|
Spatial Audio Scene Characterization (SASC): Automatic Localization of Front-, Back-, Up-, and Down-Positioned Music Ensembles in Binaural Recordings. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12031569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
The automatic localization of audio sources distributed symmetrically with respect to coronal or transverse planes using binaural signals still poses a challenging task, due to the front–back and up–down confusion effects. This paper demonstrates that the convolutional neural network (CNN) can be used to automatically localize music ensembles panned to the front, back, up, or down positions. The network was developed using the repository of the binaural excerpts obtained by the convolution of multi-track music recordings with the selected sets of head-related transfer functions (HRTFs). They were generated in such a way that a music ensemble (of circular shape in terms of its boundaries) was positioned in one of the following four locations with respect to the listener: front, back, up, and down. According to the obtained results, CNN identified the location of the ensembles with the average accuracy levels of 90.7% and 71.4% when tested under the HRTF-dependent and HRTF-independent conditions, respectively. For HRTF-dependent tests, the accuracy decreased monotonically with the increase in the ensemble size. A modified image occlusion sensitivity technique revealed selected frequency bands as being particularly important in terms of the localization process. These frequency bands are largely in accordance with the psychoacoustical literature.
Collapse
|
9
|
Luberadzka J, Kayser H, Hohmann V. Making sense of periodicity glimpses in a prediction-update-loop-A computational model of attentive voice tracking. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 151:712. [PMID: 35232067 PMCID: PMC9088677 DOI: 10.1121/10.0009337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 11/13/2021] [Accepted: 01/03/2022] [Indexed: 06/14/2023]
Abstract
Humans are able to follow a speaker even in challenging acoustic conditions. The perceptual mechanisms underlying this ability remain unclear. A computational model of attentive voice tracking, consisting of four computational blocks: (1) sparse periodicity-based auditory features (sPAF) extraction, (2) foreground-background segregation, (3) state estimation, and (4) top-down knowledge, is presented. The model connects the theories about auditory glimpses, foreground-background segregation, and Bayesian inference. It is implemented with the sPAF, sequential Monte Carlo sampling, and probabilistic voice models. The model is evaluated by comparing it with the human data obtained in the study by Woods and McDermott [Curr. Biol. 25(17), 2238-2246 (2015)], which measured the ability to track one of two competing voices with time-varying parameters [fundamental frequency (F0) and formants (F1,F2)]. Three model versions were tested, which differ in the type of information used for the segregation: version (a) uses the oracle F0, version (b) uses the estimated F0, and version (c) uses the spectral shape derived from the estimated F0 and oracle F1 and F2. Version (a) simulates the optimal human performance in conditions with the largest separation between the voices, version (b) simulates the conditions in which the separation in not sufficient to follow the voices, and version (c) is closest to the human performance for moderate voice separation.
Collapse
Affiliation(s)
- Joanna Luberadzka
- Auditory Signal Processing, Department of Medical Physics and Acoustics, University of Oldenburg, Germany
| | - Hendrik Kayser
- Auditory Signal Processing, Department of Medical Physics and Acoustics, University of Oldenburg, Germany
| | - Volker Hohmann
- Auditory Signal Processing, Department of Medical Physics and Acoustics, University of Oldenburg, Germany
| |
Collapse
|
10
|
Attentional control via synaptic gain mechanisms in auditory streaming. Brain Res 2021; 1778:147720. [PMID: 34785256 DOI: 10.1016/j.brainres.2021.147720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 09/13/2021] [Accepted: 11/05/2021] [Indexed: 11/21/2022]
Abstract
Attention is a crucial component in sound source segregation allowing auditory objects of interest to be both singled out and held in focus. Our study utilizes a fundamental paradigm for sound source segregation: a sequence of interleaved tones, A and B, of different frequencies that can be heard as a single integrated stream or segregated into two streams (auditory streaming paradigm). We focus on the irregular alternations between integrated and segregated that occur for long presentations, so-called auditory bistability. Psychaoustic experiments demonstrate how attentional control, a listener's intention to experience integrated or segregated, biases perception in favour of different perceptual interpretations. Our data show that this is achieved by prolonging the dominance times of the attended percept and, to a lesser extent, by curtailing the dominance times of the unattended percept, an effect that remains consistent across a range of values for the difference in frequency between A and B. An existing neuromechanistic model describes the neural dynamics of perceptual competition downstream of primary auditory cortex (A1). The model allows us to propose plausible neural mechanisms for attentional control, as linked to different attentional strategies, in a direct comparison with behavioural data. A mechanism based on a percept-specific input gain best accounts for the effects of attentional control.
Collapse
|
11
|
Holmes E, Parr T, Griffiths TD, Friston KJ. Active inference, selective attention, and the cocktail party problem. Neurosci Biobehav Rev 2021; 131:1288-1304. [PMID: 34687699 DOI: 10.1016/j.neubiorev.2021.09.038] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 08/27/2021] [Accepted: 09/17/2021] [Indexed: 11/25/2022]
Abstract
In this paper, we introduce a new generative model for an active inference account of preparatory and selective attention, in the context of a classic 'cocktail party' paradigm. In this setup, pairs of words are presented simultaneously to the left and right ears and an instructive spatial cue directs attention to the left or right. We use this generative model to test competing hypotheses about the way that human listeners direct preparatory and selective attention. We show that assigning low precision to words at attended-relative to unattended-locations can explain why a listener reports words from a competing sentence. Under this model, temporal changes in sensory precision were not needed to account for faster reaction times with longer cue-target intervals, but were necessary to explain ramping effects on event-related potentials (ERPs)-resembling the contingent negative variation (CNV)-during the preparatory interval. These simulations reveal that different processes are likely to underlie the improvement in reaction times and the ramping of ERPs that are associated with spatial cueing.
Collapse
Affiliation(s)
- Emma Holmes
- Department of Speech Hearing and Phonetic Sciences, UCL, London, WC1N 1PF, UK; Wellcome Centre for Human Neuroimaging, UCL, London, WC1N 3AR, UK.
| | - Thomas Parr
- Wellcome Centre for Human Neuroimaging, UCL, London, WC1N 3AR, UK
| | - Timothy D Griffiths
- Wellcome Centre for Human Neuroimaging, UCL, London, WC1N 3AR, UK; Biosciences Institute, Newcastle University, Newcastle upon Tyne, NE2 4HH, UK
| | - Karl J Friston
- Wellcome Centre for Human Neuroimaging, UCL, London, WC1N 3AR, UK
| |
Collapse
|
12
|
Ferrario A, Rankin J. Auditory streaming emerges from fast excitation and slow delayed inhibition. JOURNAL OF MATHEMATICAL NEUROSCIENCE 2021; 11:8. [PMID: 33939042 PMCID: PMC8093365 DOI: 10.1186/s13408-021-00106-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Accepted: 04/22/2021] [Indexed: 05/29/2023]
Abstract
In the auditory streaming paradigm, alternating sequences of pure tones can be perceived as a single galloping rhythm (integration) or as two sequences with separated low and high tones (segregation). Although studied for decades, the neural mechanisms underlining this perceptual grouping of sound remains a mystery. With the aim of identifying a plausible minimal neural circuit that captures this phenomenon, we propose a firing rate model with two periodically forced neural populations coupled by fast direct excitation and slow delayed inhibition. By analyzing the model in a non-smooth, slow-fast regime we analytically prove the existence of a rich repertoire of dynamical states and of their parameter dependent transitions. We impose plausible parameter restrictions and link all states with perceptual interpretations. Regions of stimulus parameters occupied by states linked with each percept match those found in behavioural experiments. Our model suggests that slow inhibition masks the perception of subsequent tones during segregation (forward masking), whereas fast excitation enables integration for large pitch differences between the two tones.
Collapse
Affiliation(s)
- Andrea Ferrario
- Department of Mathematics, College of Engineering, Mathematics & Physical Sciences, University of Exeter, Exeter, UK.
| | - James Rankin
- Department of Mathematics, College of Engineering, Mathematics & Physical Sciences, University of Exeter, Exeter, UK
| |
Collapse
|
13
|
A Comparison of Human against Machine-Classification of Spatial Audio Scenes in Binaural Recordings of Music. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10175956] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The purpose of this paper is to compare the performance of human listeners against the selected machine learning algorithms in the task of the classification of spatial audio scenes in binaural recordings of music under practical conditions. The three scenes were subject to classification: (1) music ensemble (a group of musical sources) located in the front, (2) music ensemble located at the back, and (3) music ensemble distributed around a listener. In the listening test, undertaken remotely over the Internet, human listeners reached the classification accuracy of 42.5%. For the listeners who passed the post-screening test, the accuracy was greater, approaching 60%. The above classification task was also undertaken automatically using four machine learning algorithms: convolutional neural network, support vector machines, extreme gradient boosting framework, and logistic regression. The machine learning algorithms substantially outperformed human listeners, with the classification accuracy reaching 84%, when tested under the binaural-room-impulse-response (BRIR) matched conditions. However, when the algorithms were tested under the BRIR mismatched scenario, the accuracy obtained by the algorithms was comparable to that exhibited by the listeners who passed the post-screening test, implying that the machine learning algorithms capability to perform in unknown electro-acoustic conditions needs to be further improved.
Collapse
|
14
|
Nguyen QA, Rinzel J, Curtu R. Buildup and bistability in auditory streaming as an evidence accumulation process with saturation. PLoS Comput Biol 2020; 16:e1008152. [PMID: 32853256 PMCID: PMC7480857 DOI: 10.1371/journal.pcbi.1008152] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 09/09/2020] [Accepted: 07/15/2020] [Indexed: 12/23/2022] Open
Abstract
A repeating triplet-sequence ABA- of non-overlapping brief tones, A and B, is a valued paradigm for studying auditory stream formation and the cocktail party problem. The stimulus is "heard" either as a galloping pattern (integration) or as two interleaved streams (segregation); the initial percept is typically integration then followed by spontaneous alternations between segregation and integration, each being dominant for a few seconds. The probability of segregation grows over seconds, from near-zero to a steady value, defining the buildup function, BUF. Its stationary level increases with the difference in tone frequencies, DF, and the BUF rises faster. Percept durations have DF-dependent means and are gamma-like distributed. Behavioral and computational studies usually characterize triplet streaming either during alternations or during buildup. Here, our experimental design and modeling encompass both. We propose a pseudo-neuromechanistic model that incorporates spiking activity in primary auditory cortex, A1, as input and resolves perception along two network-layers downstream of A1. Our model is straightforward and intuitive. It describes the noisy accumulation of evidence against the current percept which generates switches when reaching a threshold. Accumulation can saturate either above or below threshold; if below, the switching dynamics resemble noise-induced transitions from an attractor state. Our model accounts quantitatively for three key features of data: the BUFs, mean durations, and normalized dominance duration distributions, at various DF values. It describes perceptual alternations without competition per se, and underscores that treating triplets in the sequence independently and averaging across trials, as implemented in earlier widely cited studies, is inadequate.
Collapse
Affiliation(s)
- Quynh-Anh Nguyen
- Department of Mathematics, The University of Iowa, Iowa City, Iowa, United States of America
| | - John Rinzel
- Center for Neural Science, New York University, New York, New York, United States of America
- Courant Institute of Mathematical Sciences, New York University, New York, New York, United States of America
| | - Rodica Curtu
- Department of Mathematics, The University of Iowa, Iowa City, Iowa, United States of America
- Iowa Neuroscience Institute, Human Brain Research Laboratory, Iowa City, Iowa, United States of America
- * E-mail:
| |
Collapse
|
15
|
Grossberg S. Developmental Designs and Adult Functions of Cortical Maps in Multiple Modalities: Perception, Attention, Navigation, Numbers, Streaming, Speech, and Cognition. Front Neuroinform 2020; 14:4. [PMID: 32116628 PMCID: PMC7016218 DOI: 10.3389/fninf.2020.00004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Accepted: 01/16/2020] [Indexed: 11/13/2022] Open
Abstract
This article unifies neural modeling results that illustrate several basic design principles and mechanisms that are used by advanced brains to develop cortical maps with multiple psychological functions. One principle concerns how brains use a strip map that simultaneously enables one feature to be represented throughout its extent, as well as an ordered array of another feature at different positions of the strip. Strip maps include circuits to represent ocular dominance and orientation columns, place-value numbers, auditory streams, speaker-normalized speech, and cognitive working memories that can code repeated items. A second principle concerns how feature detectors for multiple functions develop in topographic maps, including maps for optic flow navigation, reinforcement learning, motion perception, and category learning at multiple organizational levels. A third principle concerns how brains exploit a spatial gradient of cells that respond at an ordered sequence of different rates. Such a rate gradient is found along the dorsoventral axis of the entorhinal cortex, whose lateral branch controls the development of time cells, and whose medial branch controls the development of grid cells. Populations of time cells can be used to learn how to adaptively time behaviors for which a time interval of hundreds of milliseconds, or several seconds, must be bridged, as occurs during trace conditioning. Populations of grid cells can be used to learn hippocampal place cells that represent the large spaces in which animals navigate. A fourth principle concerns how and why all neocortical circuits are organized into layers, and how functionally distinct columns develop in these circuits to enable map development. A final principle concerns the role of Adaptive Resonance Theory top-down matching and attentional circuits in the dynamic stabilization of early development and adult learning. Cortical maps are modeled in visual, auditory, temporal, parietal, prefrontal, entorhinal, and hippocampal cortices.
Collapse
Affiliation(s)
- Stephen Grossberg
- Center for Adaptive Systems, Graduate Program in Cognitive and Neural Systems, Departments of Mathematics & Statistics, Psychological & Brain Sciences, and Biomedical Engineering, Boston University, Boston, MA, United States
| |
Collapse
|
16
|
Auditory streaming and bistability paradigm extended to a dynamic environment. Hear Res 2019; 383:107807. [PMID: 31622836 DOI: 10.1016/j.heares.2019.107807] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 09/19/2019] [Accepted: 10/01/2019] [Indexed: 11/23/2022]
Abstract
We explore stream segregation with temporally modulated acoustic features using behavioral experiments and modelling. The auditory streaming paradigm in which alternating high- A and low-frequency tones B appear in a repeating ABA-pattern, has been shown to be perceptually bistable for extended presentations (order of minutes). For a fixed, repeating stimulus, perception spontaneously changes (switches) at random times, every 2-15 s, between an integrated interpretation with a galloping rhythm and segregated streams. Streaming in a natural auditory environment requires segregation of auditory objects with features that evolve over time. With the relatively idealized ABA-triplet paradigm, we explore perceptual switching in a non-static environment by considering slowly and periodically varying stimulus features. Our previously published model captures the dynamics of auditory bistability and predicts here how perceptual switches are entrained, tightly locked to the rising and falling phase of modulation. In psychoacoustic experiments we find that entrainment depends on both the period of modulation and the intrinsic switch characteristics of individual listeners. The extended auditory streaming paradigm with slowly modulated stimulus features presented here will be of significant interest for future imaging and neurophysiology experiments by reducing the need for subjective perceptual reports of ongoing perception.
Collapse
|
17
|
Rankin J, Rinzel J. Computational models of auditory perception from feature extraction to stream segregation and behavior. Curr Opin Neurobiol 2019; 58:46-53. [PMID: 31326723 DOI: 10.1016/j.conb.2019.06.009] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 06/22/2019] [Indexed: 10/26/2022]
Abstract
Audition is by nature dynamic, from brainstem processing on sub-millisecond time scales, to segregating and tracking sound sources with changing features, to the pleasure of listening to music and the satisfaction of getting the beat. We review recent advances from computational models of sound localization, of auditory stream segregation and of beat perception/generation. A wealth of behavioral, electrophysiological and imaging studies shed light on these processes, typically with synthesized sounds having regular temporal structure. Computational models integrate knowledge from different experimental fields and at different levels of description. We advocate a neuromechanistic modeling approach that incorporates knowledge of the auditory system from various fields, that utilizes plausible neural mechanisms, and that bridges our understanding across disciplines.
Collapse
Affiliation(s)
- James Rankin
- College of Engineering, Mathematics and Physical Sciences, University of Exeter, Harrison Building, North Park Rd, Exeter EX4 4QF, UK.
| | - John Rinzel
- Center for Neural Science, New York University, 4 Washington Place, 10003 New York, NY, United States; Courant Institute of Mathematical Sciences, New York University, 251 Mercer St, 10012 New York, NY, United States
| |
Collapse
|
18
|
Paredes-Gallardo A, Dau T, Marozeau J. Auditory Stream Segregation Can Be Modeled by Neural Competition in Cochlear Implant Listeners. Front Comput Neurosci 2019; 13:42. [PMID: 31333438 PMCID: PMC6616076 DOI: 10.3389/fncom.2019.00042] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 06/17/2019] [Indexed: 11/13/2022] Open
Abstract
Auditory stream segregation is a perceptual process by which the human auditory system groups sounds from different sources into perceptually meaningful elements (e.g., a voice or a melody). The perceptual segregation of sounds is important, for example, for the understanding of speech in noisy scenarios, a particularly challenging task for listeners with a cochlear implant (CI). It has been suggested that some aspects of stream segregation may be explained by relatively basic neural mechanisms at a cortical level. During the past decades, a variety of models have been proposed to account for the data from stream segregation experiments in normal-hearing (NH) listeners. However, little attention has been given to corresponding findings in CI listeners. The present study investigated whether a neural model of sequential stream segregation, proposed to describe the behavioral effects observed in NH listeners, can account for behavioral data from CI listeners. The model operates on the stimulus features at the cortical level and includes a competition stage between the neuronal units encoding the different percepts. The competition arises from a combination of mutual inhibition, adaptation, and additive noise. The model was found to capture the main trends in the behavioral data from CI listeners, such as the larger probability of a segregated percept with increasing the feature difference between the sounds as well as the build-up effect. Importantly, this was achieved without any modification to the model's competition stage, suggesting that stream segregation could be mediated by a similar mechanism in both groups of listeners.
Collapse
Affiliation(s)
- Andreu Paredes-Gallardo
- Hearing Systems Section, Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
| | - Torsten Dau
- Hearing Systems Section, Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
| | - Jeremy Marozeau
- Hearing Systems Section, Department of Health Technology, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
19
|
Automatic Spatial Audio Scene Classification in Binaural Recordings of Music. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9091724] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The aim of the study was to develop a method for automatic classification of the three spatial audio scenes, differing in horizontal distribution of foreground and background audio content around a listener in binaurally rendered recordings of music. For the purpose of the study, audio recordings were synthesized using thirteen sets of binaural-room-impulse-responses (BRIRs), representing room acoustics of both semi-anechoic and reverberant venues. Head movements were not considered in the study. The proposed method was assumption-free with regards to the number and characteristics of the audio sources. A least absolute shrinkage and selection operator was employed as a classifier. According to the results, it is possible to automatically identify the spatial scenes using a combination of binaural and spectro-temporal features. The method exhibits a satisfactory classification accuracy when it is trained and then tested on different stimuli but synthesized using the same BRIRs (accuracy ranging from 74% to 98%), even in highly reverberant conditions. However, the generalizability of the method needs to be further improved. This study demonstrates that in addition to the binaural cues, the Mel-frequency cepstral coefficients constitute an important carrier of spatial information, imperative for the classification of spatial audio scenes.
Collapse
|
20
|
Kondo HM, Pressnitzer D, Shimada Y, Kochiyama T, Kashino M. Inhibition-excitation balance in the parietal cortex modulates volitional control for auditory and visual multistability. Sci Rep 2018; 8:14548. [PMID: 30267021 PMCID: PMC6162284 DOI: 10.1038/s41598-018-32892-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Accepted: 09/18/2018] [Indexed: 11/25/2022] Open
Abstract
Perceptual organisation must select one interpretation from several alternatives to guide behaviour. Computational models suggest that this could be achieved through an interplay between inhibition and excitation across competing types of neural population coding for each interpretation. Here, to test for such models, we used magnetic resonance spectroscopy to measure non-invasively the concentrations of inhibitory γ-aminobutyric acid (GABA) and excitatory glutamate-glutamine (Glx) in several brain regions. Human participants first performed auditory and visual multistability tasks that produced spontaneous switching between percepts. Then, we observed that longer percept durations during behaviour were associated with higher GABA/Glx ratios in the sensory area coding for each modality. When participants were asked to voluntarily modulate their perception, a common factor across modalities emerged: the GABA/Glx ratio in the posterior parietal cortex tended to be positively correlated with the amount of effective volitional control. Our results provide direct evidence implicating that the balance between neural inhibition and excitation within sensory regions resolves perceptual competition. This powerful computational principle appears to be leveraged by both audition and vision, implemented independently across modalities, but modulated by an integrated control process.
Collapse
Affiliation(s)
- Hirohito M Kondo
- School of Psychology, Chukyo University, Nagoya, Aichi, Japan.
- Human Information Science Laboratory, NTT Communication Science Laboratories, NTT Corporation, Atsugi, Kanagawa, Japan.
| | - Daniel Pressnitzer
- Laboratoire des Systèmes Perceptifs, CNRS UMR 8248, Paris, France
- Département d'Études Cognitive, École Normale Supérieure, Paris, France
| | - Yasuhiro Shimada
- Brain Activity Imaging Center, ATR-Promotions, Seika-cho, Kyoto, Japan
| | - Takanori Kochiyama
- Brain Activity Imaging Center, ATR-Promotions, Seika-cho, Kyoto, Japan
- Department of Cognitive Neuroscience, Advanced Telecommunications Research Institute International, Seika-cho, Kyoto, Japan
| | - Makio Kashino
- Sports Brain Science Project, NTT Communication Science Laboratories, NTT Corporation, Atsugi, Kanagawa, Japan
- School of Engineering, Tokyo Institute of Technology, Yokohama, Kanagawa, Japan
| |
Collapse
|