1
|
Di Tullio RW, Wei L, Balasubramanian V. Slow and steady: auditory features for discriminating animal vocalizations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.20.599962. [PMID: 39005308 PMCID: PMC11244870 DOI: 10.1101/2024.06.20.599962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
We propose that listeners can use temporal regularities - spectro-temporal correlations that change smoothly over time - to discriminate animal vocalizations within and between species. To test this idea, we used Slow Feature Analysis (SFA) to find the most temporally regular components of vocalizations from birds (blue jay, house finch, American yellow warbler, and great blue heron), humans (English speakers), and rhesus macaques. We projected vocalizations into the learned feature space and tested intra-class (same speaker/species) and inter-class (different speakers/species) auditory discrimination by a trained classifier. We found that: 1) Vocalization discrimination was excellent (> 95%) in all cases; 2) Performance depended primarily on the ~10 most temporally regular features; 3) Most vocalizations are dominated by ~10 features with high temporal regularity; and 4) These regular features are highly correlated with the most predictable components of animal sounds.
Collapse
Affiliation(s)
- Ronald W Di Tullio
- David Rittenhouse Laboratory, Department of Physics and Astronomy, University of Pennsylvania, USA
- Computational Neuroscience Initiative, University of Pennsylvania, USA
| | - Linran Wei
- David Rittenhouse Laboratory, Department of Physics and Astronomy, University of Pennsylkvania, USA
| | - Vijay Balasubramanian
- David Rittenhouse Laboratory, Department of Physics and Astronomy, University of Pennsylvania, USA
- Computational Neuroscience Initiative, University of Pennsylvania, USA
- Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA
| |
Collapse
|
2
|
Selection levels on vocal individuality: strategic use or byproduct. Curr Opin Behav Sci 2022. [DOI: 10.1016/j.cobeha.2022.101140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
3
|
Linhart P, Mahamoud-Issa M, Stowell D, Blumstein DT. The potential for acoustic individual identification in mammals. Mamm Biol 2022. [DOI: 10.1007/s42991-021-00222-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
4
|
Sainburg T, Gentner TQ. Toward a Computational Neuroethology of Vocal Communication: From Bioacoustics to Neurophysiology, Emerging Tools and Future Directions. Front Behav Neurosci 2021; 15:811737. [PMID: 34987365 PMCID: PMC8721140 DOI: 10.3389/fnbeh.2021.811737] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Accepted: 11/29/2021] [Indexed: 11/23/2022] Open
Abstract
Recently developed methods in computational neuroethology have enabled increasingly detailed and comprehensive quantification of animal movements and behavioral kinematics. Vocal communication behavior is well poised for application of similar large-scale quantification methods in the service of physiological and ethological studies. This review describes emerging techniques that can be applied to acoustic and vocal communication signals with the goal of enabling study beyond a small number of model species. We review a range of modern computational methods for bioacoustics, signal processing, and brain-behavior mapping. Along with a discussion of recent advances and techniques, we include challenges and broader goals in establishing a framework for the computational neuroethology of vocal communication.
Collapse
Affiliation(s)
- Tim Sainburg
- Department of Psychology, University of California, San Diego, La Jolla, CA, United States
- Center for Academic Research & Training in Anthropogeny, University of California, San Diego, La Jolla, CA, United States
| | - Timothy Q. Gentner
- Department of Psychology, University of California, San Diego, La Jolla, CA, United States
- Neurosciences Graduate Program, University of California, San Diego, La Jolla, CA, United States
- Neurobiology Section, Division of Biological Sciences, University of California, San Diego, La Jolla, CA, United States
- Kavli Institute for Brain and Mind, University of California, San Diego, La Jolla, CA, United States
| |
Collapse
|
5
|
Bermant PC. BioCPPNet: automatic bioacoustic source separation with deep neural networks. Sci Rep 2021; 11:23502. [PMID: 34873197 PMCID: PMC8648737 DOI: 10.1038/s41598-021-02790-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 11/16/2021] [Indexed: 11/09/2022] Open
Abstract
We introduce the Bioacoustic Cocktail Party Problem Network (BioCPPNet), a lightweight, modular, and robust U-Net-based machine learning architecture optimized for bioacoustic source separation across diverse biological taxa. Employing learnable or handcrafted encoders, BioCPPNet operates directly on the raw acoustic mixture waveform containing overlapping vocalizations and separates the input waveform into estimates corresponding to the sources in the mixture. Predictions are compared to the reference ground truth waveforms by searching over the space of (output, target) source order permutations, and we train using an objective function motivated by perceptual audio quality. We apply BioCPPNet to several species with unique vocal behavior, including macaques, bottlenose dolphins, and Egyptian fruit bats, and we evaluate reconstruction quality of separated waveforms using the scale-invariant signal-to-distortion ratio (SI-SDR) and downstream identity classification accuracy. We consider mixtures with two or three concurrent conspecific vocalizers, and we examine separation performance in open and closed speaker scenarios. To our knowledge, this paper redefines the state-of-the-art in end-to-end single-channel bioacoustic source separation in a permutation-invariant regime across a heterogeneous set of non-human species. This study serves as a major step toward the deployment of bioacoustic source separation systems for processing substantial volumes of previously unusable data containing overlapping bioacoustic signals.
Collapse
|
6
|
Sainburg T, Thielk M, Gentner TQ. Finding, visualizing, and quantifying latent structure across diverse animal vocal repertoires. PLoS Comput Biol 2020; 16:e1008228. [PMID: 33057332 PMCID: PMC7591061 DOI: 10.1371/journal.pcbi.1008228] [Citation(s) in RCA: 63] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 10/27/2020] [Accepted: 08/08/2020] [Indexed: 12/15/2022] Open
Abstract
Animals produce vocalizations that range in complexity from a single repeated call to hundreds of unique vocal elements patterned in sequences unfolding over hours. Characterizing complex vocalizations can require considerable effort and a deep intuition about each species' vocal behavior. Even with a great deal of experience, human characterizations of animal communication can be affected by human perceptual biases. We present a set of computational methods for projecting animal vocalizations into low dimensional latent representational spaces that are directly learned from the spectrograms of vocal signals. We apply these methods to diverse datasets from over 20 species, including humans, bats, songbirds, mice, cetaceans, and nonhuman primates. Latent projections uncover complex features of data in visually intuitive and quantifiable ways, enabling high-powered comparative analyses of vocal acoustics. We introduce methods for analyzing vocalizations as both discrete sequences and as continuous latent variables. Each method can be used to disentangle complex spectro-temporal structure and observe long-timescale organization in communication.
Collapse
Affiliation(s)
- Tim Sainburg
- Department of Psychology, University of California, San Diego, La Jolla, CA, USA
- Center for Academic Research & Training in Anthropogeny, University of California, San Diego, La Jolla, CA, USA
| | - Marvin Thielk
- Neurosciences Graduate Program, University of California, San Diego, La Jolla, CA, USA
| | - Timothy Q. Gentner
- Department of Psychology, University of California, San Diego, La Jolla, CA, USA
- Neurosciences Graduate Program, University of California, San Diego, La Jolla, CA, USA
- Neurobiology Section, Division of Biological Sciences, University of California, San Diego, La Jolla, CA, USA
- Kavli Institute for Brain and Mind, University of California, San Diego, La Jolla, CA, USA
| |
Collapse
|
7
|
Liu ST, Montes-Lourido P, Wang X, Sadagopan S. Optimal features for auditory categorization. Nat Commun 2019; 10:1302. [PMID: 30899018 PMCID: PMC6428858 DOI: 10.1038/s41467-019-09115-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Accepted: 02/20/2019] [Indexed: 01/13/2023] Open
Abstract
Humans and vocal animals use vocalizations to communicate with members of their species. A necessary function of auditory perception is to generalize across the high variability inherent in vocalization production and classify them into behaviorally distinct categories ('words' or 'call types'). Here, we demonstrate that detecting mid-level features in calls achieves production-invariant classification. Starting from randomly chosen marmoset call features, we use a greedy search algorithm to determine the most informative and least redundant features necessary for call classification. High classification performance is achieved using only 10-20 features per call type. Predictions of tuning properties of putative feature-selective neurons accurately match some observed auditory cortical responses. This feature-based approach also succeeds for call categorization in other species, and for other complex classification tasks such as caller identification. Our results suggest that high-level neural representations of sounds are based on task-dependent features optimized for specific computational goals.
Collapse
Affiliation(s)
- Shi Tong Liu
- Department of Bioengineering, University of Pittsburgh, Pittsburgh, 15213, PA, USA
| | - Pilar Montes-Lourido
- Department of Neurobiology, University of Pittsburgh, Pittsburgh, 15213, PA, USA
| | - Xiaoqin Wang
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, 21205, MD, USA
| | - Srivatsun Sadagopan
- Department of Bioengineering, University of Pittsburgh, Pittsburgh, 15213, PA, USA. .,Department of Neurobiology, University of Pittsburgh, Pittsburgh, 15213, PA, USA. .,Department of Otolaryngology, University of Pittsburgh, Pittsburgh, 15213, PA, USA.
| |
Collapse
|
8
|
A "voice patch" system in the primate brain for processing vocal information? Hear Res 2018; 366:65-74. [PMID: 29776691 DOI: 10.1016/j.heares.2018.04.010] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/05/2018] [Revised: 04/14/2018] [Accepted: 04/25/2018] [Indexed: 12/13/2022]
Abstract
We review behavioural and neural evidence for the processing of information contained in conspecific vocalizations (CVs) in three primate species: humans, macaques and marmosets. We focus on abilities that are present and ecologically relevant in all three species: the detection and sensitivity to CVs; and the processing of identity cues in CVs. Current evidence, although fragmentary, supports the notion of a "voice patch system" in the primate brain analogous to the face patch system of visual cortex: a series of discrete, interconnected cortical areas supporting increasingly abstract representations of the vocal input. A central question concerns the degree to which the voice patch system is conserved in evolution. We outline challenges that arise and suggesting potential avenues for comparing the organization of the voice patch system across primate brains.
Collapse
|
9
|
Santoro R, Moerel M, De Martino F, Valente G, Ugurbil K, Yacoub E, Formisano E. Reconstructing the spectrotemporal modulations of real-life sounds from fMRI response patterns. Proc Natl Acad Sci U S A 2017; 114:4799-4804. [PMID: 28420788 PMCID: PMC5422795 DOI: 10.1073/pnas.1617622114] [Citation(s) in RCA: 59] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Ethological views of brain functioning suggest that sound representations and computations in the auditory neural system are optimized finely to process and discriminate behaviorally relevant acoustic features and sounds (e.g., spectrotemporal modulations in the songs of zebra finches). Here, we show that modeling of neural sound representations in terms of frequency-specific spectrotemporal modulations enables accurate and specific reconstruction of real-life sounds from high-resolution functional magnetic resonance imaging (fMRI) response patterns in the human auditory cortex. Region-based analyses indicated that response patterns in separate portions of the auditory cortex are informative of distinctive sets of spectrotemporal modulations. Most relevantly, results revealed that in early auditory regions, and progressively more in surrounding regions, temporal modulations in a range relevant for speech analysis (∼2-4 Hz) were reconstructed more faithfully than other temporal modulations. In early auditory regions, this effect was frequency-dependent and only present for lower frequencies (<∼2 kHz), whereas for higher frequencies, reconstruction accuracy was higher for faster temporal modulations. Further analyses suggested that auditory cortical processing optimized for the fine-grained discrimination of speech and vocal sounds underlies this enhanced reconstruction accuracy. In sum, the present study introduces an approach to embed models of neural sound representations in the analysis of fMRI response patterns. Furthermore, it reveals that, in the human brain, even general purpose and fundamental neural processing mechanisms are shaped by the physical features of real-world stimuli that are most relevant for behavior (i.e., speech, voice).
Collapse
Affiliation(s)
- Roberta Santoro
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6200 MD Maastricht, The Netherlands
- Maastricht Brain Imaging Center, 6200 MD Maastricht, The Netherlands
- Brain and Language Laboratory, Department of Clinical Neuroscience, University Medical School, University of Geneva, CH-1211 Geneva, Switzerland
| | - Michelle Moerel
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6200 MD Maastricht, The Netherlands
- Maastricht Brain Imaging Center, 6200 MD Maastricht, The Netherlands
- Center for Magnetic Resonance Research, Department of Radiology, University of Minnesota, Minneapolis, MN 55455
- Maastricht Centre for Systems Biology, Maastricht University, 6200 MD Maastricht, The Netherlands
| | - Federico De Martino
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6200 MD Maastricht, The Netherlands
- Maastricht Brain Imaging Center, 6200 MD Maastricht, The Netherlands
| | - Giancarlo Valente
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6200 MD Maastricht, The Netherlands
- Maastricht Brain Imaging Center, 6200 MD Maastricht, The Netherlands
| | - Kamil Ugurbil
- Center for Magnetic Resonance Research, Department of Radiology, University of Minnesota, Minneapolis, MN 55455
| | - Essa Yacoub
- Center for Magnetic Resonance Research, Department of Radiology, University of Minnesota, Minneapolis, MN 55455
| | - Elia Formisano
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6200 MD Maastricht, The Netherlands;
- Maastricht Brain Imaging Center, 6200 MD Maastricht, The Netherlands
- Maastricht Centre for Systems Biology, Maastricht University, 6200 MD Maastricht, The Netherlands
| |
Collapse
|