1
|
Ekström AG. Correcting the record: Phonetic potential of primate vocal tracts and the legacy of Philip Lieberman (1934-2022). Am J Primatol 2024; 86:e23637. [PMID: 38741274 DOI: 10.1002/ajp.23637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 04/24/2024] [Accepted: 04/27/2024] [Indexed: 05/16/2024]
Abstract
The phonetic potential of nonhuman primate vocal tracts has been the subject of considerable contention in recent literature. Here, the work of Philip Lieberman (1934-2022) is considered at length, and two research papers-both purported challenges to Lieberman's theoretical work-and a review of Lieberman's scientific legacy are critically examined. I argue that various aspects of Lieberman's research have been consistently misinterpreted in the literature. A paper by Fitch et al. overestimates the would-be "speech-ready" capacities of a rhesus macaque, and the data presented nonetheless supports Lieberman's principal position-that nonhuman primates cannot articulate the full extent of human speech sounds. The suggestion that no vocal anatomical evolution was necessary for the evolution of human speech (as spoken by all normally developing humans) is not supported by phonetic or anatomical data. The second challenge, by Boë et al., attributes vowel-like qualities of baboon calls to articulatory capacities based on audio data; I argue that such "protovocalic" properties likely result from disparate articulatory maneuvers compared to human speakers. A review of Lieberman's scientific legacy by Boë et al. ascribes a view of speech evolution (which the authors term "laryngeal descent theory") to Lieberman, which contradicts his writings. The present article documents a pattern of incorrect interpretations of Lieberman's theoretical work in recent literature. Finally, the apparent trend of vowel-like formant dispersions in great ape vocalization literature is discussed with regard to Lieberman's theoretical work. The review concludes that the "Lieberman account" of primate vocal tract phonetic capacities remains supported by research: the ready articulation of fully human speech reflects species-unique anatomy.
Collapse
Affiliation(s)
- Axel G Ekström
- Speech, Music & Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
| |
Collapse
|
2
|
Fedorenko E, Piantadosi ST, Gibson EAF. Language is primarily a tool for communication rather than thought. Nature 2024; 630:575-586. [PMID: 38898296 DOI: 10.1038/s41586-024-07522-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 05/03/2024] [Indexed: 06/21/2024]
Abstract
Language is a defining characteristic of our species, but the function, or functions, that it serves has been debated for centuries. Here we bring recent evidence from neuroscience and allied disciplines to argue that in modern humans, language is a tool for communication, contrary to a prominent view that we use language for thinking. We begin by introducing the brain network that supports linguistic ability in humans. We then review evidence for a double dissociation between language and thought, and discuss several properties of language that suggest that it is optimized for communication. We conclude that although the emergence of language has unquestionably transformed human culture, language does not appear to be a prerequisite for complex thought, including symbolic thought. Instead, language is a powerful tool for the transmission of cultural knowledge; it plausibly co-evolved with our thinking and reasoning capacities, and only reflects, rather than gives rise to, the signature sophistication of human cognition.
Collapse
Affiliation(s)
- Evelina Fedorenko
- Massachusetts Institute of Technology, Cambridge, MA, USA.
- Speech and Hearing in Bioscience and Technology Program at Harvard University, Boston, MA, USA.
| | | | | |
Collapse
|
3
|
Heeringa AN, Jüchter C, Beutelmann R, Klump GM, Köppl C. Altered neural encoding of vowels in noise does not affect behavioral vowel discrimination in gerbils with age-related hearing loss. Front Neurosci 2023; 17:1238941. [PMID: 38033551 PMCID: PMC10682387 DOI: 10.3389/fnins.2023.1238941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 10/24/2023] [Indexed: 12/02/2023] Open
Abstract
Introduction Understanding speech in a noisy environment, as opposed to speech in quiet, becomes increasingly more difficult with increasing age. Using the quiet-aged gerbil, we studied the effects of aging on speech-in-noise processing. Specifically, behavioral vowel discrimination and the encoding of these vowels by single auditory-nerve fibers were compared, to elucidate some of the underlying mechanisms of age-related speech-in-noise perception deficits. Methods Young-adult and quiet-aged Mongolian gerbils, of either sex, were trained to discriminate a deviant naturally-spoken vowel in a sequence of vowel standards against a speech-like background noise. In addition, we recorded responses from single auditory-nerve fibers of young-adult and quiet-aged gerbils while presenting the same speech stimuli. Results Behavioral vowel discrimination was not significantly affected by aging. For both young-adult and quiet-aged gerbils, the behavioral discrimination between /eː/ and /iː/ was more difficult to make than /eː/ vs. /aː/ or /iː/ vs. /aː/, as evidenced by longer response times and lower d' values. In young-adults, spike timing-based vowel discrimination agreed with the behavioral vowel discrimination, while in quiet-aged gerbils it did not. Paradoxically, discrimination between vowels based on temporal responses was enhanced in aged gerbils for all vowel comparisons. Representation schemes, based on the spectrum of the inter-spike interval histogram, revealed stronger encoding of both the fundamental and the lower formant frequencies in fibers of quiet-aged gerbils, but no qualitative changes in vowel encoding. Elevated thresholds in combination with a fixed stimulus level, i.e., lower sensation levels of the stimuli for old individuals, can explain the enhanced temporal coding of the vowels in noise. Discussion These results suggest that the altered auditory-nerve discrimination metrics in old gerbils may mask age-related deterioration in the central (auditory) system to the extent that behavioral vowel discrimination matches that of the young adults.
Collapse
Affiliation(s)
- Amarins N. Heeringa
- Research Centre Neurosensory Science and Cluster of Excellence “Hearing4all”, Department of Neuroscience, School of Medicine and Health Science, Carl von Ossietzky University Oldenburg, Oldenburg, Germany
| | | | | | | | | |
Collapse
|
4
|
Heeringa AN, Köppl C. Auditory Nerve Fiber Discrimination and Representation of Naturally-Spoken Vowels in Noise. eNeuro 2022; 9:ENEURO.0474-21.2021. [PMID: 35086866 PMCID: PMC8856707 DOI: 10.1523/eneuro.0474-21.2021] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2021] [Revised: 12/13/2021] [Accepted: 12/17/2021] [Indexed: 11/22/2022] Open
Abstract
To understand how vowels are encoded by auditory nerve (AN) fibers, a number of representation schemes have been suggested that extract the vowel's formant frequencies from AN-fiber spiking patterns. The current study aims to apply and compare these schemes for AN-fiber responses to naturally-spoken vowels in a speech-shaped background noise. Responses to three vowels were evaluated; based on behavioral experiments in the same species, two of these were perceptually difficult to discriminate from each other (/e/ vs /i/), and one was perceptually easy to discriminate from the other two (/a:/). Single-unit AN fibers were recorded from ketamine/xylazine-anesthetized Mongolian gerbils of either sex (n = 8). First, single-unit discrimination between the three vowels was studied. Compared with the perceptually easy discriminations, the average spike timing-based discrimination values were significantly lower for the perceptually difficult vowel discrimination. This was not true for an average rate-based discrimination metric, the rate d-prime (d'). Consistently, spike timing-based representation schemes, plotting the temporal responses of all recorded units as a function of their best frequency (BF), i.e., dominant component schemes, average localized interval rate, and fluctuation profiles, revealed representation of the vowel's formant frequencies, whereas no such representation was apparent in the rate-based excitation pattern. Making use of perceptual discrimination data, this study reveals that discrimination difficulties of naturally-spoken vowels in speech-shaped noise originate peripherally and can be studied in the spike timing patterns of single AN fibers.
Collapse
Affiliation(s)
- Amarins N Heeringa
- Cluster of Excellence "Hearing4all" and Research Centre Neurosensory Science, Department of Neuroscience, School of Medicine and Health Science, Carl von Ossietzky University Oldenburg, Oldenburg 26129, Germany
| | - Christine Köppl
- Cluster of Excellence "Hearing4all" and Research Centre Neurosensory Science, Department of Neuroscience, School of Medicine and Health Science, Carl von Ossietzky University Oldenburg, Oldenburg 26129, Germany
| |
Collapse
|
5
|
Wang X, Xu L. Speech perception in noise: Masking and unmasking. J Otol 2021; 16:109-119. [PMID: 33777124 PMCID: PMC7985001 DOI: 10.1016/j.joto.2020.12.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 12/03/2020] [Accepted: 12/06/2020] [Indexed: 11/23/2022] Open
Abstract
Speech perception is essential for daily communication. Background noise or concurrent talkers, on the other hand, can make it challenging for listeners to track the target speech (i.e., cocktail party problem). The present study reviews and compares existing findings on speech perception and unmasking in cocktail party listening environments in English and Mandarin Chinese. The review starts with an introduction section followed by related concepts of auditory masking. The next two sections review factors that release speech perception from masking in English and Mandarin Chinese, respectively. The last section presents an overall summary of the findings with comparisons between the two languages. Future research directions with respect to the difference in literature on the reviewed topic between the two languages are also discussed.
Collapse
Affiliation(s)
- Xianhui Wang
- Communication Sciences and Disorders, Ohio University, Athens, OH, 45701, USA
| | - Li Xu
- Communication Sciences and Disorders, Ohio University, Athens, OH, 45701, USA
| |
Collapse
|
6
|
Silva DMR, Rothe-Neves R, Melges DB. Long-latency event-related responses to vowels: N1-P2 decomposition by two-step principal component analysis. Int J Psychophysiol 2019; 148:93-102. [PMID: 31863852 DOI: 10.1016/j.ijpsycho.2019.11.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2019] [Revised: 11/16/2019] [Accepted: 11/18/2019] [Indexed: 11/26/2022]
Abstract
The N1-P2 complex of the auditory event-related potential (ERP) has been used to examine neural activity associated with speech sound perception. Since it is thought to reflect multiple generator processes, its functional significance is difficult to infer. In the present study, a temporospatial principal component analysis (PCA) was used to decompose the N1-P2 response into latent factors underlying covariance patterns in ERP data recorded during passive listening to pairs of successive vowels. In each trial, one of six sounds drawn from an /i/-/e/ vowel continuum was followed either by an identical sound, a different token of the same vowel category, or a token from the other category. Responses were examined as to how they were modulated by within- and across-category vowel differences and by adaptation (repetition suppression) effects. Five PCA factors were identified as corresponding to three well-known N1 subcomponents and two P2 subcomponents. Results added evidence that the N1 peak reflects both generators that are sensitive to spectral information and generators that are not. For later latency ranges, different patterns of sensitivity to vowel quality were found, including category-related effects. Particularly, a subcomponent identified as the Tb wave showed release from adaptation in response to an /i/ followed by an /e/ sound. A P2 subcomponent varied linearly with spectral shape along the vowel continuum, while the other was stronger the closer the vowel was to the category boundary, suggesting separate processing of continuous and category-related information. Thus, the PCA-based decomposition of the N1-P2 complex was functionally meaningful, revealing distinct underlying processes at work during speech sound perception.
Collapse
Affiliation(s)
- Daniel M R Silva
- Phonetics Lab, Faculty of Letters, Federal University of Minas Gerais, Belo Horizonte, Brazil
| | - Rui Rothe-Neves
- Phonetics Lab, Faculty of Letters, Federal University of Minas Gerais, Belo Horizonte, Brazil.
| | - Danilo B Melges
- Graduate Program in Electrical Engineering, Department of Electrical Engineering, Federal University of Minas Gerais
| |
Collapse
|
7
|
Abstract
Human category learning appears to be supported by dual learning systems. Previous research indicates the engagement of distinct neural systems in learning categories that require selective attention to dimensions versus those that require integration across dimensions. This evidence has largely come from studies of learning across perceptually separable visual dimensions, but recent research has applied dual system models to understanding auditory and speech categorization. Since differential engagement of the dual learning systems is closely related to selective attention to input dimensions, it may be important that acoustic dimensions are quite often perceptually integral and difficult to attend to selectively. We investigated this issue across artificial auditory categories defined by center frequency and modulation frequency acoustic dimensions. Learners demonstrated a bias to integrate across the dimensions, rather than to selectively attend, and the bias specifically reflected a positive correlation between the dimensions. Further, we found that the acoustic dimensions did not equivalently contribute to categorization decisions. These results demonstrate the need to reconsider the assumption that the orthogonal input dimensions used in designing an experiment are indeed orthogonal in perceptual space as there are important implications for category learning.
Collapse
|
8
|
Constraints on learning disjunctive, unidimensional auditory and phonetic categories. Atten Percept Psychophys 2019; 81:958-980. [PMID: 30761500 DOI: 10.3758/s13414-019-01683-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Phonetic categories must be learned, but the processes that allow that learning to unfold are still under debate. The current study investigates constraints on the structure of categories that can be learned and whether these constraints are speech-specific. Category structure constraints are a key difference between theories of category learning, which can roughly be divided into instance-based learning (i.e., exemplar only) and abstractionist learning (i.e., at least partly rule-based or prototype-based) theories. Abstractionist theories can relatively easily accommodate constraints on the structure of categories that can be learned, whereas instance-based theories cannot easily include such constraints. The current study included three groups to investigate these possible constraints as well as their speech specificity: English speakers learning German speech categories, German speakers learning German speech categories, and English speakers learning musical instrument categories, with each group including participants who learned different sets of categories. Both speech groups had greater difficulty learning disjunctive categories (ones that require an "or" statement) than nondisjunctive categories, which suggests that instance-based learning alone is insufficient to explain the learning of the participants learning phonetic categories. This fact was true for both novices (English speakers) and experts (German speakers), which implies that expertise with the materials used cannot explain the patterns observed. However, the same was not true for the musical instrument categories, suggesting a degree of domain-specificity in these constraints that cannot be explained through recourse to expertise alone.
Collapse
|
9
|
Abstract
Studies of vowel systems regularly appeal to the need to understand how the auditory system encodes and processes the information in the acoustic signal. The goal of this study is to present computational models to address this need, and to use the models to illustrate responses to vowels at two levels of the auditory pathway. Many of the models previously used to study auditory representations of speech are based on linear filter banks simulating the tuning of the inner ear. These models do not incorporate key nonlinear response properties of the inner ear that influence responses at conversational-speech sound levels. These nonlinear properties shape neural representations in ways that are important for understanding responses in the central nervous system. The model for auditory-nerve (AN) fibers used here incorporates realistic nonlinear properties associated with the basilar membrane, inner hair cells (IHCs), and the IHC-AN synapse. These nonlinearities set up profiles of f0-related fluctuations that vary in amplitude across the population of frequency-tuned AN fibers. Amplitude fluctuations in AN responses are smallest near formant peaks and largest at frequencies between formants. These f0-related fluctuations strongly excite or suppress neurons in the auditory midbrain, the first level of the auditory pathway where tuning for low-frequency fluctuations in sounds occurs. Formant-related amplitude fluctuations provide representations of the vowel spectrum in discharge rates of midbrain neurons. These representations in the midbrain are robust across a wide range of sound levels, including the entire range of conversational-speech levels, and in the presence of realistic background noise levels.
Collapse
|
10
|
Tamura S, Ito K, Hirose N, Mori S. Precision of voicing perceptual identification is altered in association with voice-onset time production changes. Exp Brain Res 2019; 237:2197-2204. [DOI: 10.1007/s00221-019-05584-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Accepted: 06/13/2019] [Indexed: 11/30/2022]
|
11
|
Suksiri B, Fukumoto M. An Efficient Framework for Estimating the Direction of Multiple Sound Sources Using Higher-Order Generalized Singular Value Decomposition. SENSORS 2019; 19:s19132977. [PMID: 31284497 PMCID: PMC6651797 DOI: 10.3390/s19132977] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 07/01/2019] [Accepted: 07/03/2019] [Indexed: 11/20/2022]
Abstract
This paper presents an efficient framework for estimating the direction-of-arrival (DOA) of wideband sound sources. The proposed framework provides an efficient way to construct a wideband cross-correlation matrix from multiple narrowband cross-correlation matrices for all frequency bins. In addition, the proposed framework is inspired by the coherent signal subspace technique with further improvement of linear transformation procedure, and the new procedure no longer requires any process of DOA preliminary estimation by exploiting unique cross-correlation matrices between the received signal and itself on distinct frequencies, along with the higher-order generalized singular value decomposition of the array of this unique matrix. Wideband DOAs are estimated by employing any subspace-based technique for estimating narrowband DOAs, but using the proposed wideband correlation instead of the narrowband correlation matrix. It implies that the proposed framework enables cutting-edge studies in the recent narrowband subspace methods to estimate DOAs of the wideband sources directly, which result in reducing computational complexity and facilitating the estimation algorithm. Practical examples are presented to showcase its applicability and effectiveness, and the results show that the performance of fusion methods perform better than others over a range of signal-to-noise ratios with just a few sensors, which make it suitable for practical use.
Collapse
Affiliation(s)
- Bandhit Suksiri
- Department of Engineering, Graduate School of Engineering, Kochi University of Technology, Kami Campus, Kochi 782-0003, Japan
| | - Masahiro Fukumoto
- School of Information, Kochi University of Technology, Kami Campus, Kochi 782-0003, Japan.
| |
Collapse
|
12
|
Vandermosten M, Correia J, Vanderauwera J, Wouters J, Ghesquière P, Bonte M. Brain activity patterns of phonemic representations are atypical in beginning readers with family risk for dyslexia. Dev Sci 2019; 23:e12857. [PMID: 31090993 DOI: 10.1111/desc.12857] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 04/03/2019] [Accepted: 04/29/2019] [Indexed: 12/13/2022]
Abstract
There is an ongoing debate whether phonological deficits in dyslexics should be attributed to (a) less specified representations of speech sounds, like suggested by studies in young children with a familial risk for dyslexia, or (b) to an impaired access to these phonemic representations, as suggested by studies in adults with dyslexia. These conflicting findings are rooted in between study differences in sample characteristics and/or testing techniques. The current study uses the same multivariate functional MRI (fMRI) approach as previously used in adults with dyslexia to investigate phonemic representations in 30 beginning readers with a familial risk and 24 beginning readers without a familial risk of dyslexia, of whom 20 were later retrospectively classified as dyslexic. Based on fMRI response patterns evoked by listening to different utterances of /bA/ and /dA/ sounds, multivoxel analyses indicate that the underlying activation patterns of the two phonemes were distinct in children with a low family risk but not in children with high family risk. However, no group differences were observed between children that were later classified as typical versus dyslexic readers, regardless of their family risk status, indicating that poor phonemic representations constitute a risk for dyslexia but are not sufficient to result in reading problems. We hypothesize that poor phonemic representations are trait (family risk) and not state (dyslexia) dependent, and that representational deficits only lead to reading difficulties when they are present in conjunction with other neuroanatomical or-functional deficits.
Collapse
Affiliation(s)
- Maaike Vandermosten
- Research Group ExpORL, Department of Neuroscience, KU Leuven, Leuven, Belgium.,Department of Cognitive Neuroscience and Maastricht Brain Imaging Center, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands
| | - Joao Correia
- Department of Cognitive Neuroscience and Maastricht Brain Imaging Center, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands.,Basque Center on Cognition, Brain and Language, San Sebastian, Spain
| | - Jolijn Vanderauwera
- Research Group ExpORL, Department of Neuroscience, KU Leuven, Leuven, Belgium.,Parenting and Special Education Research Unit, Faculty of Psychology and Educational Sciences, KU Leuven, Leuven, Belgium
| | - Jan Wouters
- Research Group ExpORL, Department of Neuroscience, KU Leuven, Leuven, Belgium
| | - Pol Ghesquière
- Parenting and Special Education Research Unit, Faculty of Psychology and Educational Sciences, KU Leuven, Leuven, Belgium
| | - Milene Bonte
- Department of Cognitive Neuroscience and Maastricht Brain Imaging Center, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, The Netherlands
| |
Collapse
|
13
|
Abstract
Research in speech perception has explored how knowledge of a language influences phonetic perception. The current study investigated whether such linguistic influences extend to the perceptual (sequential) organization of speech. Listeners heard sinewave analogs of word pairs (e.g., loose seam, which contains a single [s] frication but is perceived as two /s/ phonemes) cycle continuously, which causes the stimulus to split apart into foreground and background percepts. They had to identify the foreground percept when the stimuli were heard as nonspeech and then again when heard as speech. Of interest was how grouping changed across listening condition when [s] was heard as speech or as a hiss. Although the section of the signal that was identified as the foreground differed little across listening condition, a strong bias to perceive [s] as forming the onset of the foreground was observed in the speech condition (Experiment 1). This effect was reduced in Experiment 2 by increasing the stimulus repetition rate. Findings suggest that the sequential organization of speech arises from the interaction of auditory and linguistic processes, with the former constraining the latter.
Collapse
|
14
|
Llanos F, Alexander JM, Stilp CE, Kluender KR. Power spectral entropy as an information-theoretic correlate of manner of articulation in American English. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 141:EL127. [PMID: 28253693 DOI: 10.1121/1.4976109] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2016] [Revised: 08/22/2016] [Accepted: 10/31/2016] [Indexed: 06/06/2023]
Abstract
While all languages differentiate speech sounds by manner of articulation, none of the acoustic correlates proposed to date seem to account for how these contrasts are encoded in the speech signal. The present study describes power spectral entropy (PSE), which quantifies the amount of potential information conveyed in the power spectrum of a given sound. Results of acoustic analyses of speech samples extracted from the Texas Instruments-Massachusetts Institute of Technology database reveal a statistically significant correspondence between PSE and American English major classes of manner of articulation. Thus, PSE accurately captures an acoustic correlate of manner of articulation in American English.
Collapse
Affiliation(s)
- Fernando Llanos
- School of Languages and Cultures, Purdue University, 640 Oval Drive, West Lafayette, Indiana 47907, USA
| | - Joshua M Alexander
- Department of Speech, Language, and Hearing Sciences, Purdue University, 715 Clinic Drive, West Lafayette, Indiana 47907, USA
| | - Christian E Stilp
- Department of Psychological and Brain Sciences, University of Louisville, 308 Life Sciences Building, Louisville, Kentucky 40292, USA
| | - Keith R Kluender
- Department of Speech, Language, and Hearing Sciences, Purdue University, 715 Clinic Drive, West Lafayette, Indiana 47907, USA
| |
Collapse
|
15
|
The cocktail-party problem revisited: early processing and selection of multi-talker speech. Atten Percept Psychophys 2015; 77:1465-87. [PMID: 25828463 PMCID: PMC4469089 DOI: 10.3758/s13414-015-0882-9] [Citation(s) in RCA: 212] [Impact Index Per Article: 23.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
How do we recognize what one person is saying when others are speaking at the same time? This review summarizes widespread research in psychoacoustics, auditory scene analysis, and attention, all dealing with early processing and selection of speech, which has been stimulated by this question. Important effects occurring at the peripheral and brainstem levels are mutual masking of sounds and “unmasking” resulting from binaural listening. Psychoacoustic models have been developed that can predict these effects accurately, albeit using computational approaches rather than approximations of neural processing. Grouping—the segregation and streaming of sounds—represents a subsequent processing stage that interacts closely with attention. Sounds can be easily grouped—and subsequently selected—using primitive features such as spatial location and fundamental frequency. More complex processing is required when lexical, syntactic, or semantic information is used. Whereas it is now clear that such processing can take place preattentively, there also is evidence that the processing depth depends on the task-relevancy of the sound. This is consistent with the presence of a feedback loop in attentional control, triggering enhancement of to-be-selected input. Despite recent progress, there are still many unresolved issues: there is a need for integrative models that are neurophysiologically plausible, for research into grouping based on other than spatial or voice-related cues, for studies explicitly addressing endogenous and exogenous attention, for an explanation of the remarkable sluggishness of attention focused on dynamically changing sounds, and for research elucidating the distinction between binaural speech perception and sound localization.
Collapse
|
16
|
Honey C, Schnupp J. Neural Resolution of Formant Frequencies in the Primary Auditory Cortex of Rats. PLoS One 2015; 10:e0134078. [PMID: 26252382 PMCID: PMC4529216 DOI: 10.1371/journal.pone.0134078] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2014] [Accepted: 07/06/2015] [Indexed: 11/18/2022] Open
Abstract
Pulse-resonance sounds play an important role in animal communication and auditory object recognition, yet very little is known about the cortical representation of this class of sounds. In this study we shine light on one simple aspect: how well does the firing rate of cortical neurons resolve resonant ("formant") frequencies of vowel-like pulse-resonance sounds. We recorded neural responses in the primary auditory cortex (A1) of anesthetized rats to two-formant pulse-resonance sounds, and estimated their formant resolving power using a statistical kernel smoothing method which takes into account the natural variability of cortical responses. While formant-tuning functions were diverse in structure across different penetrations, most were sensitive to changes in formant frequency, with a frequency resolution comparable to that reported for rat cochlear filters.
Collapse
Affiliation(s)
| | - Jan Schnupp
- Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
17
|
Speech Coding in the Brain: Representation of Vowel Formants by Midbrain Neurons Tuned to Sound Fluctuations. eNeuro 2015; 2:eN-TNC-0004-15. [PMID: 26464993 PMCID: PMC4596011 DOI: 10.1523/eneuro.0004-15.2015] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2015] [Revised: 06/18/2015] [Accepted: 06/18/2015] [Indexed: 11/21/2022] Open
Abstract
Current models for neural coding of vowels are typically based on linear descriptions of the auditory periphery, and fail at high sound levels and in background noise. These models rely on either auditory nerve discharge rates or phase locking to temporal fine structure. However, both discharge rates and phase locking saturate at moderate to high sound levels, and phase locking is degraded in the CNS at middle to high frequencies. The fact that speech intelligibility is robust over a wide range of sound levels is problematic for codes that deteriorate as the sound level increases. Additionally, a successful neural code must function for speech in background noise at levels that are tolerated by listeners. The model presented here resolves these problems, and incorporates several key response properties of the nonlinear auditory periphery, including saturation, synchrony capture, and phase locking to both fine structure and envelope temporal features. The model also includes the properties of the auditory midbrain, where discharge rates are tuned to amplitude fluctuation rates. The nonlinear peripheral response features create contrasts in the amplitudes of low-frequency neural rate fluctuations across the population. These patterns of fluctuations result in a response profile in the midbrain that encodes vowel formants over a wide range of levels and in background noise. The hypothesized code is supported by electrophysiological recordings from the inferior colliculus of awake rabbits. This model provides information for understanding the structure of cross-linguistic vowel spaces, and suggests strategies for automatic formant detection and speech enhancement for listeners with hearing loss.
Collapse
|
18
|
Winter B. Spoken language achieves robustness and evolvability by exploiting degeneracy and neutrality. Bioessays 2014; 36:960-7. [DOI: 10.1002/bies.201400028] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Affiliation(s)
- Bodo Winter
- Cognitive and Information Sciences; University of California; Merced CA USA
| |
Collapse
|
19
|
McGettigan C, Scott SK. Cortical asymmetries in speech perception: what's wrong, what's right and what's left? Trends Cogn Sci 2012; 16:269-76. [PMID: 22521208 DOI: 10.1016/j.tics.2012.04.006] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2012] [Revised: 04/04/2012] [Accepted: 04/06/2012] [Indexed: 10/28/2022]
Abstract
Over the past 30 years hemispheric asymmetries in speech perception have been construed within a domain-general framework, according to which preferential processing of speech is due to left-lateralized, non-linguistic acoustic sensitivities. A prominent version of this argument holds that the left temporal lobe selectively processes rapid/temporal information in sound. Acoustically, this is a poor characterization of speech and there has been little empirical support for a left-hemisphere selectivity for these cues. In sharp contrast, the right temporal lobe is demonstrably sensitive to specific acoustic properties. We suggest that acoustic accounts of speech sensitivities need to be informed by the nature of the speech signal and that a simple domain-general vs. domain-specific dichotomy may be incorrect.
Collapse
Affiliation(s)
- Carolyn McGettigan
- Institute of Cognitive Neuroscience, University College London, 17 Queen Square, London WC1N 3AR, UK
| | | |
Collapse
|
20
|
Rapid synaptic depression explains nonlinear modulation of spectro-temporal tuning in primary auditory cortex by natural stimuli. J Neurosci 2009; 29:3374-86. [PMID: 19295144 DOI: 10.1523/jneurosci.5249-08.2009] [Citation(s) in RCA: 105] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
In this study, we explored ways to account more accurately for responses of neurons in primary auditory cortex (A1) to natural sounds. The auditory cortex has evolved to extract behaviorally relevant information from complex natural sounds, but most of our understanding of its function is derived from experiments using simple synthetic stimuli. Previous neurophysiological studies have found that existing models, such as the linear spectro-temporal receptive field (STRF), fail to capture the entire functional relationship between natural stimuli and neural responses. To study this problem, we compared STRFs for A1 neurons estimated using a natural stimulus, continuous speech, with STRFs estimated using synthetic ripple noise. For about one-third of the neurons, we found significant differences between STRFs, usually in the temporal dynamics of inhibition and/or overall gain. This shift in tuning resulted primarily from differences in the coarse temporal structure of the speech and noise stimuli. Using simulations, we found that the stimulus dependence of spectro-temporal tuning can be explained by a model in which synaptic inputs to A1 neurons are susceptible to rapid nonlinear depression. This dynamic reshaping of spectro-temporal tuning suggests that synaptic depression may enable efficient encoding of natural auditory stimuli.
Collapse
|
21
|
Young ED. Neural representation of spectral and temporal information in speech. Philos Trans R Soc Lond B Biol Sci 2008; 363:923-45. [PMID: 17827107 PMCID: PMC2606788 DOI: 10.1098/rstb.2007.2151] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Speech is the most interesting and one of the most complex sounds dealt with by the auditory system. The neural representation of speech needs to capture those features of the signal on which the brain depends in language communication. Here we describe the representation of speech in the auditory nerve and in a few sites in the central nervous system from the perspective of the neural coding of important aspects of the signal. The representation is tonotopic, meaning that the speech signal is decomposed by frequency and different frequency components are represented in different populations of neurons. Essential to the representation are the properties of frequency tuning and nonlinear suppression. Tuning creates the decomposition of the signal by frequency, and nonlinear suppression is essential for maintaining the representation across sound levels. The representation changes in central auditory neurons by becoming more robust against changes in stimulus intensity and more transient. However, it is probable that the form of the representation at the auditory cortex is fundamentally different from that at lower levels, in that stimulus features other than the distribution of energy across frequency are analysed.
Collapse
Affiliation(s)
- Eric D Young
- Department of Biomedical Engineering, Centre for Hearing and Balance, Johns Hopkins University, 720 Rutland Avenue, Baltimore, MD 21205, USA.
| |
Collapse
|
22
|
Campbell R. The processing of audio-visual speech: empirical and neural bases. Philos Trans R Soc Lond B Biol Sci 2008; 363:1001-10. [PMID: 17827105 PMCID: PMC2606792 DOI: 10.1098/rstb.2007.2155] [Citation(s) in RCA: 172] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this selective review, I outline a number of ways in which seeing the talker affects auditory perception of speech, including, but not confined to, the McGurk effect. To date, studies suggest that all linguistic levels are susceptible to visual influence, and that two main modes of processing can be described: a complementary mode, whereby vision provides information more efficiently than hearing for some under-specified parts of the speech stream, and a correlated mode, whereby vision partially duplicates information about dynamic articulatory patterning.Cortical correlates of seen speech suggest that at the neurological as well as the perceptual level, auditory processing of speech is affected by vision, so that 'auditory speech regions' are activated by seen speech. The processing of natural speech, whether it is heard, seen or heard and seen, activates the perisylvian language regions (left>right). It is highly probable that activation occurs in a specific order. First, superior temporal, then inferior parietal and finally inferior frontal regions (left>right) are activated. There is some differentiation of the visual input stream to the core perisylvian language system, suggesting that complementary seen speech information makes special use of the visual ventral processing stream, while for correlated visual speech, the dorsal processing stream, which is sensitive to visual movement, may be relatively more involved.
Collapse
Affiliation(s)
- Ruth Campbell
- Department of Human Communication Science, University College London, Chandler House, 2 Wakefield Street, London WC1N 1PF, UK.
| |
Collapse
|
23
|
Moore BCJ, Tyler LK, Marslen-Wilson W. Introduction. The perception of speech: from sound to meaning. Philos Trans R Soc Lond B Biol Sci 2008; 363:917-21. [PMID: 17827100 PMCID: PMC2042536 DOI: 10.1098/rstb.2007.2195] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Spoken language communication is arguably the most important activity that distinguishes humans from non-human species. This paper provides an overview of the review papers that make up this theme issue on the processes underlying speech communication. The volume includes contributions from researchers who specialize in a wide range of topics within the general area of speech perception and language processing. It also includes contributions from key researchers in neuroanatomy and functional neuro-imaging, in an effort to cut across traditional disciplinary boundaries and foster cross-disciplinary interactions in this important and rapidly developing area of the biological and cognitive sciences.
Collapse
Affiliation(s)
- Brian C J Moore
- Department of Experimental Psychology, University of Cambridge, Downing Street, Cambridge CB2 3EB, UK.
| | | | | |
Collapse
|
24
|
Moore BCJ. Basic auditory processes involved in the analysis of speech sounds. Philos Trans R Soc Lond B Biol Sci 2008; 363:947-63. [PMID: 17827102 PMCID: PMC2606789 DOI: 10.1098/rstb.2007.2152] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This paper reviews the basic aspects of auditory processing that play a role in the perception of speech. The frequency selectivity of the auditory system, as measured using masking experiments, is described and used to derive the internal representation of the spectrum (the excitation pattern) of speech sounds. The perception of timbre and distinctions in quality between vowels are related to both static and dynamic aspects of the spectra of sounds. The perception of pitch and its role in speech perception are described. Measures of the temporal resolution of the auditory system are described and a model of temporal resolution based on a sliding temporal integrator is outlined. The combined effects of frequency and temporal resolution can be modelled by calculation of the spectro-temporal excitation pattern, which gives good insight into the internal representation of speech sounds. For speech presented in quiet, the resolution of the auditory system in frequency and time usually markedly exceeds the resolution necessary for the identification or discrimination of speech sounds, which partly accounts for the robust nature of speech perception. However, for people with impaired hearing, speech perception is often much less robust.
Collapse
Affiliation(s)
- Brian C J Moore
- Department of Experimental Psychology, University of Cambridge, Downing Street, Cambridge CB2 3EB, UK.
| |
Collapse
|
25
|
Abstract
Although most research on the perception of speech has been conducted with speech presented without any competing sounds, we almost always listen to speech against a background of other sounds which we are adept at ignoring. Nevertheless, such additional irrelevant sounds can cause severe problems for speech recognition algorithms and for the hard of hearing as well as posing a challenge to theories of speech perception. A variety of different problems are created by the presence of additional sound sources: detection of features that are partially masked, allocation of detected features to the appropriate sound sources and recognition of sounds on the basis of partial information. The separation of sounds is arousing substantial attention in psychoacoustics and in computer science. An effective solution to the problem of separating sounds would have important practical applications.
Collapse
Affiliation(s)
- C J Darwin
- Department of Psychology, University of Sussex, Brighton BN1 9QG, UK.
| |
Collapse
|