1
|
Anikin A, Barreda S, Reby D. A practical guide to calculating vocal tract length and scale-invariant formant patterns. Behav Res Methods 2024; 56:5588-5604. [PMID: 38158551 PMCID: PMC11525281 DOI: 10.3758/s13428-023-02288-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/02/2023] [Indexed: 01/03/2024]
Abstract
Formants (vocal tract resonances) are increasingly analyzed not only by phoneticians in speech but also by behavioral scientists studying diverse phenomena such as acoustic size exaggeration and articulatory abilities of non-human animals. This often involves estimating vocal tract length acoustically and producing scale-invariant representations of formant patterns. We present a theoretical framework and practical tools for carrying out this work, including open-source software solutions included in R packages soundgen and phonTools. Automatic formant measurement with linear predictive coding is error-prone, but formant_app provides an integrated environment for formant annotation and correction with visual and auditory feedback. Once measured, formants can be normalized using a single recording (intrinsic methods) or multiple recordings from the same individual (extrinsic methods). Intrinsic speaker normalization can be as simple as taking formant ratios and calculating the geometric mean as a measure of overall scale. The regression method implemented in the function estimateVTL calculates the apparent vocal tract length assuming a single-tube model, while its residuals provide a scale-invariant vowel space based on how far each formant deviates from equal spacing (the schwa function). Extrinsic speaker normalization provides more accurate estimates of speaker- and vowel-specific scale factors by pooling information across recordings with simple averaging or mixed models, which we illustrate with example datasets and R code. The take-home messages are to record several calls or vowels per individual, measure at least three or four formants, check formant measurements manually, treat uncertain values as missing, and use the statistical tools best suited to each modeling context.
Collapse
Affiliation(s)
- Andrey Anikin
- Division of Cognitive Science, Department of Philosophy, Lund University, Box 192, SE-221 00, Lund, Sweden.
- ENES Bioacoustics Research Laboratory, CRNL Center for Research in Neuroscience in Lyon, University of Saint Étienne, 42023, St-Étienne, France.
| | - Santiago Barreda
- Department of Linguistics, University of California, Davis, Davis, CA, USA
| | - David Reby
- ENES Bioacoustics Research Laboratory, CRNL Center for Research in Neuroscience in Lyon, University of Saint Étienne, 42023, St-Étienne, France
- Institut Universitaire de France, 75005, Paris, France
| |
Collapse
|
2
|
Luthra S. Why are listeners hindered by talker variability? Psychon Bull Rev 2024; 31:104-121. [PMID: 37580454 PMCID: PMC10864679 DOI: 10.3758/s13423-023-02355-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/27/2023] [Indexed: 08/16/2023]
Abstract
Though listeners readily recognize speech from a variety of talkers, accommodating talker variability comes at a cost: Myriad studies have shown that listeners are slower to recognize a spoken word when there is talker variability compared with when talker is held constant. This review focuses on two possible theoretical mechanisms for the emergence of these processing penalties. One view is that multitalker processing costs arise through a resource-demanding talker accommodation process, wherein listeners compare sensory representations against hypothesized perceptual candidates and error signals are used to adjust the acoustic-to-phonetic mapping (an active control process known as contextual tuning). An alternative proposal is that these processing costs arise because talker changes involve salient stimulus-level discontinuities that disrupt auditory attention. Some recent data suggest that multitalker processing costs may be driven by both mechanisms operating over different time scales. Fully evaluating this claim requires a foundational understanding of both talker accommodation and auditory streaming; this article provides a primer on each literature and also reviews several studies that have observed multitalker processing costs. The review closes by underscoring a need for comprehensive theories of speech perception that better integrate auditory attention and by highlighting important considerations for future research in this area.
Collapse
Affiliation(s)
- Sahil Luthra
- Department of Psychology, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, 15213, USA.
| |
Collapse
|
3
|
Oganian Y, Bhaya-Grossman I, Johnson K, Chang EF. Vowel and formant representation in the human auditory speech cortex. Neuron 2023; 111:2105-2118.e4. [PMID: 37105171 PMCID: PMC10330593 DOI: 10.1016/j.neuron.2023.04.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Revised: 02/08/2023] [Accepted: 04/04/2023] [Indexed: 04/29/2023]
Abstract
Vowels, a fundamental component of human speech across all languages, are cued acoustically by formants, resonance frequencies of the vocal tract shape during speaking. An outstanding question in neurolinguistics is how formants are processed neurally during speech perception. To address this, we collected high-density intracranial recordings from the human speech cortex on the superior temporal gyrus (STG) while participants listened to continuous speech. We found that two-dimensional receptive fields based on the first two formants provided the best characterization of vowel sound representation. Neural activity at single sites was highly selective for zones in this formant space. Furthermore, formant tuning is adjusted dynamically for speaker-specific spectral context. However, the entire population of formant-encoding sites was required to accurately decode single vowels. Overall, our results reveal that complex acoustic tuning in the two-dimensional formant space underlies local vowel representations in STG. As a population code, this gives rise to phonological vowel perception.
Collapse
Affiliation(s)
- Yulia Oganian
- Department of Neurological Surgery, University of California, San Francisco, 675 Nelson Rising Lane, San Francisco, CA 94158, USA
| | - Ilina Bhaya-Grossman
- Department of Neurological Surgery, University of California, San Francisco, 675 Nelson Rising Lane, San Francisco, CA 94158, USA; University of California, Berkeley-University of California, San Francisco Graduate Program in Bioengineering, Berkeley, CA 94720, USA
| | - Keith Johnson
- Department of Linguistics, University of California, Berkeley, Berkeley, CA, USA
| | - Edward F Chang
- Department of Neurological Surgery, University of California, San Francisco, 675 Nelson Rising Lane, San Francisco, CA 94158, USA.
| |
Collapse
|
4
|
Persson A, Jaeger TF. Evaluating normalization accounts against the dense vowel space of Central Swedish. Front Psychol 2023; 14:1165742. [PMID: 37416548 PMCID: PMC10322199 DOI: 10.3389/fpsyg.2023.1165742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Accepted: 05/23/2023] [Indexed: 07/08/2023] Open
Abstract
Talkers vary in the phonetic realization of their vowels. One influential hypothesis holds that listeners overcome this inter-talker variability through pre-linguistic auditory mechanisms that normalize the acoustic or phonetic cues that form the input to speech recognition. Dozens of competing normalization accounts exist-including both accounts specific to vowel perception and general purpose accounts that can be applied to any type of cue. We add to the cross-linguistic literature on this matter by comparing normalization accounts against a new phonetically annotated vowel database of Swedish, a language with a particularly dense vowel inventory of 21 vowels differing in quality and quantity. We evaluate normalization accounts on how they differ in predicted consequences for perception. The results indicate that the best performing accounts either center or standardize formants by talker. The study also suggests that general purpose accounts perform as well as vowel-specific accounts, and that vowel normalization operates in both temporal and spectral domains.
Collapse
Affiliation(s)
- Anna Persson
- Department of Swedish Language and Multilingualism, Stockholm University, Stockholm, Sweden
| | - T. Florian Jaeger
- Brain and Cognitive Sciences, University of Rochester, Rochester, NY, United States
- Computer Science, University of Rochester, Rochester, NY, United States
| |
Collapse
|
5
|
Luthra S, Mechtenberg H, Giorio C, Theodore RM, Magnuson JS, Myers EB. Using TMS to evaluate a causal role for right posterior temporal cortex in talker-specific phonetic processing. BRAIN AND LANGUAGE 2023; 240:105264. [PMID: 37087863 PMCID: PMC10286152 DOI: 10.1016/j.bandl.2023.105264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Revised: 04/06/2023] [Accepted: 04/08/2023] [Indexed: 05/03/2023]
Abstract
Theories suggest that speech perception is informed by listeners' beliefs of what phonetic variation is typical of a talker. A previous fMRI study found right middle temporal gyrus (RMTG) sensitivity to whether a phonetic variant was typical of a talker, consistent with literature suggesting that the right hemisphere may play a key role in conditioning phonetic identity on talker information. The current work used transcranial magnetic stimulation (TMS) to test whether the RMTG plays a causal role in processing talker-specific phonetic variation. Listeners were exposed to talkers who differed in how they produced voiceless stop consonants while TMS was applied to RMTG, left MTG, or scalp vertex. Listeners subsequently showed near-ceiling performance in indicating which of two variants was typical of a trained talker, regardless of previous stimulation site. Thus, even though the RMTG is recruited for talker-specific phonetic processing, modulation of its function may have only modest consequences.
Collapse
Affiliation(s)
| | | | | | | | - James S Magnuson
- University of Connecticut, United States; BCBL. Basque Center on Cognition Brain and Language, Donostia-San Sebastián, Spain; Ikerbasque, Basque Foundation for Science, Bilbao, Spain
| | | |
Collapse
|
6
|
Voeten CC, Heeringa W, Van de Velde H. Normalization of nonlinearly time-dynamic vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:2692. [PMID: 36456282 DOI: 10.1121/10.0015025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 10/13/2022] [Indexed: 06/17/2023]
Abstract
This study compares 16 vowel-normalization methods for purposes of sociophonetic research. Most of the previous work in this domain has focused on the performance of normalization methods on steady-state vowels. By contrast, this study explicitly considers dynamic formant trajectories, using generalized additive models to model these nonlinearly. Normalization methods were compared using a hand-corrected dataset from the Flemish-Dutch Teacher Corpus, which contains 160 speakers from 8 geographical regions, who spoke regionally accented versions of Netherlandic/Flemish Standard Dutch. Normalization performance was assessed by comparing the methods' abilities to remove anatomical variation, retain vowel distinctions, and explain variation in the normalized F0-F3. In addition, it was established whether normalization competes with by-speaker random effects or supplements it, by comparing how much between-speaker variance remained to be apportioned to random effects after normalization. The results partly reproduce the good performance of Lobanov, Gerstman, and Nearey 1 found earlier and generally favor log-mean and centroid methods. However, newer methods achieve higher effect sizes (i.e., explain more variance) at only marginally worse performances. Random effects were found to be equally useful before and after normalization, showing that they complement it. The findings are interpreted in light of the way that the different methods handle formant dynamics.
Collapse
Affiliation(s)
- Cesko C Voeten
- Fryske Akademy, Doelestraat 8, Leeuwarden, 8911 DX, The Netherlands
| | - Wilbert Heeringa
- Fryske Akademy, Doelestraat 8, Leeuwarden, 8911 DX, The Netherlands
| | | |
Collapse
|
7
|
Uezu Y, Hiroya S, Mochida T. Articulatory compensation for low-pass filtered formant-altered auditory feedback. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:64. [PMID: 34340472 DOI: 10.1121/10.0004775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Accepted: 04/01/2021] [Indexed: 06/13/2023]
Abstract
Auditory feedback while speaking plays an important role in stably controlling speech articulation. Its importance has been verified in formant-altered auditory feedback (AAF) experiments where speakers utter while listening to speech with perturbed first (F1) and second (F2) formant frequencies. However, the contribution of the frequency components higher than F2 to the articulatory control under the perturbations of F1 and F2 has not yet been investigated. In this study, a formant-AAF experiment was conducted in which a low-pass filter was applied to speech. The experimental results showed that the deviation in the compensatory response was significantly larger when a low-pass filter with a cutoff frequency of 3 kHz was used compared to that when cutoff frequencies of 4 and 8 kHz were used. It was also found that the deviation in the 3-kHz condition correlated with the fundamental frequency and spectral tilt of the produced speech. Additional simulation results using a neurocomputational model of speech production (SimpleDIVA model) and the experimental data showed that the feedforward learning rate increased as the cutoff frequency decreased. These results suggest that high-frequency components of the auditory feedback would be involved in the determination of corrective motor commands from auditory errors.
Collapse
Affiliation(s)
- Yasufumi Uezu
- NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, 3-1, Morinosato-Wakamiya, Atsugi-shi, Kanagawa, 243-0198, Japan
| | - Sadao Hiroya
- NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, 3-1, Morinosato-Wakamiya, Atsugi-shi, Kanagawa, 243-0198, Japan
| | - Takemi Mochida
- NTT Communication Science Laboratories, Nippon Telegraph and Telephone Corporation, 3-1, Morinosato-Wakamiya, Atsugi-shi, Kanagawa, 243-0198, Japan
| |
Collapse
|
8
|
Talker familiarity and the accommodation of talker variability. Atten Percept Psychophys 2021; 83:1842-1860. [PMID: 33398658 DOI: 10.3758/s13414-020-02203-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/05/2020] [Indexed: 11/08/2022]
Abstract
A fundamental problem in speech perception is how (or whether) listeners accommodate variability in the way talkers produce speech. One view of the way listeners cope with this variability is that talker differences are normalized - a mapping between talker-specific characteristics and phonetic categories is computed such that speech is recognized in the context of the talker's vocal characteristics. Consistent with this view, listeners process speech more slowly when the talker changes randomly than when the talker remains constant. An alternative view is that speech perception is based on talker-specific auditory exemplars in memory clustered around linguistic categories that allow talker-independent perception. Consistent with this view, listeners become more efficient at talker-specific phonetic processing after voice identification training. We asked whether phonetic efficiency would increase with talker familiarity by testing listeners with extremely familiar talkers (family members), newly familiar talkers (based on laboratory training), and unfamiliar talkers. We also asked whether familiarity would reduce the need for normalization. As predicted, phonetic efficiency (word recognition in noise) increased with familiarity (unfamiliar < trained-on < family). However, we observed a constant processing cost for talker changes even for pairs of family members. We discuss how normalization and exemplar theories might account for these results, and constraints the results impose on theoretical accounts of phonetic constancy.
Collapse
|
9
|
Lehet M, Holt LL. Nevertheless, it persists: Dimension-based statistical learning and normalization of speech impact different levels of perceptual processing. Cognition 2020; 202:104328. [PMID: 32502867 DOI: 10.1016/j.cognition.2020.104328] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Revised: 05/04/2020] [Accepted: 05/13/2020] [Indexed: 11/25/2022]
Abstract
Speech is notoriously variable, with no simple mapping from acoustics to linguistically-meaningful units like words and phonemes. Empirical research on this theoretically central issue establishes at least two classes of perceptual phenomena that accommodate acoustic variability: normalization and perceptual learning. Intriguingly, perceptual learning is supported by learning across acoustic variability, but normalization is thought to counteract acoustic variability leaving open questions about how these two phenomena might interact. Here, we examine the joint impact of normalization and perceptual learning on how acoustic dimensions map to vowel categories. As listeners categorized nonwords as setch or satch, they experienced a shift in short-term distributional regularities across the vowels' acoustic dimensions. Introduction of this 'artificial accent' resulted in a shift in the contribution of vowel duration in categorization. Although this dimension-based statistical learning impacted the influence of vowel duration on vowel categorization, the duration of these very same vowels nonetheless maintained a consistent influence on categorization of a subsequent consonant via duration contrast, a form of normalization. Thus, vowel duration had a duplex role consistent with normalization and perceptual learning operating on distinct levels in the processing hierarchy. We posit that whereas normalization operates across auditory dimensions, dimension-based statistical learning impacts the connection weights among auditory dimensions and phonetic categories.
Collapse
Affiliation(s)
- Matthew Lehet
- Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15232, USA; Center for the Neural Basis of Cognition, Pittsburgh, PA 15232, USA
| | - Lori L Holt
- Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15232, USA; Center for the Neural Basis of Cognition, Pittsburgh, PA 15232, USA; Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA 15232, USA.
| |
Collapse
|
10
|
Graham C, Post B. Constancy and Variation in Speech: Phonetic Realisation and Abstraction. PHONETICA 2019; 76:87-99. [PMID: 31112964 DOI: 10.1159/000497439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Accepted: 01/20/2019] [Indexed: 06/09/2023]
Affiliation(s)
- Calbert Graham
- Phonetics Laboratory, Theoretical and Applied Linguistics, University of Cambridge, Cambridge, United Kingdom,
| | - Brechtje Post
- Phonetics Laboratory, Theoretical and Applied Linguistics, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
11
|
Long-standing problems in speech perception dissolve within an information-theoretic perspective. Atten Percept Psychophys 2019; 81:861-883. [PMID: 30937673 DOI: 10.3758/s13414-019-01702-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
An information theoretic framework is proposed to have the potential to dissolve (rather than attempt to solve) multiple long-standing problems concerning speech perception. By this view, speech perception can be reframed as a series of processes through which sensitivity to information-that which changes and/or is unpredictable-becomes increasingly sophisticated and shaped by experience. Problems concerning appropriate objects of perception (gestures vs. sounds), rate normalization, variance consequent to articulation, and talker normalization are reframed, or even dissolved, within this information-theoretic framework. Application of discriminative models founded on information theory provides a productive approach to answer questions concerning perception of speech, and perception most broadly.
Collapse
|
12
|
Pike CD, Kriengwatana BP. Vocal tract constancy in birds and humans. Behav Processes 2018; 163:99-112. [PMID: 30145277 DOI: 10.1016/j.beproc.2018.08.001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2017] [Revised: 07/30/2018] [Accepted: 08/10/2018] [Indexed: 12/30/2022]
Abstract
Humans perceive speech as being relatively stable despite acoustic variation caused by vocal tract (VT) differences between speakers. Humans use perceptual 'vocal tract normalisation' (VTN) and other processes to achieve this stability. Similarity in vocal apparatus/acoustics between birds and humans means that birds might also experience VT variation. This has the potential to impede bird communication. No known studies have explicitly examined this, but a number of studies show perceptual stability or 'perceptual constancy' in birds similar to that seen in humans when dealing with VT variation. This review explores similarities between birds and humans and concludes that birds show sufficient evidence of perceptual constancy to warrant further research in this area. Future work should 1) quantify the multiple sources of variation in bird vocalisations, including, but not limited to VT variations, 2) determine whether vocalisations are perniciously disrupted by any of these and 3) investigate how birds reduce variation to maintain perceptual constancy and perceptual efficiency.
Collapse
Affiliation(s)
- Cleopatra Diana Pike
- School of Psychology and Neuroscience, University of St Andrews, St Mary's Quad, South Street, St Andrews, Fife, KY16 9JP, UK.
| | - Buddhamas Pralle Kriengwatana
- School of Psychology and Neuroscience, University of St Andrews, St Mary's Quad, South Street, St Andrews, Fife, KY16 9JP, UK
| |
Collapse
|
13
|
Barreda S, Nearey TM. A regression approach to vowel normalization for missing and unbalanced data. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:500. [PMID: 30075677 DOI: 10.1121/1.5047742] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 07/06/2018] [Indexed: 06/08/2023]
Abstract
Researchers investigating the vowel systems of languages or dialects frequently employ normalization methods to minimize between-speaker variability in formant patterns while preserving between-phoneme separation and (socio-)dialectal variation. Here two methods are considered: log-mean and Lobanov normalization. Although both of these methods express formants in a speaker-dependent space, the methods differ in their complexity and in their implied models of human vowel-perception. Typical implementations of these methods rely on balanced data across speakers so that researchers may have to reduce the data available in the analyses in missing-data situations. Here, an alternative method is proposed for the normalization of vowels using the log-mean method in a linear-regression framework. The performance of the traditional approaches to log-mean and Lobanov normalization against the regression approach to the log-mean method using naturalistic, simulated vowel-data was investigated. The results indicate that the Lobanov method likely removes legitimate linguistic variation from vowel data and often provides very noisy estimates of the actual vowel quality associated with individual tokens. The authors further argue that the Lobanov method is too complex to represent a plausible model of human vowel perception, and so is unlikely to provide results that reflect the true perceptual organization of linguistic data.
Collapse
Affiliation(s)
- Santiago Barreda
- Department of Linguistics, University of California, Davis, Davis, California 95616, USA
| | - Terrance M Nearey
- Department of Linguistics, University of Alberta, Edmonton T6G 2E7, Canada
| |
Collapse
|
14
|
Ménard L, Côté D, Trudeau-Fisette P. Maintaining Distinctiveness at Increased Speaking Rates: A Comparison between Congenitally Blind and Sighted Speakers. Folia Phoniatr Logop 2017; 68:232-238. [PMID: 28746935 DOI: 10.1159/000470905] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES The effects of increased speaking rates on vowels have been well documented in sighted adults. It has been reported that in fast speech, vowels are less widely spaced acoustically than in their citation form. Vowel space compression has also been reported in congenitally blind speakers. The objective of the study was to investigate the interaction of vision and speaking rate in adult speakers. PATIENTS AND METHODS Contrast distances between vowels were examined in conversational and fast speech produced by 10 congenitally blind and 10 sighted French-Canadian adults. Acoustic analyses were carried out. RESULTS Compared with the sighted speakers, in the fast speaking condition, the blind speakers produced more vowels with contrast along the height, place of articulation, and rounding features located within the auditory target regions typical of French vowels. CONCLUSION Blind speakers relied more heavily than sighted speakers on auditory properties of vowels to maintain perceptual distinctiveness.
Collapse
Affiliation(s)
- Lucie Ménard
- Laboratoire de Phonétique, Center for Research on Brain, Language, and Music, Université du Québec à Montréal, Montreal, QC, Canada
| | | | | |
Collapse
|
15
|
Gaston J, Dickerson K, Hipp D, Gerhardstein P. Change deafness for real spatialized environmental scenes. COGNITIVE RESEARCH-PRINCIPLES AND IMPLICATIONS 2017; 2:29. [PMID: 28680950 PMCID: PMC5487906 DOI: 10.1186/s41235-017-0066-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/21/2016] [Accepted: 06/02/2017] [Indexed: 11/10/2022]
Abstract
The everyday auditory environment is complex and dynamic; often, multiple sounds co-occur and compete for a listener’s cognitive resources. ‘Change deafness’, framed as the auditory analog to the well-documented phenomenon of ‘change blindness’, describes the finding that changes presented within complex environments are often missed. The present study examines a number of stimulus factors that may influence change deafness under real-world listening conditions. Specifically, an AX (same-different) discrimination task was used to examine the effects of both spatial separation over a loudspeaker array and the type of change (sound source additions and removals) on discrimination of changes embedded in complex backgrounds. Results using signal detection theory and accuracy analyses indicated that, under most conditions, errors were significantly reduced for spatially distributed relative to non-spatial scenes. A second goal of the present study was to evaluate a possible link between memory for scene contents and change discrimination. Memory was evaluated by presenting a cued recall test following each trial of the discrimination task. Results using signal detection theory and accuracy analyses indicated that recall ability was similar in terms of accuracy, but there were reductions in sensitivity compared to previous reports. Finally, the present study used a large and representative sample of outdoor, urban, and environmental sounds, presented in unique combinations of nearly 1000 trials per participant. This enabled the exploration of the relationship between change perception and the perceptual similarity between change targets and background scene sounds. These (post hoc) analyses suggest both a categorical and a stimulus-level relationship between scene similarity and the magnitude of change errors.
Collapse
Affiliation(s)
- Jeremy Gaston
- Army Research Laboratory, Human Research and Engineering Directorate, Adelphi, MD USA
| | - Kelly Dickerson
- Army Research Laboratory, Human Research and Engineering Directorate, Adelphi, MD USA
| | - Daniel Hipp
- Army Research Laboratory, Human Research and Engineering Directorate, Adelphi, MD USA
| | - Peter Gerhardstein
- Army Research Laboratory, Human Research and Engineering Directorate, Adelphi, MD USA
| |
Collapse
|
16
|
Neuromagnetic correlates of voice pitch, vowel type, and speaker size in auditory cortex. Neuroimage 2017; 158:79-89. [PMID: 28669914 DOI: 10.1016/j.neuroimage.2017.06.065] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2017] [Revised: 06/13/2017] [Accepted: 06/22/2017] [Indexed: 11/24/2022] Open
Abstract
Vowel recognition is largely immune to differences in speaker size despite the waveform differences associated with variation in speaker size. This has led to the suggestion that voice pitch and mean formant frequency (MFF) are extracted early in the hierarchy of hearing/speech processing and used to normalize the internal representation of vowel sounds. This paper presents a magnetoencephalographic (MEG) experiment designed to locate and compare neuromagnetic activity associated with voice pitch, MFF and vowel type in human auditory cortex. Sequences of six sustained vowels were used to contrast changes in the three components of vowel perception, and MEG responses to the changes were recorded from 25 participants. A staged procedure was employed to fit the MEG data with a source model having one bilateral pair of dipoles for each component of vowel perception. This dipole model showed that the activity associated with the three perceptual changes was functionally separable; the pitch source was located in Heschl's gyrus (bilaterally), while the vowel-type and formant-frequency sources were located (bilaterally) just behind Heschl's gyrus in planum temporale. The results confirm that vowel normalization begins in auditory cortex at an early point in the hierarchy of speech processing.
Collapse
|
17
|
Mulak KE, Bonn CD, Chládková K, Aslin RN, Escudero P. Indexical and linguistic processing by 12-month-olds: Discrimination of speaker, accent and vowel differences. PLoS One 2017; 12:e0176762. [PMID: 28520762 PMCID: PMC5435166 DOI: 10.1371/journal.pone.0176762] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2016] [Accepted: 04/17/2017] [Indexed: 11/19/2022] Open
Abstract
Infants preferentially discriminate between speech tokens that cross native category boundaries prior to acquiring a large receptive vocabulary, implying a major role for unsupervised distributional learning strategies in phoneme acquisition in the first year of life. Multiple sources of between-speaker variability contribute to children's language input and thus complicate the problem of distributional learning. Adults resolve this type of indexical variability by adjusting their speech processing for individual speakers. For infants to handle indexical variation in the same way, they must be sensitive to both linguistic and indexical cues. To assess infants' sensitivity to and relative weighting of indexical and linguistic cues, we familiarized 12-month-old infants to tokens of a vowel produced by one speaker, and tested their listening preference to trials containing a vowel category change produced by the same speaker (linguistic information), and the same vowel category produced by another speaker of the same or a different accent (indexical information). Infants noticed linguistic and indexical differences, suggesting that both are salient in infant speech processing. Future research should explore how infants weight these cues in a distributional learning context that contains both phonetic and indexical variation.
Collapse
Affiliation(s)
- Karen E. Mulak
- The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Penrith, New South Wales, Australia
- Australian Research Council Centre of Excellence for the Dynamics of Language, Western Sydney University, Penrith, New South Wales, Australia
| | - Cory D. Bonn
- Department of Brain & Cognitive Sciences, University of Rochester, Rochester, New York, United States of America
| | - Kateřina Chládková
- Amsterdam Center for Language and Communication, University of Amsterdam, Amsterdam, Netherlands
| | - Richard N. Aslin
- Department of Brain & Cognitive Sciences, University of Rochester, Rochester, New York, United States of America
| | - Paola Escudero
- The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Penrith, New South Wales, Australia
- Australian Research Council Centre of Excellence for the Dynamics of Language, Western Sydney University, Penrith, New South Wales, Australia
| |
Collapse
|
18
|
Tzeng CY, Alexander JED, Sidaras SK, Nygaard LC. The role of training structure in perceptual learning of accented speech. J Exp Psychol Hum Percept Perform 2016; 42:1793-1805. [PMID: 27399829 PMCID: PMC5083239 DOI: 10.1037/xhp0000260] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Foreign-accented speech contains multiple sources of variation that listeners learn to accommodate. Extending previous findings showing that exposure to high-variation training facilitates perceptual learning of accented speech, the current study examines to what extent the structure of training materials affects learning. During training, native adult speakers of American English transcribed sentences spoken in English by native Spanish-speaking adults. In Experiment 1, training stimuli were blocked by speaker, sentence, or randomized with respect to speaker and sentence (Variable training). At test, listeners transcribed novel English sentences produced by unfamiliar Spanish-accented speakers. Listeners' transcription accuracy was highest in the Variable condition, suggesting that varying both speaker identity and sentence across training trials enabled listeners to generalize their learning to novel speakers and linguistic content. Experiment 2 assessed the extent to which ordering of training tokens by a single factor, speaker intelligibility, would facilitate speaker-independent accent learning, finding that listeners' test performance did not reliably differ from that in the no-training control condition. Overall, these results suggest that the structure of training exposure, specifically trial-to-trial variation on both speaker's voice and linguistic content, facilitates learning of the systematic properties of accented speech. The current findings suggest a crucial role of training structure in optimizing perceptual learning. Beyond characterizing the types of variation listeners encode in their representations of spoken utterances, theories of spoken language processing should incorporate the role of training structure in learning lawful variation in speech. (PsycINFO Database Record
Collapse
|
19
|
T V A, A G R. Intrinsic-cum-extrinsic normalization of formant data of vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 140:EL446. [PMID: 27908035 DOI: 10.1121/1.4967311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Using a known speaker-intrinsic normalization procedure, formant data are scaled by the reciprocal of the geometric mean of the first three formant frequencies. This reduces the influence of the talker but results in a distorted vowel space. The proposed speaker-extrinsic procedure re-scales the normalized values by the mean formant values of vowels. When tested on the formant data of vowels published by Peterson and Barney, the combined approach leads to well separated clusters by reducing the spread due to talkers. The proposed procedure performs better than two top-ranked normalization procedures based on the accuracy of vowel classification as the objective measure.
Collapse
Affiliation(s)
| | - Ramakrishnan A G
- Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, India
| |
Collapse
|
20
|
Liao JS. An Acoustic Study of Vowels Produced by Alaryngeal Speakers in Taiwan. AMERICAN JOURNAL OF SPEECH-LANGUAGE PATHOLOGY 2016; 25:481-492. [PMID: 27538050 DOI: 10.1044/2016_ajslp-15-0068] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2015] [Accepted: 02/01/2016] [Indexed: 06/06/2023]
Abstract
PURPOSE This study investigated the acoustic properties of 6 Taiwan Southern Min vowels produced by 10 laryngeal speakers (LA), 10 speakers with a pneumatic artificial larynx (PA), and 8 esophageal speakers (ES). METHOD Each of the 6 monophthongs of Taiwan Southern Min (/i, e, a, ɔ, u, ə/) was represented by a Taiwan Southern Min character and appeared randomly on a list 3 times (6 Taiwan Southern Min characters × 3 repetitions = 18 tokens). Each Taiwan Southern Min character in this study has the same syllable structure, /V/, and all were read with tone 1 (high and level). Acoustic measurements of the 1st formant, 2nd formant, and 3rd formant were taken for each vowel. Then, vowel space areas (VSAs) enclosed by /i, a, u/ were calculated for each group of speakers. The Euclidean distance between vowels in the pairs /i, a/, /i, u/, and /a, u/ was also calculated and compared across the groups. RESULTS PA and ES have higher 1st or 2nd formant values than LA for each vowel. The distance is significantly shorter between vowels in the corner vowel pairs /i, a/ and /i, u/. PA and ES have a significantly smaller VSA compared with LA. CONCLUSIONS In accordance with previous studies, alaryngeal speakers have higher formant frequency values than LA because they have a shortened vocal tract as a result of their total laryngectomy. Furthermore, the resonance frequencies are inversely related to the length of the vocal tract (on the basis of the assumption of the source filter theory). PA and ES have a smaller VSA and shorter distances between corner vowels compared with LA, which may be related to speech intelligibility. This hypothesis needs further support from future study.
Collapse
Affiliation(s)
- Jia-Shiou Liao
- Department of Speech Language Pathology and Audiology, Chung Shan Medical University, Taichung, Taiwan
| |
Collapse
|
21
|
Hu W, Mi L, Yang Z, Tao S, Li M, Wang W, Dong Q, Liu C. Shifting Perceptual Weights in L2 Vowel Identification after Training. PLoS One 2016; 11:e0162876. [PMID: 27649413 PMCID: PMC5029867 DOI: 10.1371/journal.pone.0162876] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2015] [Accepted: 08/30/2016] [Indexed: 11/18/2022] Open
Abstract
Difficulties with second-language vowel perception may be related to the significant challenges in using acoustic-phonetic cues. This study investigated the effects of perception training with duration-equalized vowels on native Chinese listeners' English vowel perception and their use of acoustic-phonetic cues. Seventeen native Chinese listeners were perceptually trained with duration-equalized English vowels, and another 17 native Chinese listeners watched English videos as a control group. Both groups were tested with English vowel identification and vowel formant discrimination before training, immediately after training, and three months later. The results showed that the training effect was greater for the vowel training group than for the control group, while both groups improved their English vowel identification and vowel formant discrimination after training. Moreover, duration-equalized vowel perception training significantly reduced listeners' reliance on duration cues and improved their use of spectral cues in identifying English vowels, but video-watching did not help. The results suggest that duration-equalized English vowel perception training may improve non-native listeners' English vowel perception by changing their perceptual weights of acoustic-phonetic cues.
Collapse
Affiliation(s)
- Wei Hu
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
| | - Lin Mi
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
| | - Zhen Yang
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
| | - Sha Tao
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
| | - Mingshuang Li
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
| | - Wenjing Wang
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
| | - Qi Dong
- State Key Laboratory of Cognitive Neuroscience and Learning & IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
- National Innovation Center for Assessment of Basic Education Quality, Beijing Normal University, Beijing, China
| | - Chang Liu
- Department of Communication Sciences and Disorders, University of Texas at Austin, Austin, Texas, United States of America
| |
Collapse
|
22
|
Wenndt SJ. Human recognition of familiar voices. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 140:1172. [PMID: 27586746 DOI: 10.1121/1.4958682] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Recognizing familiar voices is something we do every day. In quiet environments, it is usually easy to recognize a familiar voice. In noisier environments, this can become a difficult task. This paper examines how robust listeners are at identifying familiar voices in noisy, changing environments and what factors may affect their recognition rates. While there is previous research addressing familiar speaker recognition, the research is limited due to the difficulty in obtaining appropriate data that eliminates speaker dependent traits, such as word choice, along with having corresponding listeners who are familiar with the speakers. The data used in this study were collected in such a fashion to mimic conversational, free-flow dialogue, but in a way to eliminate many variables such as word choice, intonation, or non-verbal cues. These data provide some of the most realistic test scenarios to-date for familiar speaker identification. A pure-tone hearing test was used to separate listeners into normal hearing and hearing impaired groups. It is hypothesized that the results of the Normal Hearing Group will be statistically better. Additionally, the aspect of familiar speaker recognition is addressed by having each listener rate his or her familiarity with each speaker. Two statistical approaches showed that the more familiar a listener is with a speaker, the more likely the listener will recognize the speaker.
Collapse
Affiliation(s)
- Stanley J Wenndt
- Air Force Research Laboratory, 525 Brooks Road, Rome, New York 13441, USA
| |
Collapse
|
23
|
Bourguignon NJ, Baum SR, Shiller DM. Please say what this word is-Vowel-extrinsic normalization in the sensorimotor control of speech. J Exp Psychol Hum Percept Perform 2016; 42:1039-47. [PMID: 26820250 DOI: 10.1037/xhp0000209] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
The extent to which the adaptive nature of speech perception influences the acoustic targets underlying speech production is not well understood. For example, listeners can rapidly accommodate to talker-dependent phonetic properties-a process known as vowel-extrinsic normalization-without altering their speech output. Recent evidence, however, shows that reinforcement-based learning in vowel perception alters the processing of speech auditory feedback, impacting sensorimotor control during vowel production. This suggests that more automatic and ubiquitous forms of perceptual plasticity, such as those characterizing perceptual talker normalization, may also impact the sensorimotor control of speech. To test this hypothesis, we set out to examine the possible effects of vowel-extrinsic normalization on experimental subjects' interpretation of their own speech outcomes. By combining a well-known manipulation of vowel-extrinsic normalization with speech auditory-motor adaptation, we show that exposure to different vowel spectral properties subsequently alters auditory feedback processing during speech production, thereby influencing speech motor adaptation. These findings extend the scope of perceptual normalization processes to include auditory feedback and support the idea that naturally occurring adaptations found in speech perception impact speech production. (PsycINFO Database Record
Collapse
Affiliation(s)
- Nicolas J Bourguignon
- École d'orthophonie et d'audiologie, University of Montreal, Centre de recherche, CHU Sainte-Justine
| | - Shari R Baum
- School of Communication Sciences and Disorders, McGill University
| | - Douglas M Shiller
- École d'orthophonie et d'audiologie, University of Montreal, Centre de recherche, CHU Sainte-Justine
| |
Collapse
|
24
|
Kriengwatana B, Escudero P, Kerkhoven AH, Cate CT. A general auditory bias for handling speaker variability in speech? Evidence in humans and songbirds. Front Psychol 2015; 6:1243. [PMID: 26379579 PMCID: PMC4548094 DOI: 10.3389/fpsyg.2015.01243] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2015] [Accepted: 08/04/2015] [Indexed: 11/17/2022] Open
Abstract
Different speakers produce the same speech sound differently, yet listeners are still able to reliably identify the speech sound. How listeners can adjust their perception to compensate for speaker differences in speech, and whether these compensatory processes are unique only to humans, is still not fully understood. In this study we compare the ability of humans and zebra finches to categorize vowels despite speaker variation in speech in order to test the hypothesis that accommodating speaker and gender differences in isolated vowels can be achieved without prior experience with speaker-related variability. Using a behavioral Go/No-go task and identical stimuli, we compared Australian English adults’ (naïve to Dutch) and zebra finches’ (naïve to human speech) ability to categorize / I/ and /ε/ vowels of an novel Dutch speaker after learning to discriminate those vowels from only one other speaker. Experiments 1 and 2 presented vowels of two speakers interspersed or blocked, respectively. Results demonstrate that categorization of vowels is possible without prior exposure to speaker-related variability in speech for zebra finches, and in non-native vowel categories for humans. Therefore, this study is the first to provide evidence for what might be a species-shared auditory bias that may supersede speaker-related information during vowel categorization. It additionally provides behavioral evidence contradicting a prior hypothesis that accommodation of speaker differences is achieved via the use of formant ratios. Therefore, investigations of alternative accounts of vowel normalization that incorporate the possibility of an auditory bias for disregarding inter-speaker variability are warranted.
Collapse
Affiliation(s)
- Buddhamas Kriengwatana
- Behavioural Biology, Institute for Biology Leiden, Leiden University , Leiden, Netherlands ; Leiden Institute for Brain and Cognition, Leiden University , Leiden, Netherlands
| | - Paola Escudero
- The MARCS Institute and ARC Centre of Excellence for the Dynamics of Language, University of Western Sydney , Sydney, NSW, Australia
| | - Anne H Kerkhoven
- Behavioural Biology, Institute for Biology Leiden, Leiden University , Leiden, Netherlands
| | - Carel Ten Cate
- Behavioural Biology, Institute for Biology Leiden, Leiden University , Leiden, Netherlands ; Leiden Institute for Brain and Cognition, Leiden University , Leiden, Netherlands
| |
Collapse
|
25
|
Town SM, Atilgan H, Wood KC, Bizley JK. The role of spectral cues in timbre discrimination by ferrets and humans. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 137:2870-2883. [PMID: 25994714 PMCID: PMC6544515 DOI: 10.1121/1.4916690] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Timbre distinguishes sounds of equal loudness, pitch, and duration; however, little is known about the neural mechanisms underlying timbre perception. Such understanding requires animal models such as the ferret in which neuronal and behavioral observation can be combined. The current study asked what spectral cues ferrets use to discriminate between synthetic vowels. Ferrets were trained to discriminate vowels differing in the position of the first (F1) and second formants (F2), inter-formant distance, and spectral centroid. In experiment 1, ferrets responded to probe trials containing novel vowels in which the spectral cues of trained vowels were mismatched. Regression models fitted to behavioral responses determined that F2 and spectral centroid were stronger predictors of ferrets' behavior than either F1 or inter-formant distance. Experiment 2 examined responses to single formant vowels and found that individual spectral peaks failed to account for multi-formant vowel perception. Experiment 3 measured responses to unvoiced vowels and showed that ferrets could generalize vowel identity across voicing conditions. Experiment 4 employed the same design as experiment 1 but with human participants. Their responses were also predicted by F2 and spectral centroid. Together these findings further support the ferret as a model for studying the neural processes underlying timbre perception.
Collapse
Affiliation(s)
- Stephen M Town
- Ear Institute, University College London, 332 Gray's Inn Road, London WC1X 8EE, United Kingdom
| | - Huriye Atilgan
- Ear Institute, University College London, 332 Gray's Inn Road, London WC1X 8EE, United Kingdom
| | - Katherine C Wood
- Ear Institute, University College London, 332 Gray's Inn Road, London WC1X 8EE, United Kingdom
| | - Jennifer K Bizley
- Ear Institute, University College London, 332 Gray's Inn Road, London WC1X 8EE, United Kingdom
| |
Collapse
|
26
|
Donai JJ, Paschall DD. Identification of high-pass filtered male, female, and child vowels: The use of high-frequency cues. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 137:1971-1982. [PMID: 25920848 DOI: 10.1121/1.4916195] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Vowels are characteristically described according to low-frequency resonance characteristics, which are presumed to provide the requisite information for identification. Classically, the study of vowel perception has focused on the lowest formant frequencies, typically F1, F2, and F3. Lehiste and Peterson [Phonetica 4, 161-177 (1959)] investigated identification accuracy of naturally produced male vowels composed of various amounts of low- and high-frequency content. Results showed near-chance identification performance for vowel segments containing only spectral information above 3.5 kHz. The authors concluded that high-frequency information was of minor importance for vowel identification. The current experiments report identification accuracy for high-pass filtered vowels produced by two male, two female, and two child talkers using both between- and within-subject designs. Identification performance was found to be significantly above chance for the majority of vowels even after high-pass filtering to remove spectral content below 3.0-3.5 kHz. Additionally, the filtered vowels having the highest fundamental frequency (child talkers) often had the highest identification accuracy scores. Linear discriminant function analysis mirrored perceptual performance when using spectral peak information between 3 and 12 kHz.
Collapse
Affiliation(s)
- Jeremy J Donai
- Department of Communication Sciences and Disorders, West Virginia University, P.O. Box 6122, Morgantown, West Virginia 26506
| | - D Dwayne Paschall
- Texas Tech University Health Sciences Center, Texas Tech University, 3601 4th Street, Lubbock, Texas 79430
| |
Collapse
|
27
|
Kriengwatana B, Escudero P, ten Cate C. Revisiting vocal perception in non-human animals: a review of vowel discrimination, speaker voice recognition, and speaker normalization. Front Psychol 2015; 5:1543. [PMID: 25628583 PMCID: PMC4292401 DOI: 10.3389/fpsyg.2014.01543] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 12/12/2014] [Indexed: 12/03/2022] Open
Abstract
The extent to which human speech perception evolved by taking advantage of predispositions and pre-existing features of vertebrate auditory and cognitive systems remains a central question in the evolution of speech. This paper reviews asymmetries in vowel perception, speaker voice recognition, and speaker normalization in non-human animals - topics that have not been thoroughly discussed in relation to the abilities of non-human animals, but are nonetheless important aspects of vocal perception. Throughout this paper we demonstrate that addressing these issues in non-human animals is relevant and worthwhile because many non-human animals must deal with similar issues in their natural environment. That is, they must also discriminate between similar-sounding vocalizations, determine signaler identity from vocalizations, and resolve signaler-dependent variation in vocalizations from conspecifics. Overall, we find that, although plausible, the current evidence is insufficiently strong to conclude that directional asymmetries in vowel perception are specific to humans, or that non-human animals can use voice characteristics to recognize human individuals. However, we do find some indication that non-human animals can normalize speaker differences. Accordingly, we identify avenues for future research that would greatly improve and advance our understanding of these topics.
Collapse
Affiliation(s)
- Buddhamas Kriengwatana
- Behavioural Biology, Institute for Biology Leiden, Leiden UniversityLeiden, Netherlands
- Leiden Institute for Brain and Cognition, Leiden UniversityLeiden, Netherlands
| | - Paola Escudero
- The MARCS Institute, University of Western SydneySydney, NSW, Australia
| | - Carel ten Cate
- Behavioural Biology, Institute for Biology Leiden, Leiden UniversityLeiden, Netherlands
- Leiden Institute for Brain and Cognition, Leiden UniversityLeiden, Netherlands
| |
Collapse
|
28
|
Liu C, Jin SH, Chen CT. Durations of American English vowels by native and non-native speakers: acoustic analyses and perceptual effects. LANGUAGE AND SPEECH 2014; 57:238-253. [PMID: 25102608 DOI: 10.1177/0023830913507692] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The goal of this study was to examine durations of American English vowels produced by English-, Chinese-, and Korean-native speakers and the effects of vowel duration on vowel intelligibility. Twelve American English vowels were recorded in the /hVd/ phonetic context by native speakers and non-native speakers. The English vowel duration patterns as a function of vowel produced by non-native speakers were generally similar to those produced by native speakers. These results imply that using duration differences across vowels may be an important strategy for non-native speakers' production before they are able to employ spectral cues to produce and perceive English speech sounds. In the intelligibility experiment, vowels were selected from 10 native and non-native speakers and vowel durations were equalized at 170 ms. Intelligibility of vowels with original and equalized durations was evaluated by American English native listeners. Results suggested that vowel intelligibility of native and non-native speakers degraded slightly by 3-8% when durations were equalized, indicating that vowel duration plays a minor role in vowel intelligibility.
Collapse
|
29
|
Smith DRR. Does knowing speaker sex facilitate vowel recognition at short durations? Acta Psychol (Amst) 2014; 148:81-90. [PMID: 24486810 DOI: 10.1016/j.actpsy.2014.01.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2011] [Revised: 01/06/2014] [Accepted: 01/08/2014] [Indexed: 12/01/2022] Open
Abstract
A man, woman or child saying the same vowel do so with very different voices. The auditory system solves the complex problem of extracting what the man, woman or child has said despite substantial differences in the acoustic properties of their voices. Much of the acoustic variation between the voices of men and woman is due to changes in the underlying anatomical mechanisms for producing speech. If the auditory system knew the sex of the speaker then it could potentially correct for speaker sex related acoustic variation thus facilitating vowel recognition. This study measured the minimum stimulus duration necessary to accurately discriminate whether a brief vowel segment was spoken by a man or woman, and the minimum stimulus duration necessary to accuately recognise what vowel was spoken. Results showed that reliable vowel recognition precedesreliable speaker sex discrimination, thus questioning the use of speaker sex information in compensating for speaker sex related acoustic variation in the voice. Furthermore, the pattern of performance across experiments where the fundamental frequency and formant frequency information of speaker's voices were systematically varied, was markedly different depending on whether the task was speaker-sex discrimination or vowel recognition. This argues for there being little relationship between perception of speaker sex (indexical information) and perception of what has been said (linguistic information) at short durations.
Collapse
Affiliation(s)
- David R R Smith
- Department of Psychology, University of Hull, Cottingham Road, Hull HU6 7RX, United Kingdom.
| |
Collapse
|
30
|
Rao A, Carney LH. Speech enhancement for listeners with hearing loss based on a model for vowel coding in the auditory midbrain. IEEE Trans Biomed Eng 2014; 61:2081-91. [PMID: 24686228 DOI: 10.1109/tbme.2014.2313618] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
A novel signal-processing strategy is proposed to enhance speech for listeners with hearing loss. The strategy focuses on improving vowel perception based on a recent hypothesis for vowel coding in the auditory system. Traditionally, studies of neural vowel encoding have focused on the representation of formants (peaks in vowel spectra) in the discharge patterns of the population of auditory-nerve (AN) fibers. A recent hypothesis focuses instead on vowel encoding in the auditory midbrain, and suggests a robust representation of formants. AN fiber discharge rates are characterized by pitch-related fluctuations having frequency-dependent modulation depths. Fibers tuned to frequencies near formants exhibit weaker pitch-related fluctuations than those tuned to frequencies between formants. Many auditory midbrain neurons show tuning to amplitude modulation frequency in addition to audio frequency. According to the auditory midbrain vowel encoding hypothesis, the response map of a population of midbrain neurons tuned to modulations near voice pitch exhibits minima near formant frequencies, due to the lack of strong pitch-related fluctuations at their inputs. This representation is robust over the range of noise conditions in which speech intelligibility is also robust for normal-hearing listeners. Based on this hypothesis, a vowel-enhancement strategy has been proposed that aims to restore vowel encoding at the level of the auditory midbrain. The signal processing consists of pitch tracking, formant tracking, and formant enhancement. The novel formant-tracking method proposed here estimates the first two formant frequencies by modeling characteristics of the auditory periphery, such as saturated discharge rates of AN fibers and modulation tuning properties of auditory midbrain neurons. The formant enhancement stage aims to restore the representation of formants at the level of the midbrain by increasing the dominance of a single harmonic near each formant and saturating that frequency channel. A MATLAB implementation of the system with low computational complexity was developed. Objective tests of the formant-tracking subsystem on vowels suggest that the method generalizes well over a wide range of speakers and vowels.
Collapse
|
31
|
Mesgarani N, Cheung C, Johnson K, Chang EF. Phonetic feature encoding in human superior temporal gyrus. Science 2014; 343:1006-10. [PMID: 24482117 DOI: 10.1126/science.1245994] [Citation(s) in RCA: 499] [Impact Index Per Article: 49.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
During speech perception, linguistic elements such as consonants and vowels are extracted from a complex acoustic speech signal. The superior temporal gyrus (STG) participates in high-order auditory processing of speech, but how it encodes phonetic information is poorly understood. We used high-density direct cortical surface recordings in humans while they listened to natural, continuous speech to reveal the STG representation of the entire English phonetic inventory. At single electrodes, we found response selectivity to distinct phonetic features. Encoding of acoustic properties was mediated by a distributed population response. Phonetic features could be directly related to tuning for spectrotemporal acoustic cues, some of which were encoded in a nonlinear fashion or by integration of multiple cues. These findings demonstrate the acoustic-phonetic representation of speech in human STG.
Collapse
Affiliation(s)
- Nima Mesgarani
- Department of Neurological Surgery, Department of Physiology, and Center for Integrative Neuroscience, University of California, San Francisco, CA 94143, USA
| | | | | | | |
Collapse
|
32
|
Vitela AD, Warner N, Lotto AJ. Perceptual compensation for differences in speaking style. Front Psychol 2013; 4:399. [PMID: 23847573 PMCID: PMC3698514 DOI: 10.3389/fpsyg.2013.00399] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2013] [Accepted: 06/13/2013] [Indexed: 11/13/2022] Open
Abstract
It is well-established that listeners will shift their categorization of a target vowel as a function of acoustic characteristics of a preceding carrier phrase (CP). These results have been interpreted as an example of perceptual normalization for variability resulting from differences in talker anatomy. The present study examined whether listeners would normalize for acoustic variability resulting from differences in speaking style within a single talker. Two vowel series were synthesized that varied between central and peripheral vowels (the vowels in "beat"-"bit" and "bod"-"bud"). Each member of the series was appended to one of four CPs that were spoken in either a "clear" or "reduced" speech style. Participants categorized vowels in these eight contexts. A reliable shift in categorization as a function of speaking style was obtained for three of four phrase sets. This demonstrates that phrase context effects can be obtained with a single talker. However, the directions of the obtained shifts are not reliably predicted on the basis of the speaking style of the talker. Instead, it appears that the effect is determined by an interaction of the average spectrum of the phrase with the target vowel.
Collapse
Affiliation(s)
- A. Davi Vitela
- Speech, Language and Hearing Sciences, University of ArizonaTucson, AZ, USA
| | - Natasha Warner
- Department of Linguistics, University of ArizonaTucson, AZ, USA
| | - Andrew J. Lotto
- Speech, Language and Hearing Sciences, University of ArizonaTucson, AZ, USA
| |
Collapse
|
33
|
Tuomainen J, Savela J, Obleser J, Aaltonen O. Attention modulates the use of spectral attributes in vowel discrimination: behavioral and event-related potential evidence. Brain Res 2012; 1490:170-83. [PMID: 23174416 DOI: 10.1016/j.brainres.2012.10.067] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Revised: 10/26/2012] [Accepted: 10/31/2012] [Indexed: 10/27/2022]
Abstract
Speech contains a variety of acoustic cues to auditory and phonetic contrasts that are exploited by the listener in decoding the acoustic signal. In three experiments, we tried to elucidate whether listeners rely on formant peak frequencies or whole spectrum attributes in vowel discrimination. We created two vowel continua in which the acoustic distance in formant frequencies was constant but the continua differed in spectral moments (i.e., the whole spectrum modeled as a probability density function). In Experiment 1, we measured reaction times and response accuracy while listeners performed a go/no-go discrimination task. The results indicated that the performance of the listeners was based on the spectral moments (especially the first and second moments), and not on formant peaks. Behavioral results in Experiment 2 showed that, when the stimuli were presented in noise eliminating differences in spectral moments between the two continua, listeners employed formant peak frequencies. In Experiment 3, using the same listeners and stimuli as in Experiment 1, we measured an automatic brain potential, the mismatch negativity (MMN), when listeners did not attend to the auditory stimuli. Results showed that the MMN reflects sensitivity only to the formant structure of the vowels. We suggest that the auditory cortex automatically and pre-attentively encodes formant peak frequencies, whereas attention can be deployed for processing additional spectral information, such as spectral moments, to enhance vowel discrimination.
Collapse
Affiliation(s)
- J Tuomainen
- Department of Speech, Hearing and Phonetic Sciences, University College London, UK.
| | | | | | | |
Collapse
|
34
|
Perkell JS. Movement goals and feedback and feedforward control mechanisms in speech production. JOURNAL OF NEUROLINGUISTICS 2012; 25:382-407. [PMID: 22661828 PMCID: PMC3361736 DOI: 10.1016/j.jneuroling.2010.02.011] [Citation(s) in RCA: 61] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Studies of speech motor control are described that support a theoretical framework in which fundamental control variables for phonemic movements are multi-dimensional regions in auditory and somatosensory spaces. Auditory feedback is used to acquire and maintain auditory goals and in the development and function of feedback and feedforward control mechanisms. Several lines of evidence support the idea that speakers with more acute sensory discrimination acquire more distinct goal regions and therefore produce speech sounds with greater contrast. Feedback modification findings indicate that fluently produced sound sequences are encoded as feedforward commands, and feedback control serves to correct mismatches between expected and produced sensory consequences.
Collapse
Affiliation(s)
- Joseph S Perkell
- Speech Communication Group, Massachusetts Institute of Technology, Research Laboratory of Electronics, Room 36-591, 50 Vassar St., Cambridge, MA 02139-4307, United States
| |
Collapse
|
35
|
Andreeva NG, Kulikov GA. Comparative analysis of the acoustic parameters of vowels in child and adult speech. DOKLADY BIOLOGICAL SCIENCES : PROCEEDINGS OF THE ACADEMY OF SCIENCES OF THE USSR, BIOLOGICAL SCIENCES SECTIONS 2012; 445:207-209. [PMID: 22945517 DOI: 10.1134/s0012496612040011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Indexed: 06/01/2023]
Affiliation(s)
- N G Andreeva
- St. Petersburg State University, St. Petersburg, Russia
| | | |
Collapse
|
36
|
Katseff S, Houde J, Johnson K. Partial compensation for altered auditory feedback: a tradeoff with somatosensory feedback? LANGUAGE AND SPEECH 2012; 55:295-308. [PMID: 22783636 DOI: 10.1177/0023830911417802] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
Talkers are known to compensate only partially for experimentally-induced changes to their auditory feedback. In a typical experiment, talkers might hear their FI feedback shifted higher (so that/epsilon/ sounds like /ae/, for example), and compensate by lowering FI in their subsequent speech by about a quarter of that distance. Here, we sought to characterize and understand partial compensation by examining how talkers respond to each step on a staircase of increasing shifts in auditory feedback. Subjects wore an apparatus which altered their real time auditory feedback. They were asked to repeat visually-presented hVd stimulus words while feedback was altered stepwise over the course of 360 trials. We used a novel analysis method to calculate each subject's compensation at each compensation step relative to their baseline. Results demonstrated that subjects compensated more for small feedback shifts than for larger shifts. We suggest that this pattern is consistent with vowel targets that incorporate auditory and somatosensory information, and a speech motor control system that is driven by differential weighting of auditory and somatosensory feedback.
Collapse
Affiliation(s)
- Shira Katseff
- University of California at Berkeley, Berkeley, CA 94720-2650, USA.
| | | | | |
Collapse
|
37
|
A Comprehensive Vowel Space for Whispered Speech. J Voice 2012; 26:e49-56. [PMID: 21550772 DOI: 10.1016/j.jvoice.2010.12.002] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2010] [Accepted: 12/06/2010] [Indexed: 11/21/2022]
|
38
|
Kohn ME, Farrington C. Evaluating acoustic speaker normalization algorithms: evidence from longitudinal child data. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2012; 131:2237-2248. [PMID: 22423719 DOI: 10.1121/1.3682061] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Speaker vowel formant normalization, a technique that controls for variation introduced by physical differences between speakers, is necessary in variationist studies to compare speakers of different ages, genders, and physiological makeup in order to understand non-physiological variation patterns within populations. Many algorithms have been established to reduce variation introduced into vocalic data from physiological sources. The lack of real-time studies tracking the effectiveness of these normalization algorithms from childhood through adolescence inhibits exploration of child participation in vowel shifts. This analysis compares normalization techniques applied to data collected from ten African American children across five time points. Linear regressions compare the reduction in variation attributable to age and gender for each speaker for the vowels BEET, BAT, BOT, BUT, and BOAR. A normalization technique is successful if it maintains variation attributable to a reference sociolinguistic variable, while reducing variation attributable to age. Results indicate that normalization techniques which rely on both a measure of central tendency and range of the vowel space perform best at reducing variation attributable to age, although some variation attributable to age persists after normalization for some sections of the vowel space.
Collapse
Affiliation(s)
- Mary Elizabeth Kohn
- Department of Linguistics, The University of North Carolina at Chapel Hill, 104A Smith Building, CB #3155, Chapel Hill, North Carolina 27599-3155, USA
| | | |
Collapse
|
39
|
Yoon YS, Li Y, Fu QJ. Speech recognition and acoustic features in combined electric and acoustic stimulation. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2012; 55:105-24. [PMID: 22199183 PMCID: PMC3288603 DOI: 10.1044/1092-4388(2011/10-0325)] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
PURPOSE In this study, the authors aimed to identify speech information processed by a hearing aid (HA) that is additive to information processed by a cochlear implant (CI) as a function of signal-to-noise ratio (SNR). METHOD Speech recognition was measured with CI alone, HA alone, and CI + HA. Ten participants were separated into 2 groups; good (aided pure-tone average [PTA] < 55 dB) and poor (aided PTA ≥ 55 dB) at audiometric frequencies ≤ 1 kHz in HA. RESULTS Results showed that the good-aided PTA group derived a clear bimodal benefit (performance difference between CI + HA and CI alone) for vowel and sentence recognition in noise, whereas the poor-aided PTA group received little benefit across speech tests and SNRs. Results also showed that a better aided PTA helped in processing cues embedded in both low and high frequencies; none of these cues was significantly perceived by the poor-aided PTA group. CONCLUSIONS The aided PTA is an important indicator for bimodal advantage in speech perception. The lack of bimodal benefits in the poor group may be attributed to the nonoptimal HA fitting. Bimodal listening provides a synergistic effect for cues in both low- and high-frequency components in speech.
Collapse
Affiliation(s)
- Yang-soo Yoon
- Communication and Neuroscience Division, House Ear Institute, 2100 W. 3 St., Los Angeles, CA 90057
| | - Yongxin Li
- Department of Otolaryngology, Head and Neck Surgery, Beijing TongRen Hospital, Capital Medical University, Key Laboratory of Otolaryngology, Head and Neck Surgery, Ministry of Education of China, Beijing. People’s Republic of China 100730
| | - Qian-Jie Fu
- Communication and Neuroscience Division, House Ear Institute, 2100 W. 3 St., Los Angeles, CA 90057
| |
Collapse
|
40
|
Barreda S, Nearey TM. The direct and indirect roles of fundamental frequency in vowel perception. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2012; 131:466-477. [PMID: 22280608 DOI: 10.1121/1.3662068] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Several experiments have found that changing the intrinsic f0 of a vowel can have an effect on perceived vowel quality. It has been suggested that these shifts may occur because f0 is involved in the specification of vowel quality in the same way as the formant frequencies. Another possibility is that f0 affects vowel quality indirectly, by changing a listener's assumptions about characteristics of a speaker who is likely to have uttered the vowel. In the experiment outlined here, participants were asked to listen to vowels differing in terms of f0 and their formant frequencies and report vowel quality and the apparent speaker's gender and size on a trial-by-trial basis. The results presented here suggest that f0 affects vowel quality mainly indirectly via its effects on the apparent-speaker characteristics; however, f0 may also have some residual direct effects on vowel quality. Furthermore, the formant frequencies were also found to have significant indirect effects on vowel quality by way of their strong influence on the apparent speaker.
Collapse
Affiliation(s)
- Santiago Barreda
- Department of Linguistics, University of Alberta, Edmonton, Alberta T6G 2E7, Canada.
| | | |
Collapse
|
41
|
Fox RA, Jacewicz E, Chang CY. Auditory spectral integration in the perception of static vowels. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2011; 54:1667-1681. [PMID: 21862680 PMCID: PMC4486011 DOI: 10.1044/1092-4388(2011/09-0279)] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
PURPOSE To evaluate potential contributions of broadband spectral integration in the perception of static vowels. Specifically, can the auditory system infer formant frequency information from changes in the intensity weighting across harmonics when the formant itself is missing? Does this type of integration produce the same results in the lower (first formant [F1]) and higher (second formant [F2]) regions? Does the spacing between the spectral components affect a listener's ability to integrate the acoustic cues? METHOD Twenty young listeners with normal hearing identified synthesized vowel-like stimuli created for adjustments in the F1 region (/Λ/-/α/, /i/-/ε/) and in the F2 region (/Λ/-/æ/). There were 2 types of stimuli: (a) 2-formant tokens and (b) tokens in which 1 formant was removed and 2 pairs of sine waves were inserted below and above the missing formant; the intensities of these harmonics were modified to cause variations in their spectral center of gravity (COG). The COG effects were tested over a wide range of frequencies. RESULTS Obtained patterns were consistent with calculated changes to the spectral COG, in both the F1 and F2 regions. The spacing of the sine waves did not affect listeners' responses. CONCLUSION The auditory system may perform broadband integration as a type of auditory wideband spectral analysis.
Collapse
|
42
|
Feng Y, Gracco VL, Max L. Integration of auditory and somatosensory error signals in the neural control of speech movements. J Neurophysiol 2011; 106:667-79. [PMID: 21562187 DOI: 10.1152/jn.00638.2010] [Citation(s) in RCA: 62] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
We investigated auditory and somatosensory feedback contributions to the neural control of speech. In task I, sensorimotor adaptation was studied by perturbing one of these sensory modalities or both modalities simultaneously. The first formant (F1) frequency in the auditory feedback was shifted up by a real-time processor and/or the extent of jaw opening was increased or decreased with a force field applied by a robotic device. All eight subjects lowered F1 to compensate for the up-shifted F1 in the feedback signal regardless of whether or not the jaw was perturbed. Adaptive changes in subjects' acoustic output resulted from adjustments in articulatory movements of the jaw or tongue. Adaptation in jaw opening extent in response to the mechanical perturbation occurred only when no auditory feedback perturbation was applied or when the direction of adaptation to the force was compatible with the direction of adaptation to a simultaneous acoustic perturbation. In tasks II and III, subjects' auditory and somatosensory precision and accuracy were estimated. Correlation analyses showed that the relationships 1) between F1 adaptation extent and auditory acuity for F1 and 2) between jaw position adaptation extent and somatosensory acuity for jaw position were weak and statistically not significant. Taken together, the combined findings from this work suggest that, in speech production, sensorimotor adaptation updates the underlying control mechanisms in such a way that the planning of vowel-related articulatory movements takes into account a complex integration of error signals from previous trials but likely with a dominant role for the auditory modality.
Collapse
Affiliation(s)
- Yongqiang Feng
- Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences, Beijing, China
| | | | | |
Collapse
|
43
|
Monahan PJ, Idsardi WJ. Auditory Sensitivity to Formant Ratios:Toward an Account of Vowel Normalization. LANGUAGE AND COGNITIVE PROCESSES 2010; 25:808-839. [PMID: 20606713 PMCID: PMC2893733 DOI: 10.1080/01690965.2010.490047] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
A long-standing question in speech perception research is how do listeners extract linguistic content from a highly variable acoustic input. In the domain of vowel perception, formant ratios, or the calculation of relative bark differences between vowel formants, have been a sporadically proposed solution. We propose a novel formant ratio algorithm in which the first (F1) and second (F2) formants are compared against the third formant (F3). Results from two magnetoencephelographic (MEG) experiments are presented that suggest auditory cortex is sensitive to formant ratios. Our findings also demonstrate that the perceptual system shows heightened sensitivity to formant ratios for tokens located in more crowded regions of the vowel space. Additionally, we present statistical evidence that this algorithm eliminates speaker-dependent variation based on age and gender from vowel productions. We conclude that these results present an impetus to reconsider formant ratios as a legitimate mechanistic component in the solution to the problem of speaker normalization.
Collapse
Affiliation(s)
- Philip J. Monahan
- Basque Center on Cognition, Brain and Language, Donostia-San Sebastián, Spain
| | - William J. Idsardi
- Department of Linguistics, University of Maryland, USA
- Neuroscience and Cognitive Science Program University of Maryland, USA
| |
Collapse
|
44
|
Zhang T, Dorman MF, Spahr AJ. Information from the voice fundamental frequency (F0) region accounts for the majority of the benefit when acoustic stimulation is added to electric stimulation. Ear Hear 2010; 31:63-9. [PMID: 20050394 PMCID: PMC3684557 DOI: 10.1097/aud.0b013e3181b7190c] [Citation(s) in RCA: 138] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
OBJECTIVES The aim of this study was to determine the minimum amount of low-frequency acoustic information that is required to achieve speech perception benefit in listeners with a cochlear implant in one ear and low-frequency hearing in the other ear. DESIGN The recognition of monosyllabic words in quiet and sentences in noise was evaluated in three listening conditions: electric stimulation alone, acoustic stimulation alone, and combined electric and acoustic stimulation. The acoustic stimuli presented to the nonimplanted ear were either low-pass-filtered at 125, 250, 500, or 750 Hz, or unfiltered (wideband). RESULTS Adding low-frequency acoustic information to electrically stimulated information led to a significant improvement in word recognition in quiet and sentence recognition in noise. Improvement was observed in the electric and acoustic stimulation condition even when the acoustic information was limited to the 125-Hz-low-passed signal. Further improvement for the sentences in noise was observed when the acoustic signal was increased to wideband. CONCLUSIONS Information from the voice fundamental frequency (F0) region accounts for the majority of the speech perception benefit when acoustic stimulation is added to electric stimulation. We propose that, in quiet, low-frequency acoustic information leads to an improved representation of voicing, which in turn leads to a reduction in word candidates in the lexicon. In noise, the robust representation of voicing allows access to low-frequency acoustic landmarks that mark syllable structure and word boundaries. These landmarks can bootstrap word and sentence recognition.
Collapse
Affiliation(s)
- Ting Zhang
- University of Maryland at College Park, Maryland, USA.
| | | | | |
Collapse
|
45
|
Andreeva NG, Kulikov GA. Perceptive significance of frequency and amplitude characteristics of vowels with difference fundamental frequency. DOKLADY BIOLOGICAL SCIENCES : PROCEEDINGS OF THE ACADEMY OF SCIENCES OF THE USSR, BIOLOGICAL SCIENCES SECTIONS 2009; 429:487-489. [PMID: 20170052 DOI: 10.1134/s0012496609060015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Affiliation(s)
- N G Andreeva
- Faculty of Biology and Soil Sciences, St. Petersburg State University, Universitetskaya nab. 7/9, St. Petersburg, 199034 Russia
| | | |
Collapse
|
46
|
|
47
|
The role of f 0 and formant frequencies in distinguishing the voices of men and women. Atten Percept Psychophys 2009; 71:1150-66. [PMID: 19525544 DOI: 10.3758/app.71.5.1150] [Citation(s) in RCA: 104] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
48
|
Turner RE, Walters TC, Monaghan JJM, Patterson RD. A statistical, formant-pattern model for segregating vowel type and vocal-tract length in developmental formant data. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2009; 125:2374-86. [PMID: 19354411 PMCID: PMC2824129 DOI: 10.1121/1.3079772] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/21/2023]
Abstract
This paper investigates the theoretical basis for estimating vocal-tract length (VTL) from the formant frequencies of vowel sounds. A statistical inference model was developed to characterize the relationship between vowel type and VTL, on the one hand, and formant frequency and vocal cavity size, on the other. The model was applied to two well known developmental studies of formant frequency. The results show that VTL is the major source of variability after vowel type and that the contribution due to other factors like developmental changes in oral-pharyngeal ratio is small relative to the residual measurement noise. The results suggest that speakers adjust the shape of the vocal tract as they grow to maintain a specific pattern of formant frequencies for individual vowels. This formant-pattern hypothesis motivates development of a statistical-inference model for estimating VTL from formant-frequency data. The technique is illustrated using a third developmental study of formant frequencies. The VTLs of the speakers are estimated and used to provide a more accurate description of the complicated relationship between VTL and glottal pulse rate as children mature into adults.
Collapse
Affiliation(s)
- Richard E Turner
- Gatsby Computational Neuroscience Unit, Alexandra House, 17 Queen Square, London, United Kingdom.
| | | | | | | |
Collapse
|
49
|
Ames H, Grossberg S. Speaker normalization using cortical strip maps: a neural model for steady-state vowel categorization. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2008; 124:3918-3936. [PMID: 19206817 DOI: 10.1121/1.2997478] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Auditory signals of speech are speaker dependent, but representations of language meaning are speaker independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by adaptive resonance theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [Peterson, G. E., and Barney, H.L., J. Acoust. Soc. Am. 24, 175-184 (1952).] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.
Collapse
Affiliation(s)
- Heather Ames
- Department of Cognitive and Neural Systems, Center for Adaptive Systems, and Center of Excellence for Learning In Education, Science, and Technology, Boston University, Boston, Massachusetts 02215, USA
| | | |
Collapse
|
50
|
Assmann PF, Nearey TM. Identification of frequency-shifted vowels. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2008; 124:3203-3212. [PMID: 19045804 DOI: 10.1121/1.2980456] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Within certain limits, speech intelligibility is preserved with upward or downward scaling of the spectral envelope. To study these limits and assess their interaction with fundamental frequency (F0), vowels in /hVd/ syllables were processed using the STRAIGHT vocoder and presented to listeners for identification. Identification accuracy showed a gradual decline when the spectral envelope was scaled up or down in vowels spoken by men, women, and children. Upward spectral envelope shifts led to poorer identification of children's vowels compared to adults, while downward shifts had a greater impact on men's vowels compared to women and children. Coordinated shifts (F0 and spectral envelope shifted in the same direction) generally produced higher accuracy than conditions with F0 and spectral envelope shifted in opposite directions. Vowel identification was poorest in conditions with very high F0, consistent with suggestions from the literature that sparse sampling of the spectral envelope may be a factor in vowel identification. However, the gradual decline in accuracy as a function of both upward and downward spectral envelope shifts and the interaction between spectral envelope shifts and F0 suggests the additional operation of perceptual mechanisms sensitive to the statistical covariation of F0 and formant frequencies in natural speech.
Collapse
Affiliation(s)
- Peter F Assmann
- School of Behavioral and Brain Sciences, University of Texas at Dallas, Richardson, Texas 75083-0688, USA.
| | | |
Collapse
|