1
|
Sarrett ME, Toscano JC. Decoding speech sounds from neurophysiological data: Practical considerations and theoretical implications. Psychophysiology 2024; 61:e14475. [PMID: 37947235 DOI: 10.1111/psyp.14475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 09/13/2023] [Accepted: 10/04/2023] [Indexed: 11/12/2023]
Abstract
Machine learning techniques have proven to be a useful tool in cognitive neuroscience. However, their implementation in scalp-recorded electroencephalography (EEG) is relatively limited. To address this, we present three analyses using data from a previous study that examined event-related potential (ERP) responses to a wide range of naturally-produced speech sounds. First, we explore which features of the EEG signal best maximize machine learning accuracy for a voicing distinction, using a support vector machine (SVM). We manipulate three dimensions of the EEG signal as input to the SVM: number of trials averaged, number of time points averaged, and polynomial fit. We discuss the trade-offs in using different feature sets and offer some recommendations for researchers using machine learning. Next, we use SVMs to classify specific pairs of phonemes, finding that we can detect differences in the EEG signal that are not otherwise detectable using conventional ERP analyses. Finally, we characterize the timecourse of phonetic feature decoding across three phonological dimensions (voicing, manner of articulation, and place of articulation), and find that voicing and manner are decodable from neural activity, whereas place of articulation is not. This set of analyses addresses both practical considerations in the application of machine learning to EEG, particularly for speech studies, and also sheds light on current issues regarding the nature of perceptual representations of speech.
Collapse
Affiliation(s)
- McCall E Sarrett
- Department of Psychological and Brain Sciences, Villanova University, Villanova, Pennsylvania, USA
- Psychology Department, Gonzaga University, Spokane, Washington, USA
| | - Joseph C Toscano
- Department of Psychological and Brain Sciences, Villanova University, Villanova, Pennsylvania, USA
| |
Collapse
|
2
|
Whalen DH. Direct neural coding of speech: Reconsideration of Whalen et al. (2006) (L). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:1704-1706. [PMID: 38426833 PMCID: PMC10908555 DOI: 10.1121/10.0025125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 02/12/2024] [Accepted: 02/12/2024] [Indexed: 03/02/2024]
Abstract
Previous brain imaging results indicated that speech perception proceeded independently of the auditory primitives that are the product of primary auditory cortex [Whalen, Benson, Richardson, Swainson, Clark, Lai, Mencl, Fulbright, Constable, and Liberman (2006). J. Acoust. Soc. Am. 119, 575-581]. Recent evidence using electrocorticography [Hamilton, Oganian, Hall, and Chang (2021). Cell 184, 4626-4639] indicates that there is a more direct connection from subcortical regions to cortical speech regions than previous studies had shown. Although the mechanism differs, the Hamilton, Oganian, Hall, and Chang result supports the original conclusion even more strongly: Speech perception does not rely on the analysis of primitives from auditory analysis. Rather, the speech signal is processed as speech from the beginning.
Collapse
|
3
|
Leung KKW, Wang Y. Modelling Mandarin tone perception-production link through critical perceptual cues. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2024; 155:1451-1468. [PMID: 38364045 DOI: 10.1121/10.0024890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Accepted: 01/28/2024] [Indexed: 02/18/2024]
Abstract
Theoretical accounts posit a close link between speech perception and production, but empirical findings on this relationship are mixed. To explain this apparent contradiction, a proposed view is that a perception-production relationship should be established through the use of critical perceptual cues. This study examines this view by using Mandarin tones as a test case because the perceptual cues for Mandarin tones consist of perceptually critical pitch direction and noncritical pitch height cues. The defining features of critical and noncritical perceptual cues and the perception-production relationship of each cue for each tone were investigated. The perceptual stimuli in the perception experiment were created by varying one critical and one noncritical perceptual cue orthogonally. The cues for tones produced by the same group of native Mandarin participants were measured. This study found that the critical status of perceptual cues primarily influenced within-category and between-category perception for nearly all tones. Using cross-domain bidirectional statistical modelling, a perception-production link was found for the critical perceptual cue only. A stronger link was obtained when within-category and between-category perception data were included in the models as compared to using between-category perception data alone, suggesting a phonetically and phonologically driven perception-production relationship.
Collapse
Affiliation(s)
- Keith K W Leung
- Department of Linguistics, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada
| | - Yue Wang
- Department of Linguistics, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada
| |
Collapse
|
4
|
Goldenberg D, Tiede MK, Bennett RT, Whalen DH. Congruent aero-tactile stimuli bias perception of voicing continua. Front Hum Neurosci 2022; 16:879981. [PMID: 35911601 PMCID: PMC9334670 DOI: 10.3389/fnhum.2022.879981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2022] [Accepted: 06/28/2022] [Indexed: 11/13/2022] Open
Abstract
Multimodal integration is the formation of a coherent percept from different sensory inputs such as vision, audition, and somatosensation. Most research on multimodal integration in speech perception has focused on audio-visual integration. In recent years, audio-tactile integration has also been investigated, and it has been established that puffs of air applied to the skin and timed with listening tasks shift the perception of voicing by naive listeners. The current study has replicated and extended these findings by testing the effect of air puffs on gradations of voice onset time along a continuum rather than the voiced and voiceless endpoints of the original work. Three continua were tested: bilabial (“pa/ba”), velar (“ka/ga”), and a vowel continuum (“head/hid”) used as a control. The presence of air puffs was found to significantly increase the likelihood of choosing voiceless responses for the two VOT continua but had no effect on choices for the vowel continuum. Analysis of response times revealed that the presence of air puffs lengthened responses for intermediate (ambiguous) stimuli and shortened them for endpoint (non-ambiguous) stimuli. The slowest response times were observed for the intermediate steps for all three continua, but for the bilabial continuum this effect interacted with the presence of air puffs: responses were slower in the presence of air puffs, and faster in their absence. This suggests that during integration auditory and aero-tactile inputs are weighted differently by the perceptual system, with the latter exerting greater influence in those cases where the auditory cues for voicing are ambiguous.
Collapse
Affiliation(s)
| | - Mark K. Tiede
- Haskins Laboratories, New Haven, CT, United States
- *Correspondence: Mark K. Tiede,
| | - Ryan T. Bennett
- Department of Linguistics, University of California, Santa Cruz, Santa Cruz, CA, United States
| | - D. H. Whalen
- Haskins Laboratories, New Haven, CT, United States
- The Graduate Center, City University of New York (CUNY), New York, NY, United States
- Department of Linguistics, Yale University, New Haven, CT, United States
| |
Collapse
|
5
|
King H, Chitoran I. Difficult to hear but easy to see: Audio-visual perception of the /r/-/w/ contrast in Anglo-English. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:368. [PMID: 35931552 DOI: 10.1121/10.0012660] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 06/23/2022] [Indexed: 06/15/2023]
Abstract
This paper investigates the influence of visual cues in the perception of the /r/-/w/ contrast in Anglo-English. Audio-visual perception of Anglo-English /r/ warrants attention because productions are increasingly non-lingual, labiodental (e.g., [ʋ]), possibly involving visual prominence of the lips for the post-alveolar approximant [ɹ]. Forty native speakers identified [ɹ] and [w] stimuli in four presentation modalities: auditory-only, visual-only, congruous audio-visual, and incongruous audio-visual. Auditory stimuli were presented in noise. The results indicate that native Anglo-English speakers can identify [ɹ] and [w] from visual information alone with almost perfect accuracy. Furthermore, visual cues dominate the perception of the /r/-/w/ contrast when auditory and visual cues are mismatched. However, auditory perception is ambiguous because participants tend to perceive both [ɹ] and [w] as /r/. Auditory ambiguity is related to Anglo-English listeners' exposure to acoustic variation for /r/, especially to [ʋ], which is often confused with [w]. It is suggested that a specific labial configuration for Anglo-English /r/ encodes the contrast with /w/ visually, compensating for the ambiguous auditory contrast. An audio-visual enhancement hypothesis is proposed, and the findings are discussed with regard to sound change.
Collapse
Affiliation(s)
- Hannah King
- Université Paris Cité, UFR Linguistique, CLILLAC-ARP, F-75013 Paris, France
| | - Ioana Chitoran
- Université Paris Cité, UFR Linguistique, CLILLAC-ARP, F-75013 Paris, France
| |
Collapse
|
6
|
Chen J, Chang H. Sketching the Landscape of Speech Perception Research (2000-2020): A Bibliometric Study. Front Psychol 2022; 13:822241. [PMID: 35719567 PMCID: PMC9201966 DOI: 10.3389/fpsyg.2022.822241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 05/09/2022] [Indexed: 12/02/2022] Open
Abstract
Based on 6,407 speech perception research articles published between 2000 and 2020, a bibliometric analysis was conducted to identify leading countries, research institutes, researchers, research collaboration networks, high impact research articles, central research themes and trends in speech perception research. Analysis of highly cited articles and researchers indicated three foundational theoretical approaches to speech perception, that is the motor theory, the direct realism and the computational approach as well as four non-native speech perception models, that is the Speech Learning Model, the Perceptual Assimilation Model, the Native Language Magnet model, and the Second Language Linguistic Perception model. Citation networks, term frequency analysis and co-word networks revealed several central research topics: audio-visual speech perception, spoken word recognition, bilingual and infant/child speech perception and learning. Two directions for future research were also identified: (1) speech perception by clinical populations, such as hearing loss children with cochlear implants and speech perception across lifespan, including infants and aged population; (2) application of neurocognitive techniques in investigating activation of different brain regions during speech perception. Our bibliometric analysis can facilitate research advancements and future collaborations among linguists, psychologists and brain scientists by offering a bird view of this interdisciplinary field.
Collapse
Affiliation(s)
- Juqiang Chen
- School of Foreign Languages, Shanghai Jiao Tong University, Shanghai, China
| | - Hui Chang
- School of Foreign Languages, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
7
|
Morett LM, Feiler JB, Getz LM. Elucidating the influences of embodiment and conceptual metaphor on lexical and non-speech tone learning. Cognition 2022; 222:105014. [DOI: 10.1016/j.cognition.2022.105014] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 11/25/2022]
|
8
|
Computational Modelling of Tone Perception Based on Direct Processing of f0 Contours. Brain Sci 2022; 12:brainsci12030337. [PMID: 35326294 PMCID: PMC8946547 DOI: 10.3390/brainsci12030337] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2022] [Revised: 02/26/2022] [Accepted: 03/01/2022] [Indexed: 12/25/2022] Open
Abstract
It has been widely assumed that in speech perception it is imperative to first detect a set of distinctive properties or features and then use them to recognize phonetic units like consonants, vowels, and tones. Those features can be auditory cues or articulatory gestures, or a combination of both. There have been no clear demonstrations of how exactly such a two-phase process would work in the perception of continuous speech, however. Here we used computational modelling to explore whether it is possible to recognize phonetic categories from syllable-sized continuous acoustic signals of connected speech without intermediate featural representations. We used Support Vector Machine (SVM) and Self-organizing Map (SOM) to simulate tone perception in Mandarin, by either directly processing f0 trajectories, or extracting various tonal features. The results show that direct tone recognition not only yields better performance than any of the feature extraction schemes, but also requires less computational power. These results suggest that prior extraction of features is unlikely the operational mechanism of speech perception.
Collapse
|
9
|
Ito T, Ohashi H, Gracco VL. Somatosensory contribution to audio-visual speech processing. Cortex 2021; 143:195-204. [PMID: 34450567 DOI: 10.1016/j.cortex.2021.07.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Revised: 07/20/2021] [Accepted: 07/28/2021] [Indexed: 10/20/2022]
Abstract
Recent studies have demonstrated that the auditory speech perception of a listener can be modulated by somatosensory input applied to the facial skin suggesting that perception is an embodied process. However, speech perception is a multisensory process involving both the auditory and visual modalities. It is unknown whether and to what extent somatosensory stimulation to the facial skin modulates audio-visual speech perception. If speech perception is an embodied process, then somatosensory stimulation applied to the perceiver should influence audio-visual speech processing. Using the McGurk effect (the perceptual illusion that occurs when a sound is paired with the visual representation of a different sound, resulting in the perception of a third sound) we tested the prediction using a simple behavioral paradigm and at the neural level using event-related potentials (ERPs) and their cortical sources. We recorded ERPs from 64 scalp sites in response to congruent and incongruent audio-visual speech randomly presented with and without somatosensory stimulation associated with facial skin deformation. Subjects judged whether the production was /ba/ or not under all stimulus conditions. In the congruent audio-visual condition subjects identifying the sound as /ba/, but not in the incongruent condition consistent with the McGurk effect. Concurrent somatosensory stimulation improved the ability of participants to more correctly identify the production as /ba/ relative to the non-somatosensory condition in both congruent and incongruent conditions. ERP in response to the somatosensory stimulation for the incongruent condition reliably diverged 220 msec after stimulation onset. Cortical sources were estimated around the left anterior temporal gyrus, the right middle temporal gyrus, the right posterior superior temporal lobe and the right occipital region. The results demonstrate a clear multisensory convergence of somatosensory and audio-visual processing in both behavioral and neural processing consistent with the perspective that speech perception is a self-referenced, sensorimotor process.
Collapse
Affiliation(s)
- Takayuki Ito
- University Grenoble-Alpes, CNRS, Grenoble-INP, GIPSA-Lab, Saint Martin D'heres Cedex, France; Haskins Laboratories, New Haven, CT, USA.
| | | | - Vincent L Gracco
- Haskins Laboratories, New Haven, CT, USA; McGill University, Montréal, QC, Canada
| |
Collapse
|
10
|
Yang J, Davis BL, Diehl RL. The development of tonal duration in Mandarin-speaking children. JASA EXPRESS LETTERS 2021; 1:085202. [PMID: 36154246 DOI: 10.1121/10.0005892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Developmental changes in suprasegmental tonal duration were investigated in monolingual Mandarin-speaking children. Tone durations were acoustically measured in five- and eight-year-old children and adults. Children's tone duration and variability decreased with age. Five-year-olds produced significantly longer tone durations than adults. Adult-like duration patterns existed in all children: Tone 4 was the shortest and tone 3 the longest. Duration differences between tones 2 and 3 became larger between five- and eight-year-olds. Results suggest a prolonged process of tone development beyond establishing phonological contrasts, which can be viewed as a hybrid of physiological production capacities and perceptual learning for maximal contrastivity.
Collapse
Affiliation(s)
- Jie Yang
- Department of Communication Disorders, Texas State University, 200 Bobcat Way, Round Rock, Texas 78665, USA
| | - Barbara L Davis
- Department of Speech, Language and Hearing Sciences, University of Texas at Austin, 1 University Station A1100, Austin, Texas 78712, USA
| | | |
Collapse
|
11
|
Abstract
Vowel-intrinsic fundamental frequency (IF0), the phenomenon that high vowels tend to have a higher fundamental frequency (f0) than low vowels, has been studied for over a century, but its causal mechanism is still controversial. The most commonly accepted "tongue-pull" hypothesis successfully explains the IF0 difference between high and low vowels but fails to account for gradient IF0 differences among low vowels. Moreover, previous studies that investigated the articulatory correlates of IF0 showed inconsistent results and did not appropriately distinguish between the tongue and the jaw. The current study used articulatory and acoustic data from two large corpora of American English (44 speakers in total) to examine the separate contributions of tongue and jaw height on IF0. Using data subsetting and stepwise linear regression, the results showed that both the jaw and tongue heights were positively correlated with vowel f0, but the contribution of the jaw to IF0 was greater than that of the tongue. These results support a dual mechanism hypothesis in which the tongue-pull mechanism contributes to raising f0 in non-low vowels while a secondary "jaw-push" mechanism plays a more important role in lowering f0 for non-high vowels.
Collapse
Affiliation(s)
- Wei-Rong Chen
- Haskins Laboratories, 300 George Street #900, New Haven, CT 06511
| | - D. H. Whalen
- Haskins Laboratories, 300 George Street #900, New Haven, CT 06511
- City University of New York, 205 E 42nd Street, New York, NY 1001
- Yale University, New Haven, CT 06520
| | - Mark K. Tiede
- Haskins Laboratories, 300 George Street #900, New Haven, CT 06511
| |
Collapse
|
12
|
Yuan Y, Lleo Y, Daniel R, White A, Oh Y. The Impact of Temporally Coherent Visual Cues on Speech Perception in Complex Auditory Environments. Front Neurosci 2021; 15:678029. [PMID: 34163326 PMCID: PMC8216555 DOI: 10.3389/fnins.2021.678029] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 05/04/2021] [Indexed: 11/22/2022] Open
Abstract
Speech perception often takes place in noisy environments, where multiple auditory signals compete with one another. The addition of visual cues such as talkers’ faces or lip movements to an auditory signal can help improve the intelligibility of speech in those suboptimal listening environments. This is referred to as audiovisual benefits. The current study aimed to delineate the signal-to-noise ratio (SNR) conditions under which visual presentations of the acoustic amplitude envelopes have their most significant impact on speech perception. Seventeen adults with normal hearing were recruited. Participants were presented with spoken sentences in babble noise either in auditory-only or auditory-visual conditions with various SNRs at −7, −5, −3, −1, and 1 dB. The visual stimulus applied in this study was a sphere that varied in size syncing with the amplitude envelope of the target speech signals. Participants were asked to transcribe the sentences they heard. Results showed that a significant improvement in accuracy in the auditory-visual condition versus the audio-only condition was obtained at the SNRs of −3 and −1 dB, but no improvement was observed in other SNRs. These results showed that dynamic temporal visual information can benefit speech perception in noise, and the optimal facilitative effects of visual amplitude envelope can be observed under an intermediate SNR range.
Collapse
Affiliation(s)
- Yi Yuan
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, FL, United States
| | - Yasneli Lleo
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, FL, United States
| | - Rebecca Daniel
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, FL, United States
| | - Alexandra White
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, FL, United States
| | - Yonghee Oh
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, FL, United States
| |
Collapse
|
13
|
Kalaivanan K, Sumartono F, Tan YY. The Homogenization of Ethnic Differences in Singapore English? A Consonantal Production Study. LANGUAGE AND SPEECH 2021; 64:123-140. [PMID: 32484011 DOI: 10.1177/0023830920925510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Past research on Singapore English (SgE) has shown that there are specific segmental and prosodic patterns that are unique to the three major ethnic groups, Chinese, Malay, and Indian in Singapore. These features have been highlighted as the "stereotypical" ethnic markers of SgE speakers, assuming substrate influence from the speakers' "ethnic" languages (Mandarin, Malay, and Tamil). However, recent research suggests that Singaporeans are becoming increasingly English dominant and has challenged the position of the ethnic languages as true "mother tongues" of Singaporeans. Hence, this study seeks to question if such "stereotypical" ethnic features exist, and if so, the extent to which a less dominant ethnic language would affect the phonology of speakers' English. This study looks specifically at the production of consonants /f/, /θ/, /t/, /v/, and /w/ as salient segmental features in SgE. Participants' phonetic behavior of /θ/, which was produced similarly across the three ethnic groups, disputed substrate influence. Tamil speakers were the most disparate, particularly with the /v/-/w/ contrast production. However, these deviations were often sporadic phonetic changes, which scarcely reflect robust speech patterns in the community. As a result, consonantal production in SgE is found to be largely independent of substrate influence and relatively uniform across the three ethnicities. The homogeneity observed in this study sheds light on bilinguals' acquisition of sounds, and it also provides phonological evidence toward the understanding of the evolutionary process of postcolonial Englishes.
Collapse
|
14
|
Chodroff E, Wilson C. Acoustic-phonetic and auditory mechanisms of adaptation in the perception of sibilant fricatives. Atten Percept Psychophys 2020; 82:2027-2048. [PMID: 31875314 PMCID: PMC7297833 DOI: 10.3758/s13414-019-01894-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Listeners are highly proficient at adapting to contextual variation when perceiving speech. In the present study, we examined the effects of brief speech and nonspeech contexts on the perception of sibilant fricatives. We explored three theoretically motivated accounts of contextual adaptation, based on phonetic cue calibration, phonetic covariation, and auditory contrast. Under the cue calibration account, listeners adapt by estimating a talker-specific average for each phonetic cue or dimension; under the cue covariation account, listeners adapt by exploiting consistencies in how the realization of speech sounds varies across talkers; under the auditory contrast account, adaptation results from (partial) masking of spectral components that are shared by adjacent stimuli. The spectral center of gravity, a phonetic cue to fricative identity, was manipulated for several types of context sound: /z/-initial syllables, /v/-initial syllables, and white noise matched in long-term average spectrum (LTAS) to the /z/-initial stimuli. Listeners' perception of the /s/-/ʃ/ contrast was significantly influenced by /z/-initial syllables and LTAS-matched white noise stimuli, but not by /v/-initial syllables. No significant difference in adaptation was observed between exposure to /z/-initial syllables and matched white noise stimuli, and speech did not have a considerable advantage over noise when the two were presented consecutively within a context. The pattern of findings is most consistent with the auditory contrast account of short-term perceptual adaptation. The cue covariation account makes accurate predictions for speech contexts, but not for nonspeech contexts or for the absence of a speech-versus-nonspeech difference.
Collapse
Affiliation(s)
- Eleanor Chodroff
- Department of Language and Linguistic Science, University of York, Heslington, York, YO10 5DD, UK.
| | - Colin Wilson
- Department of Cognitive Science, Johns Hopkins University, 3400 N. Charles St., Baltimore, MD, 21218, USA
| |
Collapse
|
15
|
Kirby J. Acoustic correlates of plosive voicing in Madurese. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:2779. [PMID: 32359255 DOI: 10.1121/10.0000992] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 10/01/2019] [Indexed: 06/11/2023]
Abstract
Madurese, a Malayo-Polynesian language of Indonesia, is of interest both areally and typologically: it is described as having a three-way laryngeal contrast between voiced, voiceless unaspirated, and voiceless aspirated plosives, along with a strict phonotactic restriction on consonant voicing-vowel height sequences. An acoustic analysis of Madurese consonants and vowels obtained from the recordings of 15 speakers is presented to assess whether its voiced and aspirated plosives might share acoustic properties indicative of a shared articulatory gesture. Although voiced and voiceless aspirated plosives in word-initial position pattern together in terms of several spectral balance measures, these are most likely due to the following vowel quality, rather than aspects of a shared laryngeal configuration. Conversely, the voiceless (aspirated and unaspirated) plosives share multiple acoustic properties, including F0 trajectories and overlapping voicing lag time distributions, suggesting that they share a glottal aperture target. The implications of these findings for the typology of laryngeal contrasts and the historical evolution of the Madurese consonant-vowel co-occurrence restriction are discussed.
Collapse
Affiliation(s)
- James Kirby
- School of Philosophy, Psychology, and Language Science, University of Edinburgh, Dugald Stuart Building, 3 Charles Street, Edinburgh EH8 9AD, United Kingdom
| |
Collapse
|
16
|
Horo L, Sarmah P, Anderson GDS. Acoustic phonetic study of the Sora vowel system. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:3000. [PMID: 32359268 DOI: 10.1121/10.0001011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Accepted: 02/11/2020] [Indexed: 06/11/2023]
Abstract
This paper is an acoustic phonetic study of vowels in Sora, a Munda language of the Austroasiatic language family. Descriptions here illustrate that the Sora vowel system has six vowels and provide evidence that Sora disyllables have prominence on the second syllable. While the acoustic categorization of vowels is based on formant frequencies, the presence of prominence on the second syllable is shown through temporal features of vowels, including duration, intensity, and fundamental frequency. Additionally, this paper demonstrates that acoustic categorization of vowels in Sora is better in the prominent syllable than in the non-prominent syllable, providing evidence that syllable prominence and vowel quality are correlated in Sora. These acoustic properties of Sora vowels are discussed in relation to the existing debates on vowels and patterns of syllable prominence in Munda languages of India. In this regard, it is noteworthy that Munda languages, in general, lack instrumental studies, and therefore this paper presents significant findings that are undocumented in other Munda languages. These acoustic studies are supported by exploratory statistical modeling and statistical classification methods.
Collapse
Affiliation(s)
- Luke Horo
- Living Tongues Institute for Endangered Languages, Salem, Oregon 97302, USA
| | - Priyankoo Sarmah
- Department of Humanities and Social Sciences, Indian Institute of Technology Guwahati, Guwahati, Assam 781039, India
| | | |
Collapse
|
17
|
Yuan Y, Wayland R, Oh Y. Visual analog of the acoustic amplitude envelope benefits speech perception in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:EL246. [PMID: 32237828 DOI: 10.1121/10.0000737] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 01/29/2020] [Indexed: 06/11/2023]
Abstract
The nature of the visual input that integrates with the audio signal to yield speech processing advantages remains controversial. This study tests the hypothesis that the information extracted for audiovisual integration includes co-occurring suprasegmental dynamic changes in the acoustic and visual signal. English sentences embedded in multi-talker babble noise were presented to native English listeners in audio-only and audiovisual modalities. A significant intelligibility enhancement with the visual analogs congruent to the acoustic amplitude envelopes was observed. These results suggest that dynamic visual modulation provides speech rhythmic information that can be integrated online with the audio signal to enhance speech intelligibility.
Collapse
Affiliation(s)
- Yi Yuan
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, Florida 32610, USA
| | - Ratree Wayland
- Department of Linguistics, University of Florida, Gainesville, Florida 32611, , ,
| | - Yonghee Oh
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, Florida 32610, USA
| |
Collapse
|
18
|
Namasivayam AK, Coleman D, O’Dwyer A, van Lieshout P. Speech Sound Disorders in Children: An Articulatory Phonology Perspective. Front Psychol 2020; 10:2998. [PMID: 32047453 PMCID: PMC6997346 DOI: 10.3389/fpsyg.2019.02998] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2019] [Accepted: 12/18/2019] [Indexed: 01/20/2023] Open
Abstract
Speech Sound Disorders (SSDs) is a generic term used to describe a range of difficulties producing speech sounds in children (McLeod and Baker, 2017). The foundations of clinical assessment, classification and intervention for children with SSD have been heavily influenced by psycholinguistic theory and procedures, which largely posit a firm boundary between phonological processes and phonetics/articulation (Shriberg, 2010). Thus, in many current SSD classification systems the complex relationships between the etiology (distal), processing deficits (proximal) and the behavioral levels (speech symptoms) is under-specified (Terband et al., 2019a). It is critical to understand the complex interactions between these levels as they have implications for differential diagnosis and treatment planning (Terband et al., 2019a). There have been some theoretical attempts made towards understanding these interactions (e.g., McAllister Byun and Tessier, 2016) and characterizing speech patterns in children either solely as the product of speech motor performance limitations or purely as a consequence of phonological/grammatical competence has been challenged (Inkelas and Rose, 2007; McAllister Byun, 2012). In the present paper, we intend to reconcile the phonetic-phonology dichotomy and discuss the interconnectedness between these levels and the nature of SSDs using an alternative perspective based on the notion of an articulatory "gesture" within the broader concepts of the Articulatory Phonology model (AP; Browman and Goldstein, 1992). The articulatory "gesture" serves as a unit of phonological contrast and characterization of the resulting articulatory movements (Browman and Goldstein, 1992; van Lieshout and Goldstein, 2008). We present evidence supporting the notion of articulatory gestures at the level of speech production and as reflected in control processes in the brain and discuss how an articulatory "gesture"-based approach can account for articulatory behaviors in typical and disordered speech production (van Lieshout, 2004; Pouplier and van Lieshout, 2016). Specifically, we discuss how the AP model can provide an explanatory framework for understanding SSDs in children. Although other theories may be able to provide alternate explanations for some of the issues we will discuss, the AP framework in our view generates a unique scope that covers linguistic (phonology) and motor processes in a unified manner.
Collapse
Affiliation(s)
- Aravind Kumar Namasivayam
- Oral Dynamics Laboratory, Department of Speech-Language Pathology, University of Toronto, Toronto, ON, Canada
- Toronto Rehabilitation Institute, University Health Network, Toronto, ON, Canada
| | - Deirdre Coleman
- Oral Dynamics Laboratory, Department of Speech-Language Pathology, University of Toronto, Toronto, ON, Canada
- Independent Researcher, Surrey, BC, Canada
| | - Aisling O’Dwyer
- Oral Dynamics Laboratory, Department of Speech-Language Pathology, University of Toronto, Toronto, ON, Canada
- St. James’s Hospital, Dublin, Ireland
| | - Pascal van Lieshout
- Oral Dynamics Laboratory, Department of Speech-Language Pathology, University of Toronto, Toronto, ON, Canada
- Toronto Rehabilitation Institute, University Health Network, Toronto, ON, Canada
- Rehabilitation Sciences Institute, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
19
|
Patri JF, Diard J, Perrier P. Modeling Sensory Preference in Speech Motor Planning: A Bayesian Modeling Framework. Front Psychol 2019; 10:2339. [PMID: 31708828 PMCID: PMC6824204 DOI: 10.3389/fpsyg.2019.02339] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Accepted: 10/01/2019] [Indexed: 11/13/2022] Open
Abstract
Experimental studies of speech production involving compensations for auditory and somatosensory perturbations and adaptation after training suggest that both types of sensory information are considered to plan and monitor speech production. Interestingly, individual sensory preferences have been observed in this context: subjects who compensate less for somatosensory perturbations compensate more for auditory perturbations, and vice versa. We propose to integrate this sensory preference phenomenon in a model of speech motor planning using a probabilistic model in which speech units are characterized both in auditory and somatosensory terms. Sensory preference is implemented in the model according to two approaches. In the first approach, which is often used in motor control models accounting for sensory integration, sensory preference is attributed to the relative precision (i.e., inverse of the variance) of the sensory characterization of the speech motor goals associated with phonological units (which are phonemes in the context of this paper). In the second, "more original" variant, sensory preference is implemented by modulating the sensitivity of the comparison between the predicted sensory consequences of motor commands and the sensory characterizations of the phonemes. We present simulation results using these two variants, in the context of the adaptation to an auditory perturbation, implemented in a 2-dimensional biomechanical model of the tongue. Simulation results show that both variants lead to qualitatively similar results. Distinguishing them experimentally would require precise analyses of partial compensation patterns. However, the second proposed variant implements sensory preference without changing the sensory characterizations of the phonemes. This dissociates sensory preference and sensory characterizations of the phonemes, and makes the account of sensory preference more flexible. Indeed, in the second variant the sensory characterizations of the phonemes can remain stable, when sensory preference varies as a response to cognitive or attentional control. This opens new perspectives for capturing speech production variability associated with aging, disorders and speaking conditions.
Collapse
Affiliation(s)
- Jean-François Patri
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France.,Université Grenoble Alpes, CNRS, LPNC, Grenoble, France.,Cognition Motion and Neuroscience Unit, Fondazione Istituto Italiano di Tecnologia, Genova, Italy
| | - Julien Diard
- Université Grenoble Alpes, CNRS, LPNC, Grenoble, France
| | - Pascal Perrier
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France
| |
Collapse
|
20
|
Abstract
Human category learning appears to be supported by dual learning systems. Previous research indicates the engagement of distinct neural systems in learning categories that require selective attention to dimensions versus those that require integration across dimensions. This evidence has largely come from studies of learning across perceptually separable visual dimensions, but recent research has applied dual system models to understanding auditory and speech categorization. Since differential engagement of the dual learning systems is closely related to selective attention to input dimensions, it may be important that acoustic dimensions are quite often perceptually integral and difficult to attend to selectively. We investigated this issue across artificial auditory categories defined by center frequency and modulation frequency acoustic dimensions. Learners demonstrated a bias to integrate across the dimensions, rather than to selectively attend, and the bias specifically reflected a positive correlation between the dimensions. Further, we found that the acoustic dimensions did not equivalently contribute to categorization decisions. These results demonstrate the need to reconsider the assumption that the orthogonal input dimensions used in designing an experiment are indeed orthogonal in perceptual space as there are important implications for category learning.
Collapse
|
21
|
Abstract
Studies of vowel systems regularly appeal to the need to understand how the auditory system encodes and processes the information in the acoustic signal. The goal of this study is to present computational models to address this need, and to use the models to illustrate responses to vowels at two levels of the auditory pathway. Many of the models previously used to study auditory representations of speech are based on linear filter banks simulating the tuning of the inner ear. These models do not incorporate key nonlinear response properties of the inner ear that influence responses at conversational-speech sound levels. These nonlinear properties shape neural representations in ways that are important for understanding responses in the central nervous system. The model for auditory-nerve (AN) fibers used here incorporates realistic nonlinear properties associated with the basilar membrane, inner hair cells (IHCs), and the IHC-AN synapse. These nonlinearities set up profiles of f0-related fluctuations that vary in amplitude across the population of frequency-tuned AN fibers. Amplitude fluctuations in AN responses are smallest near formant peaks and largest at frequencies between formants. These f0-related fluctuations strongly excite or suppress neurons in the auditory midbrain, the first level of the auditory pathway where tuning for low-frequency fluctuations in sounds occurs. Formant-related amplitude fluctuations provide representations of the vowel spectrum in discharge rates of midbrain neurons. These representations in the midbrain are robust across a wide range of sound levels, including the entire range of conversational-speech levels, and in the presence of realistic background noise levels.
Collapse
|
22
|
Llompart M, Reinisch E. Imitation in a Second Language Relies on Phonological Categories but Does Not Reflect the Productive Usage of Difficult Sound Contrasts. LANGUAGE AND SPEECH 2019; 62:594-622. [PMID: 30319031 DOI: 10.1177/0023830918803978] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This study investigated the relationship between imitation and both the perception and production abilities of second language (L2) learners for two non-native contrasts differing in their expected degree of difficulty. German learners of English were tested on perceptual categorization, imitation and a word reading task for the difficult English /ɛ/-/æ/ contrast, which tends not to be well encoded in the learners' phonological inventories, and the easy, near-native /i/-/ɪ/ contrast. As expected, within-task comparisons between contrasts revealed more robust perception and better differentiation during production for /i/-/ɪ/ than /ɛ/-/æ/. Imitation also followed this pattern, suggesting that imitation is modulated by the phonological encoding of L2 categories. Moreover, learners' ability to imitate /ɛ/ and /æ/ was related to their perception of that contrast, confirming a tight perception-production link at the phonological level for difficult L2 sound contrasts. However, no relationship was observed between acoustic measures for imitated and read-aloud tokens of /ɛ/ and /æ/. This dissociation is mostly attributed to the influence of inaccurate non-native lexical representations in the word reading task. We conclude that imitation is strongly related to the phonological representation of L2 sound contrasts, but does not need to reflect the learners' productive usage of such non-native distinctions.
Collapse
|
23
|
|
24
|
Interactions between speech perception and production during learning of novel phonemic categories. Atten Percept Psychophys 2019; 81:981-1005. [PMID: 30976997 DOI: 10.3758/s13414-019-01725-4] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
A successful language learner must be able to perceive and produce novel sounds in their second language. However, the relationship between learning in perception and production is unclear. Some studies show correlations between the two modalities; however, other studies have not shown such correlations. In the present study, I examine learning in perception and production after training in a distributional learning paradigm. Training modality is manipulated, while testing modality remained constant. Overall, participants showed substantial learning in the modality in which they were trained; however, learning across modalities shows a more complex pattern. Although individuals trained in perception improved in production, individuals trained in production did not show substantial learning in perception. That is, production during training disrupted perceptual learning. Further, correlations between learning in the two modalities were not strong. Several possible explanations for the pattern of results are explored, including a close examination of the role of production variability, and the results are explained using a paradigm appealing to shared cognitive resources. The article concludes with a discussion of the implications of these results for theories of second-language learning, speech perception, and production.
Collapse
|
25
|
Long-standing problems in speech perception dissolve within an information-theoretic perspective. Atten Percept Psychophys 2019; 81:861-883. [PMID: 30937673 DOI: 10.3758/s13414-019-01702-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
An information theoretic framework is proposed to have the potential to dissolve (rather than attempt to solve) multiple long-standing problems concerning speech perception. By this view, speech perception can be reframed as a series of processes through which sensitivity to information-that which changes and/or is unpredictable-becomes increasingly sophisticated and shaped by experience. Problems concerning appropriate objects of perception (gestures vs. sounds), rate normalization, variance consequent to articulation, and talker normalization are reframed, or even dissolved, within this information-theoretic framework. Application of discriminative models founded on information theory provides a productive approach to answer questions concerning perception of speech, and perception most broadly.
Collapse
|
26
|
Havenhill J, Do Y. Visual Speech Perception Cues Constrain Patterns of Articulatory Variation and Sound Change. Front Psychol 2018; 9:728. [PMID: 29867686 PMCID: PMC5962885 DOI: 10.3389/fpsyg.2018.00728] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 04/25/2018] [Indexed: 11/20/2022] Open
Abstract
What are the factors that contribute to (or inhibit) diachronic sound change? While acoustically motivated sound changes are well-documented, research on the articulatory and audiovisual-perceptual aspects of sound change is limited. This paper investigates the interaction of articulatory variation and audiovisual speech perception in the Northern Cities Vowel Shift (NCVS), a pattern of sound change observed in the Great Lakes region of the United States. We focus specifically on the maintenance of the contrast between the vowels /ɑ/ and /ɔ/, both of which are fronted as a result of the NCVS. We present results from two experiments designed to test how the NCVS is produced and perceived. In the first experiment, we present data from an articulatory and acoustic analysis of the production of fronted /ɑ/ and /ɔ/. We find that some speakers distinguish /ɔ/ from /ɑ/ with a combination of both tongue position and lip rounding, while others do so using either tongue position or lip rounding alone. For speakers who distinguish /ɔ/ from /ɑ/ along only one articulatory dimension, /ɑ/ and /ɔ/ are acoustically more similar than for speakers who produce multiple articulatory distinctions. While all three groups of speakers maintain some degree of acoustic contrast between the vowels, the question is raised as to whether these articulatory strategies differ in their perceptibility. In the perception experiment, we test the hypothesis that visual speech cues play a role in maintaining contrast between the two sounds. The results of this experiment suggest that articulatory configurations in which /ɔ/ is produced with unround lips are perceptually weaker than those in which /ɔ/ is produced with rounding, even though these configurations result in acoustically similar output. We argue that these findings have implications for theories of sound change and variation in at least two respects: (1) visual cues can shape phonological systems through misperception-based sound change, and (2) phonological systems may be optimized not only for auditory but also for visual perceptibility.
Collapse
Affiliation(s)
- Jonathan Havenhill
- Department of Linguistics, Georgetown University, Washington, DC, United States
| | - Youngah Do
- Department of Linguistics, University of Hong Kong, Hong Kong, Hong Kong
| |
Collapse
|
27
|
Dittinger E, D'Imperio M, Besson M. Enhanced neural and behavioural processing of a nonnative phonemic contrast in professional musicians. Eur J Neurosci 2018; 47:1504-1516. [DOI: 10.1111/ejn.13939] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Revised: 01/24/2018] [Accepted: 04/16/2018] [Indexed: 11/28/2022]
Affiliation(s)
- Eva Dittinger
- CNRS & Aix-Marseille Université; Laboratoire de Neurosciences Cognitives (LNC, UMR 7291); Marseille France
- CNRS & Aix-Marseille Université; Laboratoire Parole et Langage (LPL, UMR 7309); Aix-en-Provence France
- Brain and Language Research Institute (BLRI); Aix-en-Provence France
| | - Mariapaola D'Imperio
- CNRS & Aix-Marseille Université; Laboratoire Parole et Langage (LPL, UMR 7309); Aix-en-Provence France
- Institut Universitaire de France (IUF); Paris France
| | - Mireille Besson
- CNRS & Aix-Marseille Université; Laboratoire de Neurosciences Cognitives (LNC, UMR 7291); Marseille France
| |
Collapse
|
28
|
Whalen DH, Chen WR, Tiede MK, Nam H. Variability of articulator positions and formants across nine English vowels. JOURNAL OF PHONETICS 2018; 68:1-14. [PMID: 30034052 PMCID: PMC6053058 DOI: 10.1016/j.wocn.2018.01.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Speech, though communicative, is quite variable both in articulation and acoustics, and it has often been claimed that articulation is more variable. Here we compared variability in articulation and acoustics for 32 speakers in the x-ray microbeam database (XRMB; Westbury, 1994). Variability in tongue, lip and jaw positions for nine English vowels (/u, ʊ, æ, ɑ, ʌ, ɔ, ε, ɪ, i/) was compared to that of the corresponding formant values. The domains were made comparable by creating three-dimensional spaces for each: the first three principal components from an analysis of a 14-dimensional space for articulation, and an F1xF2xF3 space for acoustics. More variability occurred in the articulation than the acoustics for half of the speakers, while the reverse was true for the other half. Individual tokens were further from the articulatory median than the acoustic median for 40-60% of tokens across speakers. A separate analysis of three non-low front vowels (/ε, ɪ, i/, for which the XRMB system provides the most direct articulatory evidence) did not differ from the omnibus analysis. Speakers tended to be either more or less variable consistently across vowels. Across speakers, there was a positive correlation between articulatory and acoustic variability, both for all vowels and for just the three non-low front vowels. Although the XRMB is an incomplete representation of articulation, it nonetheless provides data for direct comparisons between articulatory and acoustic variability that have not been reported previously. The results indicate that articulation is not more variable than acoustics, that speakers had relatively consistent variability across vowels, and that articulatory and acoustic variability were related for the vowels themselves.
Collapse
Affiliation(s)
- D H Whalen
- Haskins Laboratories
- City University of New York
- Yale University
| | | | | | | |
Collapse
|
29
|
Jacewicz E, Fox RA. Regional Variation in Fundamental Frequency of American English Vowels. PHONETICA 2018; 75:273-309. [PMID: 29649804 DOI: 10.1159/000484610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2016] [Accepted: 10/04/2017] [Indexed: 06/08/2023]
Abstract
We examined whether the fundamental frequency (f0) of vowels is influenced by regional variation, aiming to (1) establish how the relationship between vowel height and f0 ("intrinsic f0") is utilized in regional vowel systems and (2) determine whether regional varieties differ in their implementation of the effects of phonetic context on f0 variations. An extended set of acoustic measures explored f0 in vowels in isolated tokens (experiment 1) and in connected speech (experiment 2) from 36 women representing 3 different varieties of American English. Regional differences were found in f0 shape in isolated tokens, in the magnitude of intrinsic f0 difference between high and low vowels, in the nature of f0 contours in stressed vowels, and in the completion of f0 contours in the context of coda voicing. Regional varieties utilize f0 control in vowels in different ways, including regional f0 ranges and variation in f0 shape.
Collapse
|
30
|
Abstract
Phonemes play a central role in traditional theories as units of speech perception and access codes to lexical representations. Phonemes have two essential properties: they are 'segment-sized' (the size of a consonant or vowel) and abstract (a single phoneme may be have different acoustic realisations). Nevertheless, there is a long history of challenging the phoneme hypothesis, with some theorists arguing for differently sized phonological units (e.g. features or syllables) and others rejecting abstract codes in favour of representations that encode detailed acoustic properties of the stimulus. The phoneme hypothesis is the minority view today. We defend the phoneme hypothesis in two complementary ways. First, we show that rejection of phonemes is based on a flawed interpretation of empirical findings. For example, it is commonly argued that the failure to find acoustic invariances for phonemes rules out phonemes. However, the lack of invariance is only a problem on the assumption that speech perception is a bottom-up process. If learned sublexical codes are modified by top-down constraints (which they are), then this argument loses all force. Second, we provide strong positive evidence for phonemes on the basis of linguistic data. Almost all findings that are taken (incorrectly) as evidence against phonemes are based on psycholinguistic studies of single words. However, phonemes were first introduced in linguistics, and the best evidence for phonemes comes from linguistic analyses of complex word forms and sentences. In short, the rejection of phonemes is based on a false analysis and a too-narrow consideration of the relevant data.
Collapse
Affiliation(s)
- Nina Kazanina
- School of Experimental Psychology, University of Bristol, 12a Priory Road, Bristol, BS8 1TU, UK.
| | - Jeffrey S Bowers
- School of Experimental Psychology, University of Bristol, 12a Priory Road, Bristol, BS8 1TU, UK
| | - William Idsardi
- Department of Linguistics, University of Maryland, 1401 Marie Mount Hall, College Park, MD, 20742, USA
| |
Collapse
|
31
|
MacDonald J. Hearing Lips and Seeing Voices: the Origins and Development of the 'McGurk Effect' and Reflections on Audio-Visual Speech Perception Over the Last 40 Years. Multisens Res 2018; 31:7-18. [PMID: 31264593 DOI: 10.1163/22134808-00002548] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2016] [Accepted: 01/20/2017] [Indexed: 11/19/2022]
Abstract
In 1976 Harry McGurk and I published a paper in Nature, entitled 'Hearing Lips and Seeing Voices'. The paper described a new audio-visual illusion we had discovered that showed the perception of auditorily presented speech could be influenced by the simultaneous presentation of incongruent visual speech. This hitherto unknown effect has since had a profound impact on audiovisual speech perception research. The phenomenon has come to be known as the 'McGurk effect', and the original paper has been cited in excess of 4800 times. In this paper I describe the background to the discovery of the effect, the rationale for the generation of the initial stimuli, the construction of the exemplars used and the serendipitous nature of the finding. The paper will also cover the reaction (and non-reaction) to the Nature publication, the growth of research on, and utilizing the 'McGurk effect' and end with some reflections on the significance of the finding.
Collapse
Affiliation(s)
- John MacDonald
- Department of Psychology, University of the West of Scotland, Paisley, PA1 2BE, UK
| |
Collapse
|
32
|
Kawahara S. Durational compensation within a CV mora in spontaneous Japanese: Evidence from the Corpus of Spontaneous Japanese. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 142:EL143. [PMID: 28764476 DOI: 10.1121/1.4994674] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Previous experimental studies showed that in Japanese, vowels are longer after shorter onset consonants; there is durational compensation within a CV-mora. In order to address whether this compensation occurs in natural speech, this study re-examines this observation using the Corpus of Spontaneous Japanese. The results, which are based on more than 200 000 CV-mora tokens, show that there is a negative correlation between the onset consonant and the following vowel in terms of their duration. The statistical significance of this negative correlation is assessed by a traditional correlation analysis as well as a bootstrap resampling analysis, which both show that it is unlikely that the observed compensation effect occurred by chance. The compensation is not perfect, however, suggesting that it is a stochastic tendency rather than an absolute principle. This paper closes with a discussion of potential factors that may interact with the durational compensation effect.
Collapse
Affiliation(s)
- Shigeto Kawahara
- The Institute of Cultural and Linguistic Studies, Keio University, 2-15-45 Mita, Minato-ku, Tokyo, Japan
| |
Collapse
|
33
|
Flaherty M, Dent ML, Sawusch JR. Experience with speech sounds is not necessary for cue trading by budgerigars (Melopsittacus undulatus). PLoS One 2017; 12:e0177676. [PMID: 28562597 PMCID: PMC5451017 DOI: 10.1371/journal.pone.0177676] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2016] [Accepted: 05/01/2017] [Indexed: 11/18/2022] Open
Abstract
The influence of experience with human speech sounds on speech perception in budgerigars, vocal mimics whose speech exposure can be tightly controlled in a laboratory setting, was measured. Budgerigars were divided into groups that differed in auditory exposure and then tested on a cue-trading identification paradigm with synthetic speech. Phonetic cue trading is a perceptual phenomenon observed when changes on one cue dimension are offset by changes in another cue dimension while still maintaining the same phonetic percept. The current study examined whether budgerigars would trade the cues of voice onset time (VOT) and the first formant onset frequency when identifying syllable initial stop consonants and if this would be influenced by exposure to speech sounds. There were a total of four different exposure groups: No speech exposure (completely isolated), Passive speech exposure (regular exposure to human speech), and two Speech-trained groups. After the exposure period, all budgerigars were tested for phonetic cue trading using operant conditioning procedures. Birds were trained to peck keys in response to different synthetic speech sounds that began with "d" or "t" and varied in VOT and frequency of the first formant at voicing onset. Once training performance criteria were met, budgerigars were presented with the entire intermediate series, including ambiguous sounds. Responses on these trials were used to determine which speech cues were used, if a trading relation between VOT and the onset frequency of the first formant was present, and whether speech exposure had an influence on perception. Cue trading was found in all birds and these results were largely similar to those of a group of humans. Results indicated that prior speech experience was not a requirement for cue trading by budgerigars. The results are consistent with theories that explain phonetic cue trading in terms of a rich auditory encoding of the speech signal.
Collapse
Affiliation(s)
- Mary Flaherty
- Department of Psychology, University at Buffalo, The State University of New York, Buffalo, New York, United States of America
| | - Micheal L. Dent
- Department of Psychology, University at Buffalo, The State University of New York, Buffalo, New York, United States of America
- * E-mail:
| | - James R. Sawusch
- Department of Psychology, University at Buffalo, The State University of New York, Buffalo, New York, United States of America
| |
Collapse
|
34
|
Irwin J, DiBlasi L. Audiovisual speech perception: A new approach and implications for clinical populations. LANGUAGE AND LINGUISTICS COMPASS 2017; 11:77-91. [PMID: 29520300 PMCID: PMC5839512 DOI: 10.1111/lnc3.12237] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2015] [Accepted: 01/25/2017] [Indexed: 06/01/2023]
Abstract
This selected overview of audiovisual (AV) speech perception examines the influence of visible articulatory information on what is heard. Thought to be a cross-cultural phenomenon that emerges early in typical language development, variables that influence AV speech perception include properties of the visual and the auditory signal, attentional demands, and individual differences. A brief review of the existing neurobiological evidence on how visual information influences heard speech indicates potential loci, timing, and facilitatory effects of AV over auditory only speech. The current literature on AV speech in certain clinical populations (individuals with an autism spectrum disorder, developmental language disorder, or hearing loss) reveals differences in processing that may inform interventions. Finally, a new method of assessing AV speech that does not require obvious cross-category mismatch or auditory noise was presented as a novel approach for investigators.
Collapse
Affiliation(s)
- Julia Irwin
- LEARN Center, Haskins Laboratories Inc., USA
| | | |
Collapse
|
35
|
Zhang K, Wang X, Peng G. Normalization of lexical tones and nonlinguistic pitch contours: Implications for speech-specific processing mechanism. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 141:38. [PMID: 28147563 DOI: 10.1121/1.4973414] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Context is indispensable for accurate tone perception, especially when the target tone system is as complex as that of Cantonese. However, not all contexts are equally beneficial. Speech contexts are usually more effective in improving lexical tone identification than nonspeech contexts matched in pitch information. Some potential factors which may contribute to these unequal effects have been proposed but, thus far, their plausibility remains unclear. To shed light on this issue, the present study compares the perception of lexical tones and their nonlinguistic counterparts under specific contextual (speech, nonspeech) and attentional (with/without focal attention) conditions. The results reveal a prominent congruency effect-target sounds tend to be identified more accurately when embedded in contexts of the same nature (speech/nonspeech). This finding suggests that speech and nonspeech sounds are partly processed by domain-specific mechanisms and that information from the same domain can be integrated more effectively than that from different domains. Therefore, domain-specific processing of speech could be the most likely cause of the unequal context effect. Moreover, focal attention is not a prerequisite for extracting contextual cues from speech and nonspeech during perceptual normalization. This finding implies that context encoding is highly automatic for native listeners.
Collapse
Affiliation(s)
- Kaile Zhang
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China
| | - Xiao Wang
- Department of Linguistics and Modern Languages, The Chinese University of Hong Kong, Hong Kong Special Administrative Region, China
| | - Gang Peng
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region, China
| |
Collapse
|
36
|
Best CT, Avery RA. Left-Hemisphere Advantage for Click Consonants is Determined by Linguistic Significance and Experience. Psychol Sci 2016. [DOI: 10.1111/1467-9280.00108] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
Abstract
Left-hemisphere (LH) superiority for speech perception is a fundamental neurocognitive aspect of language, and is particularly strong for consonant perception. Two key theoretical aspects of the LH advantage for consonants remain controversial, however: the processing mode (auditory vs. linguistic) and the developmental basis of the specialization (innate vs. experience dependent). Click consonants offer a unique opportunity to evaluate these theoretical issues. Brief and spectrally complex, oral clicks exemplify the acoustic properties that have been proposed for an auditorily based LH specialization, yet they retain linguistic significance only for listeners whose languages employ them as consonants (e.g., Zulu). Speakers of other languages (e.g., English) perceive these clicks as nonspeech sounds. We assessed Zulu versus English listeners' hemispheric asymmetries for clicks, in and out of syllable context, in a dichotic-listening task. Performance was good for both groups, but only Zulus showed an LH advantage. Thus, linguistic processing and experience both appear to be crucial.
Collapse
Affiliation(s)
| | - Robert A. Avery
- Wesleyan University and Haskins Laboratories
- Department of Diagnostic Radiology, Yale University School of Medicine
| |
Collapse
|
37
|
Abstract
If, as we believe, language is a specialization all the way down to its roots, then perception of its consonantal elements should be immediately phonetic, not as in the conventional view, a secondary translation from percepts of an auditory sort Supporting observations come from an experiment in which formant transitions that distinguish [da] and [ga] were presented as sinusoids and combined with a synthetic syllable made of resonances, thus causing the auditory system to treat these acoustically incoherent parts as different sources Evidence for the source difference was varied by changing the intensity of the sinusoids relative to the remainder of the syllable Over the greater part of a 60-dB range, listeners accurately identified the consonants, indicating that they had integrated the stimuli according to a coherence that existed only in the phonetic domain At the lowest intensities, indeed, the consonants were accurately identified, even though the whistles—the normal responses to the sinusoids—were not There followed then a range over which perception was duplex Both consonants and whistles were accurately identified At the highest intensities, phonetic integration failed, but accurate perception of the whistles was maintained That the phonetic percept was present when its auditory counterpart was absent, and vice versa, is evidence that the phonetic percept is independent of its auditory counterpart and not a translation from it, as is the fact that the two percepts followed very different courses in response to the experimental variable
Collapse
|
38
|
Lalonde K, Holt RF. Audiovisual speech perception development at varying levels of perceptual processing. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 139:1713. [PMID: 27106318 PMCID: PMC4826374 DOI: 10.1121/1.4945590] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Revised: 01/04/2016] [Accepted: 03/25/2016] [Indexed: 06/05/2023]
Abstract
This study used the auditory evaluation framework [Erber (1982). Auditory Training (Alexander Graham Bell Association, Washington, DC)] to characterize the influence of visual speech on audiovisual (AV) speech perception in adults and children at multiple levels of perceptual processing. Six- to eight-year-old children and adults completed auditory and AV speech perception tasks at three levels of perceptual processing (detection, discrimination, and recognition). The tasks differed in the level of perceptual processing required to complete them. Adults and children demonstrated visual speech influence at all levels of perceptual processing. Whereas children demonstrated the same visual speech influence at each level of perceptual processing, adults demonstrated greater visual speech influence on tasks requiring higher levels of perceptual processing. These results support previous research demonstrating multiple mechanisms of AV speech processing (general perceptual and speech-specific mechanisms) with independent maturational time courses. The results suggest that adults rely on both general perceptual mechanisms that apply to all levels of perceptual processing and speech-specific mechanisms that apply when making phonetic decisions and/or accessing the lexicon. Six- to eight-year-old children seem to rely only on general perceptual mechanisms across levels. As expected, developmental differences in AV benefit on this and other recognition tasks likely reflect immature speech-specific mechanisms and phonetic processing in children.
Collapse
Affiliation(s)
- Kaylah Lalonde
- Department of Speech and Hearing Sciences, Indiana University, 200 South Jordan Avenue, Bloomington, Indiana 47405, USA
| | - Rachael Frush Holt
- Department of Speech and Hearing Science, Ohio State University, 110 Pressey Hall, 1070 Carmack Road, Columbus, Ohio 43210, USA
| |
Collapse
|
39
|
Iverson P, Wagner A, Rosen S. Effects of language experience on pre-categorical perception: Distinguishing general from specialized processes in speech perception. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2016; 139:1799. [PMID: 27106328 DOI: 10.1121/1.4944755] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Cross-language differences in speech perception have traditionally been linked to phonological categories, but it has become increasingly clear that language experience has effects beginning at early stages of perception, which blurs the accepted distinctions between general and speech-specific processing. The present experiments explored this distinction by playing stimuli to English and Japanese speakers that manipulated the acoustic form of English /r/ and /l/, in order to determine how acoustically natural and phonologically identifiable a stimulus must be for cross-language discrimination differences to emerge. Discrimination differences were found for stimuli that did not sound subjectively like speech or /r/ and /l/, but overall they were strongly linked to phonological categorization. The results thus support the view that phonological categories are an important source of cross-language differences, but also show that these differences can extend to stimuli that do not clearly sound like speech.
Collapse
Affiliation(s)
- Paul Iverson
- Department of Speech, Hearing and Phonetic Sciences, University College London, Chandler House, 2 Wakefield Street, London WC1N 1PF, United Kingdom
| | - Anita Wagner
- Department of Speech, Hearing and Phonetic Sciences, University College London, Chandler House, 2 Wakefield Street, London WC1N 1PF, United Kingdom
| | - Stuart Rosen
- Department of Speech, Hearing and Phonetic Sciences, University College London, Chandler House, 2 Wakefield Street, London WC1N 1PF, United Kingdom
| |
Collapse
|
40
|
Turner AC, McIntosh DN, Moody EJ. Don't Listen With Your Mouth Full: The Role of Facial Motor Action in Visual Speech Perception. LANGUAGE AND SPEECH 2015; 58:267-278. [PMID: 26677646 DOI: 10.1177/0023830914542305] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Theories of speech perception agree that visual input enhances the understanding of speech but disagree on whether physically mimicking the speaker improves understanding. This study investigated whether facial motor mimicry facilitates visual speech perception by testing whether blocking facial motor action impairs speechreading performance. Thirty-five typically developing children (19 boys; 16 girls; M age = 7 years) completed the Revised Craig Lipreading Inventory under two conditions. While observing silent videos of 15 words being spoken, participants either held a tongue depressor horizontally with their teeth (blocking facial motor action) or squeezed a ball with one hand (allowing facial motor action). As hypothesized, blocking motor action resulted in fewer correctly understood words than that of the control task. The results suggest that facial mimicry or other methods of facial action support visual speech perception in children. Future studies on the impact of motor action on the typical and atypical development of speech perception are warranted.
Collapse
|
41
|
Toscano JC, McMurray B. The time-course of speaking rate compensation: Effects of sentential rate and vowel length on voicing judgments. LANGUAGE, COGNITION AND NEUROSCIENCE 2015; 30:529-543. [PMID: 25780801 PMCID: PMC4358767 DOI: 10.1080/23273798.2014.946427] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Many sources of context information in speech (such as speaking rate) occur either before or after the phonetic cues they influence, yet there is little work examining the time-course of these effects. Here, we investigate how listeners compensate for preceding sentence rate and subsequent vowel length (a secondary cue that has been used as a proxy for speaking rate) when categorizing words varying in voice-onset time (VOT). Participants selected visual objects in a display while their eye-movements were recorded, allowing us to examine when each source of information had an effect on lexical processing. We found that the effect of VOT preceded that of vowel length, suggesting that each cue is used as it becomes available. In a second experiment, we found that, in contrast, the effect of preceding sentence rate occurred simultaneously with VOT, suggesting that listeners interpret VOT relative to preceding rate.
Collapse
Affiliation(s)
- Joseph C Toscano
- Beckman Institute for Advanced Science and Technology, University of Illinois at Urbana-Champaign, 405 N Mathews Ave, Urbana, IL 61801
| | - Bob McMurray
- Dept. of Psychology and Dept. of Communication Sciences & Disorders, University of Iowa, E11 Seashore Hall, Iowa City, IA 52242
| |
Collapse
|
42
|
|
43
|
Abstract
Tactile sensations at extreme distal body locations can integrate with auditory information to alter speech perception among uninformed and untrained listeners. Inaudible air puffs were applied to participants' ankles, simultaneously with audible syllables having aspirated and unaspirated stop onsets. Syllables heard simultaneously with air puffs were more likely to be heard as aspirated. These results demonstrate that event-appropriate information from distal parts of the body integrates in speech perception, even without frequent or robust location-specific experience. In addition, overall performance was significantly better for those with hair on their ankles, which suggests that the presence of hair may help establish signal relevance, and so aid in multi-modal speech perception.
Collapse
Affiliation(s)
- Donald Derrick
- University of Western Sydney, MARCS Institute, Locked Bag 1791, Penrith, NSW, 2751, Australia
- University of Canterbury, New Zealand Institute of Language, Brain & Behaviour, Private Bag 4800, Christchurch 8140, New Zealand
| | - Bryan Gick
- Department of Linguistics, University of British Columbia, Totem Field Studios, 2613 West Mall, Vancouver, BC, Canada, V6T 1Z4
- Haskins Laboratories, New Haven, Connecticut, 06511, USA
| |
Collapse
|
44
|
Sink positive: Linguistic experience with th substitutions influences nonnative word recognition. Atten Percept Psychophys 2011; 74:613-29. [DOI: 10.3758/s13414-011-0259-7] [Citation(s) in RCA: 42] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
45
|
Ordin M. Palatalization and intrinsic prosodic vowel features in Russian. LANGUAGE AND SPEECH 2011; 54:547-568. [PMID: 22338791 DOI: 10.1177/0023830911404962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The presented study is aimed at investigating the interaction of palatalization and intrinsic prosodic features of the vowel in CVC (consonant+vowel+consonant) syllables in Russian. The universal nature of intrinsic prosodic vowel features was confirmed with the data from the Russian language. It was found that palatalization of the consonants affects intrinsic fundamental frequency (IFO), intensity (I), and duration of the vowels in CVC syllables by modifying the vowel articulatory parameters such as vowel height and fronting. The obtained results are discussed in the light of opposing theories: those suggesting automatic control and those suggesting active control over intrinsic vowel features.
Collapse
Affiliation(s)
- Mikhail Ordin
- Moscow Academy of Humanities and Technology, Russia.
| |
Collapse
|
46
|
|
47
|
|
48
|
|
49
|
|
50
|
|