1
|
Magnotti JF, Lado A, Beauchamp MS. The noisy encoding of disparity model predicts perception of the McGurk effect in native Japanese speakers. Front Neurosci 2024; 18:1421713. [PMID: 38988770 PMCID: PMC11233445 DOI: 10.3389/fnins.2024.1421713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 05/28/2024] [Indexed: 07/12/2024] Open
Abstract
In the McGurk effect, visual speech from the face of the talker alters the perception of auditory speech. The diversity of human languages has prompted many intercultural studies of the effect in both Western and non-Western cultures, including native Japanese speakers. Studies of large samples of native English speakers have shown that the McGurk effect is characterized by high variability in the susceptibility of different individuals to the illusion and in the strength of different experimental stimuli to induce the illusion. The noisy encoding of disparity (NED) model of the McGurk effect uses principles from Bayesian causal inference to account for this variability, separately estimating the susceptibility and sensory noise for each individual and the strength of each stimulus. To determine whether variation in McGurk perception is similar between Western and non-Western cultures, we applied the NED model to data collected from 80 native Japanese-speaking participants. Fifteen different McGurk stimuli that varied in syllable content (unvoiced auditory "pa" + visual "ka" or voiced auditory "ba" + visual "ga") were presented interleaved with audiovisual congruent stimuli. The McGurk effect was highly variable across stimuli and participants, with the percentage of illusory fusion responses ranging from 3 to 78% across stimuli and from 0 to 91% across participants. Despite this variability, the NED model accurately predicted perception, predicting fusion rates for individual stimuli with 2.1% error and for individual participants with 2.4% error. Stimuli containing the unvoiced pa/ka pairing evoked more fusion responses than the voiced ba/ga pairing. Model estimates of sensory noise were correlated with participant age, with greater sensory noise in older participants. The NED model of the McGurk effect offers a principled way to account for individual and stimulus differences when examining the McGurk effect in different cultures.
Collapse
Affiliation(s)
- John F Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Michael S Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
2
|
Nidiffer AR, Cao CZ, O'Sullivan A, Lalor EC. A representation of abstract linguistic categories in the visual system underlies successful lipreading. Neuroimage 2023; 282:120391. [PMID: 37757989 DOI: 10.1016/j.neuroimage.2023.120391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 09/22/2023] [Accepted: 09/24/2023] [Indexed: 09/29/2023] Open
Abstract
There is considerable debate over how visual speech is processed in the absence of sound and whether neural activity supporting lipreading occurs in visual brain areas. Much of the ambiguity stems from a lack of behavioral grounding and neurophysiological analyses that cannot disentangle high-level linguistic and phonetic/energetic contributions from visual speech. To address this, we recorded EEG from human observers as they watched silent videos, half of which were novel and half of which were previously rehearsed with the accompanying audio. We modeled how the EEG responses to novel and rehearsed silent speech reflected the processing of low-level visual features (motion, lip movements) and a higher-level categorical representation of linguistic units, known as visemes. The ability of these visemes to account for the EEG - beyond the motion and lip movements - was significantly enhanced for rehearsed videos in a way that correlated with participants' trial-by-trial ability to lipread that speech. Source localization of viseme processing showed clear contributions from visual cortex, with no strong evidence for the involvement of auditory areas. We interpret this as support for the idea that the visual system produces its own specialized representation of speech that is (1) well-described by categorical linguistic features, (2) dissociable from lip movements, and (3) predictive of lipreading ability. We also suggest a reinterpretation of previous findings of auditory cortical activation during silent speech that is consistent with hierarchical accounts of visual and audiovisual speech perception.
Collapse
Affiliation(s)
- Aaron R Nidiffer
- Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA
| | - Cody Zhewei Cao
- Department of Psychology, University of Michigan, Ann Arbor, MI, USA
| | - Aisling O'Sullivan
- School of Engineering, Trinity College Institute of Neuroscience, Trinity Centre for Biomedical Engineering, Trinity College, Dublin, Ireland
| | - Edmund C Lalor
- Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA; School of Engineering, Trinity College Institute of Neuroscience, Trinity Centre for Biomedical Engineering, Trinity College, Dublin, Ireland.
| |
Collapse
|
3
|
Discriminatory Brain Processes of Native and Foreign Language in Children with and without Reading Difficulties. Brain Sci 2022; 13:brainsci13010076. [PMID: 36672057 PMCID: PMC9856413 DOI: 10.3390/brainsci13010076] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 12/21/2022] [Accepted: 12/28/2022] [Indexed: 01/03/2023] Open
Abstract
The association between impaired speech perception and reading difficulty has been well established in native language processing, as can be observed from brain activity. However, there has been scarce investigation of whether this association extends to brain activity during foreign language processing. The relationship between reading skills and neuronal speech representation of foreign language remains unclear. In the present study, we used event-related potentials (ERPs) with high-density EEG to investigate this question. Eleven- to 13-year-old children typically developed (CTR) or with reading difficulties (RD) were tested via a passive auditory oddball paradigm containing native (Finnish) and foreign (English) speech items. The change-detection-related ERP responses, the mismatch response (MMR), and the late discriminative negativity (LDN) were studied. The cluster-based permutation tests within and between groups were performed. The results showed an apparent language effect. In the CTR group, we found an atypical MMR in the foreign language processing and a larger LDN response for speech items containing a diphthong in both languages. In the RD group, we found unstable MMR with lower amplitude and a nonsignificant LDN response. A deficit in the LDN response in both languages was found within the RD group analysis. Moreover, we observed larger brain responses in the RD group and a hemispheric polarity reversal compared to the CTR group responses. Our results provide new evidence that language processing differed between the CTR and RD groups in early and late discriminatory responses and that language processing is linked to reading skills in both native and foreign language contexts.
Collapse
|
4
|
Jenson D. Audiovisual incongruence differentially impacts left and right hemisphere sensorimotor oscillations: Potential applications to production. PLoS One 2021; 16:e0258335. [PMID: 34618866 PMCID: PMC8496780 DOI: 10.1371/journal.pone.0258335] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2020] [Accepted: 09/26/2021] [Indexed: 11/21/2022] Open
Abstract
Speech production gives rise to distinct auditory and somatosensory feedback signals which are dynamically integrated to enable online monitoring and error correction, though it remains unclear how the sensorimotor system supports the integration of these multimodal signals. Capitalizing on the parity of sensorimotor processes supporting perception and production, the current study employed the McGurk paradigm to induce multimodal sensory congruence/incongruence. EEG data from a cohort of 39 typical speakers were decomposed with independent component analysis to identify bilateral mu rhythms; indices of sensorimotor activity. Subsequent time-frequency analyses revealed bilateral patterns of event related desynchronization (ERD) across alpha and beta frequency ranges over the time course of perceptual events. Right mu activity was characterized by reduced ERD during all cases of audiovisual incongruence, while left mu activity was attenuated and protracted in McGurk trials eliciting sensory fusion. Results were interpreted to suggest distinct hemispheric contributions, with right hemisphere mu activity supporting a coarse incongruence detection process and left hemisphere mu activity reflecting a more granular level of analysis including phonological identification and incongruence resolution. Findings are also considered in regard to incongruence detection and resolution processes during production.
Collapse
Affiliation(s)
- David Jenson
- Department of Speech and Hearing Sciences, Washington State University, Spokane, Washington, United States of America
| |
Collapse
|
5
|
Schaadt G, van der Meer E, Pannekamp A, Oberecker R, Männel C. Children with dyslexia show a reduced processing benefit from bimodal speech information compared to their typically developing peers. Neuropsychologia 2019; 126:147-158. [DOI: 10.1016/j.neuropsychologia.2018.01.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Revised: 11/24/2017] [Accepted: 01/12/2018] [Indexed: 01/20/2023]
|
6
|
Krause JC, Lopez KA. Cued Speech Transliteration: Effects of Accuracy and Lag Time on Message Intelligibility. JOURNAL OF DEAF STUDIES AND DEAF EDUCATION 2017; 22:378-392. [PMID: 28961869 PMCID: PMC5881264 DOI: 10.1093/deafed/enx024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/13/2017] [Revised: 06/13/2017] [Accepted: 06/16/2017] [Indexed: 06/07/2023]
Abstract
This paper is the second in a series concerned with the level of access afforded to students who use educational interpreters. The first paper (Krause & Tessler, 2016) focused on factors affecting accuracy of messages produced by Cued Speech (CS) transliterators (expression). In this study, factors affecting intelligibility (reception by deaf consumers) of those messages is explored. Results for eight highly skilled receivers of CS showed that (a) overall intelligibility (72%) was higher than average accuracy (61%), (b) accuracy had a large positive effect on intelligibility, accounting for 26% of the variance, and (c) the likelihood that an utterance reached 70% intelligibility as a function of accuracy was sigmoidal, decreasing sharply for accuracy values below 65%. While no direct effect of lag time on intelligibility could be detected, intelligibility was most likely to exceed 70% for utterances produced with lag times between 0.6 and 1.8 s. Differences between CS transliterators suggested sources of transliterator variability (e.g. speechreadability, facial expression, non-manual markers, cueing rate) that are also likely to affect intelligibility and thus warrant further investigation. Such research, along with investigations of other communication options (e.g. American Sign Language, signed English systems, etc.) are important for ensuring accessibility to all students who use educational interpreters.
Collapse
|
7
|
High visual resolution matters in audiovisual speech perception, but only for some. Atten Percept Psychophys 2016; 78:1472-87. [DOI: 10.3758/s13414-016-1109-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
8
|
Venezia JH, Thurman SM, Matchin W, George SE, Hickok G. Timing in audiovisual speech perception: A mini review and new psychophysical data. Atten Percept Psychophys 2016; 78:583-601. [PMID: 26669309 PMCID: PMC4744562 DOI: 10.3758/s13414-015-1026-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content.
Collapse
Affiliation(s)
- Jonathan H Venezia
- Department of Cognitive Sciences, University of California, Irvine, CA, 92697, USA.
| | - Steven M Thurman
- Department of Psychology, University of California, Los Angeles, CA, USA
| | - William Matchin
- Department of Linguistics, University of Maryland, Baltimore, MD, USA
| | - Sahara E George
- Department of Anatomy and Neurobiology, University of California, Irvine, CA, USA
| | - Gregory Hickok
- Department of Cognitive Sciences, University of California, Irvine, CA, 92697, USA
| |
Collapse
|
9
|
Schaadt G, Männel C, van der Meer E, Pannekamp A, Friederici AD. Facial speech gestures: the relation between visual speech processing, phonological awareness, and developmental dyslexia in 10-year-olds. Dev Sci 2015; 19:1020-1034. [DOI: 10.1111/desc.12346] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2014] [Accepted: 07/02/2015] [Indexed: 11/29/2022]
Affiliation(s)
- Gesa Schaadt
- Department of Neuropsychology; Max Planck Institute for Human Cognitive and Brain Sciences; Leipzig Germany
- Department of Psychology; Humboldt-Universität zu Berlin; Germany
| | - Claudia Männel
- Department of Neuropsychology; Max Planck Institute for Human Cognitive and Brain Sciences; Leipzig Germany
| | - Elke van der Meer
- Department of Psychology; Humboldt-Universität zu Berlin; Germany
- Graduate School of Mind and Brain; Berlin Germany
| | - Ann Pannekamp
- Department of Psychology; Humboldt-Universität zu Berlin; Germany
| | - Angela D. Friederici
- Department of Neuropsychology; Max Planck Institute for Human Cognitive and Brain Sciences; Leipzig Germany
| |
Collapse
|
10
|
Files BT, Tjan BS, Jiang J, Bernstein LE. Visual speech discrimination and identification of natural and synthetic consonant stimuli. Front Psychol 2015; 6:878. [PMID: 26217249 PMCID: PMC4499841 DOI: 10.3389/fpsyg.2015.00878] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2015] [Accepted: 06/15/2015] [Indexed: 11/25/2022] Open
Abstract
From phonetic features to connected discourse, every level of psycholinguistic structure including prosody can be perceived through viewing the talking face. Yet a longstanding notion in the literature is that visual speech perceptual categories comprise groups of phonemes (referred to as visemes), such as /p, b, m/ and /f, v/, whose internal structure is not informative to the visual speech perceiver. This conclusion has not to our knowledge been evaluated using a psychophysical discrimination paradigm. We hypothesized that perceivers can discriminate the phonemes within typical viseme groups, and that discrimination measured with d-prime (d') and response latency is related to visual stimulus dissimilarities between consonant segments. In Experiment 1, participants performed speeded discrimination for pairs of consonant-vowel spoken nonsense syllables that were predicted to be same, near, or far in their perceptual distances, and that were presented as natural or synthesized video. Near pairs were within-viseme consonants. Natural within-viseme stimulus pairs were discriminated significantly above chance (except for /k/-/h/). Sensitivity (d') increased and response times decreased with distance. Discrimination and identification were superior with natural stimuli, which comprised more phonetic information. We suggest that the notion of the viseme as a unitary perceptual category is incorrect. Experiment 2 probed the perceptual basis for visual speech discrimination by inverting the stimuli. Overall reductions in d' with inverted stimuli but a persistent pattern of larger d' for far than for near stimulus pairs are interpreted as evidence that visual speech is represented by both its motion and configural attributes. The methods and results of this investigation open up avenues for understanding the neural and perceptual bases for visual and audiovisual speech perception and for development of practical applications such as visual lipreading/speechreading speech synthesis.
Collapse
Affiliation(s)
- Benjamin T. Files
- U.S. Army Research Laboratory, Human Research and Engineering Directorate, Aberdeen Proving GroundMD, USA
| | - Bosco S. Tjan
- Department of Psychology, University of Southern California, Los AngelesCA, USA
| | | | - Lynne E. Bernstein
- Department of Speech and Hearing Science, George Washington University, WashingtonDC, USA
| |
Collapse
|
11
|
Napoli DJ, Mellon NK, Niparko JK, Rathmann C, Mathur G, Humphries T, Handley T, Scambler S, Lantos JD. Should All Deaf Children Learn Sign Language? Pediatrics 2015; 136:170-6. [PMID: 26077481 DOI: 10.1542/peds.2014-1632] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/27/2014] [Indexed: 11/24/2022] Open
Abstract
Every year, 10,000 infants are born in the United States with sensorineural deafness. Deaf children of hearing (and nonsigning) parents are unique among all children in the world in that they cannot easily or naturally learn the language that their parents speak. These parents face tough choices. Should they seek a cochlear implant for their child? If so, should they also learn to sign? As pediatricians, we need to help parents understand the risks and benefits of different approaches to parent-child communication when the child is deaf [corrected].
Collapse
Affiliation(s)
| | | | - John K Niparko
- Department of Otolaryngology, University of Southern California
| | - Christian Rathmann
- Institute for German Sign Language and Communication of the Deaf, University of Hamburg
| | | | - Tom Humphries
- Department of Education Studies, University of California at San Diego
| | | | | | | |
Collapse
|
12
|
Bernstein LE, Liebenthal E. Neural pathways for visual speech perception. Front Neurosci 2014; 8:386. [PMID: 25520611 PMCID: PMC4248808 DOI: 10.3389/fnins.2014.00386] [Citation(s) in RCA: 89] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2014] [Accepted: 11/10/2014] [Indexed: 12/03/2022] Open
Abstract
This paper examines the questions, what levels of speech can be perceived visually, and how is visual speech represented by the brain? Review of the literature leads to the conclusions that every level of psycholinguistic speech structure (i.e., phonetic features, phonemes, syllables, words, and prosody) can be perceived visually, although individuals differ in their abilities to do so; and that there are visual modality-specific representations of speech qua speech in higher-level vision brain areas. That is, the visual system represents the modal patterns of visual speech. The suggestion that the auditory speech pathway receives and represents visual speech is examined in light of neuroimaging evidence on the auditory speech pathways. We outline the generally agreed-upon organization of the visual ventral and dorsal pathways and examine several types of visual processing that might be related to speech through those pathways, specifically, face and body, orthography, and sign language processing. In this context, we examine the visual speech processing literature, which reveals widespread diverse patterns of activity in posterior temporal cortices in response to visual speech stimuli. We outline a model of the visual and auditory speech pathways and make several suggestions: (1) The visual perception of speech relies on visual pathway representations of speech qua speech. (2) A proposed site of these representations, the temporal visual speech area (TVSA) has been demonstrated in posterior temporal cortex, ventral and posterior to multisensory posterior superior temporal sulcus (pSTS). (3) Given that visual speech has dynamic and configural features, its representations in feedforward visual pathways are expected to integrate these features, possibly in TVSA.
Collapse
Affiliation(s)
- Lynne E Bernstein
- Department of Speech and Hearing Sciences, George Washington University Washington, DC, USA
| | - Einat Liebenthal
- Department of Neurology, Medical College of Wisconsin Milwaukee, WI, USA ; Department of Psychiatry, Brigham and Women's Hospital Boston, MA, USA
| |
Collapse
|
13
|
Files BT, Auer ET, Bernstein LE. The visual mismatch negativity elicited with visual speech stimuli. Front Hum Neurosci 2013; 7:371. [PMID: 23882205 PMCID: PMC3712324 DOI: 10.3389/fnhum.2013.00371] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2013] [Accepted: 06/26/2013] [Indexed: 11/26/2022] Open
Abstract
The visual mismatch negativity (vMMN), deriving from the brain's response to stimulus deviance, is thought to be generated by the cortex that represents the stimulus. The vMMN response to visual speech stimuli was used in a study of the lateralization of visual speech processing. Previous research suggested that the right posterior temporal cortex has specialization for processing simple non-speech face gestures, and the left posterior temporal cortex has specialization for processing visual speech gestures. Here, visual speech consonant-vowel (CV) stimuli with controlled perceptual dissimilarities were presented in an electroencephalography (EEG) vMMN paradigm. The vMMNs were obtained using the comparison of event-related potentials (ERPs) for separate CVs in their roles as deviant vs. their roles as standard. Four separate vMMN contrasts were tested, two with the perceptually far deviants (i.e., “zha” or “fa”) and two with the near deviants (i.e., “zha” or “ta”). Only far deviants evoked the vMMN response over the left posterior temporal cortex. All four deviants evoked vMMNs over the right posterior temporal cortex. The results are interpreted as evidence that the left posterior temporal cortex represents speech contrasts that are perceived as different consonants, and the right posterior temporal cortex represents face gestures that may not be perceived as different CVs.
Collapse
Affiliation(s)
- Benjamin T Files
- Neuroscience Graduate Program, University of Southern California Los Angeles, CA, USA
| | | | | |
Collapse
|
14
|
Bernstein LE, Auer ET, Eberhardt SP, Jiang J. Auditory Perceptual Learning for Speech Perception Can be Enhanced by Audiovisual Training. Front Neurosci 2013; 7:34. [PMID: 23515520 PMCID: PMC3600826 DOI: 10.3389/fnins.2013.00034] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2012] [Accepted: 02/28/2013] [Indexed: 11/13/2022] Open
Abstract
Speech perception under audiovisual (AV) conditions is well known to confer benefits to perception such as increased speed and accuracy. Here, we investigated how AV training might benefit or impede auditory perceptual learning of speech degraded by vocoding. In Experiments 1 and 3, participants learned paired associations between vocoded spoken nonsense words and nonsense pictures. In Experiment 1, paired-associates (PA) AV training of one group of participants was compared with audio-only (AO) training of another group. When tested under AO conditions, the AV-trained group was significantly more accurate than the AO-trained group. In addition, pre- and post-training AO forced-choice consonant identification with untrained nonsense words showed that AV-trained participants had learned significantly more than AO participants. The pattern of results pointed to their having learned at the level of the auditory phonetic features of the vocoded stimuli. Experiment 2, a no-training control with testing and re-testing on the AO consonant identification, showed that the controls were as accurate as the AO-trained participants in Experiment 1 but less accurate than the AV-trained participants. In Experiment 3, PA training alternated AV and AO conditions on a list-by-list basis within participants, and training was to criterion (92% correct). PA training with AO stimuli was reliably more effective than training with AV stimuli. We explain these discrepant results in terms of the so-called “reverse hierarchy theory” of perceptual learning and in terms of the diverse multisensory and unisensory processing resources available to speech perception. We propose that early AV speech integration can potentially impede auditory perceptual learning; but visual top-down access to relevant auditory features can promote auditory perceptual learning.
Collapse
Affiliation(s)
- Lynne E Bernstein
- Communication Neuroscience Laboratory, Department of Speech and Hearing Science, George Washington University Washington, DC, USA
| | | | | | | |
Collapse
|
15
|
Jiang J, Bernstein LE. Psychophysics of the McGurk and other audiovisual speech integration effects. J Exp Psychol Hum Percept Perform 2011; 37:1193-209. [PMID: 21574741 DOI: 10.1037/a0023100] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called McGurk effect), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of the four types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, correlations, and acoustic measures, all representing audiovisual stimulus relationships. In Experiment 1, open-set perceptual responses were obtained for acoustic /bɑ/ or /lɑ/ dubbed to video /bɑ, dɑ, gɑ, vɑ, zɑ, lɑ, wɑ, ðɑ/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for perceptual response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships.
Collapse
Affiliation(s)
- Jintao Jiang
- Division of Communication and Auditory Neuroscience, House Ear Institute, Los Angeles, California, USA.
| | | |
Collapse
|
16
|
Bernstein LE, Jiang J, Pantazis D, Lu ZL, Joshi A. Visual phonetic processing localized using speech and nonspeech face gestures in video and point-light displays. Hum Brain Mapp 2010; 32:1660-76. [PMID: 20853377 DOI: 10.1002/hbm.21139] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2008] [Revised: 06/08/2010] [Accepted: 06/28/2010] [Indexed: 11/09/2022] Open
Abstract
The talking face affords multiple types of information. To isolate cortical sites with responsibility for integrating linguistically relevant visual speech cues, speech and nonspeech face gestures were presented in natural video and point-light displays during fMRI scanning at 3.0T. Participants with normal hearing viewed the stimuli and also viewed localizers for the fusiform face area (FFA), the lateral occipital complex (LOC), and the visual motion (V5/MT) regions of interest (ROIs). The FFA, the LOC, and V5/MT were significantly less activated for speech relative to nonspeech and control stimuli. Distinct activation of the posterior superior temporal sulcus and the adjacent middle temporal gyrus to speech, independent of media, was obtained in group analyses. Individual analyses showed that speech and nonspeech stimuli were associated with adjacent but different activations, with the speech activations more anterior. We suggest that the speech activation area is the temporal visual speech area (TVSA), and that it can be localized with the combination of stimuli used in this study.
Collapse
Affiliation(s)
- Lynne E Bernstein
- Division of Communication and Auditory Neuroscience, House Ear Institute, Los Angeles, California, USA.
| | | | | | | | | |
Collapse
|
17
|
Ma WJ, Zhou X, Ross LA, Foxe JJ, Parra LC. Lip-reading aids word recognition most in moderate noise: a Bayesian explanation using high-dimensional feature space. PLoS One 2009; 4:e4638. [PMID: 19259259 PMCID: PMC2645675 DOI: 10.1371/journal.pone.0004638] [Citation(s) in RCA: 109] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2008] [Accepted: 01/07/2009] [Indexed: 11/21/2022] Open
Abstract
Watching a speaker's facial movements can dramatically enhance our ability to comprehend words, especially in noisy environments. From a general doctrine of combining information from different sensory modalities (the principle of inverse effectiveness), one would expect that the visual signals would be most effective at the highest levels of auditory noise. In contrast, we find, in accord with a recent paper, that visual information improves performance more at intermediate levels of auditory noise than at the highest levels, and we show that a novel visual stimulus containing only temporal information does the same. We present a Bayesian model of optimal cue integration that can explain these conflicts. In this model, words are regarded as points in a multidimensional space and word recognition is a probabilistic inference process. When the dimensionality of the feature space is low, the Bayesian model predicts inverse effectiveness; when the dimensionality is high, the enhancement is maximal at intermediate auditory noise levels. When the auditory and visual stimuli differ slightly in high noise, the model makes a counterintuitive prediction: as sound quality increases, the proportion of reported words corresponding to the visual stimulus should first increase and then decrease. We confirm this prediction in a behavioral experiment. We conclude that auditory-visual speech perception obeys the same notion of optimality previously observed only for simple multisensory stimuli.
Collapse
Affiliation(s)
- Wei Ji Ma
- Department of Neuroscience, Baylor College of Medicine, Houston, Texas, United States of America.
| | | | | | | | | |
Collapse
|
18
|
Scarborough R, Keating P, Mattys SL, Cho T, Alwan A. Optical phonetics and visual perception of lexical and phrasal stress in English. LANGUAGE AND SPEECH 2009; 52:135-175. [PMID: 19624028 DOI: 10.1177/0023830909103165] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
In a study of optical cues to the visual perception of stress, three American English talkers spoke words that differed in lexical stress and sentences that differed in phrasal stress, while video and movements of the face were recorded. The production of stressed and unstressed syllables from these utterances was analyzed along many measures of facial movement, which were generally larger and faster in the stressed condition. In a visual perception experiment, 16 perceivers identified the location of stress in forced-choice judgments of video clips of these utterances (without audio). Phrasal stress was better perceived than lexical stress. The relation of the visual intelligibility of the prosody of these utterances to the optical characteristics of their production was analyzed to determine which cues are associated with successful visual perception. While most optical measures were correlated with perception performance, chin measures, especially Chin Opening Displacement, contributed the most to correct perception independently of the other measures. Thus, our results indicate that the information for visual stress perception is mainly associated with mouth opening movements.
Collapse
|
19
|
Quantified acoustic-optical speech signal incongruity identifies cortical sites of audiovisual speech processing. Brain Res 2008; 1242:172-84. [PMID: 18495091 DOI: 10.1016/j.brainres.2008.04.018] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2008] [Revised: 04/01/2008] [Accepted: 04/02/2008] [Indexed: 11/21/2022]
Abstract
A fundamental question about human perception is how the speech perceiving brain combines auditory and visual phonetic stimulus information. We assumed that perceivers learn the normal relationship between acoustic and optical signals. We hypothesized that when the normal relationship is perturbed by mismatching the acoustic and optical signals, cortical areas responsible for audiovisual stimulus integration respond as a function of the magnitude of the mismatch. To test this hypothesis, in a previous study, we developed quantitative measures of acoustic-optical speech stimulus incongruity that correlate with perceptual measures. In the current study, we presented low incongruity (LI, matched), medium incongruity (MI, moderately mismatched), and high incongruity (HI, highly mismatched) audiovisual nonsense syllable stimuli during fMRI scanning. Perceptual responses differed as a function of the incongruity level, and BOLD measures were found to vary regionally and quantitatively with perceptual and quantitative incongruity levels. Each increase in the level of incongruity resulted in an increase in overall levels of cortical activity and in additional activations. However, the only cortical region that demonstrated differential sensitivity to the three stimulus incongruity levels (HI>MI>LI) was a subarea of the left supramarginal gyrus (SMG). The left SMG might support a fine-grained analysis of the relationship between audiovisual phonetic input in comparison with stored knowledge, as hypothesized here. The methods here show that quantitative manipulation of stimulus incongruity is a new and powerful tool for disclosing the system that processes audiovisual speech stimuli.
Collapse
|