1
|
Mitchel AD, Lusk LG, Wellington I, Mook AT. Segmenting Speech by Mouth: The Role of Oral Prosodic Cues for Visual Speech Segmentation. LANGUAGE AND SPEECH 2023; 66:819-832. [PMID: 36448317 DOI: 10.1177/00238309221137607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Adults are able to use visual prosodic cues in the speaker's face to segment speech. Furthermore, eye-tracking data suggest that learners will shift their gaze to the mouth during visual speech segmentation. Although these findings suggest that the mouth may be viewed more than the eyes or nose during visual speech segmentation, no study has examined the direct functional importance of individual features; thus, it is unclear which visual prosodic cues are important for word segmentation. In this study, we examined the impact of first removing (Experiment 1) and then isolating (Experiment 2) individual facial features on visual speech segmentation. Segmentation performance was above chance in all conditions except for when the visual display was restricted to the eye region (eyes only condition in Experiment 2). This suggests that participants were able to segment speech when they could visually access the mouth but not when the mouth was completely removed from the visual display, providing evidence that visual prosodic cues conveyed by the mouth are sufficient and likely necessary for visual speech segmentation.
Collapse
Affiliation(s)
| | - Laina G Lusk
- Bucknell University, USA; Children's Hospital of Philadelphia, USA
| | - Ian Wellington
- Bucknell University, USA; University of Connecticut, USA
| | | |
Collapse
|
2
|
Shan T, Wenner CE, Xu C, Duan Z, Maddox RK. Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face. Trends Hear 2022; 26:23312165221136934. [PMID: 36384325 PMCID: PMC9677167 DOI: 10.1177/23312165221136934] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from speech audio and a single face image. In this study, we aimed to quantify the benefits that such a system can bring to speech comprehension, especially in noise. The target speech audio was masked with signal to noise ratios of -9, -6, -3, and 0 dB and was presented to subjects in three audio-visual (AV) stimulus conditions: (1) synthesized AV: audio with the synthesized talking face movie; (2) natural AV: audio with the original movie from the corpus; and (3) audio-only: audio with a static image of the talker. Subjects were asked to type the sentences they heard in each trial and keyword recognition was quantified for each condition. Overall, performance in the synthesized AV condition fell approximately halfway between the other two conditions, showing a marked improvement over the audio-only control but still falling short of the natural AV condition. Every subject showed some benefit from the synthetic AV stimulus. The results of this study support the idea that a DNN-based model that generates a talking face from speech audio can meaningfully enhance comprehension in noisy environments, and has the potential to be used as a visual hearing aid.
Collapse
Affiliation(s)
- Tong Shan
- Department of Biomedical Engineering, University of Rochester, Rochester, NY, USA,Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA,Center for Visual Science, University of Rochester, Rochester, NY, USA
| | - Casper E. Wenner
- Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA
| | - Chenliang Xu
- Department of Computer Science, University of Rochester, Rochester, NY, USA
| | - Zhiyao Duan
- Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA
| | - Ross K. Maddox
- Department of Biomedical Engineering, University of Rochester, Rochester, NY, USA,Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA,Center for Visual Science, University of Rochester, Rochester, NY, USA,Department of Neuroscience, University of Rochester, Rochester, NY, USA,Ross K. Maddox, Department of Biomedical Engineering and Department of Neuroscience, University of Rochester, Rochester, NY, USA.
| |
Collapse
|
3
|
Trotter AS, Banks B, Adank P. The Relevance of the Availability of Visual Speech Cues During Adaptation to Noise-Vocoded Speech. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2021; 64:2513-2528. [PMID: 34161748 DOI: 10.1044/2021_jslhr-20-00575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Purpose This study first aimed to establish whether viewing specific parts of the speaker's face (eyes or mouth), compared to viewing the whole face, affected adaptation to distorted noise-vocoded sentences. Second, this study also aimed to replicate results on processing of distorted speech from lab-based experiments in an online setup. Method We monitored recognition accuracy online while participants were listening to noise-vocoded sentences. We first established if participants were able to perceive and adapt to audiovisual four-band noise-vocoded sentences when the entire moving face was visible (AV Full). Four further groups were then tested: a group in which participants viewed the moving lower part of the speaker's face (AV Mouth), a group in which participants only see the moving upper part of the face (AV Eyes), a group in which participants could not see the moving lower or upper face (AV Blocked), and a group in which participants saw an image of a still face (AV Still). Results Participants repeated around 40% of the key words correctly and adapted during the experiment, but only when the moving mouth was visible. In contrast, performance was at floor level, and no adaptation took place, in conditions when the moving mouth was occluded. Conclusions The results show the importance of being able to observe relevant visual speech information from the speaker's mouth region, but not the eyes/upper face region, when listening and adapting to distorted sentences online. Second, the results also demonstrated that it is feasible to run speech perception and adaptation studies online, but that not all findings reported for lab studies replicate. Supplemental Material https://doi.org/10.23641/asha.14810523.
Collapse
Affiliation(s)
- Antony S Trotter
- Speech, Hearing and Phonetic Sciences, University College London, United Kingdom
| | - Briony Banks
- Department of Psychology, Lancaster University, United Kingdom
| | - Patti Adank
- Speech, Hearing and Phonetic Sciences, University College London, United Kingdom
| |
Collapse
|
4
|
Audio-visual integration in noise: Influence of auditory and visual stimulus degradation on eye movements and perception of the McGurk effect. Atten Percept Psychophys 2020; 82:3544-3557. [PMID: 32533526 PMCID: PMC7788022 DOI: 10.3758/s13414-020-02042-x] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Seeing a talker’s face can aid audiovisual (AV) integration when speech is presented in noise. However, few studies have simultaneously manipulated auditory and visual degradation. We aimed to establish how degrading the auditory and visual signal affected AV integration. Where people look on the face in this context is also of interest; Buchan, Paré and Munhall (Brain Research, 1242, 162–171, 2008) found fixations on the mouth increased in the presence of auditory noise whilst Wilson, Alsius, Paré and Munhall (Journal of Speech, Language, and Hearing Research, 59(4), 601–615, 2016) found mouth fixations decreased with decreasing visual resolution. In Condition 1, participants listened to clear speech, and in Condition 2, participants listened to vocoded speech designed to simulate the information provided by a cochlear implant. Speech was presented in three levels of auditory noise and three levels of visual blurring. Adding noise to the auditory signal increased McGurk responses, while blurring the visual signal decreased McGurk responses. Participants fixated the mouth more on trials when the McGurk effect was perceived. Adding auditory noise led to people fixating the mouth more, while visual degradation led to people fixating the mouth less. Combined, the results suggest that modality preference and where people look during AV integration of incongruent syllables varies according to the quality of information available.
Collapse
|
5
|
Talking Points: A Modulating Circle Increases Listening Effort Without Improving Speech Recognition in Young Adults. Psychon Bull Rev 2020; 27:536-543. [PMID: 32128719 DOI: 10.3758/s13423-020-01713-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Speech recognition is improved when the acoustic input is accompanied by visual cues provided by a talking face (Erber in Journal of Speech and Hearing Research, 12(2), 423-425, 1969; Sumby & Pollack in The Journal of the Acoustical Society of America, 26(2), 212-215, 1954). One way that the visual signal facilitates speech recognition is by providing the listener with information about fine phonetic detail that complements information from the auditory signal. However, given that degraded face stimuli can still improve speech recognition accuracy (Munhall, Kroos, Jozan, & Vatikiotis-Bateson in Perception & Psychophysics, 66(4), 574-583, 2004), and static or moving shapes can improve speech detection accuracy (Bernstein, Auer, & Takayanagi in Speech Communication, 44(1-4), 5-18, 2004), aspects of the visual signal other than fine phonetic detail may also contribute to the perception of speech. In two experiments, we show that a modulating circle providing information about the onset, offset, and acoustic amplitude envelope of the speech does not improve recognition of spoken sentences (Experiment 1) or words (Experiment 2). Further, contrary to our hypothesis, the modulating circle increased listening effort despite subjective reports that it made the word recognition task seem easier to complete (Experiment 2). These results suggest that audiovisual speech processing, even when the visual stimulus only conveys temporal information about the acoustic signal, may be a cognitively demanding process.
Collapse
|
6
|
Stemberger JP, Bernhardt BM. Phonetic Transcription for Speech-Language Pathology in the 21st Century. Folia Phoniatr Logop 2019; 72:75-83. [PMID: 31550711 DOI: 10.1159/000500701] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2018] [Accepted: 04/30/2019] [Indexed: 11/19/2022] Open
Abstract
The past few decades have seen rapid changes in speech-language pathology in terms of technology, information on speech production and perception, and increasing levels of multilingualism in communities. This tutorial provides an overview of phonetic transcription for the modern world, both for work with clients, and for research and training. The authors draw on their backgrounds in phonetics, phonology and speech-language pathology, and their crosslinguistic project in the phonological acquisition of children with typical versus protracted phonological development. Challenges and solutions are presented, as well as resources for further training of students, clinicians and researchers.
Collapse
Affiliation(s)
- Joseph Paul Stemberger
- Department of Linguistics, University of British Columbia, Vancouver, British Columbia, Canada,
| | - Barbara May Bernhardt
- School of Audiology and Speech Science, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
7
|
Strand JF, Brown VA, Barbour DL. Talking points: A modulating circle reduces listening effort without improving speech recognition. Psychon Bull Rev 2019; 26:291-297. [PMID: 29790122 DOI: 10.3758/s13423-018-1489-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Speech recognition is improved when the acoustic input is accompanied by visual cues provided by a talking face (Erber in Journal of Speech and Hearing Research, 12(2), 423-425 1969; Sumby & Pollack in The Journal of the Acoustical Society of America, 26(2), 212-215, 1954). One way that the visual signal facilitates speech recognition is by providing the listener with information about fine phonetic detail that complements information from the auditory signal. However, given that degraded face stimuli can still improve speech recognition accuracy (Munhall et al. in Perception & Psychophysics, 66(4), 574-583, 2004), and static or moving shapes can improve speech detection accuracy (Bernstein et al. in Speech Communication, 44(1/4), 5-18, 2004), aspects of the visual signal other than fine phonetic detail may also contribute to the perception of speech. In two experiments, we show that a modulating circle providing information about the onset, offset, and acoustic amplitude envelope of the speech does not improve recognition of spoken sentences (Experiment 1) or words (Experiment 2), but does reduce the effort necessary to recognize speech. These results suggest that although fine phonetic detail may be required for the visual signal to benefit speech recognition, low-level features of the visual signal may function to reduce the cognitive effort associated with processing speech.
Collapse
Affiliation(s)
- Julia F Strand
- Department of Psychology, Carleton College, Northfield, MN, USA.
| | - Violet A Brown
- Department of Psychology, Carleton College, Northfield, MN, USA
| | - Dennis L Barbour
- Department of Biomedical Engineering, Washington University in St. Louis, St. Louis, MO, USA
| |
Collapse
|
8
|
Balan JR, Maruthy S. Dynamics of Speech Perception in the Auditory-Visual Mode: An Empirical Evidence for the Management of Auditory Neuropathy Spectrum Disorders. J Audiol Otol 2018; 22:197-203. [PMID: 29969891 PMCID: PMC6233939 DOI: 10.7874/jao.2018.00059] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 05/09/2018] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND AND OBJECTIVES The present study probed into the relative and combined contribution of auditory and visual modalities in the speech perception of individuals with auditory neuropathy spectrum disorders (ANSD). Specifically, the identification scores of consonantvowel (CV) syllables, visual enhancement (VE), and auditory enhancement in different signal to noise ratios (SNRs) were compared with that of the control group. Subjects and. METHODS The study used a repeated measure standard group comparison research design. Two groups of individuals in the age range of 16 to 35 years participated in the study. The clinical group included 35 participants diagnosed as ANSD, while the control group had 35 age and gender matched individuals with typical auditory abilities. The participants were assessed for CV syllable identification in auditory only (A), visual only (V), and auditory-visual (AV) modalities. The syllables were presented in quiet and at 0 dB SNR. RESULTS The speech identification score was maximum in AV condition followed by A-condition and least in V condition. This was true in both the groups. The individuals with ANSD were able to make better use of visual cues than the control group, as evident in the VE score. CONCLUSIONS The dynamics of speech perception in the AV mode is different between ANSD and control. There is definite benefit of auditory as well as visual cues to individuals with ANSD, suggesting the need to facilitate both the modalities as part of the audiological rehabilitation. Future studies can focus on independently facilitating the two modalities and testing the benefits in the AV mode of speech perception in individuals with ANSD.
Collapse
Affiliation(s)
- Jithin Raj Balan
- Department of Audiology, All India Institute of Speech and Hearing, Mysuru, India
| | - Sandeep Maruthy
- Department of Audiology, All India Institute of Speech and Hearing, Mysuru, India
| |
Collapse
|
9
|
Jansen SD, Keebler JR, Chaparro A. Shifts in Maximum Audiovisual Integration with Age. Multisens Res 2018; 31:191-212. [DOI: 10.1163/22134808-00002599] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Accepted: 07/14/2017] [Indexed: 11/19/2022]
Abstract
Listeners attempting to understand speech in noisy environments rely on visual and auditory processes, typically referred to as audiovisual processing. Noise corrupts the auditory speech signal and listeners naturally leverage visual cues from the talker’s face in an attempt to interpret the degraded auditory signal. Studies of speech intelligibility in noise show that the maximum improvement in speech recognition performance (i.e., maximum visual enhancement or VEmax), derived from seeing an interlocutor’s face, is invariant with age. Several studies have reported that VEmaxis typically associated with a signal-to-noise (SNR) of −12 dB; however, few studies have systematically investigated whether the SNR associated with VEmaxchanges with age. We investigated if VEmaxchanges as a function of age, whether the SNR at VEmaxchanges as a function of age, and what perceptual/cognitive abilities account for or mediate such relationships. We measured VEmaxon a nongeriatric adult sample () ranging in age from 20 to 59 years old. We found that VEmaxwas age-invariant, replicating earlier studies. No perceptual/cognitive measures predicted VEmax, most likely due to limited variance in VEmaxscores. Importantly, we found that the SNR at VEmaxshifts toward higher (quieter) SNR levels with increasing age; however, this relationship is partially mediated by working memory capacity, where those with larger working memory capacities (WMCs) can identify speech under lower (louder) SNR levels than their age equivalents with smaller WMCs. The current study is the first to report that individual differences in WMC partially mediate the age-related shift in SNR at VEmax.
Collapse
Affiliation(s)
| | - Joseph R. Keebler
- Department of Human Factors and Behavioral Neurobiology, Embry-Riddle Aeronautical University, Daytona Beach, FL, USA
| | - Alex Chaparro
- Department of Human Factors and Behavioral Neurobiology, Embry-Riddle Aeronautical University, Daytona Beach, FL, USA
| |
Collapse
|
10
|
Looking Behavior and Audiovisual Speech Understanding in Children With Normal Hearing and Children With Mild Bilateral or Unilateral Hearing Loss. Ear Hear 2017; 39:783-794. [PMID: 29252979 DOI: 10.1097/aud.0000000000000534] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
OBJECTIVES Visual information from talkers facilitates speech intelligibility for listeners when audibility is challenged by environmental noise and hearing loss. Less is known about how listeners actively process and attend to visual information from different talkers in complex multi-talker environments. This study tracked looking behavior in children with normal hearing (NH), mild bilateral hearing loss (MBHL), and unilateral hearing loss (UHL) in a complex multi-talker environment to examine the extent to which children look at talkers and whether looking patterns relate to performance on a speech-understanding task. It was hypothesized that performance would decrease as perceptual complexity increased and that children with hearing loss would perform more poorly than their peers with NH. Children with MBHL or UHL were expected to demonstrate greater attention to individual talkers during multi-talker exchanges, indicating that they were more likely to attempt to use visual information from talkers to assist in speech understanding in adverse acoustics. It also was of interest to examine whether MBHL, versus UHL, would differentially affect performance and looking behavior. DESIGN Eighteen children with NH, eight children with MBHL, and 10 children with UHL participated (8-12 years). They followed audiovisual instructions for placing objects on a mat under three conditions: a single talker providing instructions via a video monitor, four possible talkers alternately providing instructions on separate monitors in front of the listener, and the same four talkers providing both target and nontarget information. Multi-talker background noise was presented at a 5 dB signal-to-noise ratio during testing. An eye tracker monitored looking behavior while children performed the experimental task. RESULTS Behavioral task performance was higher for children with NH than for either group of children with hearing loss. There were no differences in performance between children with UHL and children with MBHL. Eye-tracker analysis revealed that children with NH looked more at the screens overall than did children with MBHL or UHL, though individual differences were greater in the groups with hearing loss. Listeners in all groups spent a small proportion of time looking at relevant screens as talkers spoke. Although looking was distributed across all screens, there was a bias toward the right side of the display. There was no relationship between overall looking behavior and performance on the task. CONCLUSIONS The present study examined the processing of audiovisual speech in the context of a naturalistic task. Results demonstrated that children distributed their looking to a variety of sources during the task, but that children with NH were more likely to look at screens than were those with MBHL/UHL. However, all groups looked at the relevant talkers as they were speaking only a small proportion of the time. Despite variability in looking behavior, listeners were able to follow the audiovisual instructions and children with NH demonstrated better performance than children with MBHL/UHL. These results suggest that performance on some challenging multi-talker audiovisual tasks is not dependent on visual fixation to relevant talkers for children with NH or with MBHL/UHL.
Collapse
|
11
|
Kokinous J, Tavano A, Kotz SA, Schröger E. Perceptual integration of faces and voices depends on the interaction of emotional content and spatial frequency. Biol Psychol 2017; 123:155-165. [DOI: 10.1016/j.biopsycho.2016.12.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2016] [Revised: 10/11/2016] [Accepted: 12/11/2016] [Indexed: 10/20/2022]
|
12
|
Wilson AH, Alsius A, Paré M, Munhall KG. Spatial Frequency Requirements and Gaze Strategy in Visual-Only and Audiovisual Speech Perception. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2016; 59:601-15. [PMID: 27537379 PMCID: PMC5280058 DOI: 10.1044/2016_jslhr-s-15-0092] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Revised: 09/16/2015] [Accepted: 10/07/2015] [Indexed: 06/06/2023]
Abstract
PURPOSE The aim of this article is to examine the effects of visual image degradation on performance and gaze behavior in audiovisual and visual-only speech perception tasks. METHOD We presented vowel-consonant-vowel utterances visually filtered at a range of frequencies in visual-only, audiovisual congruent, and audiovisual incongruent conditions (Experiment 1; N = 66). In Experiment 2 (N = 20), participants performed a visual-only speech perception task and in Experiment 3 (N = 20) an audiovisual task while having their gaze behavior monitored using eye-tracking equipment. RESULTS In the visual-only condition, increasing image resolution led to monotonic increases in performance, and proficient speechreaders were more affected by the removal of high spatial information than were poor speechreaders. The McGurk effect also increased with increasing visual resolution, although it was less affected by the removal of high-frequency information. Observers tended to fixate on the mouth more in visual-only perception, but gaze toward the mouth did not correlate with accuracy of silent speechreading or the magnitude of the McGurk effect. CONCLUSIONS The results suggest that individual differences in silent speechreading and the McGurk effect are not related. This conclusion is supported by differential influences of high-resolution visual information on the 2 tasks and differences in the pattern of gaze.
Collapse
Affiliation(s)
- Amanda H. Wilson
- Psychology Department, Queen's University, Kingston, Ontario, Canada
- Centre for Neuroscience Studies, Queen's University, Kingston, Ontario, Canada
| | - Agnès Alsius
- Psychology Department, Queen's University, Kingston, Ontario, Canada
| | - Martin Paré
- Centre for Neuroscience Studies, Queen's University, Kingston, Ontario, Canada
| | - Kevin G. Munhall
- Psychology Department, Queen's University, Kingston, Ontario, Canada
- Centre for Neuroscience Studies, Queen's University, Kingston, Ontario, Canada
| |
Collapse
|
13
|
Tye-Murray N, Spehar B, Myerson J, Hale S, Sommers M. Lipreading and audiovisual speech recognition across the adult lifespan: Implications for audiovisual integration. Psychol Aging 2016; 31:380-9. [PMID: 27294718 PMCID: PMC4910521 DOI: 10.1037/pag0000094] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In this study of visual (V-only) and audiovisual (AV) speech recognition in adults aged 22-92 years, the rate of age-related decrease in V-only performance was more than twice that in AV performance. Both auditory-only (A-only) and V-only performance were significant predictors of AV speech recognition, but age did not account for additional (unique) variance. Blurring the visual speech signal decreased speech recognition, and in AV conditions involving stimuli associated with equivalent unimodal performance for each participant, speech recognition remained constant from 22 to 92 years of age. Finally, principal components analysis revealed separate visual and auditory factors, but no evidence of an AV integration factor. Taken together, these results suggest that the benefit that comes from being able to see as well as hear a talker remains constant throughout adulthood and that changes in this AV advantage are entirely driven by age-related changes in unimodal visual and auditory speech recognition. (PsycINFO Database Record
Collapse
Affiliation(s)
| | - Brent Spehar
- Washington University in St Louis School of Medicine
| | | | | | | |
Collapse
|
14
|
High visual resolution matters in audiovisual speech perception, but only for some. Atten Percept Psychophys 2016; 78:1472-87. [DOI: 10.3758/s13414-016-1109-4] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
15
|
Venezia JH, Thurman SM, Matchin W, George SE, Hickok G. Timing in audiovisual speech perception: A mini review and new psychophysical data. Atten Percept Psychophys 2016; 78:583-601. [PMID: 26669309 PMCID: PMC4744562 DOI: 10.3758/s13414-015-1026-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content.
Collapse
Affiliation(s)
- Jonathan H Venezia
- Department of Cognitive Sciences, University of California, Irvine, CA, 92697, USA.
| | - Steven M Thurman
- Department of Psychology, University of California, Los Angeles, CA, USA
| | - William Matchin
- Department of Linguistics, University of Maryland, Baltimore, MD, USA
| | - Sahara E George
- Department of Anatomy and Neurobiology, University of California, Irvine, CA, USA
| | - Gregory Hickok
- Department of Cognitive Sciences, University of California, Irvine, CA, 92697, USA
| |
Collapse
|
16
|
Jaekl P, Pesquita A, Alsius A, Munhall K, Soto-Faraco S. The contribution of dynamic visual cues to audiovisual speech perception. Neuropsychologia 2015; 75:402-10. [PMID: 26100561 DOI: 10.1016/j.neuropsychologia.2015.06.025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 06/11/2015] [Accepted: 06/18/2015] [Indexed: 11/19/2022]
Abstract
Seeing a speaker's facial gestures can significantly improve speech comprehension, especially in noisy environments. However, the nature of the visual information from the speaker's facial movements that is relevant for this enhancement is still unclear. Like auditory speech signals, visual speech signals unfold over time and contain both dynamic configural information and luminance-defined local motion cues; two information sources that are thought to engage anatomically and functionally separate visual systems. Whereas, some past studies have highlighted the importance of local, luminance-defined motion cues in audiovisual speech perception, the contribution of dynamic configural information signalling changes in form over time has not yet been assessed. We therefore attempted to single out the contribution of dynamic configural information to audiovisual speech processing. To this aim, we measured word identification performance in noise using unimodal auditory stimuli, and with audiovisual stimuli. In the audiovisual condition, speaking faces were presented as point light displays achieved via motion capture of the original talker. Point light displays could be isoluminant, to minimise the contribution of effective luminance-defined local motion information, or with added luminance contrast, allowing the combined effect of dynamic configural cues and local motion cues. Audiovisual enhancement was found in both the isoluminant and contrast-based luminance conditions compared to an auditory-only condition, demonstrating, for the first time the specific contribution of dynamic configural cues to audiovisual speech improvement. These findings imply that globally processed changes in a speaker's facial shape contribute significantly towards the perception of articulatory gestures and the analysis of audiovisual speech.
Collapse
Affiliation(s)
- Philip Jaekl
- Center for Visual Science and Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA.
| | - Ana Pesquita
- UBC Vision Lab, Department of Psychology, University of British Colombia, Vancouver, BC, Canada
| | - Agnes Alsius
- Department of Psychology, Queen's University, Kingston, ON, Canada
| | - Kevin Munhall
- Department of Psychology, Queen's University, Kingston, ON, Canada
| | - Salvador Soto-Faraco
- Centre for Brain and Cognition, Department of Information Technology and Communications, Universitat Pompeu Fabra, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Spain
| |
Collapse
|
17
|
Eg R, Behne DM. Perceived synchrony for realistic and dynamic audiovisual events. Front Psychol 2015; 6:736. [PMID: 26082738 PMCID: PMC4451240 DOI: 10.3389/fpsyg.2015.00736] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Accepted: 05/17/2015] [Indexed: 11/13/2022] Open
Abstract
In well-controlled laboratory experiments, researchers have found that humans can perceive delays between auditory and visual signals as short as 20 ms. Conversely, other experiments have shown that humans can tolerate audiovisual asynchrony that exceeds 200 ms. This seeming contradiction in human temporal sensitivity can be attributed to a number of factors such as experimental approaches and precedence of the asynchronous signals, along with the nature, duration, location, complexity and repetitiveness of the audiovisual stimuli, and even individual differences. In order to better understand how temporal integration of audiovisual events occurs in the real world, we need to close the gap between the experimental setting and the complex setting of everyday life. With this work, we aimed to contribute one brick to the bridge that will close this gap. We compared perceived synchrony for long-running and eventful audiovisual sequences to shorter sequences that contain a single audiovisual event, for three types of content: action, music, and speech. The resulting windows of temporal integration showed that participants were better at detecting asynchrony for the longer stimuli, possibly because the long-running sequences contain multiple corresponding events that offer audiovisual timing cues. Moreover, the points of subjective simultaneity differ between content types, suggesting that the nature of a visual scene could influence the temporal perception of events. An expected outcome from this type of experiment was the rich variation among participants' distributions and the derived points of subjective simultaneity. Hence, the designs of similar experiments call for more participants than traditional psychophysical studies. Heeding this caution, we conclude that existing theories on multisensory perception are ready to be tested on more natural and representative stimuli.
Collapse
Affiliation(s)
| | - Dawn M Behne
- Department of Psychology, Norwegian University of Science and Technology Trondheim, Norway
| |
Collapse
|
18
|
Yi A, Wong W, Eizenman M. Gaze patterns and audiovisual speech enhancement. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2013; 56:471-80. [PMID: 23275394 DOI: 10.1044/1092-4388(2012/10-0288)] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
PURPOSE In this study, the authors sought to quantify the relationships between speech intelligibility (perception) and gaze patterns under different auditory-visual conditions. METHOD Eleven subjects listened to low-context sentences spoken by a single talker while viewing the face of one or more talkers on a computer display. Subjects either maintained their gaze at a specific distance (0°, 2.5°, 5°, 10°, and 15°) from the center of the talker's mouth (CTM) or moved their eyes freely on the computer display. Eye movements were monitored with an eye-tracking system, and speech intelligibility was evaluated by the mean percentage of correctly perceived words. RESULTS With a single talker and a fixed point of gaze, speech intelligibility was similar for all fixations within 10° of the CTM. With visual cues from two talker faces and a speech signal from one of the talkers, speech intelligibility was similar to that of a single talker for fixations within 2.5° of the CTM. With natural viewing of a single talker, gaze strategy changed with speech-signal-to-noise ratio (SNR). For low speech-SNR, a strategy that brought the point of gaze directly to within 2.5° of the CTM was used in approximately 80% of trials, whereas in high speech-SNR it was used in only approximately 50% of trials. CONCLUSIONS With natural viewing of a single talker and high speech-SNR, subjects can shift their gaze between points on the talker's face without compromising speech intelligibility. With low-speech SNR, subjects change their gaze patterns to fixate primarily on points that are in close proximity to the talker's mouth. The latter strategy is essential to optimize speech intelligibility in situations where there are simultaneous visual cues from multiple talkers (i.e., when some of the visual cues are distracters).
Collapse
Affiliation(s)
- Astrid Yi
- University of Toronto, Ontario, Canada.
| | | | | |
Collapse
|
19
|
Kelly SD, Hansen BC, Clark DT. "Slight" of hand: the processing of visually degraded gestures with speech. PLoS One 2012; 7:e42620. [PMID: 22912715 PMCID: PMC3415388 DOI: 10.1371/journal.pone.0042620] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2012] [Accepted: 07/10/2012] [Indexed: 11/18/2022] Open
Abstract
Co-speech hand gestures influence language comprehension. The present experiment explored what part of the visual processing system is optimized for processing these gestures. Participants viewed short video clips of speech and gestures (e.g., a person saying “chop” or “twist” while making a chopping gesture) and had to determine whether the two modalities were congruent or incongruent. Gesture videos were designed to stimulate the parvocellular or magnocellular visual pathways by filtering out low or high spatial frequencies (HSF versus LSF) at two levels of degradation severity (moderate and severe). Participants were less accurate and slower at processing gesture and speech at severe versus moderate levels of degradation. In addition, they were slower for LSF versus HSF stimuli, and this difference was most pronounced in the severely degraded condition. However, exploratory item analyses showed that the HSF advantage was modulated by the range of motion and amount of motion energy in each video. The results suggest that hand gestures exploit a wide range of spatial frequencies, and depending on what frequencies carry the most motion energy, parvocellular or magnocellular visual pathways are maximized to quickly and optimally extract meaning.
Collapse
Affiliation(s)
- Spencer D Kelly
- Department of Psychology and Neuroscience Program, Colgate University, Hamilton, New York, United States of America.
| | | | | |
Collapse
|
20
|
Morris NL, Chaparro A, Downs D, Wood JM. Effects of simulated cataracts on speech intelligibility. Vision Res 2012; 66:49-54. [DOI: 10.1016/j.visres.2012.06.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2011] [Revised: 05/30/2012] [Accepted: 06/09/2012] [Indexed: 10/28/2022]
|
21
|
Abstract
Visual information augments our understanding of auditory speech. New evidence shows that infants' gaze fixations to the mouth and eye region shift predictably with changes in age and language familiarity.
Collapse
Affiliation(s)
- K G Munhall
- Departments of Psychology and Otolaryngology, Queen's University, 62 Arch Street, Kingston, Ontario K7L3N6, Canada.
| | | |
Collapse
|
22
|
Jiang J, Bernstein LE. Psychophysics of the McGurk and other audiovisual speech integration effects. J Exp Psychol Hum Percept Perform 2011; 37:1193-209. [PMID: 21574741 DOI: 10.1037/a0023100] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
When the auditory and visual components of spoken audiovisual nonsense syllables are mismatched, perceivers produce four different types of perceptual responses, auditory correct, visual correct, fusion (the so-called McGurk effect), and combination (i.e., two consonants are reported). Here, quantitative measures were developed to account for the distribution of the four types of perceptual responses to 384 different stimuli from four talkers. The measures included mutual information, correlations, and acoustic measures, all representing audiovisual stimulus relationships. In Experiment 1, open-set perceptual responses were obtained for acoustic /bɑ/ or /lɑ/ dubbed to video /bɑ, dɑ, gɑ, vɑ, zɑ, lɑ, wɑ, ðɑ/. The talker, the video syllable, and the acoustic syllable significantly influenced the type of response. In Experiment 2, the best predictors of response category proportions were a subset of the physical stimulus measures, with the variance accounted for in the perceptual response category proportions between 17% and 52%. That audiovisual stimulus relationships can account for perceptual response distributions supports the possibility that internal representations are based on modality-specific stimulus relationships.
Collapse
Affiliation(s)
- Jintao Jiang
- Division of Communication and Auditory Neuroscience, House Ear Institute, Los Angeles, California, USA.
| | | |
Collapse
|
23
|
Dynamic changes in superior temporal sulcus connectivity during perception of noisy audiovisual speech. J Neurosci 2011; 31:1704-14. [PMID: 21289179 DOI: 10.1523/jneurosci.4853-10.2011] [Citation(s) in RCA: 133] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Humans are remarkably adept at understanding speech, even when it is contaminated by noise. Multisensory integration may explain some of this ability: combining independent information from the auditory modality (vocalizations) and the visual modality (mouth movements) reduces noise and increases accuracy. Converging evidence suggests that the superior temporal sulcus (STS) is a critical brain area for multisensory integration, but little is known about its role in the perception of noisy speech. Behavioral studies have shown that perceptual judgments are weighted by the reliability of the sensory modality: more reliable modalities are weighted more strongly, even if the reliability changes rapidly. We hypothesized that changes in the functional connectivity of STS with auditory and visual cortex could provide a neural mechanism for perceptual reliability weighting. To test this idea, we performed five blood oxygenation level-dependent functional magnetic resonance imaging and behavioral experiments in 34 healthy subjects. We found increased functional connectivity between the STS and auditory cortex when the auditory modality was more reliable (less noisy) and increased functional connectivity between the STS and visual cortex when the visual modality was more reliable, even when the reliability changed rapidly during presentation of successive words. This finding matched the results of a behavioral experiment in which the perception of incongruent audiovisual syllables was biased toward the more reliable modality, even with rapidly changing reliability. Changes in STS functional connectivity may be an important neural mechanism underlying the perception of noisy speech.
Collapse
|
24
|
Dickinson CM, Taylor J. The effect of simulated visual impairment on speech-reading ability. Ophthalmic Physiol Opt 2011; 31:249-57. [DOI: 10.1111/j.1475-1313.2010.00810.x] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
25
|
Buchan JN, Munhall KG. The Influence of Selective Attention to Auditory and Visual Speech on the Integration of Audiovisual Speech Information. Perception 2011; 40:1164-82. [DOI: 10.1068/p6939] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
Abstract
Conflicting visual speech information can influence the perception of acoustic speech, causing an illusory percept of a sound not present in the actual acoustic speech (the McGurk effect). We examined whether participants can voluntarily selectively attend to either the auditory or visual modality by instructing participants to pay attention to the information in one modality and to ignore competing information from the other modality. We also examined how performance under these instructions was affected by weakening the influence of the visual information by manipulating the temporal offset between the audio and video channels (experiment 1), and the spatial frequency information present in the video (experiment 2). Gaze behaviour was also monitored to examine whether attentional instructions influenced the gathering of visual information. While task instructions did have an influence on the observed integration of auditory and visual speech information, participants were unable to completely ignore conflicting information, particularly information from the visual stream. Manipulating temporal offset had a more pronounced interaction with task instructions than manipulating the amount of visual information. Participants' gaze behaviour suggests that the attended modality influences the gathering of visual information in audiovisual speech perception.
Collapse
Affiliation(s)
| | - Kevin G Munhall
- Department of Otolaryngology, Queen's University, Kingston, Ontario, Canada
| |
Collapse
|
26
|
Legault I, Gagné JP, Rhoualem W, Anderson-Gosselin P. The effects of blurred vision on auditory-visual speech perception in younger and older adults. Int J Audiol 2010; 49:904-11. [DOI: 10.3109/14992027.2010.509112] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
27
|
Bishop CW, Miller LM. A multisensory cortical network for understanding speech in noise. J Cogn Neurosci 2009; 21:1790-805. [PMID: 18823249 DOI: 10.1162/jocn.2009.21118] [Citation(s) in RCA: 85] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
In noisy environments, listeners tend to hear a speaker's voice yet struggle to understand what is said. The most effective way to improve intelligibility in such conditions is to watch the speaker's mouth movements. Here we identify the neural networks that distinguish understanding from merely hearing speech, and determine how the brain applies visual information to improve intelligibility. Using functional magnetic resonance imaging, we show that understanding speech-in-noise is supported by a network of brain areas including the left superior parietal lobule, the motor/premotor cortex, and the left anterior superior temporal sulcus (STS), a likely apex of the acoustic processing hierarchy. Multisensory integration likely improves comprehension through improved communication between the left temporal-occipital boundary, the left medial-temporal lobe, and the left STS. This demonstrates how the brain uses information from multiple modalities to improve speech comprehension in naturalistic, acoustically adverse conditions.
Collapse
|
28
|
Buchan JN, Paré M, Munhall KG. The effect of varying talker identity and listening conditions on gaze behavior during audiovisual speech perception. Brain Res 2008; 1242:162-71. [PMID: 18621032 DOI: 10.1016/j.brainres.2008.06.083] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2008] [Revised: 05/16/2008] [Accepted: 06/14/2008] [Indexed: 10/21/2022]
Abstract
During face-to-face conversation the face provides auditory and visual linguistic information, and also conveys information about the identity of the speaker. This study investigated behavioral strategies involved in gathering visual information while watching talking faces. The effects of varying talker identity and varying the intelligibility of speech (by adding acoustic noise) on gaze behavior were measured with an eyetracker. Varying the intelligibility of the speech by adding noise had a noticeable effect on the location and duration of fixations. When noise was present subjects adopted a vantage point that was more centralized on the face by reducing the frequency of the fixations on the eyes and mouth and lengthening the duration of their gaze fixations on the nose and mouth. Varying talker identity resulted in a more modest change in gaze behavior that was modulated by the intelligibility of the speech. Although subjects generally used similar strategies to extract visual information in both talker variability conditions, when noise was absent there were more fixations on the mouth when viewing a different talker every trial as opposed to the same talker every trial. These findings provide a useful baseline for studies examining gaze behavior during audiovisual speech perception and perception of dynamic faces.
Collapse
Affiliation(s)
- Julie N Buchan
- Department of Psychology, Queen's University, Humphrey Hall, 62 Arch Street, Kingston, Ontario, Canada.
| | | | | |
Collapse
|
29
|
Everdell IT, Marsh HO, Yurick MD, Munhall KG, Paré M. Gaze behaviour in audiovisual speech perception: asymmetrical distribution of face-directed fixations. Perception 2008; 36:1535-45. [PMID: 18265836 DOI: 10.1068/p5852] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Speech perception under natural conditions entails integration of auditory and visual information. Understanding how visual and auditory speech information are integrated requires detailed descriptions of the nature and processing of visual speech information. To understand better the process of gathering visual information, we studied the distribution of face-directed fixations of humans performing an audiovisual speech perception task to characterise the degree of asymmetrical viewing and its relationship to speech intelligibility. Participants showed stronger gaze fixation asymmetries while viewing dynamic faces, compared to static faces or face-like objects, especially when gaze was directed to the talkers' eyes. Although speech perception accuracy was significantly enhanced by the viewing of congruent, dynamic faces, we found no correlation between task performance and gaze fixation asymmetry. Most participants preferentially fixated the right side of the faces and their preferences persisted while viewing horizontally mirrored stimuli, different talkers, or static faces. These results suggest that the asymmetrical distributions of gaze fixations reflect the participants' viewing preferences, rather than being a product of asymmetrical faces, but that this behavioural bias does not predict correct audiovisual speech perception.
Collapse
Affiliation(s)
- Ian T Everdell
- Biological Communication Centre, Queen's University, Kingston, ON K7L 3N6, Canada
| | | | | | | | | |
Collapse
|
30
|
Vatikiotis-Bateson E, Yehia HC. Speaking mode variability in multimodal speech production. ACTA ACUST UNITED AC 2008; 13:894-9. [PMID: 18244485 DOI: 10.1109/tnn.2002.1021890] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The speech acoustics and the phonetically relevant motion of the face during speech are determined by the time-varying behavior of the vocal tract. A benefit of this linkage is that we are able to estimate face motion from the spectral acoustics during speech production using simple neural networks. Thus far, however, the scope of reliable estimation has been limited to individual sentences; network training degrades sharply when multiple sentences are analyzed together. While there is a number of potential avenues for improving network generalization, this paper investigates the possibility that the experimental recording procedures introduce artificial boundary constraints between sentence length utterances. Specifically, the same sentence materials were recorded individually and as part of longer, paragraph length utterances. The scope of reliable network estimation was found to depend both on the length of the utterance (sentence versus paragraph) and, not surprisingly, on phonetic content: estimation of face motion from speech acoustics was reliable for larger sentence training sets when sentences were recorded in continuous paragraph readings; and greater phonetic diversity reduced reliability.
Collapse
|
31
|
Similarity structure in visual speech perception and optical phonetic signals. ACTA ACUST UNITED AC 2008; 69:1070-83. [PMID: 18038946 DOI: 10.3758/bf03193945] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
A complete understanding of visual phonetic perception (lipreading) requires linking perceptual effects to physical stimulus properties. However, the talking face is a highly complex stimulus, affording innumerable possible physical measurements. In the search for isomorphism between stimulus properties and phoneticeffects, second-order isomorphism was examined between theperceptual similarities of video-recorded perceptually identified speech syllables and the physical similarities among the stimuli. Four talkers produced the stimulus syllables comprising 23 initial consonants followed by one of three vowels. Six normal-hearing participants identified the syllables in a visual-only condition. Perceptual stimulus dissimilarity was quantified using the Euclidean distances between stimuli in perceptual spaces obtained via multidimensional scaling. Physical stimulus dissimilarity was quantified using face points recorded in three dimensions by an optical motion capture system. The variance accounted for in the relationship between the perceptual and the physical dissimilarities was evaluated using both the raw dissimilarities and the weighted dissimilarities. With weighting and the full set of 3-D optical data, the variance accounted for ranged between 46% and 66% across talkers and between 49% and 64% across vowels. The robust second-order relationship between the sparse 3-D point representation of visible speech and the perceptual effects suggests that the 3-D point representation is a viable basis for controlled studies of first-order relationships between visual phonetic perception and physical stimulus attributes.
Collapse
|
32
|
Wilson A, Wilson A, Ten Hove MW, Paré M, Munhall KG. Loss of Central Vision and Audiovisual Speech Perception. VISUAL IMPAIRMENT RESEARCH 2008; 10:23-34. [PMID: 19440249 PMCID: PMC2680551 DOI: 10.1080/13882350802053731] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Communication impairments pose a major threat to an individual's quality of life. However, the impact of visual impairments on communication is not well understood, despite the important role that vision plays in the perception of speech. Here we present 2 experiments examining the impact of discrete central scotomas on speech perception. In the first experiment, 4 patients with central vision loss due to unilateral macular holes identified utterances with conflicting auditory-visual information, while simultaneously having their eye movements recorded. Each eye was tested individually. Three participants showed similar speech perception with both the impaired eye and the unaffected eye. For 1 participant, speech perception was disrupted by the scotoma because the participant did not shift gaze to avoid obscuring the talker's mouth with the scotoma. In the second experiment, 12 undergraduate students with gaze-contingent artificial scotomas (10 visual degrees in diameter) identified sentences in background noise. These larger scotomas disrupted speech perception, but some participants overcame this by adopting a gaze strategy whereby they shifted gaze to prevent obscuring important regions of the face such as the mouth. Participants who did not spontaneously adopt an adaptive gaze strategy did not learn to do so over the course of 5 days; however, participants who began with adaptive gaze strategies became more consistent in their gaze location. These findings confirm that peripheral vision is sufficient for perception of most visual information in speech, and suggest that training in gaze strategy may be worthwhile for individuals with communication deficits due to visual impairments.
Collapse
Affiliation(s)
- Amanda Wilson
- Department of Psychology and Queen's Biological Communication Centre, Queen's University, Kingston, Ontario, Canada
| | | | | | | | | |
Collapse
|
33
|
Conrey B, Gold JM. An ideal observer analysis of variability in visual-only speech. Vision Res 2006; 46:3243-58. [PMID: 16725171 DOI: 10.1016/j.visres.2006.03.020] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2005] [Revised: 02/16/2006] [Accepted: 03/20/2006] [Indexed: 10/24/2022]
Abstract
Normal-hearing observers typically have some ability to "lipread," or understand visual-only speech without an accompanying auditory signal. However, talkers vary in how easy they are to lipread. Such variability could arise from differences in the visual information available in talkers' speech, human perceptual strategies that are better suited to some talkers than others, or some combination of these factors. A comparison of human and ideal observer performance in a visual-only speech recognition task found that although talkers do vary in how much physical information they produce during speech, human perceptual strategies also play a role in talker variability.
Collapse
Affiliation(s)
- Brianna Conrey
- Department of Psychological and Brain Sciences, Indiana University, Bloomington, IN 47405, USA
| | | |
Collapse
|