1
|
Yu Y, Lado A, Zhang Y, Magnotti JF, Beauchamp MS. Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces. Front Neurosci 2024; 18:1379988. [PMID: 38784097 PMCID: PMC11111898 DOI: 10.3389/fnins.2024.1379988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 04/11/2024] [Indexed: 05/25/2024] Open
Abstract
The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking faces generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the real vs. synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces.
Collapse
Affiliation(s)
- Yingjia Yu
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Yue Zhang
- Department of Neurosurgery, Baylor College of Medicine, Houston, TX, United States
| | - John F. Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Michael S. Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
2
|
Magnotti JF, Lado A, Zhang Y, Maasø A, Nath A, Beauchamp MS. Repeatedly experiencing the McGurk effect induces long-lasting changes in auditory speech perception. COMMUNICATIONS PSYCHOLOGY 2024; 2:25. [PMID: 39242734 PMCID: PMC11332120 DOI: 10.1038/s44271-024-00073-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 03/07/2024] [Indexed: 09/09/2024]
Abstract
In the McGurk effect, presentation of incongruent auditory and visual speech evokes a fusion percept different than either component modality. We show that repeatedly experiencing the McGurk effect for 14 days induces a change in auditory-only speech perception: the auditory component of the McGurk stimulus begins to evoke the fusion percept, even when presented on its own without accompanying visual speech. This perceptual change, termed fusion-induced recalibration (FIR), was talker-specific and syllable-specific and persisted for a year or more in some participants without any additional McGurk exposure. Participants who did not experience the McGurk effect did not experience FIR, showing that recalibration was driven by multisensory prediction error. A causal inference model of speech perception incorporating multisensory cue conflict accurately predicted individual differences in FIR. Just as the McGurk effect demonstrates that visual speech can alter the perception of auditory speech, FIR shows that these alterations can persist for months or years. The ability to induce seemingly permanent changes in auditory speech perception will be useful for studying plasticity in brain networks for language and may provide new strategies for improving language learning.
Collapse
Affiliation(s)
- John F Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Yue Zhang
- Department of Neurosurgery, Baylor College of Medicine, Houston, TX, USA
| | - Arnt Maasø
- Institute for Media and Communications, University of Oslo, Oslo, Norway
| | - Audrey Nath
- Department of Neurosurgery, University of Texas Medical Branch, Galveston, TX, USA
| | - Michael S Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
3
|
Yu Y, Lado A, Zhang Y, Magnotti JF, Beauchamp MS. The Effect on Speech-in-Noise Perception of Real Faces and Synthetic Faces Generated with either Deep Neural Networks or the Facial Action Coding System. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.05.578468. [PMID: 38370798 PMCID: PMC10871238 DOI: 10.1101/2024.02.05.578468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking face generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the real vs. synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces, providing a roadmap for improving communication in this rapidly developing domain.
Collapse
Affiliation(s)
- Yingjia Yu
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| | - Yue Zhang
- Department of Neurosurgery, Baylor College of Medicine
| | - John F Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| | - Michael S Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| |
Collapse
|
4
|
Ito T, Ogane R. Repetitive Exposure to Orofacial Somatosensory Inputs in Speech Perceptual Training Modulates Vowel Categorization in Speech Perception. Front Psychol 2022; 13:839087. [PMID: 35558689 PMCID: PMC9088678 DOI: 10.3389/fpsyg.2022.839087] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2021] [Accepted: 03/24/2022] [Indexed: 11/24/2022] Open
Abstract
Orofacial somatosensory inputs may play a role in the link between speech perception and production. Given the fact that speech motor learning, which involves paired auditory and somatosensory inputs, results in changes to speech perceptual representations, somatosensory inputs may also be involved in learning or adaptive processes of speech perception. Here we show that repetitive pairing of somatosensory inputs and sounds, such as occurs during speech production and motor learning, can also induce a change of speech perception. We examined whether the category boundary between /ε/ and /a/ was changed as a result of perceptual training with orofacial somatosensory inputs. The experiment consisted of three phases: Baseline, Training, and Aftereffect. In all phases, a vowel identification test was used to identify the perceptual boundary between /ε/ and /a/. In the Baseline and the Aftereffect phase, an adaptive method based on the maximum-likelihood procedure was applied to detect the category boundary using a small number of trials. In the Training phase, we used the method of constant stimuli in order to expose participants to stimulus variants which covered the range between /ε/ and /a/ evenly. In this phase, to mimic the sensory input that accompanies speech production and learning in an experimental group, somatosensory stimulation was applied in the upward direction when the stimulus sound was presented. A control group (CTL) followed the same training procedure in the absence of somatosensory stimulation. When we compared category boundaries prior to and following paired auditory-somatosensory training, the boundary for participants in the experimental group reliably changed in the direction of /ε/, indicating that the participants perceived /a/ more than /ε/ as a consequence of training. In contrast, the CTL did not show any change. Although a limited number of participants were tested, the perceptual shift was reduced and almost eliminated 1 week later. Our data suggest that repetitive exposure of somatosensory inputs in a task that simulates the sensory pairing which occurs during speech production, changes perceptual system and supports the idea that somatosensory inputs play a role in speech perceptual adaptation, probably contributing to the formation of sound representations for speech perception.
Collapse
Affiliation(s)
- Takayuki Ito
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France
- Haskins Laboratories, New Haven, CT, United States
| | - Rintaro Ogane
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France
- Haskins Laboratories, New Haven, CT, United States
| |
Collapse
|
5
|
Chodroff E, Wilson C. Acoustic-phonetic and auditory mechanisms of adaptation in the perception of sibilant fricatives. Atten Percept Psychophys 2020; 82:2027-2048. [PMID: 31875314 PMCID: PMC7297833 DOI: 10.3758/s13414-019-01894-2] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Listeners are highly proficient at adapting to contextual variation when perceiving speech. In the present study, we examined the effects of brief speech and nonspeech contexts on the perception of sibilant fricatives. We explored three theoretically motivated accounts of contextual adaptation, based on phonetic cue calibration, phonetic covariation, and auditory contrast. Under the cue calibration account, listeners adapt by estimating a talker-specific average for each phonetic cue or dimension; under the cue covariation account, listeners adapt by exploiting consistencies in how the realization of speech sounds varies across talkers; under the auditory contrast account, adaptation results from (partial) masking of spectral components that are shared by adjacent stimuli. The spectral center of gravity, a phonetic cue to fricative identity, was manipulated for several types of context sound: /z/-initial syllables, /v/-initial syllables, and white noise matched in long-term average spectrum (LTAS) to the /z/-initial stimuli. Listeners' perception of the /s/-/ʃ/ contrast was significantly influenced by /z/-initial syllables and LTAS-matched white noise stimuli, but not by /v/-initial syllables. No significant difference in adaptation was observed between exposure to /z/-initial syllables and matched white noise stimuli, and speech did not have a considerable advantage over noise when the two were presented consecutively within a context. The pattern of findings is most consistent with the auditory contrast account of short-term perceptual adaptation. The cue covariation account makes accurate predictions for speech contexts, but not for nonspeech contexts or for the absence of a speech-versus-nonspeech difference.
Collapse
Affiliation(s)
- Eleanor Chodroff
- Department of Language and Linguistic Science, University of York, Heslington, York, YO10 5DD, UK.
| | - Colin Wilson
- Department of Cognitive Science, Johns Hopkins University, 3400 N. Charles St., Baltimore, MD, 21218, USA
| |
Collapse
|
6
|
Attentional resources contribute to the perceptual learning of talker idiosyncrasies in audiovisual speech. Atten Percept Psychophys 2019; 81:1006-1019. [PMID: 30684204 DOI: 10.3758/s13414-018-01651-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
To recognize audiovisual speech, listeners evaluate and combine information obtained from the auditory and visual modalities. Listeners also use information from one modality to adjust their phonetic categories to a talker's idiosyncrasy encountered in the other modality. In this study, we examined whether the outcome of this cross-modal recalibration relies on attentional resources. In a standard recalibration experiment in Experiment 1, participants heard an ambiguous sound, disambiguated by the accompanying visual speech as either /p/ or /t/. Participants' primary task was to attend to the audiovisual speech while either monitoring a tone sequence for a target tone or ignoring the tones. Listeners subsequently categorized the steps of an auditory /p/-/t/ continuum more often in line with their exposure. The aftereffect of phonetic recalibration was reduced, but not eliminated, by attentional load during exposure. In Experiment 2, participants saw an ambiguous visual speech gesture that was disambiguated auditorily as either /p/ or /t/. At test, listeners categorized the steps of a visual /p/-/t/ continuum more often in line with the prior exposure. Imposing load in the auditory modality during exposure did not reduce the aftereffect of this type of cross-modal phonetic recalibration. Together, these results suggest that auditory attentional resources are needed for the processing of auditory speech and/or for the shifting of auditory phonetic category boundaries. Listeners thus need to dedicate attentional resources in order to accommodate talker idiosyncrasies in audiovisual speech.
Collapse
|
7
|
Shahin AJ, Shen S, Kerlin JR. Tolerance for audiovisual asynchrony is enhanced by the spectrotemporal fidelity of the speaker's mouth movements and speech. LANGUAGE, COGNITION AND NEUROSCIENCE 2017; 32:1102-1118. [PMID: 28966930 PMCID: PMC5617130 DOI: 10.1080/23273798.2017.1283428] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2016] [Accepted: 01/07/2017] [Indexed: 06/07/2023]
Abstract
We examined the relationship between tolerance for audiovisual onset asynchrony (AVOA) and the spectrotemporal fidelity of the spoken words and the speaker's mouth movements. In two experiments that only varied in the temporal order of sensory modality, visual speech leading (exp1) or lagging (exp2) acoustic speech, participants watched intact and blurred videos of a speaker uttering trisyllabic words and nonwords that were noise vocoded with 4-, 8-, 16-, and 32-channels. They judged whether the speaker's mouth movements and the speech sounds were in-sync or out-of-sync. Individuals perceived synchrony (tolerated AVOA) on more trials when the acoustic speech was more speech-like (8 channels and higher vs. 4 channels), and when visual speech was intact than blurred (exp1 only). These findings suggest that enhanced spectrotemporal fidelity of the audiovisual (AV) signal prompts the brain to widen the window of integration promoting the fusion of temporally distant AV percepts.
Collapse
Affiliation(s)
- Antoine J Shahin
- Center for Mind and Brain, University of California, Davis, CA, 95618
| | - Stanley Shen
- Center for Mind and Brain, University of California, Davis, CA, 95618
| | - Jess R Kerlin
- Center for Mind and Brain, University of California, Davis, CA, 95618
| |
Collapse
|
8
|
Rosenblum LD, Dorsi J, Dias JW. The Impact and Status of Carol Fowler's Supramodal Theory of Multisensory Speech Perception. ECOLOGICAL PSYCHOLOGY 2016. [DOI: 10.1080/10407413.2016.1230373] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
9
|
Rosenblum LD, Dias JW, Dorsi J. The supramodal brain: implications for auditory perception. JOURNAL OF COGNITIVE PSYCHOLOGY 2016. [DOI: 10.1080/20445911.2016.1181691] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
|