1
|
Magnotti JF, Lado A, Beauchamp MS. The noisy encoding of disparity model predicts perception of the McGurk effect in native Japanese speakers. Front Neurosci 2024; 18:1421713. [PMID: 38988770 PMCID: PMC11233445 DOI: 10.3389/fnins.2024.1421713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 05/28/2024] [Indexed: 07/12/2024] Open
Abstract
In the McGurk effect, visual speech from the face of the talker alters the perception of auditory speech. The diversity of human languages has prompted many intercultural studies of the effect in both Western and non-Western cultures, including native Japanese speakers. Studies of large samples of native English speakers have shown that the McGurk effect is characterized by high variability in the susceptibility of different individuals to the illusion and in the strength of different experimental stimuli to induce the illusion. The noisy encoding of disparity (NED) model of the McGurk effect uses principles from Bayesian causal inference to account for this variability, separately estimating the susceptibility and sensory noise for each individual and the strength of each stimulus. To determine whether variation in McGurk perception is similar between Western and non-Western cultures, we applied the NED model to data collected from 80 native Japanese-speaking participants. Fifteen different McGurk stimuli that varied in syllable content (unvoiced auditory "pa" + visual "ka" or voiced auditory "ba" + visual "ga") were presented interleaved with audiovisual congruent stimuli. The McGurk effect was highly variable across stimuli and participants, with the percentage of illusory fusion responses ranging from 3 to 78% across stimuli and from 0 to 91% across participants. Despite this variability, the NED model accurately predicted perception, predicting fusion rates for individual stimuli with 2.1% error and for individual participants with 2.4% error. Stimuli containing the unvoiced pa/ka pairing evoked more fusion responses than the voiced ba/ga pairing. Model estimates of sensory noise were correlated with participant age, with greater sensory noise in older participants. The NED model of the McGurk effect offers a principled way to account for individual and stimulus differences when examining the McGurk effect in different cultures.
Collapse
Affiliation(s)
- John F Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Michael S Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
2
|
Yu Y, Lado A, Zhang Y, Magnotti JF, Beauchamp MS. Synthetic faces generated with the facial action coding system or deep neural networks improve speech-in-noise perception, but not as much as real faces. Front Neurosci 2024; 18:1379988. [PMID: 38784097 PMCID: PMC11111898 DOI: 10.3389/fnins.2024.1379988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 04/11/2024] [Indexed: 05/25/2024] Open
Abstract
The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking faces generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the real vs. synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces.
Collapse
Affiliation(s)
- Yingjia Yu
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Yue Zhang
- Department of Neurosurgery, Baylor College of Medicine, Houston, TX, United States
| | - John F. Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Michael S. Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
3
|
Yang Q, Bai Y, Liu F, Zhang W. Integrated visual transformer and flash attention for lip-to-speech generation GAN. Sci Rep 2024; 14:4525. [PMID: 38402265 PMCID: PMC10894270 DOI: 10.1038/s41598-024-55248-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 02/21/2024] [Indexed: 02/26/2024] Open
Abstract
Lip-to-Speech (LTS) generation is an emerging technology that is highly visible, widely supported, and rapidly evolving. LTS has a wide range of promising applications, including assisting speech impairment and improving speech interaction in virtual assistants and robots. However, the technique faces the following challenges: (1) Chinese lip-to-speech generation is poorly recognized. (2) The wide range of variation in lip-speaking is poorly aligned with lip movements. Addressing these challenges will contribute to advancing Lip-to-Speech (LTS) technology, enhancing the communication abilities, and improving the quality of life for individuals with disabilities. Currently, lip-to-speech generation techniques usually employ the GAN architecture but suffer from the following problems: The primary issue lies in the insufficient joint modeling of local and global lip movements, resulting in visual ambiguities and inadequate image representations. To solve these problems, we design Flash Attention GAN (FA-GAN) with the following features: (1) Vision and audio are separately coded, and lip motion is jointly modelled to improve speech recognition accuracy. (2) A multilevel Swin-transformer is introduced to improve image representation. (3) A hierarchical iterative generator is introduced to improve speech generation. (4) A flash attention mechanism is introduced to improve computational efficiency. Many experiments have indicated that FA-GAN can recognize Chinese and English datasets better than existing architectures, especially the recognition error rate of Chinese, which is only 43.19%, the lowest among the same type.
Collapse
Affiliation(s)
- Qiong Yang
- School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China
- Shaanxi Key Laboratory of Clothing Intelligence, School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China
| | - Yuxuan Bai
- School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China.
| | - Feng Liu
- School of Computer Science, Xi'an Polytechnic University, Xi'an, 710048, Shaanxi, China
| | - Wei Zhang
- China Mobile System Integration Co, Ltd, Xi'an, 710077, China
| |
Collapse
|
4
|
Yu Y, Lado A, Zhang Y, Magnotti JF, Beauchamp MS. The Effect on Speech-in-Noise Perception of Real Faces and Synthetic Faces Generated with either Deep Neural Networks or the Facial Action Coding System. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.05.578468. [PMID: 38370798 PMCID: PMC10871238 DOI: 10.1101/2024.02.05.578468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
The prevalence of synthetic talking faces in both commercial and academic environments is increasing as the technology to generate them grows more powerful and available. While it has long been known that seeing the face of the talker improves human perception of speech-in-noise, recent studies have shown that synthetic talking faces generated by deep neural networks (DNNs) are also able to improve human perception of speech-in-noise. However, in previous studies the benefit provided by DNN synthetic faces was only about half that of real human talkers. We sought to determine whether synthetic talking faces generated by an alternative method would provide a greater perceptual benefit. The facial action coding system (FACS) is a comprehensive system for measuring visually discernible facial movements. Because the action units that comprise FACS are linked to specific muscle groups, synthetic talking faces generated by FACS might have greater verisimilitude than DNN synthetic faces which do not reference an explicit model of the facial musculature. We tested the ability of human observers to identity speech-in-noise accompanied by a blank screen; the real face of the talker; and synthetic talking face generated either by DNN or FACS. We replicated previous findings of a large benefit for seeing the face of a real talker for speech-in-noise perception and a smaller benefit for DNN synthetic faces. FACS faces also improved perception, but only to the same degree as DNN faces. Analysis at the phoneme level showed that the performance of DNN and FACS faces was particularly poor for phonemes that involve interactions between the teeth and lips, such as /f/, /v/, and /th/. Inspection of single video frames revealed that the characteristic visual features for these phonemes were weak or absent in synthetic faces. Modeling the real vs. synthetic difference showed that increasing the realism of a few phonemes could substantially increase the overall perceptual benefit of synthetic faces, providing a roadmap for improving communication in this rapidly developing domain.
Collapse
Affiliation(s)
- Yingjia Yu
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| | - Yue Zhang
- Department of Neurosurgery, Baylor College of Medicine
| | - John F Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| | - Michael S Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania
| |
Collapse
|
5
|
What Do You Think of AI? Research on the Influence of AI News Anchor Image on Watching Intention. Behav Sci (Basel) 2022; 12:bs12110465. [DOI: 10.3390/bs12110465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Revised: 11/15/2022] [Accepted: 11/16/2022] [Indexed: 11/22/2022] Open
Abstract
Since the concept of artificial intelligence was introduced in 1956, AI technology has been gradually applied in various fields, including journalism. This paper focuses on research related to AI news anchors, and two correlated experiments are applied to examine audiences’ perceived attractiveness of AI news anchors from a psychological dimension. Study 1 focuses on the different variables that influence the behavioral willingness of AI news anchor viewers, while Study 2 focuses on the mediating and moderating variables that influence audiences’ psychological changes. The results indicate that non-humanoid female AI news anchors who use anthropomorphic voices to broadcast news obtain the highest perceived attractiveness among audiences. Additionally, the mediating effect of perceived attractiveness and the negative moderating effect on the inherent impression of traditional news anchors are both verified in the study. Based on the research findings, the implications and suggestions are addressed accordingly.
Collapse
|
6
|
Thézé R, Giraud AL, Mégevand P. The phase of cortical oscillations determines the perceptual fate of visual cues in naturalistic audiovisual speech. SCIENCE ADVANCES 2020; 6:6/45/eabc6348. [PMID: 33148648 PMCID: PMC7673697 DOI: 10.1126/sciadv.abc6348] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2020] [Accepted: 09/17/2020] [Indexed: 06/11/2023]
Abstract
When we see our interlocutor, our brain seamlessly extracts visual cues from their face and processes them along with the sound of their voice, making speech an intrinsically multimodal signal. Visual cues are especially important in noisy environments, when the auditory signal is less reliable. Neuronal oscillations might be involved in the cortical processing of audiovisual speech by selecting which sensory channel contributes more to perception. To test this, we designed computer-generated naturalistic audiovisual speech stimuli where one mismatched phoneme-viseme pair in a key word of sentences created bistable perception. Neurophysiological recordings (high-density scalp and intracranial electroencephalography) revealed that the precise phase angle of theta-band oscillations in posterior temporal and occipital cortex of the right hemisphere was crucial to select whether the auditory or the visual speech cue drove perception. We demonstrate that the phase of cortical oscillations acts as an instrument for sensory selection in audiovisual speech processing.
Collapse
Affiliation(s)
- Raphaël Thézé
- Department of Basic Neurosciences, Faculty of Medicine, University of Geneva, 1202 Geneva, Switzerland
| | - Anne-Lise Giraud
- Department of Basic Neurosciences, Faculty of Medicine, University of Geneva, 1202 Geneva, Switzerland
| | - Pierre Mégevand
- Department of Basic Neurosciences, Faculty of Medicine, University of Geneva, 1202 Geneva, Switzerland.
- Division of Neurology, Department of Clinical Neurosciences, Geneva University Hospitals, 1205 Geneva, Switzerland
| |
Collapse
|