1
|
Karthik G, Cao CZ, Demidenko MI, Jahn A, Stacey WC, Wasade VS, Brang D. Auditory cortex encodes lipreading information through spatially distributed activity. Curr Biol 2024; 34:4021-4032.e5. [PMID: 39153482 PMCID: PMC11387126 DOI: 10.1016/j.cub.2024.07.073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 04/29/2024] [Accepted: 07/19/2024] [Indexed: 08/19/2024]
Abstract
Watching a speaker's face improves speech perception accuracy. This benefit is enabled, in part, by implicit lipreading abilities present in the general population. While it is established that lipreading can alter the perception of a heard word, it is unknown how these visual signals are represented in the auditory system or how they interact with auditory speech representations. One influential, but untested, hypothesis is that visual speech modulates the population-coded representations of phonetic and phonemic features in the auditory system. This model is largely supported by data showing that silent lipreading evokes activity in the auditory cortex, but these activations could alternatively reflect general effects of arousal or attention or the encoding of non-linguistic features such as visual timing information. This gap limits our understanding of how vision supports speech perception. To test the hypothesis that the auditory system encodes visual speech information, we acquired functional magnetic resonance imaging (fMRI) data from healthy adults and intracranial recordings from electrodes implanted in patients with epilepsy during auditory and visual speech perception tasks. Across both datasets, linear classifiers successfully decoded the identity of silently lipread words using the spatial pattern of auditory cortex responses. Examining the time course of classification using intracranial recordings, lipread words were classified at earlier time points relative to heard words, suggesting a predictive mechanism for facilitating speech. These results support a model in which the auditory system combines the joint neural distributions evoked by heard and lipread words to generate a more precise estimate of what was said.
Collapse
Affiliation(s)
- Ganesan Karthik
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Cody Zhewei Cao
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | | | - Andrew Jahn
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | - William C Stacey
- Department of Neurology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Vibhangini S Wasade
- Henry Ford Hospital, Detroit, MI 48202, USA; Department of Neurology, Wayne State University School of Medicine, Detroit, MI 48201, USA
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
2
|
Çetinçelik M, Jordan-Barros A, Rowland CF, Snijders TM. The effect of visual speech cues on neural tracking of speech in 10-month-old infants. Eur J Neurosci 2024. [PMID: 39188179 DOI: 10.1111/ejn.16492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 07/04/2024] [Accepted: 07/20/2024] [Indexed: 08/28/2024]
Abstract
While infants' sensitivity to visual speech cues and the benefit of these cues have been well-established by behavioural studies, there is little evidence on the effect of visual speech cues on infants' neural processing of continuous auditory speech. In this study, we investigated whether visual speech cues, such as the movements of the lips, jaw, and larynx, facilitate infants' neural speech tracking. Ten-month-old Dutch-learning infants watched videos of a speaker reciting passages in infant-directed speech while electroencephalography (EEG) was recorded. In the videos, either the full face of the speaker was displayed or the speaker's mouth and jaw were masked with a block, obstructing the visual speech cues. To assess neural tracking, speech-brain coherence (SBC) was calculated, focusing particularly on the stress and syllabic rates (1-1.75 and 2.5-3.5 Hz respectively in our stimuli). First, overall, SBC was compared to surrogate data, and then, differences in SBC in the two conditions were tested at the frequencies of interest. Our results indicated that infants show significant tracking at both stress and syllabic rates. However, no differences were identified between the two conditions, meaning that infants' neural tracking was not modulated further by the presence of visual speech cues. Furthermore, we demonstrated that infants' neural tracking of low-frequency information is related to their subsequent vocabulary development at 18 months. Overall, this study provides evidence that infants' neural tracking of speech is not necessarily impaired when visual speech cues are not fully visible and that neural tracking may be a potential mechanism in successful language acquisition.
Collapse
Affiliation(s)
- Melis Çetinçelik
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Department of Experimental Psychology, Utrecht University, Utrecht, The Netherlands
- Cognitive Neuropsychology Department, Tilburg University, Tilburg, The Netherlands
| | - Antonia Jordan-Barros
- Centre for Brain and Cognitive Development, Department of Psychological Science, Birkbeck, University of London, London, UK
- Experimental Psychology, University College London, London, UK
| | - Caroline F Rowland
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
| | - Tineke M Snijders
- Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
- Cognitive Neuropsychology Department, Tilburg University, Tilburg, The Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
| |
Collapse
|
3
|
Sato M. Audiovisual speech asynchrony asymmetrically modulates neural binding. Neuropsychologia 2024; 198:108866. [PMID: 38518889 DOI: 10.1016/j.neuropsychologia.2024.108866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 01/09/2024] [Accepted: 03/19/2024] [Indexed: 03/24/2024]
Abstract
Previous psychophysical and neurophysiological studies in young healthy adults have provided evidence that audiovisual speech integration occurs with a large degree of temporal tolerance around true simultaneity. To further determine whether audiovisual speech asynchrony modulates auditory cortical processing and neural binding in young healthy adults, N1/P2 auditory evoked responses were compared using an additive model during a syllable categorization task, without or with an audiovisual asynchrony ranging from 240 ms visual lead to 240 ms auditory lead. Consistent with previous psychophysical findings, the observed results converge in favor of an asymmetric temporal integration window. Three main findings were observed: 1) predictive temporal and phonetic cues from pre-phonatory visual movements before the acoustic onset appeared essential for neural binding to occur, 2) audiovisual synchrony, with visual pre-phonatory movements predictive of the onset of the acoustic signal, was a prerequisite for N1 latency facilitation, and 3) P2 amplitude suppression and latency facilitation occurred even when visual pre-phonatory movements were not predictive of the acoustic onset but the syllable to come. Taken together, these findings help further clarify how audiovisual speech integration partly operates through two stages of visually-based temporal and phonetic predictions.
Collapse
Affiliation(s)
- Marc Sato
- Laboratoire Parole et Langage, Centre National de la Recherche Scientifique, Aix-Marseille Université, Aix-en-Provence, France.
| |
Collapse
|
4
|
Zoefel B, Kösem A. Neural tracking of continuous acoustics: properties, speech-specificity and open questions. Eur J Neurosci 2024; 59:394-414. [PMID: 38151889 DOI: 10.1111/ejn.16221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 11/17/2023] [Accepted: 11/22/2023] [Indexed: 12/29/2023]
Abstract
Human speech is a particularly relevant acoustic stimulus for our species, due to its role of information transmission during communication. Speech is inherently a dynamic signal, and a recent line of research focused on neural activity following the temporal structure of speech. We review findings that characterise neural dynamics in the processing of continuous acoustics and that allow us to compare these dynamics with temporal aspects in human speech. We highlight properties and constraints that both neural and speech dynamics have, suggesting that auditory neural systems are optimised to process human speech. We then discuss the speech-specificity of neural dynamics and their potential mechanistic origins and summarise open questions in the field.
Collapse
Affiliation(s)
- Benedikt Zoefel
- Centre de Recherche Cerveau et Cognition (CerCo), CNRS UMR 5549, Toulouse, France
- Université de Toulouse III Paul Sabatier, Toulouse, France
| | - Anne Kösem
- Lyon Neuroscience Research Center (CRNL), INSERM U1028, Bron, France
| |
Collapse
|
5
|
Shahin AJ, Gonzales MG, Dimitrijevic A. Cross-Modal Tinnitus Remediation: A Tentative Theoretical Framework. Brain Sci 2024; 14:95. [PMID: 38275515 PMCID: PMC10813772 DOI: 10.3390/brainsci14010095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 01/15/2024] [Accepted: 01/17/2024] [Indexed: 01/27/2024] Open
Abstract
Tinnitus is a prevalent hearing-loss deficit manifested as a phantom (internally generated by the brain) sound that is heard as a high-frequency tone in the majority of afflicted persons. Chronic tinnitus is debilitating, leading to distress, sleep deprivation, anxiety, and even suicidal thoughts. It has been theorized that, in the majority of afflicted persons, tinnitus can be attributed to the loss of high-frequency input from the cochlea to the auditory cortex, known as deafferentation. Deafferentation due to hearing loss develops with aging, which progressively causes tonotopic regions coding for the lost high-frequency coding to synchronize, leading to a phantom high-frequency sound sensation. Approaches to tinnitus remediation that demonstrated promise include inhibitory drugs, the use of tinnitus-specific frequency notching to increase lateral inhibition to the deafferented neurons, and multisensory approaches (auditory-motor and audiovisual) that work by coupling multisensory stimulation to the deafferented neural populations. The goal of this review is to put forward a theoretical framework of a multisensory approach to remedy tinnitus. Our theoretical framework posits that due to vision's modulatory (inhibitory, excitatory) influence on the auditory pathway, a prolonged engagement in audiovisual activity, especially during daily discourse, as opposed to auditory-only activity/discourse, can progressively reorganize deafferented neural populations, resulting in the reduced synchrony of the deafferented neurons and a reduction in tinnitus severity over time.
Collapse
Affiliation(s)
- Antoine J. Shahin
- Department of Cognitive and Information Sciences, University of California, Merced, CA 95343, USA;
- Health Science Research Institute, University of California, Merced, CA 95343, USA
| | - Mariel G. Gonzales
- Department of Cognitive and Information Sciences, University of California, Merced, CA 95343, USA;
| | - Andrew Dimitrijevic
- Sunnybrook Research Institute, University of Toronto, Toronto, ON M4N 3M5, Canada;
| |
Collapse
|
6
|
Tan SHJ, Kalashnikova M, Di Liberto GM, Crosse MJ, Burnham D. Seeing a Talking Face Matters: Gaze Behavior and the Auditory-Visual Speech Benefit in Adults' Cortical Tracking of Infant-directed Speech. J Cogn Neurosci 2023; 35:1741-1759. [PMID: 37677057 DOI: 10.1162/jocn_a_02044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
In face-to-face conversations, listeners gather visual speech information from a speaker's talking face that enhances their perception of the incoming auditory speech signal. This auditory-visual (AV) speech benefit is evident even in quiet environments but is stronger in situations that require greater listening effort such as when the speech signal itself deviates from listeners' expectations. One example is infant-directed speech (IDS) presented to adults. IDS has exaggerated acoustic properties that are easily discriminable from adult-directed speech (ADS). Although IDS is a speech register that adults typically use with infants, no previous neurophysiological study has directly examined whether adult listeners process IDS differently from ADS. To address this, the current study simultaneously recorded EEG and eye-tracking data from adult participants as they were presented with auditory-only (AO), visual-only, and AV recordings of IDS and ADS. Eye-tracking data were recorded because looking behavior to the speaker's eyes and mouth modulates the extent of AV speech benefit experienced. Analyses of cortical tracking accuracy revealed that cortical tracking of the speech envelope was significant in AO and AV modalities for IDS and ADS. However, the AV speech benefit [i.e., AV > (A + V)] was only present for IDS trials. Gaze behavior analyses indicated differences in looking behavior during IDS and ADS trials. Surprisingly, looking behavior to the speaker's eyes and mouth was not correlated with cortical tracking accuracy. Additional exploratory analyses indicated that attention to the whole display was negatively correlated with cortical tracking accuracy of AO and visual-only trials in IDS. Our results underscore the nuances involved in the relationship between neurophysiological AV speech benefit and looking behavior.
Collapse
Affiliation(s)
- Sok Hui Jessica Tan
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University, Australia
- Science of Learning in Education Centre, Office of Education Research, National Institute of Education, Nanyang Technological University, Singapore
| | - Marina Kalashnikova
- The Basque Center on Cognition, Brain and Language
- IKERBASQUE, Basque Foundation for Science
| | - Giovanni M Di Liberto
- ADAPT Centre, School of Computer Science and Statistics, Trinity College Institute of Neuroscience, Trinity College, The University of Dublin, Ireland
| | - Michael J Crosse
- SEGOTIA, Galway, Ireland
- Trinity Center for Biomedical Engineering, Department of Mechanical, Manufacturing & Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
| | - Denis Burnham
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University, Australia
| |
Collapse
|
7
|
Jiang Z, An X, Liu S, Yin E, Yan Y, Ming D. Neural oscillations reflect the individual differences in the temporal perception of audiovisual speech. Cereb Cortex 2023; 33:10575-10583. [PMID: 37727958 DOI: 10.1093/cercor/bhad304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 08/01/2023] [Accepted: 08/02/2023] [Indexed: 09/21/2023] Open
Abstract
Multisensory integration occurs within a limited time interval between multimodal stimuli. Multisensory temporal perception varies widely among individuals and involves perceptual synchrony and temporal sensitivity processes. Previous studies explored the neural mechanisms of individual differences for beep-flash stimuli, whereas there was no study for speech. In this study, 28 subjects (16 male) performed an audiovisual speech/ba/simultaneity judgment task while recording their electroencephalography. We examined the relationship between prestimulus neural oscillations (i.e. the pre-pronunciation movement-related oscillations) and temporal perception. The perceptual synchrony was quantified using the Point of Subjective Simultaneity and temporal sensitivity using the Temporal Binding Window. Our results revealed dissociated neural mechanisms for individual differences in Temporal Binding Window and Point of Subjective Simultaneity. The frontocentral delta power, reflecting top-down attention control, is positively related to the magnitude of individual auditory leading Temporal Binding Windows (auditory Temporal Binding Windows; LTBWs), whereas the parieto-occipital theta power, indexing bottom-up visual temporal attention specific to speech, is negatively associated with the magnitude of individual visual leading Temporal Binding Windows (visual Temporal Binding Windows; RTBWs). In addition, increased left frontal and bilateral temporoparietal occipital alpha power, reflecting general attentional states, is associated with increased Points of Subjective Simultaneity. Strengthening attention abilities might improve the audiovisual temporal perception of speech and further impact speech integration.
Collapse
Affiliation(s)
- Zeliang Jiang
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Xingwei An
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Shuang Liu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Erwei Yin
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
- Defense Innovation Institute, Academy of Military Sciences (AMS), 100071 Beijing, China
- Tianjin Artificial Intelligence Innovation Center (TAIIC), 300457 Tianjin, China
| | - Ye Yan
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
- Defense Innovation Institute, Academy of Military Sciences (AMS), 100071 Beijing, China
- Tianjin Artificial Intelligence Innovation Center (TAIIC), 300457 Tianjin, China
| | - Dong Ming
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| |
Collapse
|
8
|
Chalas N, Omigie D, Poeppel D, van Wassenhove V. Hierarchically nested networks optimize the analysis of audiovisual speech. iScience 2023; 26:106257. [PMID: 36909667 PMCID: PMC9993032 DOI: 10.1016/j.isci.2023.106257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2022] [Revised: 12/22/2022] [Accepted: 02/17/2023] [Indexed: 02/22/2023] Open
Abstract
In conversational settings, seeing the speaker's face elicits internal predictions about the upcoming acoustic utterance. Understanding how the listener's cortical dynamics tune to the temporal statistics of audiovisual (AV) speech is thus essential. Using magnetoencephalography, we explored how large-scale frequency-specific dynamics of human brain activity adapt to AV speech delays. First, we show that the amplitude of phase-locked responses parametrically decreases with natural AV speech synchrony, a pattern that is consistent with predictive coding. Second, we show that the temporal statistics of AV speech affect large-scale oscillatory networks at multiple spatial and temporal resolutions. We demonstrate a spatial nestedness of oscillatory networks during the processing of AV speech: these oscillatory hierarchies are such that high-frequency activity (beta, gamma) is contingent on the phase response of low-frequency (delta, theta) networks. Our findings suggest that the endogenous temporal multiplexing of speech processing confers adaptability within the temporal regimes that are essential for speech comprehension.
Collapse
Affiliation(s)
- Nikos Chalas
- Institute for Biomagnetism and Biosignal Analysis, University of Münster, P.C., 48149 Münster, Germany
- CEA, DRF/Joliot, NeuroSpin, INSERM, Cognitive Neuroimaging Unit; CNRS; Université Paris-Saclay, 91191 Gif/Yvette, France
- School of Biology, Faculty of Sciences, Aristotle University of Thessaloniki, P.C., 54124 Thessaloniki, Greece
- Corresponding author
| | - Diana Omigie
- Department of Psychology, Goldsmiths University London, London, UK
| | - David Poeppel
- Department of Psychology, New York University, New York, NY 10003, USA
- Ernst Struengmann Institute for Neuroscience, 60528 Frankfurt am Main, Frankfurt, Germany
| | - Virginie van Wassenhove
- CEA, DRF/Joliot, NeuroSpin, INSERM, Cognitive Neuroimaging Unit; CNRS; Université Paris-Saclay, 91191 Gif/Yvette, France
- Corresponding author
| |
Collapse
|
9
|
Jiang Z, An X, Liu S, Wang L, Yin E, Yan Y, Ming D. The effect of prestimulus low-frequency neural oscillations on the temporal perception of audiovisual speech. Front Neurosci 2023; 17:1067632. [PMID: 36816126 PMCID: PMC9935937 DOI: 10.3389/fnins.2023.1067632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/17/2023] [Indexed: 02/05/2023] Open
Abstract
Objective Perceptual integration and segregation are modulated by the phase of ongoing neural oscillation whose frequency period is broader than the size of the temporal binding window (TBW). Studies have shown that the abstract beep-flash stimuli with about 100 ms TBW were modulated by the alpha band phase. Therefore, we hypothesize that the temporal perception of speech with about hundreds of milliseconds of TBW might be affected by the delta-theta phase. Methods Thus, we conducted a speech-stimuli-based audiovisual simultaneity judgment (SJ) experiment. Twenty human participants (12 females) attended this study, recording 62 channels of EEG. Results Behavioral results showed that the visual leading TBWs are broader than the auditory leading ones [273.37 ± 24.24 ms vs. 198.05 ± 19.28 ms, (mean ± sem)]. We used Phase Opposition Sum (POS) to quantify the differences in mean phase angles and phase concentrations between synchronous and asynchronous responses. The POS results indicated that the delta-theta phase was significantly different between synchronous and asynchronous responses in the A50V condition (50% synchronous responses in auditory leading SOA). However, in the V50A condition (50% synchronous responses in visual leading SOA), we only found the delta band effect. In the two conditions, we did not find a consistency of phases over subjects for both perceptual responses by the post hoc Rayleigh test (all ps > 0.05). The Rayleigh test results suggested that the phase might not reflect the neuronal excitability which assumed that the phases within a perceptual response across subjects concentrated on the same angle but were not uniformly distributed. But V-test showed the phase difference between synchronous and asynchronous responses across subjects had a significant phase opposition (all ps < 0.05) which is compatible with the POS result. Conclusion These results indicate that the speech temporal perception depends on the alignment of stimulus onset with an optimal phase of the neural oscillation whose frequency period might be broader than the size of TBW. The role of the oscillatory phase might be encoding the temporal information which varies across subjects rather than neuronal excitability. Given the enriched temporal structures of spoken language stimuli, the conclusion that phase encodes temporal information is plausible and valuable for future research.
Collapse
Affiliation(s)
- Zeliang Jiang
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Xingwei An
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,Xingwei An,
| | - Shuang Liu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Lu Wang
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Erwei Yin
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing, China,Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin, China
| | - Ye Yan
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing, China,Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin, China
| | - Dong Ming
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,*Correspondence: Dong Ming,
| |
Collapse
|
10
|
Zamuner TS, Rabideau T, McDonald M, Yeung HH. Developmental change in children's speech processing of auditory and visual cues: An eyetracking study. JOURNAL OF CHILD LANGUAGE 2023; 50:27-51. [PMID: 36503546 DOI: 10.1017/s0305000921000684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
This study investigates how children aged two to eight years (N = 129) and adults (N = 29) use auditory and visual speech for word recognition. The goal was to bridge the gap between apparent successes of visual speech processing in young children in visual-looking tasks, with apparent difficulties of speech processing in older children from explicit behavioural measures. Participants were presented with familiar words in audio-visual (AV), audio-only (A-only) or visual-only (V-only) speech modalities, then presented with target and distractor images, and looking to targets was measured. Adults showed high accuracy, with slightly less target-image looking in the V-only modality. Developmentally, looking was above chance for both AV and A-only modalities, but not in the V-only modality until 6 years of age (earlier on /k/-initial words). Flexible use of visual cues for lexical access develops throughout childhood.
Collapse
Affiliation(s)
| | | | - Margarethe McDonald
- Department of Linguistics, University of Ottawa, Canada
- School of Psychology, University of Ottawa, Canada
| | - H Henny Yeung
- Department of Linguistics, Simon Fraser University, Canada
- Integrative Neuroscience and Cognition Centre, UMR 8002, CNRS and University of Paris, France
| |
Collapse
|
11
|
Sato M. The timing of visual speech modulates auditory neural processing. BRAIN AND LANGUAGE 2022; 235:105196. [PMID: 36343508 DOI: 10.1016/j.bandl.2022.105196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 10/15/2022] [Accepted: 10/19/2022] [Indexed: 06/16/2023]
Abstract
In face-to-face communication, visual information from a speaker's face and time-varying kinematics of articulatory movements have been shown to fine-tune auditory neural processing and improve speech recognition. To further determine whether the timing of visual gestures modulates auditory cortical processing, three sets of syllables only differing in the onset and duration of silent prephonatory movements, before the acoustic speech signal, were contrasted using EEG. Despite similar visual recognition rates, an increase in the amplitude of P2 auditory evoked responses was observed from the longest to the shortest movements. Taken together, these results clarify how audiovisual speech perception partly operates through visually-based predictions and related processing time, with acoustic-phonetic neural processing paralleling the timing of visual prephonatory gestures.
Collapse
Affiliation(s)
- Marc Sato
- Laboratoire Parole et Langage, Centre National de la Recherche Scientifique, Aix-Marseille Université, Aix-en-Provence, France.
| |
Collapse
|
12
|
Begau A, Arnau S, Klatt LI, Wascher E, Getzmann S. Using visual speech at the cocktail-party: CNV evidence for early speech extraction in younger and older adults. Hear Res 2022; 426:108636. [DOI: 10.1016/j.heares.2022.108636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 09/26/2022] [Accepted: 10/18/2022] [Indexed: 11/04/2022]
|
13
|
Begau A, Klatt LI, Schneider D, Wascher E, Getzmann S. The role of informational content of visual speech in an audiovisual cocktail party: Evidence from cortical oscillations in young and old participants. Eur J Neurosci 2022; 56:5215-5234. [PMID: 36017762 DOI: 10.1111/ejn.15811] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 08/01/2022] [Accepted: 08/20/2022] [Indexed: 12/14/2022]
Abstract
Age-related differences in the processing of audiovisual speech in a multi-talker environment were investigated analysing event-related spectral perturbations (ERSPs), focusing on theta, alpha and beta oscillations that are assumed to reflect conflict processing, multisensory integration and attentional mechanisms, respectively. Eighteen older and 21 younger healthy adults completed a two-alternative forced-choice word discrimination task, responding to audiovisual speech stimuli. In a cocktail-party scenario with two competing talkers (located at -15° and 15° azimuth), target words (/yes/or/no/) appeared at a pre-defined (attended) position, distractor words at the other position. In two audiovisual conditions, acoustic speech was combined either with informative or uninformative visual speech. While a behavioural benefit for informative visual speech occurred for both age groups, differences between audiovisual conditions in the theta and beta band were only present for older adults. A stronger increase in theta perturbations for stimuli containing uninformative visual speech could be associated with early conflict processing, while a stronger suppression in beta perturbations for informative visual speech could be associated to audiovisual integration. Compared to the younger group, the older group showed generally stronger beta perturbations. No condition differences in the alpha band were found. Overall, the findings suggest age-related differences in audiovisual speech integration in a multi-talker environment. While the behavioural benefit of informative visual speech was unaffected by age, older adults had a stronger need for cognitive control when processing conflicting audiovisual speech input. Furthermore, mechanisms of audiovisual integration are differently activated depending on the informational content of the visual information.
Collapse
Affiliation(s)
- Alexandra Begau
- Leibniz Research Centre for Working Environment and Human Factors, Dortmund, Germany
| | - Laura-Isabelle Klatt
- Leibniz Research Centre for Working Environment and Human Factors, Dortmund, Germany
| | - Daniel Schneider
- Leibniz Research Centre for Working Environment and Human Factors, Dortmund, Germany
| | - Edmund Wascher
- Leibniz Research Centre for Working Environment and Human Factors, Dortmund, Germany
| | - Stephan Getzmann
- Leibniz Research Centre for Working Environment and Human Factors, Dortmund, Germany
| |
Collapse
|
14
|
Modulation transfer functions for audiovisual speech. PLoS Comput Biol 2022; 18:e1010273. [PMID: 35852989 PMCID: PMC9295967 DOI: 10.1371/journal.pcbi.1010273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 06/01/2022] [Indexed: 11/19/2022] Open
Abstract
Temporal synchrony between facial motion and acoustic modulations is a hallmark feature of audiovisual speech. The moving face and mouth during natural speech is known to be correlated with low-frequency acoustic envelope fluctuations (below 10 Hz), but the precise rates at which envelope information is synchronized with motion in different parts of the face are less clear. Here, we used regularized canonical correlation analysis (rCCA) to learn speech envelope filters whose outputs correlate with motion in different parts of the speakers face. We leveraged recent advances in video-based 3D facial landmark estimation allowing us to examine statistical envelope-face correlations across a large number of speakers (∼4000). Specifically, rCCA was used to learn modulation transfer functions (MTFs) for the speech envelope that significantly predict correlation with facial motion across different speakers. The AV analysis revealed bandpass speech envelope filters at distinct temporal scales. A first set of MTFs showed peaks around 3-4 Hz and were correlated with mouth movements. A second set of MTFs captured envelope fluctuations in the 1-2 Hz range correlated with more global face and head motion. These two distinctive timescales emerged only as a property of natural AV speech statistics across many speakers. A similar analysis of fewer speakers performing a controlled speech task highlighted only the well-known temporal modulations around 4 Hz correlated with orofacial motion. The different bandpass ranges of AV correlation align notably with the average rates at which syllables (3-4 Hz) and phrases (1-2 Hz) are produced in natural speech. Whereas periodicities at the syllable rate are evident in the envelope spectrum of the speech signal itself, slower 1-2 Hz regularities thus only become prominent when considering crossmodal signal statistics. This may indicate a motor origin of temporal regularities at the timescales of syllables and phrases in natural speech.
Collapse
|
15
|
Gonzales MG, Backer KC, Yan Y, Miller LM, Bortfeld H, Shahin AJ. Audition controls the flow of visual time during multisensory perception. iScience 2022; 25:104671. [PMID: 35845168 PMCID: PMC9283509 DOI: 10.1016/j.isci.2022.104671] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Revised: 05/06/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open
Abstract
Previous work addressing the influence of audition on visual perception has mainly been assessed using non-speech stimuli. Herein, we introduce the Audiovisual Time-Flow Illusion in spoken language, underscoring the role of audition in multisensory processing. When brief pauses were inserted into or brief portions were removed from an acoustic speech stream, individuals perceived the corresponding visual speech as “pausing” or “skipping”, respectively—even though the visual stimulus was intact. When the stimulus manipulation was reversed—brief pauses were inserted into, or brief portions were removed from the visual speech stream—individuals failed to perceive the illusion in the corresponding intact auditory stream. Our findings demonstrate that in the context of spoken language, people continually realign the pace of their visual perception based on that of the auditory input. In short, the auditory modality sets the pace of the visual modality during audiovisual speech processing. We describe the significance of the Audiovisual Time-Flow Illusion Temporal perturbations to auditory speech drive perception of visual speech However, perturbing visual speech stimuli does not affect auditory perception Auditory processing controls the temporal perception of the visual speech stream
Collapse
|
16
|
Chalas N, Karagiorgis A, Bamidis P, Paraskevopoulos E. The impact of musical training in symbolic and non-symbolic audiovisual judgements of magnitude. PLoS One 2022; 17:e0266165. [PMID: 35511806 PMCID: PMC9070945 DOI: 10.1371/journal.pone.0266165] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Accepted: 03/15/2022] [Indexed: 11/30/2022] Open
Abstract
Quantity estimation can be represented in either an analog or symbolic manner and recent evidence now suggests that analog and symbolic representation of quantities interact. Nonetheless, those two representational forms of quantities may be enhanced by convergent multisensory information. Here, we elucidate those interactions using high-density electroencephalography (EEG) and an audiovisual oddball paradigm. Participants were presented simultaneous audiovisual tokens in which the co-varying pitch of tones was combined with the embedded cardinality of dot patterns. Incongruencies were elicited independently from symbolic and non-symbolic modality within the audio-visual percept, violating the newly acquired rule that “the higher the pitch of the tone, the larger the cardinality of the figure.” The effect of neural plasticity in symbolic and non-symbolic numerical representations of quantities was investigated through a cross-sectional design, comparing musicians to musically naïve controls. Individual’s cortical activity was reconstructed and statistically modeled for a predefined time-window of the evoked response (130–170 ms). To summarize, we show that symbolic and non-symbolic processing of magnitudes is re-organized in cortical space, with professional musicians showing altered activity in motor and temporal areas. Thus, we argue that the symbolic representation of quantities is altered through musical training.
Collapse
Affiliation(s)
- Nikos Chalas
- Institute for Biomagnetism and Biosignal analysis, University of Münster, Münster, Germany
- School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Alexandros Karagiorgis
- School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Panagiotis Bamidis
- School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Evangelos Paraskevopoulos
- School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, Greece
- Department of Psychology, University of Cyprus, Nicosia, Cyprus
- * E-mail:
| |
Collapse
|
17
|
Sato M. Motor and visual influences on auditory neural processing during speaking and listening. Cortex 2022; 152:21-35. [DOI: 10.1016/j.cortex.2022.03.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2021] [Revised: 02/02/2022] [Accepted: 03/15/2022] [Indexed: 11/03/2022]
|
18
|
Gordon-Salant S, Schwartz MS, Oppler KA, Yeni-Komshian GH. Detection and Recognition of Asynchronous Auditory/Visual Speech: Effects of Age, Hearing Loss, and Talker Accent. Front Psychol 2022; 12:772867. [PMID: 35153900 PMCID: PMC8832148 DOI: 10.3389/fpsyg.2021.772867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Accepted: 12/21/2021] [Indexed: 11/13/2022] Open
Abstract
This investigation examined age-related differences in auditory-visual (AV) integration as reflected on perceptual judgments of temporally misaligned AV English sentences spoken by native English and native Spanish talkers. In the detection task, it was expected that slowed auditory temporal processing of older participants, relative to younger participants, would be manifest as a shift in the range over which participants would judge asynchronous stimuli as synchronous (referred to as the "AV simultaneity window"). The older participants were also expected to exhibit greater declines in speech recognition for asynchronous AV stimuli than younger participants. Talker accent was hypothesized to influence listener performance, with older listeners exhibiting a greater narrowing of the AV simultaneity window and much poorer recognition of asynchronous AV foreign-accented speech compared to younger listeners. Participant groups included younger and older participants with normal hearing and older participants with hearing loss. Stimuli were video recordings of sentences produced by native English and native Spanish talkers. The video recordings were altered in 50 ms steps by delaying either the audio or video onset. Participants performed a detection task in which they judged whether the sentences were synchronous or asynchronous, and performed a recognition task for multiple synchronous and asynchronous conditions. Both the detection and recognition tasks were conducted at the individualized signal-to-noise ratio (SNR) corresponding to approximately 70% correct speech recognition performance for synchronous AV sentences. Older listeners with and without hearing loss generally showed wider AV simultaneity windows than younger listeners, possibly reflecting slowed auditory temporal processing in auditory lead conditions and reduced sensitivity to asynchrony in auditory lag conditions. However, older and younger listeners were affected similarly by misalignment of auditory and visual signal onsets on the speech recognition task. This suggests that older listeners are negatively impacted by temporal misalignments for speech recognition, even when they do not notice that the stimuli are asynchronous. Overall, the findings show that when listener performance is equated for simultaneous AV speech signals, age effects are apparent in detection judgments but not in recognition of asynchronous speech.
Collapse
Affiliation(s)
- Sandra Gordon-Salant
- Department of Hearing and Speech Sciences, University of Maryland, College Park, MD, United States
| | | | | | | |
Collapse
|
19
|
Heins N, Pomp J, Kluger DS, Vinbrüx S, Trempler I, Kohler A, Kornysheva K, Zentgraf K, Raab M, Schubotz RI. Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming. PLoS One 2021; 16:e0253130. [PMID: 34293800 PMCID: PMC8298114 DOI: 10.1371/journal.pone.0253130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 05/31/2021] [Indexed: 11/18/2022] Open
Abstract
Auditory and visual percepts are integrated even when they are not perfectly temporally aligned with each other, especially when the visual signal precedes the auditory signal. This window of temporal integration for asynchronous audiovisual stimuli is relatively well examined in the case of speech, while other natural action-induced sounds have been widely neglected. Here, we studied the detection of audiovisual asynchrony in three different whole-body actions with natural action-induced sounds–hurdling, tap dancing and drumming. In Study 1, we examined whether audiovisual asynchrony detection, assessed by a simultaneity judgment task, differs as a function of sound production intentionality. Based on previous findings, we expected that auditory and visual signals should be integrated over a wider temporal window for actions creating sounds intentionally (tap dancing), compared to actions creating sounds incidentally (hurdling). While percentages of perceived synchrony differed in the expected way, we identified two further factors, namely high event density and low rhythmicity, to induce higher synchrony ratings as well. Therefore, we systematically varied event density and rhythmicity in Study 2, this time using drumming stimuli to exert full control over these variables, and the same simultaneity judgment tasks. Results suggest that high event density leads to a bias to integrate rather than segregate auditory and visual signals, even at relatively large asynchronies. Rhythmicity had a similar, albeit weaker effect, when event density was low. Our findings demonstrate that shorter asynchronies and visual-first asynchronies lead to higher synchrony ratings of whole-body action, pointing to clear parallels with audiovisual integration in speech perception. Overconfidence in the naturally expected, that is, synchrony of sound and sight, was stronger for intentional (vs. incidental) sound production and for movements with high (vs. low) rhythmicity, presumably because both encourage predictive processes. In contrast, high event density appears to increase synchronicity judgments simply because it makes the detection of audiovisual asynchrony more difficult. More studies using real-life audiovisual stimuli with varying event densities and rhythmicities are needed to fully uncover the general mechanisms of audiovisual integration.
Collapse
Affiliation(s)
- Nina Heins
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Jennifer Pomp
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Daniel S. Kluger
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
- Institute for Biomagnetism and Biosignal Analysis, University Hospital Muenster, Muenster, Germany
| | - Stefan Vinbrüx
- Institute of Sport and Exercise Sciences, Human Performance and Training, University of Muenster, Muenster, Germany
| | - Ima Trempler
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Axel Kohler
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Katja Kornysheva
- School of Psychology and Bangor Neuroimaging Unit, Bangor University, Wales, United Kingdom
| | - Karen Zentgraf
- Department of Movement Science and Training in Sports, Institute of Sport Sciences, Goethe University Frankfurt, Frankfurt, Germany
| | - Markus Raab
- Institute of Psychology, German Sport University Cologne, Cologne, Germany
- School of Applied Sciences, London South Bank University, London, United Kingdom
| | - Ricarda I. Schubotz
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
- * E-mail:
| |
Collapse
|
20
|
O'Sullivan AE, Crosse MJ, Liberto GMD, de Cheveigné A, Lalor EC. Neurophysiological Indices of Audiovisual Speech Processing Reveal a Hierarchy of Multisensory Integration Effects. J Neurosci 2021; 41:4991-5003. [PMID: 33824190 PMCID: PMC8197638 DOI: 10.1523/jneurosci.0906-20.2021] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Revised: 03/16/2021] [Accepted: 03/22/2021] [Indexed: 12/27/2022] Open
Abstract
Seeing a speaker's face benefits speech comprehension, especially in challenging listening conditions. This perceptual benefit is thought to stem from the neural integration of visual and auditory speech at multiple stages of processing, whereby movement of a speaker's face provides temporal cues to auditory cortex, and articulatory information from the speaker's mouth can aid recognizing specific linguistic units (e.g., phonemes, syllables). However, it remains unclear how the integration of these cues varies as a function of listening conditions. Here, we sought to provide insight on these questions by examining EEG responses in humans (males and females) to natural audiovisual (AV), audio, and visual speech in quiet and in noise. We represented our speech stimuli in terms of their spectrograms and their phonetic features and then quantified the strength of the encoding of those features in the EEG using canonical correlation analysis (CCA). The encoding of both spectrotemporal and phonetic features was shown to be more robust in AV speech responses than what would have been expected from the summation of the audio and visual speech responses, suggesting that multisensory integration occurs at both spectrotemporal and phonetic stages of speech processing. We also found evidence to suggest that the integration effects may change with listening conditions; however, this was an exploratory analysis and future work will be required to examine this effect using a within-subject design. These findings demonstrate that integration of audio and visual speech occurs at multiple stages along the speech processing hierarchy.SIGNIFICANCE STATEMENT During conversation, visual cues impact our perception of speech. Integration of auditory and visual speech is thought to occur at multiple stages of speech processing and vary flexibly depending on the listening conditions. Here, we examine audiovisual (AV) integration at two stages of speech processing using the speech spectrogram and a phonetic representation, and test how AV integration adapts to degraded listening conditions. We find significant integration at both of these stages regardless of listening conditions. These findings reveal neural indices of multisensory interactions at different stages of processing and provide support for the multistage integration framework.
Collapse
Affiliation(s)
- Aisling E O'Sullivan
- School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
| | - Michael J Crosse
- X, The Moonshot Factory, Mountain View, CA and Department of Neuroscience, Albert Einstein College of Medicine, Bronx, New York 10461
| | - Giovanni M Di Liberto
- Laboratoire des Systèmes Perceptifs, Département d'Études Cognitives, École Normale Supérieure, Paris Sciences et Lettres University, Centre National de la Recherche Scientifique, Paris 75005, France
| | - Alain de Cheveigné
- Laboratoire des Systèmes Perceptifs, Département d'Études Cognitives, École Normale Supérieure, Paris Sciences et Lettres University, Centre National de la Recherche Scientifique, Paris 75005, France
- University College London Ear Institute, University College London, London WC1X 8EE, United Kingdom
| | - Edmund C Lalor
- School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
- Department of Biomedical Engineering and Department of Neuroscience, University of Rochester, Rochester, New York 14627
| |
Collapse
|
21
|
Mégevand P, Mercier MR, Groppe DM, Zion Golumbic E, Mesgarani N, Beauchamp MS, Schroeder CE, Mehta AD. Crossmodal Phase Reset and Evoked Responses Provide Complementary Mechanisms for the Influence of Visual Speech in Auditory Cortex. J Neurosci 2020; 40:8530-8542. [PMID: 33023923 PMCID: PMC7605423 DOI: 10.1523/jneurosci.0555-20.2020] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 07/27/2020] [Accepted: 08/31/2020] [Indexed: 12/26/2022] Open
Abstract
Natural conversation is multisensory: when we can see the speaker's face, visual speech cues improve our comprehension. The neuronal mechanisms underlying this phenomenon remain unclear. The two main alternatives are visually mediated phase modulation of neuronal oscillations (excitability fluctuations) in auditory neurons and visual input-evoked responses in auditory neurons. Investigating this question using naturalistic audiovisual speech with intracranial recordings in humans of both sexes, we find evidence for both mechanisms. Remarkably, auditory cortical neurons track the temporal dynamics of purely visual speech using the phase of their slow oscillations and phase-related modulations in broadband high-frequency activity. Consistent with known perceptual enhancement effects, the visual phase reset amplifies the cortical representation of concomitant auditory speech. In contrast to this, and in line with earlier reports, visual input reduces the amplitude of evoked responses to concomitant auditory input. We interpret the combination of improved phase tracking and reduced response amplitude as evidence for more efficient and reliable stimulus processing in the presence of congruent auditory and visual speech inputs.SIGNIFICANCE STATEMENT Watching the speaker can facilitate our understanding of what is being said. The mechanisms responsible for this influence of visual cues on the processing of speech remain incompletely understood. We studied these mechanisms by recording the electrical activity of the human brain through electrodes implanted surgically inside the brain. We found that visual inputs can operate by directly activating auditory cortical areas, and also indirectly by modulating the strength of cortical responses to auditory input. Our results help to understand the mechanisms by which the brain merges auditory and visual speech into a unitary perception.
Collapse
Affiliation(s)
- Pierre Mégevand
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
- Department of Basic Neurosciences, Faculty of Medicine, University of Geneva, 1211 Geneva, Switzerland
| | - Manuel R Mercier
- Department of Neurology, Montefiore Medical Center, Bronx, New York 10467
- Department of Neuroscience, Albert Einstein College of Medicine, Bronx, New York 10461
- Institut de Neurosciences des Systèmes, Aix Marseille University, INSERM, 13005 Marseille, France
| | - David M Groppe
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
- The Krembil Neuroscience Centre, University Health Network, Toronto, Ontario M5T 1M8, Canada
| | - Elana Zion Golumbic
- The Gonda Brain Research Center, Bar Ilan University, Ramat Gan 5290002, Israel
| | - Nima Mesgarani
- Department of Electrical Engineering, Columbia University, New York, New York 10027
| | - Michael S Beauchamp
- Department of Neurosurgery, Baylor College of Medicine, Houston, Texas 77030
| | - Charles E Schroeder
- Nathan S. Kline Institute, Orangeburg, New York 10962
- Department of Psychiatry, Columbia University, New York, New York 10032
| | - Ashesh D Mehta
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
| |
Collapse
|
22
|
Audio-visual combination of syllables involves time-sensitive dynamics following from fusion failure. Sci Rep 2020; 10:18009. [PMID: 33093570 PMCID: PMC7583249 DOI: 10.1038/s41598-020-75201-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Accepted: 10/05/2020] [Indexed: 11/08/2022] Open
Abstract
In face-to-face communication, audio-visual (AV) stimuli can be fused, combined or perceived as mismatching. While the left superior temporal sulcus (STS) is presumably the locus of AV integration, the process leading to combination is unknown. Based on previous modelling work, we hypothesize that combination results from a complex dynamic originating in a failure to integrate AV inputs, followed by a reconstruction of the most plausible AV sequence. In two different behavioural tasks and one MEG experiment, we observed that combination is more time demanding than fusion. Using time-/source-resolved human MEG analyses with linear and dynamic causal models, we show that both fusion and combination involve early detection of AV incongruence in the STS, whereas combination is further associated with enhanced activity of AV asynchrony-sensitive regions (auditory and inferior frontal cortices). Based on neural signal decoding, we finally show that only combination can be decoded from the IFG activity and that combination is decoded later than fusion in the STS. These results indicate that the AV speech integration outcome primarily depends on whether the STS converges or not onto an existing multimodal syllable representation, and that combination results from subsequent temporal processing, presumably the off-line re-ordering of incongruent AV stimuli.
Collapse
|
23
|
Responses to Visual Speech in Human Posterior Superior Temporal Gyrus Examined with iEEG Deconvolution. J Neurosci 2020; 40:6938-6948. [PMID: 32727820 PMCID: PMC7470920 DOI: 10.1523/jneurosci.0279-20.2020] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 06/01/2020] [Accepted: 06/02/2020] [Indexed: 12/22/2022] Open
Abstract
Experimentalists studying multisensory integration compare neural responses to multisensory stimuli with responses to the component modalities presented in isolation. This procedure is problematic for multisensory speech perception since audiovisual speech and auditory-only speech are easily intelligible but visual-only speech is not. To overcome this confound, we developed intracranial encephalography (iEEG) deconvolution. Individual stimuli always contained both auditory and visual speech, but jittering the onset asynchrony between modalities allowed for the time course of the unisensory responses and the interaction between them to be independently estimated. We applied this procedure to electrodes implanted in human epilepsy patients (both male and female) over the posterior superior temporal gyrus (pSTG), a brain area known to be important for speech perception. iEEG deconvolution revealed sustained positive responses to visual-only speech and larger, phasic responses to auditory-only speech. Confirming results from scalp EEG, responses to audiovisual speech were weaker than responses to auditory-only speech, demonstrating a subadditive multisensory neural computation. Leveraging the spatial resolution of iEEG, we extended these results to show that subadditivity is most pronounced in more posterior aspects of the pSTG. Across electrodes, subadditivity correlated with visual responsiveness, supporting a model in which visual speech enhances the efficiency of auditory speech processing in pSTG. The ability to separate neural processes may make iEEG deconvolution useful for studying a variety of complex cognitive and perceptual tasks.SIGNIFICANCE STATEMENT Understanding speech is one of the most important human abilities. Speech perception uses information from both the auditory and visual modalities. It has been difficult to study neural responses to visual speech because visual-only speech is difficult or impossible to comprehend, unlike auditory-only and audiovisual speech. We used intracranial encephalography deconvolution to overcome this obstacle. We found that visual speech evokes a positive response in the human posterior superior temporal gyrus, enhancing the efficiency of auditory speech processing.
Collapse
|
24
|
Randazzo M, Priefer R, Smith PJ, Nagler A, Avery T, Froud K. Neural Correlates of Modality-Sensitive Deviance Detection in the Audiovisual Oddball Paradigm. Brain Sci 2020; 10:brainsci10060328. [PMID: 32481538 PMCID: PMC7348766 DOI: 10.3390/brainsci10060328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 05/15/2020] [Accepted: 05/25/2020] [Indexed: 11/16/2022] Open
Abstract
The McGurk effect, an incongruent pairing of visual /ga/–acoustic /ba/, creates a fusion illusion /da/ and is the cornerstone of research in audiovisual speech perception. Combination illusions occur given reversal of the input modalities—auditory /ga/-visual /ba/, and percept /bga/. A robust literature shows that fusion illusions in an oddball paradigm evoke a mismatch negativity (MMN) in the auditory cortex, in absence of changes to acoustic stimuli. We compared fusion and combination illusions in a passive oddball paradigm to further examine the influence of visual and auditory aspects of incongruent speech stimuli on the audiovisual MMN. Participants viewed videos under two audiovisual illusion conditions: fusion with visual aspect of the stimulus changing, and combination with auditory aspect of the stimulus changing, as well as two unimodal auditory- and visual-only conditions. Fusion and combination deviants exerted similar influence in generating congruency predictions with significant differences between standards and deviants in the N100 time window. Presence of the MMN in early and late time windows differentiated fusion from combination deviants. When the visual signal changes, a new percept is created, but when the visual is held constant and the auditory changes, the response is suppressed, evoking a later MMN. In alignment with models of predictive processing in audiovisual speech perception, we interpreted our results to indicate that visual information can both predict and suppress auditory speech perception.
Collapse
Affiliation(s)
- Melissa Randazzo
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
- Correspondence: ; Tel.: +1-516-877-4769
| | - Ryan Priefer
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
| | - Paul J. Smith
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| | - Amanda Nagler
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
| | - Trey Avery
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| | - Karen Froud
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| |
Collapse
|
25
|
Zhou HY, Cheung EFC, Chan RCK. Audiovisual temporal integration: Cognitive processing, neural mechanisms, developmental trajectory and potential interventions. Neuropsychologia 2020; 140:107396. [PMID: 32087206 DOI: 10.1016/j.neuropsychologia.2020.107396] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 02/14/2020] [Accepted: 02/15/2020] [Indexed: 12/21/2022]
Abstract
To integrate auditory and visual signals into a unified percept, the paired stimuli must co-occur within a limited time window known as the Temporal Binding Window (TBW). The width of the TBW, a proxy of audiovisual temporal integration ability, has been found to be correlated with higher-order cognitive and social functions. A comprehensive review of studies investigating audiovisual TBW reveals several findings: (1) a wide range of top-down processes and bottom-up features can modulate the width of the TBW, facilitating adaptation to the changing and multisensory external environment; (2) a large-scale brain network works in coordination to ensure successful detection of audiovisual (a)synchrony; (3) developmentally, audiovisual TBW follows a U-shaped pattern across the lifespan, with a protracted developmental course into late adolescence and rebounding in size again in late life; (4) an enlarged TBW is characteristic of a number of neurodevelopmental disorders; and (5) the TBW is highly flexible via perceptual and musical training. Interventions targeting the TBW may be able to improve multisensory function and ameliorate social communicative symptoms in clinical populations.
Collapse
Affiliation(s)
- Han-Yu Zhou
- Neuropsychology and Applied Cognitive Neuroscience Laboratory, CAS Key Laboratory of Mental Health, Institute of Psychology, Beijing, China; Department of Psychology, University of Chinese Academy of Sciences, Beijing, China
| | | | - Raymond C K Chan
- Neuropsychology and Applied Cognitive Neuroscience Laboratory, CAS Key Laboratory of Mental Health, Institute of Psychology, Beijing, China; Department of Psychology, University of Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
26
|
Hueber T, Tatulli E, Girin L, Schwartz JL. Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning. Neural Comput 2020; 32:596-625. [PMID: 31951798 DOI: 10.1162/neco_a_01264] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Sensory processing is increasingly conceived in a predictive framework in which neurons would constantly process the error signal resulting from the comparison of expected and observed stimuli. Surprisingly, few data exist on the accuracy of predictions that can be computed in real sensory scenes. Here, we focus on the sensory processing of auditory and audiovisual speech. We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and past audio or audiovisual observations (i.e., including lip movements). Those predictions exploit purely local phonetic regularities with no explicit call to higher linguistic levels. Experiments are conducted on the multispeaker LibriSpeech audio speech database (around 100 hours) and on the NTCD-TIMIT audiovisual speech database (around 7 hours). They appear to be efficient in a short temporal range (25-50 ms), predicting 50% to 75% of the variance of the incoming stimulus, which could result in potentially saving up to three-quarters of the processing power. Then they quickly decrease and almost vanish after 250 ms. Adding information on the lips slightly improves predictions, with a 5% to 10% increase in explained variance. Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound.
Collapse
Affiliation(s)
- Thomas Hueber
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| | - Eric Tatulli
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| | - Laurent Girin
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France, and Inria Grenoble-Rhône-Alpes, 38330 Montbonnot-Saint Martin, France
| | - Jean-Luc Schwartz
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| |
Collapse
|
27
|
The impact of when, what and how predictions on auditory speech perception. Exp Brain Res 2019; 237:3143-3153. [DOI: 10.1007/s00221-019-05661-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Accepted: 09/24/2019] [Indexed: 11/26/2022]
|
28
|
Karas PJ, Magnotti JF, Metzger BA, Zhu LL, Smith KB, Yoshor D, Beauchamp MS. The visual speech head start improves perception and reduces superior temporal cortex responses to auditory speech. eLife 2019; 8:e48116. [PMID: 31393261 PMCID: PMC6687434 DOI: 10.7554/elife.48116] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Accepted: 07/17/2019] [Indexed: 12/30/2022] Open
Abstract
Visual information about speech content from the talker's mouth is often available before auditory information from the talker's voice. Here we examined perceptual and neural responses to words with and without this visual head start. For both types of words, perception was enhanced by viewing the talker's face, but the enhancement was significantly greater for words with a head start. Neural responses were measured from electrodes implanted over auditory association cortex in the posterior superior temporal gyrus (pSTG) of epileptic patients. The presence of visual speech suppressed responses to auditory speech, more so for words with a visual head start. We suggest that the head start inhibits representations of incompatible auditory phonemes, increasing perceptual accuracy and decreasing total neural responses. Together with previous work showing visual cortex modulation (Ozker et al., 2018b) these results from pSTG demonstrate that multisensory interactions are a powerful modulator of activity throughout the speech perception network.
Collapse
Affiliation(s)
- Patrick J Karas
- Department of NeurosurgeryBaylor College of MedicineHoustonUnited States
| | - John F Magnotti
- Department of NeurosurgeryBaylor College of MedicineHoustonUnited States
| | - Brian A Metzger
- Department of NeurosurgeryBaylor College of MedicineHoustonUnited States
| | - Lin L Zhu
- Department of NeurosurgeryBaylor College of MedicineHoustonUnited States
| | - Kristen B Smith
- Department of NeurosurgeryBaylor College of MedicineHoustonUnited States
| | - Daniel Yoshor
- Department of NeurosurgeryBaylor College of MedicineHoustonUnited States
| | | |
Collapse
|
29
|
Bayard C, Machart L, Strauß A, Gerber S, Aubanel V, Schwartz JL. Cued Speech Enhances Speech-in-Noise Perception. JOURNAL OF DEAF STUDIES AND DEAF EDUCATION 2019; 24:223-233. [PMID: 30809665 DOI: 10.1093/deafed/enz003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/26/2018] [Revised: 01/28/2019] [Accepted: 01/31/2019] [Indexed: 06/09/2023]
Abstract
Speech perception in noise remains challenging for Deaf/Hard of Hearing people (D/HH), even fitted with hearing aids or cochlear implants. The perception of sentences in noise by 20 implanted or aided D/HH subjects mastering Cued Speech (CS), a system of hand gestures complementing lip movements, was compared with the perception of 15 typically hearing (TH) controls in three conditions: audio only, audiovisual, and audiovisual + CS. Similar audiovisual scores were obtained for signal-to-noise ratios (SNRs) 11 dB higher in D/HH participants compared with TH ones. Adding CS information enabled D/HH participants to reach a mean score of 83% in the audiovisual + CS condition at a mean SNR of 0 dB, similar to the usual audio score for TH participants at this SNR. This confirms that the combination of lipreading and Cued Speech system remains extremely important for persons with hearing loss, particularly in adverse hearing conditions.
Collapse
Affiliation(s)
| | | | - Antje Strauß
- Zukunftskolleg, FB Sprachwissenschaft, University of Konstanz
| | | | | | | |
Collapse
|
30
|
O'Sullivan AE, Lim CY, Lalor EC. Look at me when I'm talking to you: Selective attention at a multisensory cocktail party can be decoded using stimulus reconstruction and alpha power modulations. Eur J Neurosci 2019; 50:3282-3295. [PMID: 31013361 DOI: 10.1111/ejn.14425] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Revised: 03/25/2019] [Accepted: 04/17/2019] [Indexed: 11/30/2022]
Abstract
Recent work using electroencephalography has applied stimulus reconstruction techniques to identify the attended speaker in a cocktail party environment. The success of these approaches has been primarily based on the ability to detect cortical tracking of the acoustic envelope at the scalp level. However, most studies have ignored the effects of visual input, which is almost always present in naturalistic scenarios. In this study, we investigated the effects of visual input on envelope-based cocktail party decoding in two multisensory cocktail party situations: (a) Congruent AV-facing the attended speaker while ignoring another speaker represented by the audio-only stream and (b) Incongruent AV (eavesdropping)-attending the audio-only speaker while looking at the unattended speaker. We trained and tested decoders for each condition separately and found that we can successfully decode attention to congruent audiovisual speech and can also decode attention when listeners were eavesdropping, i.e., looking at the face of the unattended talker. In addition to this, we found alpha power to be a reliable measure of attention to the visual speech. Using parieto-occipital alpha power, we found that we can distinguish whether subjects are attending or ignoring the speaker's face. Considering the practical applications of these methods, we demonstrate that with only six near-ear electrodes we can successfully determine the attended speech. This work extends the current framework for decoding attention to speech to more naturalistic scenarios, and in doing so provides additional neural measures which may be incorporated to improve decoding accuracy.
Collapse
Affiliation(s)
- Aisling E O'Sullivan
- School of Engineering, Trinity Centre for Bioengineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
| | - Chantelle Y Lim
- Department of Biomedical Engineering, University of Rochester, Rochester, New York
| | - Edmund C Lalor
- School of Engineering, Trinity Centre for Bioengineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland.,Department of Biomedical Engineering, University of Rochester, Rochester, New York.,Department of Neuroscience, Del Monte Institute for Neuroscience, University of Rochester, Rochester, New York
| |
Collapse
|
31
|
Simon DM, Wallace MT. Integration and Temporal Processing of Asynchronous Audiovisual Speech. J Cogn Neurosci 2018; 30:319-337. [DOI: 10.1162/jocn_a_01205] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Multisensory integration of visual mouth movements with auditory speech is known to offer substantial perceptual benefits, particularly under challenging (i.e., noisy) acoustic conditions. Previous work characterizing this process has found that ERPs to auditory speech are of shorter latency and smaller magnitude in the presence of visual speech. We sought to determine the dependency of these effects on the temporal relationship between the auditory and visual speech streams using EEG. We found that reductions in ERP latency and suppression of ERP amplitude are maximal when the visual signal precedes the auditory signal by a small interval and that increasing amounts of asynchrony reduce these effects in a continuous manner. Time–frequency analysis revealed that these effects are found primarily in the theta (4–8 Hz) and alpha (8–12 Hz) bands, with a central topography consistent with auditory generators. Theta effects also persisted in the lower portion of the band (3.5–5 Hz), and this late activity was more frontally distributed. Importantly, the magnitude of these late theta oscillations not only differed with the temporal characteristics of the stimuli but also served to predict participants' task performance. Our analysis thus reveals that suppression of single-trial brain responses by visual speech depends strongly on the temporal concordance of the auditory and visual inputs. It further illustrates that processes in the lower theta band, which we suggest as an index of incongruity processing, might serve to reflect the neural correlates of individual differences in multisensory temporal perception.
Collapse
|
32
|
|
33
|
Van Ackeren MJ, Barbero FM, Mattioni S, Bottini R, Collignon O. Neuronal populations in the occipital cortex of the blind synchronize to the temporal dynamics of speech. eLife 2018; 7:e31640. [PMID: 29338838 PMCID: PMC5790372 DOI: 10.7554/elife.31640] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Accepted: 01/16/2018] [Indexed: 11/13/2022] Open
Abstract
The occipital cortex of early blind individuals (EB) activates during speech processing, challenging the notion of a hard-wired neurobiology of language. But, at what stage of speech processing do occipital regions participate in EB? Here we demonstrate that parieto-occipital regions in EB enhance their synchronization to acoustic fluctuations in human speech in the theta-range (corresponding to syllabic rate), irrespective of speech intelligibility. Crucially, enhanced synchronization to the intelligibility of speech was selectively observed in primary visual cortex in EB, suggesting that this region is at the interface between speech perception and comprehension. Moreover, EB showed overall enhanced functional connectivity between temporal and occipital cortices that are sensitive to speech intelligibility and altered directionality when compared to the sighted group. These findings suggest that the occipital cortex of the blind adopts an architecture that allows the tracking of speech material, and therefore does not fully abstract from the reorganized sensory inputs it receives.
Collapse
Affiliation(s)
| | - Francesca M Barbero
- Institute of research in PsychologyUniversity of LouvainLouvainBelgium
- Institute of NeuroscienceUniversity of LouvainLouvainBelgium
| | | | - Roberto Bottini
- Center for Mind/Brain StudiesUniversity of TrentoTrentoItaly
| | - Olivier Collignon
- Center for Mind/Brain StudiesUniversity of TrentoTrentoItaly
- Institute of research in PsychologyUniversity of LouvainLouvainBelgium
- Institute of NeuroscienceUniversity of LouvainLouvainBelgium
| |
Collapse
|
34
|
Treille A, Vilain C, Schwartz JL, Hueber T, Sato M. Electrophysiological evidence for Audio-visuo-lingual speech integration. Neuropsychologia 2018; 109:126-133. [DOI: 10.1016/j.neuropsychologia.2017.12.024] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Revised: 11/21/2017] [Accepted: 12/13/2017] [Indexed: 01/25/2023]
|
35
|
Sánchez-García C, Kandel S, Savariaux C, Soto-Faraco S. The Time Course of Audio-Visual Phoneme Identification: a High Temporal Resolution Study. Multisens Res 2018; 31:57-78. [DOI: 10.1163/22134808-00002560] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 02/20/2017] [Indexed: 11/19/2022]
Abstract
Speech unfolds in time and, as a consequence, its perception requires temporal integration. Yet, studies addressing audio-visual speech processing have often overlooked this temporal aspect. Here, we address the temporal course of audio-visual speech processing in a phoneme identification task using a Gating paradigm. We created disyllabic Spanish word-like utterances (e.g., /pafa/, /paθa/, …) from high-speed camera recordings. The stimuli differed only in the middle consonant (/f/, /θ/, /s/, /r/, /g/), which varied in visual and auditory saliency. As in classical Gating tasks, the utterances were presented in fragments of increasing length (gates), here in 10 ms steps, for identification and confidence ratings. We measured correct identification as a function of time (at each gate) for each critical consonant in audio, visual and audio-visual conditions, and computed the Identification Point and Recognition Point scores. The results revealed that audio-visual identification is a time-varying process that depends on the relative strength of each modality (i.e., saliency). In some cases, audio-visual identification followed the pattern of one dominant modality (either A or V), when that modality was very salient. In other cases, both modalities contributed to identification, hence resulting in audio-visual advantage or interference with respect to unimodal conditions. Both unimodal dominance and audio-visual interaction patterns may arise within the course of identification of the same utterance, at different times. The outcome of this study suggests that audio-visual speech integration models should take into account the time-varying nature of visual and auditory saliency.
Collapse
Affiliation(s)
- Carolina Sánchez-García
- Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain
| | - Sonia Kandel
- Université Grenoble Alpes, GIPSA-lab (CNRS UMR 5216), Grenoble, France
| | | | - Salvador Soto-Faraco
- Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| |
Collapse
|
36
|
Riecke L, Formisano E, Sorger B, Başkent D, Gaudrain E. Neural Entrainment to Speech Modulates Speech Intelligibility. Curr Biol 2017; 28:161-169.e5. [PMID: 29290557 DOI: 10.1016/j.cub.2017.11.033] [Citation(s) in RCA: 116] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2017] [Revised: 10/26/2017] [Accepted: 11/15/2017] [Indexed: 01/02/2023]
Abstract
Speech is crucial for communication in everyday life. Speech-brain entrainment, the alignment of neural activity to the slow temporal fluctuations (envelope) of acoustic speech input, is a ubiquitous element of current theories of speech processing. Associations between speech-brain entrainment and acoustic speech signal, listening task, and speech intelligibility have been observed repeatedly. However, a methodological bottleneck has prevented so far clarifying whether speech-brain entrainment contributes functionally to (i.e., causes) speech intelligibility or is merely an epiphenomenon of it. To address this long-standing issue, we experimentally manipulated speech-brain entrainment without concomitant acoustic and task-related variations, using a brain stimulation approach that enables modulating listeners' neural activity with transcranial currents carrying speech-envelope information. Results from two experiments involving a cocktail-party-like scenario and a listening situation devoid of aural speech-amplitude envelope input reveal consistent effects on listeners' speech-recognition performance, demonstrating a causal role of speech-brain entrainment in speech intelligibility. Our findings imply that speech-brain entrainment is critical for auditory speech comprehension and suggest that transcranial stimulation with speech-envelope-shaped currents can be utilized to modulate speech comprehension in impaired listening conditions.
Collapse
Affiliation(s)
- Lars Riecke
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6229 EV Maastricht, the Netherlands.
| | - Elia Formisano
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6229 EV Maastricht, the Netherlands
| | - Bettina Sorger
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, 6229 EV Maastricht, the Netherlands
| | - Deniz Başkent
- Department of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, 9700 RB Groningen, the Netherlands
| | - Etienne Gaudrain
- Department of Otorhinolaryngology/Head and Neck Surgery, University Medical Center Groningen, University of Groningen, 9700 RB Groningen, the Netherlands; CNRS UMR 5292, Lyon Neuroscience Research Center, Auditory Cognition and Psychoacoustics, Inserm UMRS 1028, Université Claude Bernard Lyon 1, Université de Lyon, 69366 Lyon Cedex 07, France
| |
Collapse
|
37
|
Cope TE, Sohoglu E, Sedley W, Patterson K, Jones PS, Wiggins J, Dawson C, Grube M, Carlyon RP, Griffiths TD, Davis MH, Rowe JB. Evidence for causal top-down frontal contributions to predictive processes in speech perception. Nat Commun 2017; 8:2154. [PMID: 29255275 PMCID: PMC5735133 DOI: 10.1038/s41467-017-01958-7] [Citation(s) in RCA: 91] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2017] [Accepted: 10/27/2017] [Indexed: 11/09/2022] Open
Abstract
Perception relies on the integration of sensory information and prior expectations. Here we show that selective neurodegeneration of human frontal speech regions results in delayed reconciliation of predictions in temporal cortex. These temporal regions were not atrophic, displayed normal evoked magnetic and electrical power, and preserved neural sensitivity to manipulations of sensory detail. Frontal neurodegeneration does not prevent the perceptual effects of contextual information; instead, prior expectations are applied inflexibly. The precision of predictions correlates with beta power, in line with theoretical models of the neural instantiation of predictive coding. Fronto-temporal interactions are enhanced while participants reconcile prior predictions with degraded sensory signals. Excessively precise predictions can explain several challenging phenomena in frontal aphasias, including agrammatism and subjective difficulties with speech perception. This work demonstrates that higher-level frontal mechanisms for cognitive and behavioural flexibility make a causal functional contribution to the hierarchical generative models underlying speech perception.
Collapse
Affiliation(s)
- Thomas E Cope
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0SZ, UK.
| | - E Sohoglu
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, CB2 7EF, UK
| | - W Sedley
- Institute of Neuroscience, Newcastle University, Newcastle, NE1 7RU, UK
| | - K Patterson
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0SZ, UK
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, CB2 7EF, UK
| | - P S Jones
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0SZ, UK
| | - J Wiggins
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0SZ, UK
| | - C Dawson
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0SZ, UK
| | - M Grube
- Institute of Neuroscience, Newcastle University, Newcastle, NE1 7RU, UK
| | - R P Carlyon
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, CB2 7EF, UK
| | - T D Griffiths
- Institute of Neuroscience, Newcastle University, Newcastle, NE1 7RU, UK
| | - Matthew H Davis
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, CB2 7EF, UK
| | - James B Rowe
- Department of Clinical Neurosciences, University of Cambridge, Cambridge, CB2 0SZ, UK
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge, CB2 7EF, UK
| |
Collapse
|
38
|
De Niear MA, Gupta PB, Baum SH, Wallace MT. Perceptual training enhances temporal acuity for multisensory speech. Neurobiol Learn Mem 2017; 147:9-17. [PMID: 29107704 DOI: 10.1016/j.nlm.2017.10.016] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2017] [Revised: 10/19/2017] [Accepted: 10/27/2017] [Indexed: 11/30/2022]
Abstract
The temporal relationship between auditory and visual cues is a fundamental feature in the determination of whether these signals will be integrated. The window of perceived simultaneity (TBW) is a construct that describes the epoch of time during which asynchronous auditory and visual stimuli are likely to be perceptually bound. Recently, a number of studies have demonstrated the capacity for perceptual training to enhance temporal acuity for audiovisual stimuli (i.e., narrow the TBW). These studies, however, have only examined multisensory perceptual learning that develops in response to feedback that is provided when making judgments on simple, low-level audiovisual stimuli (i.e., flashes and beeps). Here we sought to determine if perceptual training was capable of altering temporal acuity for audiovisual speech. Furthermore, we also explored whether perceptual training with simple or complex audiovisual stimuli generalized across levels of stimulus complexity. Using a simultaneity judgment (SJ) task, we measured individuals' temporal acuity (as estimated by the TBW) prior to, immediately following, and one week after four consecutive days of perceptual training. We report that temporal acuity for audiovisual speech stimuli is enhanced following perceptual training using speech stimuli. Additionally, we find that changes in temporal acuity following perceptual training do not generalize across the levels of stimulus complexity in this study. Overall, the results suggest that perceptual training is capable of enhancing temporal acuity for audiovisual speech in adults, and that the dynamics of the changes in temporal acuity following perceptual training differ between simple audiovisual stimuli and more complex audiovisual speech stimuli.
Collapse
Affiliation(s)
- Matthew A De Niear
- Medical Scientist Training Program, Vanderbilt University Medical School, Vanderbilt University, Nashville, TN 37235, USA; Vanderbilt Brain Institute, Vanderbilt University Medical School, Vanderbilt University, Nashville, TN 37235, USA.
| | - Pranjal B Gupta
- Undergraduate Neuroscience Program, Vanderbilt University Medical School, Vanderbilt University, Nashville, TN 37235, USA
| | - Sarah H Baum
- Department of Psychology, University of Washington, Seattle, WA 98195, USA
| | - Mark T Wallace
- Vanderbilt Brain Institute, Vanderbilt University Medical School, Vanderbilt University, Nashville, TN 37235, USA; Department of Hearing and Speech Sciences, Vanderbilt University Medical Center, Nashville, TN 37235, USA; Department of Psychology, Vanderbilt University, Nashville, TN 37235, USA; Department of Psychiatry, Vanderbilt University Medical Center, Nashville, TN 37235, USA
| |
Collapse
|
39
|
Eye Can Hear Clearly Now: Inverse Effectiveness in Natural Audiovisual Speech Processing Relies on Long-Term Crossmodal Temporal Integration. J Neurosci 2017; 36:9888-95. [PMID: 27656026 DOI: 10.1523/jneurosci.1396-16.2016] [Citation(s) in RCA: 81] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2016] [Accepted: 08/03/2016] [Indexed: 11/21/2022] Open
Abstract
UNLABELLED Speech comprehension is improved by viewing a speaker's face, especially in adverse hearing conditions, a principle known as inverse effectiveness. However, the neural mechanisms that help to optimize how we integrate auditory and visual speech in such suboptimal conversational environments are not yet fully understood. Using human EEG recordings, we examined how visual speech enhances the cortical representation of auditory speech at a signal-to-noise ratio that maximized the perceptual benefit conferred by multisensory processing relative to unisensory processing. We found that the influence of visual input on the neural tracking of the audio speech signal was significantly greater in noisy than in quiet listening conditions, consistent with the principle of inverse effectiveness. Although envelope tracking during audio-only speech was greatly reduced by background noise at an early processing stage, it was markedly restored by the addition of visual speech input. In background noise, multisensory integration occurred at much lower frequencies and was shown to predict the multisensory gain in behavioral performance at a time lag of ∼250 ms. Critically, we demonstrated that inverse effectiveness, in the context of natural audiovisual (AV) speech processing, relies on crossmodal integration over long temporal windows. Our findings suggest that disparate integration mechanisms contribute to the efficient processing of AV speech in background noise. SIGNIFICANCE STATEMENT The behavioral benefit of seeing a speaker's face during conversation is especially pronounced in challenging listening environments. However, the neural mechanisms underlying this phenomenon, known as inverse effectiveness, have not yet been established. Here, we examine this in the human brain using natural speech-in-noise stimuli that were designed specifically to maximize the behavioral benefit of audiovisual (AV) speech. We find that this benefit arises from our ability to integrate multimodal information over longer periods of time. Our data also suggest that the addition of visual speech restores early tracking of the acoustic speech signal during excessive background noise. These findings support and extend current mechanistic perspectives on AV speech perception.
Collapse
|
40
|
Gordon-Salant S, Yeni-Komshian GH, Fitzgibbons PJ, Willison HM, Freund MS. Recognition of asynchronous auditory-visual speech by younger and older listeners: A preliminary study. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 142:151. [PMID: 28764460 PMCID: PMC5507703 DOI: 10.1121/1.4992026] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2016] [Revised: 03/20/2017] [Accepted: 06/23/2017] [Indexed: 05/15/2023]
Abstract
This study examined the effects of age and hearing loss on recognition of speech presented when the auditory and visual speech information was misaligned in time (i.e., asynchronous). Prior research suggests that older listeners are less sensitive than younger listeners in detecting the presence of asynchronous speech for auditory-lead conditions, but recognition of speech in auditory-lead conditions has not yet been examined. Recognition performance was assessed for sentences and words presented in the auditory-visual modalities with varying degrees of auditory lead and lag. Detection of auditory-visual asynchrony for sentences was assessed to verify that listeners detected these asynchronies. The listeners were younger and older normal-hearing adults and older hearing-impaired adults. Older listeners (regardless of hearing status) exhibited a significant decline in performance in auditory-lead conditions relative to visual lead, unlike younger listeners whose recognition performance was relatively stable across asynchronies. Recognition performance was not correlated with asynchrony detection. However, one of the two cognitive measures assessed, processing speed, was identified in multiple regression analyses as contributing significantly to the variance in auditory-visual speech recognition scores. The findings indicate that, particularly in auditory-lead conditions, listener age has an impact on the ability to recognize asynchronous auditory-visual speech signals.
Collapse
Affiliation(s)
- Sandra Gordon-Salant
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| | - Grace H Yeni-Komshian
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| | - Peter J Fitzgibbons
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| | - Hannah M Willison
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| | - Maya S Freund
- Department of Hearing and Speech Sciences, University of Maryland, College Park, Maryland 20742, USA
| |
Collapse
|
41
|
Giordano BL, Ince RAA, Gross J, Schyns PG, Panzeri S, Kayser C. Contributions of local speech encoding and functional connectivity to audio-visual speech perception. eLife 2017; 6. [PMID: 28590903 PMCID: PMC5462535 DOI: 10.7554/elife.24763] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2016] [Accepted: 05/07/2017] [Indexed: 11/13/2022] Open
Abstract
Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments. DOI:http://dx.doi.org/10.7554/eLife.24763.001 When listening to someone in a noisy environment, such as a cocktail party, we can understand the speaker more easily if we can also see his or her face. Movements of the lips and tongue convey additional information that helps the listener’s brain separate out syllables, words and sentences. However, exactly where in the brain this effect occurs and how it works remain unclear. To find out, Giordano et al. scanned the brains of healthy volunteers as they watched clips of people speaking. The clarity of the speech varied between clips. Furthermore, in some of the clips the lip movements of the speaker corresponded to the speech in question, whereas in others the lip movements were nonsense babble. As expected, the volunteers performed better on a word recognition task when the speech was clear and when the lips movements agreed with the spoken dialogue. Watching the video clips stimulated rhythmic activity in multiple regions of the volunteers’ brains, including areas that process sound and areas that plan movements. Speech is itself rhythmic, and the volunteers’ brain activity synchronized with the rhythms of the speech they were listening to. Seeing the speaker’s face increased this degree of synchrony. However, it also made it easier for sound-processing regions within the listeners’ brains to transfer information to one other. Notably, only the latter effect predicted improved performance on the word recognition task. This suggests that seeing a person’s face makes it easier to understand his or her speech by boosting communication between brain regions, rather than through effects on individual areas. Further work is required to determine where and how the brain encodes lip movements and speech sounds. The next challenge will be to identify where these two sets of information interact, and how the brain merges them together to generate the impression of specific words. DOI:http://dx.doi.org/10.7554/eLife.24763.002
Collapse
Affiliation(s)
- Bruno L Giordano
- Institut de Neurosciences de la Timone UMR 7289, Aix Marseille Université - Centre National de la Recherche Scientifique, Marseille, France.,Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Robin A A Ince
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Joachim Gross
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Philippe G Schyns
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Stefano Panzeri
- Neural Computation Laboratory, Center for Neuroscience and Cognitive Systems, Istituto Italiano di Tecnologia, Rovereto, Italy
| | - Christoph Kayser
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| |
Collapse
|
42
|
Being First Matters: Topographical Representational Similarity Analysis of ERP Signals Reveals Separate Networks for Audiovisual Temporal Binding Depending on the Leading Sense. J Neurosci 2017; 37:5274-5287. [PMID: 28450537 PMCID: PMC5456109 DOI: 10.1523/jneurosci.2926-16.2017] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2016] [Revised: 02/20/2017] [Accepted: 02/25/2017] [Indexed: 11/30/2022] Open
Abstract
In multisensory integration, processing in one sensory modality is enhanced by complementary information from other modalities. Intersensory timing is crucial in this process because only inputs reaching the brain within a restricted temporal window are perceptually bound. Previous research in the audiovisual field has investigated various features of the temporal binding window, revealing asymmetries in its size and plasticity depending on the leading input: auditory–visual (AV) or visual–auditory (VA). Here, we tested whether separate neuronal mechanisms underlie this AV–VA dichotomy in humans. We recorded high-density EEG while participants performed an audiovisual simultaneity judgment task including various AV–VA asynchronies and unisensory control conditions (visual-only, auditory-only) and tested whether AV and VA processing generate different patterns of brain activity. After isolating the multisensory components of AV–VA event-related potentials (ERPs) from the sum of their unisensory constituents, we ran a time-resolved topographical representational similarity analysis (tRSA) comparing the AV and VA ERP maps. Spatial cross-correlation matrices were built from real data to index the similarity between the AV and VA maps at each time point (500 ms window after stimulus) and then correlated with two alternative similarity model matrices: AVmaps = VAmaps versus AVmaps ≠ VAmaps. The tRSA results favored the AVmaps ≠ VAmaps model across all time points, suggesting that audiovisual temporal binding (indexed by synchrony perception) engages different neural pathways depending on the leading sense. The existence of such dual route supports recent theoretical accounts proposing that multiple binding mechanisms are implemented in the brain to accommodate different information parsing strategies in auditory and visual sensory systems. SIGNIFICANCE STATEMENT Intersensory timing is a crucial aspect of multisensory integration, determining whether and how inputs in one modality enhance stimulus processing in another modality. Our research demonstrates that evaluating synchrony of auditory-leading (AV) versus visual-leading (VA) audiovisual stimulus pairs is characterized by two distinct patterns of brain activity. This suggests that audiovisual integration is not a unitary process and that different binding mechanisms are recruited in the brain based on the leading sense. These mechanisms may be relevant for supporting different classes of multisensory operations, for example, auditory enhancement of visual attention (AV) and visual enhancement of auditory speech (VA).
Collapse
|
43
|
Shahin AJ, Shen S, Kerlin JR. Tolerance for audiovisual asynchrony is enhanced by the spectrotemporal fidelity of the speaker's mouth movements and speech. LANGUAGE, COGNITION AND NEUROSCIENCE 2017; 32:1102-1118. [PMID: 28966930 PMCID: PMC5617130 DOI: 10.1080/23273798.2017.1283428] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2016] [Accepted: 01/07/2017] [Indexed: 06/07/2023]
Abstract
We examined the relationship between tolerance for audiovisual onset asynchrony (AVOA) and the spectrotemporal fidelity of the spoken words and the speaker's mouth movements. In two experiments that only varied in the temporal order of sensory modality, visual speech leading (exp1) or lagging (exp2) acoustic speech, participants watched intact and blurred videos of a speaker uttering trisyllabic words and nonwords that were noise vocoded with 4-, 8-, 16-, and 32-channels. They judged whether the speaker's mouth movements and the speech sounds were in-sync or out-of-sync. Individuals perceived synchrony (tolerated AVOA) on more trials when the acoustic speech was more speech-like (8 channels and higher vs. 4 channels), and when visual speech was intact than blurred (exp1 only). These findings suggest that enhanced spectrotemporal fidelity of the audiovisual (AV) signal prompts the brain to widen the window of integration promoting the fusion of temporally distant AV percepts.
Collapse
Affiliation(s)
- Antoine J Shahin
- Center for Mind and Brain, University of California, Davis, CA, 95618
| | - Stanley Shen
- Center for Mind and Brain, University of California, Davis, CA, 95618
| | - Jess R Kerlin
- Center for Mind and Brain, University of California, Davis, CA, 95618
| |
Collapse
|
44
|
O'Sullivan AE, Crosse MJ, Di Liberto GM, Lalor EC. Visual Cortical Entrainment to Motion and Categorical Speech Features during Silent Lipreading. Front Hum Neurosci 2017; 10:679. [PMID: 28123363 PMCID: PMC5225113 DOI: 10.3389/fnhum.2016.00679] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2016] [Accepted: 12/20/2016] [Indexed: 11/13/2022] Open
Abstract
Speech is a multisensory percept, comprising an auditory and visual component. While the content and processing pathways of audio speech have been well characterized, the visual component is less well understood. In this work, we expand current methodologies using system identification to introduce a framework that facilitates the study of visual speech in its natural, continuous form. Specifically, we use models based on the unheard acoustic envelope (E), the motion signal (M) and categorical visual speech features (V) to predict EEG activity during silent lipreading. Our results show that each of these models performs similarly at predicting EEG in visual regions and that respective combinations of the individual models (EV, MV, EM and EMV) provide an improved prediction of the neural activity over their constituent models. In comparing these different combinations, we find that the model incorporating all three types of features (EMV) outperforms the individual models, as well as both the EV and MV models, while it performs similarly to the EM model. Importantly, EM does not outperform EV and MV, which, considering the higher dimensionality of the V model, suggests that more data is needed to clarify this finding. Nevertheless, the performance of EMV, and comparisons of the subject performances for the three individual models, provides further evidence to suggest that visual regions are involved in both low-level processing of stimulus dynamics and categorical speech perception. This framework may prove useful for investigating modality-specific processing of visual speech under naturalistic conditions.
Collapse
Affiliation(s)
- Aisling E O'Sullivan
- School of Engineering, Trinity College DublinDublin, Ireland; Trinity Centre for Bioengineering, Trinity College DublinDublin, Ireland
| | - Michael J Crosse
- Department of Pediatrics and Department of Neuroscience, Albert Einstein College of Medicine Bronx, NY, USA
| | - Giovanni M Di Liberto
- School of Engineering, Trinity College DublinDublin, Ireland; Trinity Centre for Bioengineering, Trinity College DublinDublin, Ireland
| | - Edmund C Lalor
- School of Engineering, Trinity College DublinDublin, Ireland; Trinity Centre for Bioengineering, Trinity College DublinDublin, Ireland; Trinity College Institute of Neuroscience, Trinity College DublinDublin, Ireland; Department of Biomedical Engineering and Department of Neuroscience, University of RochesterRochester, NY, USA
| |
Collapse
|
45
|
Lüttke CS, Ekman M, van Gerven MAJ, de Lange FP. McGurk illusion recalibrates subsequent auditory perception. Sci Rep 2016; 6:32891. [PMID: 27611960 PMCID: PMC5017187 DOI: 10.1038/srep32891] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2016] [Accepted: 08/08/2016] [Indexed: 11/09/2022] Open
Abstract
Visual information can alter auditory perception. This is clearly illustrated by the well-known McGurk illusion, where an auditory/aba/ and a visual /aga/ are merged to the percept of ‘ada’. It is less clear however whether such a change in perception may recalibrate subsequent perception. Here we asked whether the altered auditory perception due to the McGurk illusion affects subsequent auditory perception, i.e. whether this process of fusion may cause a recalibration of the auditory boundaries between phonemes. Participants categorized auditory and audiovisual speech stimuli as /aba/, /ada/ or /aga/ while activity patterns in their auditory cortices were recorded using fMRI. Interestingly, following a McGurk illusion, an auditory /aba/ was more often misperceived as ‘ada’. Furthermore, we observed a neural counterpart of this recalibration in the early auditory cortex. When the auditory input /aba/ was perceived as ‘ada’, activity patterns bore stronger resemblance to activity patterns elicited by /ada/ sounds than when they were correctly perceived as /aba/. Our results suggest that upon experiencing the McGurk illusion, the brain shifts the neural representation of an /aba/ sound towards /ada/, culminating in a recalibration in perception of subsequent auditory input.
Collapse
Affiliation(s)
- Claudia S Lüttke
- Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, the Netherlands
| | - Matthias Ekman
- Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, the Netherlands
| | - Marcel A J van Gerven
- Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, the Netherlands
| | - Floris P de Lange
- Radboud University Nijmegen, Donders Institute for Brain, Cognition and Behaviour, the Netherlands
| |
Collapse
|
46
|
Atypical audiovisual word processing in school-age children with a history of specific language impairment: an event-related potential study. J Neurodev Disord 2016; 8:33. [PMID: 27597881 PMCID: PMC5011345 DOI: 10.1186/s11689-016-9168-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Accepted: 08/17/2016] [Indexed: 11/12/2022] Open
Abstract
Background Visual speech cues influence different aspects of language acquisition. However, whether developmental language disorders may be associated with atypical processing of visual speech is unknown. In this study, we used behavioral and ERP measures to determine whether children with a history of SLI (H-SLI) differ from their age-matched typically developing (TD) peers in the ability to match auditory words with corresponding silent visual articulations. Methods Nineteen 7–13-year-old H-SLI children and 19 age-matched TD children participated in the study. Children first heard a word and then saw a speaker silently articulating a word. In half of trials, the articulated word matched the auditory word (congruent trials), while in another half, it did not (incongruent trials). Children specified whether the auditory and the articulated words matched. We examined ERPs elicited by the onset of visual stimuli (visual P1, N1, and P2) as well as ERPs elicited by the articulatory movements themselves—namely, N400 to incongruent articulations and late positive complex (LPC) to congruent articulations. We also examined whether ERP measures of visual speech processing could predict (1) children’s linguistic skills and (2) the use of visual speech cues when listening to speech-in-noise (SIN). Results H-SLI children were less accurate in matching auditory words with visual articulations. They had a significantly reduced P1 to the talker’s face and a smaller N400 to incongruent articulations. In contrast, congruent articulations elicited LPCs of similar amplitude in both groups of children. The P1 and N400 amplitude was significantly correlated with accuracy enhancement on the SIN task when seeing the talker’s face. Conclusions H-SLI children have poorly defined correspondences between speech sounds and visually observed articulatory movements that produce them.
Collapse
|
47
|
Baart M. Quantifying lip-read-induced suppression and facilitation of the auditory N1 and P2 reveals peak enhancements and delays. Psychophysiology 2016; 53:1295-306. [DOI: 10.1111/psyp.12683] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2015] [Accepted: 05/09/2016] [Indexed: 11/29/2022]
Affiliation(s)
- Martijn Baart
- BCBL. Basque Center on Cognition, Brain and Language; Donostia-San Sebastián Spain
- Department of Cognitive Neuropsychology; Tilburg University; Tilburg The Netherlands
| |
Collapse
|
48
|
Kaganovich N, Schumaker J, Rowland C. Matching heard and seen speech: An ERP study of audiovisual word recognition. BRAIN AND LANGUAGE 2016; 157-158:14-24. [PMID: 27155219 PMCID: PMC4915735 DOI: 10.1016/j.bandl.2016.04.010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2015] [Revised: 03/23/2016] [Accepted: 04/10/2016] [Indexed: 06/05/2023]
Abstract
Seeing articulatory gestures while listening to speech-in-noise (SIN) significantly improves speech understanding. However, the degree of this improvement varies greatly among individuals. We examined a relationship between two distinct stages of visual articulatory processing and the SIN accuracy by combining a cross-modal repetition priming task with ERP recordings. Participants first heard a word referring to a common object (e.g., pumpkin) and then decided whether the subsequently presented visual silent articulation matched the word they had just heard. Incongruent articulations elicited a significantly enhanced N400, indicative of a mismatch detection at the pre-lexical level. Congruent articulations elicited a significantly larger LPC, indexing articulatory word recognition. Only the N400 difference between incongruent and congruent trials was significantly correlated with individuals' SIN accuracy improvement in the presence of the talker's face.
Collapse
Affiliation(s)
- Natalya Kaganovich
- Department of Speech, Language, and Hearing Sciences, Purdue University, 715 Clinic Drive, West Lafayette, IN 47907-2038, United States; Department of Psychological Sciences, Purdue University, 703 Third Street, West Lafayette, IN 47907-2038, United States.
| | - Jennifer Schumaker
- Department of Speech, Language, and Hearing Sciences, Purdue University, 715 Clinic Drive, West Lafayette, IN 47907-2038, United States
| | - Courtney Rowland
- Department of Speech, Language, and Hearing Sciences, Purdue University, 715 Clinic Drive, West Lafayette, IN 47907-2038, United States
| |
Collapse
|
49
|
Park H, Kayser C, Thut G, Gross J. Lip movements entrain the observers' low-frequency brain oscillations to facilitate speech intelligibility. eLife 2016; 5. [PMID: 27146891 PMCID: PMC4900800 DOI: 10.7554/elife.14521] [Citation(s) in RCA: 78] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2016] [Accepted: 05/03/2016] [Indexed: 12/02/2022] Open
Abstract
During continuous speech, lip movements provide visual temporal signals that facilitate speech processing. Here, using MEG we directly investigated how these visual signals interact with rhythmic brain activity in participants listening to and seeing the speaker. First, we investigated coherence between oscillatory brain activity and speaker’s lip movements and demonstrated significant entrainment in visual cortex. We then used partial coherence to remove contributions of the coherent auditory speech signal from the lip-brain coherence. Comparing this synchronization between different attention conditions revealed that attending visual speech enhances the coherence between activity in visual cortex and the speaker’s lips. Further, we identified a significant partial coherence between left motor cortex and lip movements and this partial coherence directly predicted comprehension accuracy. Our results emphasize the importance of visually entrained and attention-modulated rhythmic brain activity for the enhancement of audiovisual speech processing. DOI:http://dx.doi.org/10.7554/eLife.14521.001 People are able communicate effectively with each other even in very noisy places where it is difficult to actually hear what others are saying. In a face-to-face conversation, people detect and respond to many physical cues – including body posture, facial expressions, head and eye movements and gestures – alongside the sound cues. Lip movements are particularly important and contain enough information to allow trained observers to understand speech even if they cannot hear the speech itself. It is known that brain waves in listeners are synchronized with the rhythms in a speech, especially the syllables. This is thought to establish a channel for communication – similar to tuning a radio to a certain frequency to listen to a certain radio station. Park et al. studied if listeners’ brain waves also align to the speaker’s lip movements during continuous speech and if this is important for understanding the speech. The experiments reveal that a part of the brain that processes visual information – called the visual cortex – produces brain waves that are synchronized to the rhythm of syllables in continuous speech. This synchronization was more precise in a complex situation where lip movements would be more important to understand speech. Park et al. also found that the area of the observer’s brain that controls the lips (the motor cortex) also produced brain waves that were synchronized to lip movements. Volunteers whose motor cortex was more synchronized to the lip movements understood speech better. This supports the idea that brain areas that are used for producing speech are also important for understanding speech. Future challenges include understanding how synchronization of brain waves with the rhythms of speech helps us to understand speech, and how the brain waves produced by the visual and motor areas interact. DOI:http://dx.doi.org/10.7554/eLife.14521.002
Collapse
Affiliation(s)
- Hyojin Park
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Christoph Kayser
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Gregor Thut
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Joachim Gross
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| |
Collapse
|
50
|
Yovel G, O’Toole AJ. Recognizing People in Motion. Trends Cogn Sci 2016; 20:383-395. [DOI: 10.1016/j.tics.2016.02.005] [Citation(s) in RCA: 83] [Impact Index Per Article: 10.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2016] [Revised: 02/18/2016] [Accepted: 02/18/2016] [Indexed: 11/15/2022]
|