1
|
Corsini A, Tomassini A, Pastore A, Delis I, Fadiga L, D'Ausilio A. Speech perception difficulty modulates theta-band encoding of articulatory synergies. J Neurophysiol 2024; 131:480-491. [PMID: 38323331 DOI: 10.1152/jn.00388.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 01/04/2024] [Accepted: 01/25/2024] [Indexed: 02/08/2024] Open
Abstract
The human brain tracks available speech acoustics and extrapolates missing information such as the speaker's articulatory patterns. However, the extent to which articulatory reconstruction supports speech perception remains unclear. This study explores the relationship between articulatory reconstruction and task difficulty. Participants listened to sentences and performed a speech-rhyming task. Real kinematic data of the speaker's vocal tract were recorded via electromagnetic articulography (EMA) and aligned to corresponding acoustic outputs. We extracted articulatory synergies from the EMA data with principal component analysis (PCA) and employed partial information decomposition (PID) to separate the electroencephalographic (EEG) encoding of acoustic and articulatory features into unique, redundant, and synergistic atoms of information. We median-split sentences into easy (ES) and hard (HS) based on participants' performance and found that greater task difficulty involved greater encoding of unique articulatory information in the theta band. We conclude that fine-grained articulatory reconstruction plays a complementary role in the encoding of speech acoustics, lending further support to the claim that motor processes support speech perception.NEW & NOTEWORTHY Top-down processes originating from the motor system contribute to speech perception through the reconstruction of the speaker's articulatory movement. This study investigates the role of such articulatory simulation under variable task difficulty. We show that more challenging listening tasks lead to increased encoding of articulatory kinematics in the theta band and suggest that, in such situations, fine-grained articulatory reconstruction complements acoustic encoding.
Collapse
Affiliation(s)
- Alessandro Corsini
- Center for Translational Neurophysiology of Speech and Communication, Istituto Italiano di Tecnologia, Ferrara, Italy
- Department of Neuroscience and Rehabilitation, Università di Ferrara, Ferrara, Italy
| | - Alice Tomassini
- Center for Translational Neurophysiology of Speech and Communication, Istituto Italiano di Tecnologia, Ferrara, Italy
- Department of Neuroscience and Rehabilitation, Università di Ferrara, Ferrara, Italy
| | - Aldo Pastore
- Laboratorio NEST, Scuola Normale Superiore, Pisa, Italy
| | - Ioannis Delis
- School of Biomedical Sciences, University of Leeds, Leeds, United Kingdom
| | - Luciano Fadiga
- Center for Translational Neurophysiology of Speech and Communication, Istituto Italiano di Tecnologia, Ferrara, Italy
- Department of Neuroscience and Rehabilitation, Università di Ferrara, Ferrara, Italy
| | - Alessandro D'Ausilio
- Center for Translational Neurophysiology of Speech and Communication, Istituto Italiano di Tecnologia, Ferrara, Italy
- Department of Neuroscience and Rehabilitation, Università di Ferrara, Ferrara, Italy
| |
Collapse
|
2
|
Haider CL, Park H, Hauswald A, Weisz N. Neural Speech Tracking Highlights the Importance of Visual Speech in Multi-speaker Situations. J Cogn Neurosci 2024; 36:128-142. [PMID: 37977156 DOI: 10.1162/jocn_a_02059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2023]
Abstract
Visual speech plays a powerful role in facilitating auditory speech processing and has been a publicly noticed topic with the wide usage of face masks during the COVID-19 pandemic. In a previous magnetoencephalography study, we showed that occluding the mouth area significantly impairs neural speech tracking. To rule out the possibility that this deterioration is because of degraded sound quality, in the present follow-up study, we presented participants with audiovisual (AV) and audio-only (A) speech. We further independently manipulated the trials by adding a face mask and a distractor speaker. Our results clearly show that face masks only affect speech tracking in AV conditions, not in A conditions. This shows that face masks indeed primarily impact speech processing by blocking visual speech and not by acoustic degradation. We can further highlight how the spectrogram, lip movements and lexical units are tracked on a sensor level. We can show visual benefits for tracking the spectrogram especially in the multi-speaker condition. While lip movements only show additional improvement and visual benefit over tracking of the spectrogram in clear speech conditions, lexical units (phonemes and word onsets) do not show visual enhancement at all. We hypothesize that in young normal hearing individuals, information from visual input is less used for specific feature extraction, but acts more as a general resource for guiding attention.
Collapse
Affiliation(s)
| | | | | | - Nathan Weisz
- Paris Lodron Universität Salzburg
- Paracelsus Medical University Salzburg
| |
Collapse
|
3
|
Ahmed F, Nidiffer AR, Lalor EC. The effect of gaze on EEG measures of multisensory integration in a cocktail party scenario. Front Hum Neurosci 2023; 17:1283206. [PMID: 38162285 PMCID: PMC10754997 DOI: 10.3389/fnhum.2023.1283206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/20/2023] [Indexed: 01/03/2024] Open
Abstract
Seeing the speaker's face greatly improves our speech comprehension in noisy environments. This is due to the brain's ability to combine the auditory and the visual information around us, a process known as multisensory integration. Selective attention also strongly influences what we comprehend in scenarios with multiple speakers-an effect known as the cocktail-party phenomenon. However, the interaction between attention and multisensory integration is not fully understood, especially when it comes to natural, continuous speech. In a recent electroencephalography (EEG) study, we explored this issue and showed that multisensory integration is enhanced when an audiovisual speaker is attended compared to when that speaker is unattended. Here, we extend that work to investigate how this interaction varies depending on a person's gaze behavior, which affects the quality of the visual information they have access to. To do so, we recorded EEG from 31 healthy adults as they performed selective attention tasks in several paradigms involving two concurrently presented audiovisual speakers. We then modeled how the recorded EEG related to the audio speech (envelope) of the presented speakers. Crucially, we compared two classes of model - one that assumed underlying multisensory integration (AV) versus another that assumed two independent unisensory audio and visual processes (A+V). This comparison revealed evidence of strong attentional effects on multisensory integration when participants were looking directly at the face of an audiovisual speaker. This effect was not apparent when the speaker's face was in the peripheral vision of the participants. Overall, our findings suggest a strong influence of attention on multisensory integration when high fidelity visual (articulatory) speech information is available. More generally, this suggests that the interplay between attention and multisensory integration during natural audiovisual speech is dynamic and is adaptable based on the specific task and environment.
Collapse
Affiliation(s)
| | | | - Edmund C. Lalor
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY, United States
| |
Collapse
|
4
|
Di Liberto GM, Attaheri A, Cantisani G, Reilly RB, Ní Choisdealbha Á, Rocha S, Brusini P, Goswami U. Emergence of the cortical encoding of phonetic features in the first year of life. Nat Commun 2023; 14:7789. [PMID: 38040720 PMCID: PMC10692113 DOI: 10.1038/s41467-023-43490-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Accepted: 11/10/2023] [Indexed: 12/03/2023] Open
Abstract
Even prior to producing their first words, infants are developing a sophisticated speech processing system, with robust word recognition present by 4-6 months of age. These emergent linguistic skills, observed with behavioural investigations, are likely to rely on increasingly sophisticated neural underpinnings. The infant brain is known to robustly track the speech envelope, however previous cortical tracking studies were unable to demonstrate the presence of phonetic feature encoding. Here we utilise temporal response functions computed from electrophysiological responses to nursery rhymes to investigate the cortical encoding of phonetic features in a longitudinal cohort of infants when aged 4, 7 and 11 months, as well as adults. The analyses reveal an increasingly detailed and acoustically invariant phonetic encoding emerging over the first year of life, providing neurophysiological evidence that the pre-verbal human cortex learns phonetic categories. By contrast, we found no credible evidence for age-related increases in cortical tracking of the acoustic spectrogram.
Collapse
Affiliation(s)
- Giovanni M Di Liberto
- ADAPT Centre, School of Computer Science and Statistics, Trinity College, The University of Dublin, Dublin, Ireland.
- Trinity College Institute of Neuroscience, Trinity College, The University of Dublin, Dublin, Ireland.
- Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom.
| | - Adam Attaheri
- Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| | - Giorgia Cantisani
- ADAPT Centre, School of Computer Science and Statistics, Trinity College, The University of Dublin, Dublin, Ireland
- Laboratoire des Systémes Perceptifs, Département d'études Cognitives, École normale supérieure, PSL University, CNRS, 75005, Paris, France
| | - Richard B Reilly
- Trinity College Institute of Neuroscience, Trinity College, The University of Dublin, Dublin, Ireland
- School of Engineering, Trinity Centre for Biomedical Engineering, Trinity College, The University of Dublin., Dublin, Ireland
- School of Medicine, Trinity College, The University of Dublin, Dublin, Ireland
| | - Áine Ní Choisdealbha
- Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| | - Sinead Rocha
- Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| | - Perrine Brusini
- Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| | - Usha Goswami
- Centre for Neuroscience in Education, Department of Psychology, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
5
|
Brodbeck C, Das P, Gillis M, Kulasingham JP, Bhattasali S, Gaston P, Resnik P, Simon JZ. Eelbrain, a Python toolkit for time-continuous analysis with temporal response functions. eLife 2023; 12:e85012. [PMID: 38018501 PMCID: PMC10783870 DOI: 10.7554/elife.85012] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2022] [Accepted: 11/24/2023] [Indexed: 11/30/2023] Open
Abstract
Even though human experience unfolds continuously in time, it is not strictly linear; instead, it entails cascading processes building hierarchical cognitive structures. For instance, during speech perception, humans transform a continuously varying acoustic signal into phonemes, words, and meaning, and these levels all have distinct but interdependent temporal structures. Time-lagged regression using temporal response functions (TRFs) has recently emerged as a promising tool for disentangling electrophysiological brain responses related to such complex models of perception. Here, we introduce the Eelbrain Python toolkit, which makes this kind of analysis easy and accessible. We demonstrate its use, using continuous speech as a sample paradigm, with a freely available EEG dataset of audiobook listening. A companion GitHub repository provides the complete source code for the analysis, from raw data to group-level statistics. More generally, we advocate a hypothesis-driven approach in which the experimenter specifies a hierarchy of time-continuous representations that are hypothesized to have contributed to brain responses, and uses those as predictor variables for the electrophysiological signal. This is analogous to a multiple regression problem, but with the addition of a time dimension. TRF analysis decomposes the brain signal into distinct responses associated with the different predictor variables by estimating a multivariate TRF (mTRF), quantifying the influence of each predictor on brain responses as a function of time(-lags). This allows asking two questions about the predictor variables: (1) Is there a significant neural representation corresponding to this predictor variable? And if so, (2) what are the temporal characteristics of the neural response associated with it? Thus, different predictor variables can be systematically combined and evaluated to jointly model neural processing at multiple hierarchical levels. We discuss applications of this approach, including the potential for linking algorithmic/representational theories at different cognitive levels to brain responses through computational models with appropriate linking hypotheses.
Collapse
Affiliation(s)
| | - Proloy Das
- Stanford UniversityStanfordUnited States
| | | | | | | | | | - Philip Resnik
- University of Maryland, College ParkCollege ParkUnited States
| | | |
Collapse
|
6
|
Zhang Y, Ding R, Frassinelli D, Tuomainen J, Klavinskis-Whiting S, Vigliocco G. The role of multimodal cues in second language comprehension. Sci Rep 2023; 13:20824. [PMID: 38012193 PMCID: PMC10682458 DOI: 10.1038/s41598-023-47643-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 11/16/2023] [Indexed: 11/29/2023] Open
Abstract
In face-to-face communication, multimodal cues such as prosody, gestures, and mouth movements can play a crucial role in language processing. While several studies have addressed how these cues contribute to native (L1) language processing, their impact on non-native (L2) comprehension is largely unknown. Comprehension of naturalistic language by L2 comprehenders may be supported by the presence of (at least some) multimodal cues, as these provide correlated and convergent information that may aid linguistic processing. However, it is also the case that multimodal cues may be less used by L2 comprehenders because linguistic processing is more demanding than for L1 comprehenders, leaving more limited resources for the processing of multimodal cues. In this study, we investigated how L2 comprehenders use multimodal cues in naturalistic stimuli (while participants watched videos of a speaker), as measured by electrophysiological responses (N400) to words, and whether there are differences between L1 and L2 comprehenders. We found that prosody, gestures, and informative mouth movements each reduced the N400 in L2, indexing easier comprehension. Nevertheless, L2 participants showed weaker effects for each cue compared to L1 comprehenders, with the exception of meaningful gestures and informative mouth movements. These results show that L2 comprehenders focus on specific multimodal cues - meaningful gestures that support meaningful interpretation and mouth movements that enhance the acoustic signal - while using multimodal cues to a lesser extent than L1 comprehenders overall.
Collapse
Affiliation(s)
- Ye Zhang
- Experimental Psychology, University College London, London, UK
| | - Rong Ding
- Language and Computation in Neural Systems, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
| | - Diego Frassinelli
- Department of Linguistics, University of Konstanz, Konstanz, Germany
| | - Jyrki Tuomainen
- Speech, Hearing and Phonetic Sciences, University College London, London, UK
| | | | | |
Collapse
|
7
|
Wang B, Xu X, Niu Y, Wu C, Wu X, Chen J. EEG-based auditory attention decoding with audiovisual speech for hearing-impaired listeners. Cereb Cortex 2023; 33:10972-10983. [PMID: 37750333 DOI: 10.1093/cercor/bhad325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 08/21/2023] [Accepted: 08/22/2023] [Indexed: 09/27/2023] Open
Abstract
Auditory attention decoding (AAD) was used to determine the attended speaker during an auditory selective attention task. However, the auditory factors modulating AAD remained unclear for hearing-impaired (HI) listeners. In this study, scalp electroencephalogram (EEG) was recorded with an auditory selective attention paradigm, in which HI listeners were instructed to attend one of the two simultaneous speech streams with or without congruent visual input (articulation movements), and at a high or low target-to-masker ratio (TMR). Meanwhile, behavioral hearing tests (i.e. audiogram, speech reception threshold, temporal modulation transfer function) were used to assess listeners' individual auditory abilities. The results showed that both visual input and increasing TMR could significantly enhance the cortical tracking of the attended speech and AAD accuracy. Further analysis revealed that the audiovisual (AV) gain in attended speech cortical tracking was significantly correlated with listeners' auditory amplitude modulation (AM) sensitivity, and the TMR gain in attended speech cortical tracking was significantly correlated with listeners' hearing thresholds. Temporal response function analysis revealed that subjects with higher AM sensitivity demonstrated more AV gain over the right occipitotemporal and bilateral frontocentral scalp electrodes.
Collapse
Affiliation(s)
- Bo Wang
- Speech and Hearing Research Center, Key Laboratory of Machine Perception (Ministry of Education), School of Intelligence Science and Technology, Peking University, Beijing 100871, China
| | - Xiran Xu
- Speech and Hearing Research Center, Key Laboratory of Machine Perception (Ministry of Education), School of Intelligence Science and Technology, Peking University, Beijing 100871, China
| | - Yadong Niu
- Speech and Hearing Research Center, Key Laboratory of Machine Perception (Ministry of Education), School of Intelligence Science and Technology, Peking University, Beijing 100871, China
| | - Chao Wu
- School of Nursing, Peking University, Beijing 100191, China
| | - Xihong Wu
- Speech and Hearing Research Center, Key Laboratory of Machine Perception (Ministry of Education), School of Intelligence Science and Technology, Peking University, Beijing 100871, China
- National Biomedical Imaging Center, College of Future Technology, Beijing 100871, China
| | - Jing Chen
- Speech and Hearing Research Center, Key Laboratory of Machine Perception (Ministry of Education), School of Intelligence Science and Technology, Peking University, Beijing 100871, China
- National Biomedical Imaging Center, College of Future Technology, Beijing 100871, China
| |
Collapse
|
8
|
Tan SHJ, Kalashnikova M, Di Liberto GM, Crosse MJ, Burnham D. Seeing a Talking Face Matters: Gaze Behavior and the Auditory-Visual Speech Benefit in Adults' Cortical Tracking of Infant-directed Speech. J Cogn Neurosci 2023; 35:1741-1759. [PMID: 37677057 DOI: 10.1162/jocn_a_02044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
In face-to-face conversations, listeners gather visual speech information from a speaker's talking face that enhances their perception of the incoming auditory speech signal. This auditory-visual (AV) speech benefit is evident even in quiet environments but is stronger in situations that require greater listening effort such as when the speech signal itself deviates from listeners' expectations. One example is infant-directed speech (IDS) presented to adults. IDS has exaggerated acoustic properties that are easily discriminable from adult-directed speech (ADS). Although IDS is a speech register that adults typically use with infants, no previous neurophysiological study has directly examined whether adult listeners process IDS differently from ADS. To address this, the current study simultaneously recorded EEG and eye-tracking data from adult participants as they were presented with auditory-only (AO), visual-only, and AV recordings of IDS and ADS. Eye-tracking data were recorded because looking behavior to the speaker's eyes and mouth modulates the extent of AV speech benefit experienced. Analyses of cortical tracking accuracy revealed that cortical tracking of the speech envelope was significant in AO and AV modalities for IDS and ADS. However, the AV speech benefit [i.e., AV > (A + V)] was only present for IDS trials. Gaze behavior analyses indicated differences in looking behavior during IDS and ADS trials. Surprisingly, looking behavior to the speaker's eyes and mouth was not correlated with cortical tracking accuracy. Additional exploratory analyses indicated that attention to the whole display was negatively correlated with cortical tracking accuracy of AO and visual-only trials in IDS. Our results underscore the nuances involved in the relationship between neurophysiological AV speech benefit and looking behavior.
Collapse
Affiliation(s)
- Sok Hui Jessica Tan
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University, Australia
- Science of Learning in Education Centre, Office of Education Research, National Institute of Education, Nanyang Technological University, Singapore
| | - Marina Kalashnikova
- The Basque Center on Cognition, Brain and Language
- IKERBASQUE, Basque Foundation for Science
| | - Giovanni M Di Liberto
- ADAPT Centre, School of Computer Science and Statistics, Trinity College Institute of Neuroscience, Trinity College, The University of Dublin, Ireland
| | - Michael J Crosse
- SEGOTIA, Galway, Ireland
- Trinity Center for Biomedical Engineering, Department of Mechanical, Manufacturing & Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
| | - Denis Burnham
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University, Australia
| |
Collapse
|
9
|
Guilleminot P, Graef C, Butters E, Reichenbach T. Audiotactile Stimulation Can Improve Syllable Discrimination through Multisensory Integration in the Theta Frequency Band. J Cogn Neurosci 2023; 35:1760-1772. [PMID: 37677062 DOI: 10.1162/jocn_a_02045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
Syllables are an essential building block of speech. We recently showed that tactile stimuli linked to the perceptual centers of syllables in continuous speech can improve speech comprehension. The rate of syllables lies in the theta frequency range, between 4 and 8 Hz, and the behavioral effect appears linked to multisensory integration in this frequency band. Because this neural activity may be oscillatory, we hypothesized that a behavioral effect may also occur not only while but also after this activity has been evoked or entrained through vibrotactile pulses. Here, we show that audiotactile integration regarding the perception of single syllables, both on the neural and on the behavioral level, is consistent with this hypothesis. We first stimulated participants with a series of vibrotactile pulses and then presented them with a syllable in background noise. We show that, at a delay of 200 msec after the last vibrotactile pulse, audiotactile integration still occurred in the theta band and syllable discrimination was enhanced. Moreover, the dependence of both the neural multisensory integration as well as of the behavioral discrimination on the delay of the audio signal with respect to the last tactile pulse was consistent with a damped oscillation. In addition, the multisensory gain is correlated with the syllable discrimination score. Our results therefore evidence the role of the theta band in audiotactile integration and provide evidence that these effects may involve oscillatory activity that still persists after the tactile stimulation.
Collapse
|
10
|
Jiang Z, An X, Liu S, Yin E, Yan Y, Ming D. Neural oscillations reflect the individual differences in the temporal perception of audiovisual speech. Cereb Cortex 2023; 33:10575-10583. [PMID: 37727958 DOI: 10.1093/cercor/bhad304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 08/01/2023] [Accepted: 08/02/2023] [Indexed: 09/21/2023] Open
Abstract
Multisensory integration occurs within a limited time interval between multimodal stimuli. Multisensory temporal perception varies widely among individuals and involves perceptual synchrony and temporal sensitivity processes. Previous studies explored the neural mechanisms of individual differences for beep-flash stimuli, whereas there was no study for speech. In this study, 28 subjects (16 male) performed an audiovisual speech/ba/simultaneity judgment task while recording their electroencephalography. We examined the relationship between prestimulus neural oscillations (i.e. the pre-pronunciation movement-related oscillations) and temporal perception. The perceptual synchrony was quantified using the Point of Subjective Simultaneity and temporal sensitivity using the Temporal Binding Window. Our results revealed dissociated neural mechanisms for individual differences in Temporal Binding Window and Point of Subjective Simultaneity. The frontocentral delta power, reflecting top-down attention control, is positively related to the magnitude of individual auditory leading Temporal Binding Windows (auditory Temporal Binding Windows; LTBWs), whereas the parieto-occipital theta power, indexing bottom-up visual temporal attention specific to speech, is negatively associated with the magnitude of individual visual leading Temporal Binding Windows (visual Temporal Binding Windows; RTBWs). In addition, increased left frontal and bilateral temporoparietal occipital alpha power, reflecting general attentional states, is associated with increased Points of Subjective Simultaneity. Strengthening attention abilities might improve the audiovisual temporal perception of speech and further impact speech integration.
Collapse
Affiliation(s)
- Zeliang Jiang
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Xingwei An
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Shuang Liu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Erwei Yin
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
- Defense Innovation Institute, Academy of Military Sciences (AMS), 100071 Beijing, China
- Tianjin Artificial Intelligence Innovation Center (TAIIC), 300457 Tianjin, China
| | - Ye Yan
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
- Defense Innovation Institute, Academy of Military Sciences (AMS), 100071 Beijing, China
- Tianjin Artificial Intelligence Innovation Center (TAIIC), 300457 Tianjin, China
| | - Dong Ming
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| |
Collapse
|
11
|
Ahmed F, Nidiffer AR, Lalor EC. The effect of gaze on EEG measures of multisensory integration in a cocktail party scenario. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554451. [PMID: 37662393 PMCID: PMC10473711 DOI: 10.1101/2023.08.23.554451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Seeing the speaker's face greatly improves our speech comprehension in noisy environments. This is due to the brain's ability to combine the auditory and the visual information around us, a process known as multisensory integration. Selective attention also strongly influences what we comprehend in scenarios with multiple speakers - an effect known as the cocktail-party phenomenon. However, the interaction between attention and multisensory integration is not fully understood, especially when it comes to natural, continuous speech. In a recent electroencephalography (EEG) study, we explored this issue and showed that multisensory integration is enhanced when an audiovisual speaker is attended compared to when that speaker is unattended. Here, we extend that work to investigate how this interaction varies depending on a person's gaze behavior, which affects the quality of the visual information they have access to. To do so, we recorded EEG from 31 healthy adults as they performed selective attention tasks in several paradigms involving two concurrently presented audiovisual speakers. We then modeled how the recorded EEG related to the audio speech (envelope) of the presented speakers. Crucially, we compared two classes of model - one that assumed underlying multisensory integration (AV) versus another that assumed two independent unisensory audio and visual processes (A+V). This comparison revealed evidence of strong attentional effects on multisensory integration when participants were looking directly at the face of an audiovisual speaker. This effect was not apparent when the speaker's face was in the peripheral vision of the participants. Overall, our findings suggest a strong influence of attention on multisensory integration when high fidelity visual (articulatory) speech information is available. More generally, this suggests that the interplay between attention and multisensory integration during natural audiovisual speech is dynamic and is adaptable based on the specific task and environment.
Collapse
Affiliation(s)
- Farhin Ahmed
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY 14627, USA
| | - Aaron R. Nidiffer
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY 14627, USA
| | - Edmund C. Lalor
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, and Center for Visual Science, University of Rochester, Rochester, NY 14627, USA
| |
Collapse
|
12
|
Jia Z, Xu C, Li J, Gao J, Ding N, Luo B, Zou J. Phase Property of Envelope-Tracking EEG Response Is Preserved in Patients with Disorders of Consciousness. eNeuro 2023; 10:ENEURO.0130-23.2023. [PMID: 37500493 PMCID: PMC10420405 DOI: 10.1523/eneuro.0130-23.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 07/16/2023] [Accepted: 07/20/2023] [Indexed: 07/29/2023] Open
Abstract
When listening to speech, the low-frequency cortical response below 10 Hz can track the speech envelope. Previous studies have demonstrated that the phase lag between speech envelope and cortical response can reflect the mechanism by which the envelope-tracking response is generated. Here, we analyze whether the mechanism to generate the envelope-tracking response is modulated by the level of consciousness, by studying how the stimulus-response phase lag is modulated by the disorder of consciousness (DoC). It is observed that DoC patients in general show less reliable neural tracking of speech. Nevertheless, the stimulus-response phase lag changes linearly with frequency between 3.5 and 8 Hz, for DoC patients who show reliable cortical tracking to speech, regardless of the consciousness state. The mean phase lag is also consistent across these DoC patients. These results suggest that the envelope-tracking response to speech can be generated by an automatic process that is barely modulated by the consciousness state.
Collapse
Affiliation(s)
- Ziting Jia
- The Second Hospital, Cheeloo College of Medicine, Shandong University, Jinan 250033, China
| | - Chuan Xu
- Department of Neurology, Sir Run Run Shaw Hospital, Zhejiang University School of Medicine, Hangzhou 310019, China
| | - Jingqi Li
- Department of Rehabilitation, Hangzhou Mingzhou Brain Rehabilitation Hospital, Hangzhou 311215, China
| | - Jian Gao
- Department of Rehabilitation, Hangzhou Mingzhou Brain Rehabilitation Hospital, Hangzhou 311215, China
| | - Nai Ding
- Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou 310027, China
| | - Benyan Luo
- Department of Neurology, First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou 310003, China
| | - Jiajie Zou
- Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou 310027, China
| |
Collapse
|
13
|
Ahmed F, Nidiffer AR, O'Sullivan AE, Zuk NJ, Lalor EC. The integration of continuous audio and visual speech in a cocktail-party environment depends on attention. Neuroimage 2023; 274:120143. [PMID: 37121375 DOI: 10.1016/j.neuroimage.2023.120143] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 03/17/2023] [Accepted: 04/27/2023] [Indexed: 05/02/2023] Open
Abstract
In noisy environments, our ability to understand speech benefits greatly from seeing the speaker's face. This is attributed to the brain's ability to integrate audio and visual information, a process known as multisensory integration. In addition, selective attention plays an enormous role in what we understand, the so-called cocktail-party phenomenon. But how attention and multisensory integration interact remains incompletely understood, particularly in the case of natural, continuous speech. Here, we addressed this issue by analyzing EEG data recorded from participants who undertook a multisensory cocktail-party task using natural speech. To assess multisensory integration, we modeled the EEG responses to the speech in two ways. The first assumed that audiovisual speech processing is simply a linear combination of audio speech processing and visual speech processing (i.e., an A + V model), while the second allows for the possibility of audiovisual interactions (i.e., an AV model). Applying these models to the data revealed that EEG responses to attended audiovisual speech were better explained by an AV model, providing evidence for multisensory integration. In contrast, unattended audiovisual speech responses were best captured using an A + V model, suggesting that multisensory integration is suppressed for unattended speech. Follow up analyses revealed some limited evidence for early multisensory integration of unattended AV speech, with no integration occurring at later levels of processing. We take these findings as evidence that the integration of natural audio and visual speech occurs at multiple levels of processing in the brain, each of which can be differentially affected by attention.
Collapse
Affiliation(s)
- Farhin Ahmed
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY 14627, USA
| | - Aaron R Nidiffer
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY 14627, USA
| | - Aisling E O'Sullivan
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY 14627, USA; School of Engineering, Trinity Centre for Biomedical Engineering, and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
| | - Nathaniel J Zuk
- Edmond & Lily Safra Center for Brain Sciences, Hebrew University, Jerusalem, Israel
| | - Edmund C Lalor
- Department of Biomedical Engineering, Department of Neuroscience, and Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY 14627, USA; School of Engineering, Trinity Centre for Biomedical Engineering, and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland.
| |
Collapse
|
14
|
Fu X, Riecke L. Effects of continuous tactile stimulation on auditory-evoked cortical responses depend on the audio-tactile phase. Neuroimage 2023; 274:120140. [PMID: 37120042 DOI: 10.1016/j.neuroimage.2023.120140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 04/27/2023] [Indexed: 05/01/2023] Open
Abstract
Auditory perception can benefit from stimuli in non-auditory sensory modalities, as for example in lip-reading. Compared with such visual influences, tactile influences are still poorly understood. It has been shown that single tactile pulses can enhance the perception of auditory stimuli depending on their relative timing, but whether and how such brief auditory enhancements can be stretched in time with more sustained, phase-specific periodic tactile stimulation is still unclear. To address this question, we presented tactile stimulation that fluctuated coherently and continuously at 4Hz with an auditory noise (either in-phase or anti-phase) and assessed its effect on the cortical processing and perception of an auditory signal embedded in that noise. Scalp-electroencephalography recordings revealed an enhancing effect of in-phase tactile stimulation on cortical responses phase-locked to the noise and a suppressive effect of anti-phase tactile stimulation on responses evoked by the auditory signal. Although these effects appeared to follow well-known principles of multisensory integration of discrete audio-tactile events, they were not accompanied by corresponding effects on behavioral measures of auditory signal perception. Our results indicate that continuous periodic tactile stimulation can enhance cortical processing of acoustically-induced fluctuations and mask cortical responses to an ongoing auditory signal. They further suggest that such sustained cortical effects can be insufficient for inducing sustained bottom-up auditory benefits.
Collapse
Affiliation(s)
- Xueying Fu
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, the Netherlands.
| | - Lars Riecke
- Department of Cognitive Neuroscience, Faculty of Psychology and Neuroscience, Maastricht University, Maastricht, the Netherlands
| |
Collapse
|
15
|
Chalas N, Omigie D, Poeppel D, van Wassenhove V. Hierarchically nested networks optimize the analysis of audiovisual speech. iScience 2023; 26:106257. [PMID: 36909667 PMCID: PMC9993032 DOI: 10.1016/j.isci.2023.106257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2022] [Revised: 12/22/2022] [Accepted: 02/17/2023] [Indexed: 02/22/2023] Open
Abstract
In conversational settings, seeing the speaker's face elicits internal predictions about the upcoming acoustic utterance. Understanding how the listener's cortical dynamics tune to the temporal statistics of audiovisual (AV) speech is thus essential. Using magnetoencephalography, we explored how large-scale frequency-specific dynamics of human brain activity adapt to AV speech delays. First, we show that the amplitude of phase-locked responses parametrically decreases with natural AV speech synchrony, a pattern that is consistent with predictive coding. Second, we show that the temporal statistics of AV speech affect large-scale oscillatory networks at multiple spatial and temporal resolutions. We demonstrate a spatial nestedness of oscillatory networks during the processing of AV speech: these oscillatory hierarchies are such that high-frequency activity (beta, gamma) is contingent on the phase response of low-frequency (delta, theta) networks. Our findings suggest that the endogenous temporal multiplexing of speech processing confers adaptability within the temporal regimes that are essential for speech comprehension.
Collapse
Affiliation(s)
- Nikos Chalas
- Institute for Biomagnetism and Biosignal Analysis, University of Münster, P.C., 48149 Münster, Germany
- CEA, DRF/Joliot, NeuroSpin, INSERM, Cognitive Neuroimaging Unit; CNRS; Université Paris-Saclay, 91191 Gif/Yvette, France
- School of Biology, Faculty of Sciences, Aristotle University of Thessaloniki, P.C., 54124 Thessaloniki, Greece
- Corresponding author
| | - Diana Omigie
- Department of Psychology, Goldsmiths University London, London, UK
| | - David Poeppel
- Department of Psychology, New York University, New York, NY 10003, USA
- Ernst Struengmann Institute for Neuroscience, 60528 Frankfurt am Main, Frankfurt, Germany
| | - Virginie van Wassenhove
- CEA, DRF/Joliot, NeuroSpin, INSERM, Cognitive Neuroimaging Unit; CNRS; Université Paris-Saclay, 91191 Gif/Yvette, France
- Corresponding author
| |
Collapse
|
16
|
Ryumin D, Ivanko D, Ryumina E. Audio-Visual Speech and Gesture Recognition by Sensors of Mobile Devices. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23042284. [PMID: 36850882 PMCID: PMC9967234 DOI: 10.3390/s23042284] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 02/06/2023] [Accepted: 02/14/2023] [Indexed: 05/27/2023]
Abstract
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable speech recognition, particularly when audio is corrupted by noise. Additional visual information can be used for both automatic lip-reading and gesture recognition. Hand gestures are a form of non-verbal communication and can be used as a very important part of modern human-computer interaction systems. Currently, audio and video modalities are easily accessible by sensors of mobile devices. However, there is no out-of-the-box solution for automatic audio-visual speech and gesture recognition. This study introduces two deep neural network-based model architectures: one for AVSR and one for gesture recognition. The main novelty regarding audio-visual speech recognition lies in fine-tuning strategies for both visual and acoustic features and in the proposed end-to-end model, which considers three modality fusion approaches: prediction-level, feature-level, and model-level. The main novelty in gesture recognition lies in a unique set of spatio-temporal features, including those that consider lip articulation information. As there are no available datasets for the combined task, we evaluated our methods on two different large-scale corpora-LRW and AUTSL-and outperformed existing methods on both audio-visual speech recognition and gesture recognition tasks. We achieved AVSR accuracy for the LRW dataset equal to 98.76% and gesture recognition rate for the AUTSL dataset equal to 98.56%. The results obtained demonstrate not only the high performance of the proposed methodology, but also the fundamental possibility of recognizing audio-visual speech and gestures by sensors of mobile devices.
Collapse
|
17
|
Jiang Z, An X, Liu S, Wang L, Yin E, Yan Y, Ming D. The effect of prestimulus low-frequency neural oscillations on the temporal perception of audiovisual speech. Front Neurosci 2023; 17:1067632. [PMID: 36816126 PMCID: PMC9935937 DOI: 10.3389/fnins.2023.1067632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 01/17/2023] [Indexed: 02/05/2023] Open
Abstract
Objective Perceptual integration and segregation are modulated by the phase of ongoing neural oscillation whose frequency period is broader than the size of the temporal binding window (TBW). Studies have shown that the abstract beep-flash stimuli with about 100 ms TBW were modulated by the alpha band phase. Therefore, we hypothesize that the temporal perception of speech with about hundreds of milliseconds of TBW might be affected by the delta-theta phase. Methods Thus, we conducted a speech-stimuli-based audiovisual simultaneity judgment (SJ) experiment. Twenty human participants (12 females) attended this study, recording 62 channels of EEG. Results Behavioral results showed that the visual leading TBWs are broader than the auditory leading ones [273.37 ± 24.24 ms vs. 198.05 ± 19.28 ms, (mean ± sem)]. We used Phase Opposition Sum (POS) to quantify the differences in mean phase angles and phase concentrations between synchronous and asynchronous responses. The POS results indicated that the delta-theta phase was significantly different between synchronous and asynchronous responses in the A50V condition (50% synchronous responses in auditory leading SOA). However, in the V50A condition (50% synchronous responses in visual leading SOA), we only found the delta band effect. In the two conditions, we did not find a consistency of phases over subjects for both perceptual responses by the post hoc Rayleigh test (all ps > 0.05). The Rayleigh test results suggested that the phase might not reflect the neuronal excitability which assumed that the phases within a perceptual response across subjects concentrated on the same angle but were not uniformly distributed. But V-test showed the phase difference between synchronous and asynchronous responses across subjects had a significant phase opposition (all ps < 0.05) which is compatible with the POS result. Conclusion These results indicate that the speech temporal perception depends on the alignment of stimulus onset with an optimal phase of the neural oscillation whose frequency period might be broader than the size of TBW. The role of the oscillatory phase might be encoding the temporal information which varies across subjects rather than neuronal excitability. Given the enriched temporal structures of spoken language stimuli, the conclusion that phase encodes temporal information is plausible and valuable for future research.
Collapse
Affiliation(s)
- Zeliang Jiang
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Xingwei An
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,Xingwei An,
| | - Shuang Liu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Lu Wang
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China
| | - Erwei Yin
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing, China,Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin, China
| | - Ye Yan
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing, China,Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin, China
| | - Dong Ming
- Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin, China,*Correspondence: Dong Ming,
| |
Collapse
|
18
|
Saalasti S, Alho J, Lahnakoski JM, Bacha-Trams M, Glerean E, Jääskeläinen IP, Hasson U, Sams M. Lipreading a naturalistic narrative in a female population: Neural characteristics shared with listening and reading. Brain Behav 2023; 13:e2869. [PMID: 36579557 PMCID: PMC9927859 DOI: 10.1002/brb3.2869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 11/29/2022] [Accepted: 12/06/2022] [Indexed: 12/30/2022] Open
Abstract
INTRODUCTION Few of us are skilled lipreaders while most struggle with the task. Neural substrates that enable comprehension of connected natural speech via lipreading are not yet well understood. METHODS We used a data-driven approach to identify brain areas underlying the lipreading of an 8-min narrative with participants whose lipreading skills varied extensively (range 6-100%, mean = 50.7%). The participants also listened to and read the same narrative. The similarity between individual participants' brain activity during the whole narrative, within and between conditions, was estimated by a voxel-wise comparison of the Blood Oxygenation Level Dependent (BOLD) signal time courses. RESULTS Inter-subject correlation (ISC) of the time courses revealed that lipreading, listening to, and reading the narrative were largely supported by the same brain areas in the temporal, parietal and frontal cortices, precuneus, and cerebellum. Additionally, listening to and reading connected naturalistic speech particularly activated higher-level linguistic processing in the parietal and frontal cortices more consistently than lipreading, probably paralleling the limited understanding obtained via lip-reading. Importantly, higher lipreading test score and subjective estimate of comprehension of the lipread narrative was associated with activity in the superior and middle temporal cortex. CONCLUSIONS Our new data illustrates that findings from prior studies using well-controlled repetitive speech stimuli and stimulus-driven data analyses are also valid for naturalistic connected speech. Our results might suggest an efficient use of brain areas dealing with phonological processing in skilled lipreaders.
Collapse
Affiliation(s)
- Satu Saalasti
- Department of Psychology and Logopedics, University of Helsinki, Helsinki, Finland.,Brain and Mind Laboratory, Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland.,Advanced Magnetic Imaging (AMI) Centre, Aalto NeuroImaging, School of Science, Aalto University, Espoo, Finland
| | - Jussi Alho
- Brain and Mind Laboratory, Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland
| | - Juha M Lahnakoski
- Brain and Mind Laboratory, Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland.,Independent Max Planck Research Group for Social Neuroscience, Max Planck Institute of Psychiatry, Munich, Germany.,Institute of Neuroscience and Medicine, Brain & Behaviour (INM-7), Research Center Jülich, Jülich, Germany.,Institute of Systems Neuroscience, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Mareike Bacha-Trams
- Brain and Mind Laboratory, Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland
| | - Enrico Glerean
- Brain and Mind Laboratory, Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland.,Department of Psychology and the Neuroscience Institute, Princeton University, Princeton, USA
| | - Iiro P Jääskeläinen
- Brain and Mind Laboratory, Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland
| | - Uri Hasson
- Department of Psychology and the Neuroscience Institute, Princeton University, Princeton, USA
| | - Mikko Sams
- Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland.,Aalto Studios - MAGICS, Aalto University, Espoo, Finland
| |
Collapse
|
19
|
Neurodevelopmental oscillatory basis of speech processing in noise. Dev Cogn Neurosci 2022; 59:101181. [PMID: 36549148 PMCID: PMC9792357 DOI: 10.1016/j.dcn.2022.101181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 10/31/2022] [Accepted: 11/25/2022] [Indexed: 11/27/2022] Open
Abstract
Humans' extraordinary ability to understand speech in noise relies on multiple processes that develop with age. Using magnetoencephalography (MEG), we characterize the underlying neuromaturational basis by quantifying how cortical oscillations in 144 participants (aged 5-27 years) track phrasal and syllabic structures in connected speech mixed with different types of noise. While the extraction of prosodic cues from clear speech was stable during development, its maintenance in a multi-talker background matured rapidly up to age 9 and was associated with speech comprehension. Furthermore, while the extraction of subtler information provided by syllables matured at age 9, its maintenance in noisy backgrounds progressively matured until adulthood. Altogether, these results highlight distinct behaviorally relevant maturational trajectories for the neuronal signatures of speech perception. In accordance with grain-size proposals, neuromaturational milestones are reached increasingly late for linguistic units of decreasing size, with further delays incurred by noise.
Collapse
|
20
|
Suess N, Hauswald A, Reisinger P, Rösch S, Keitel A, Weisz N. Cortical tracking of formant modulations derived from silently presented lip movements and its decline with age. Cereb Cortex 2022; 32:4818-4833. [PMID: 35062025 PMCID: PMC9627034 DOI: 10.1093/cercor/bhab518] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 12/15/2021] [Accepted: 12/16/2021] [Indexed: 11/26/2022] Open
Abstract
The integration of visual and auditory cues is crucial for successful processing of speech, especially under adverse conditions. Recent reports have shown that when participants watch muted videos of speakers, the phonological information about the acoustic speech envelope, which is associated with but independent from the speakers' lip movements, is tracked by the visual cortex. However, the speech signal also carries richer acoustic details, for example, about the fundamental frequency and the resonant frequencies, whose visuophonological transformation could aid speech processing. Here, we investigated the neural basis of the visuo-phonological transformation processes of these more fine-grained acoustic details and assessed how they change as a function of age. We recorded whole-head magnetoencephalographic (MEG) data while the participants watched silent normal (i.e., natural) and reversed videos of a speaker and paid attention to their lip movements. We found that the visual cortex is able to track the unheard natural modulations of resonant frequencies (or formants) and the pitch (or fundamental frequency) linked to lip movements. Importantly, only the processing of natural unheard formants decreases significantly with age in the visual and also in the cingulate cortex. This is not the case for the processing of the unheard speech envelope, the fundamental frequency, or the purely visual information carried by lip movements. These results show that unheard spectral fine details (along with the unheard acoustic envelope) are transformed from a mere visual to a phonological representation. Aging affects especially the ability to derive spectral dynamics at formant frequencies. As listening in noisy environments should capitalize on the ability to track spectral fine details, our results provide a novel focus on compensatory processes in such challenging situations.
Collapse
Affiliation(s)
- Nina Suess
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
| | - Anne Hauswald
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
| | - Patrick Reisinger
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
| | - Sebastian Rösch
- Department of Otorhinolaryngology, Head and Neck Surgery, Paracelsus Medical University Salzburg, University Hospital Salzburg, Salzburg 5020, Austria
| | - Anne Keitel
- School of Social Sciences, University of Dundee, Dundee DD1 4HN, UK
| | - Nathan Weisz
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
- Department of Psychology, Neuroscience Institute, Christian Doppler University Hospital, Paracelsus Medical University, Salzburg 5020, Austria
| |
Collapse
|
21
|
Influence of linguistic properties and hearing impairment on visual speech perception skills in the German language. PLoS One 2022; 17:e0275585. [PMID: 36178907 PMCID: PMC9524625 DOI: 10.1371/journal.pone.0275585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 09/20/2022] [Indexed: 11/19/2022] Open
Abstract
Visual input is crucial for understanding speech under noisy conditions, but there are hardly any tools to assess the individual ability to lipread. With this study, we wanted to (1) investigate how linguistic characteristics of language on the one hand and hearing impairment on the other hand have an impact on lipreading abilities and (2) provide a tool to assess lipreading abilities for German speakers. 170 participants (22 prelingually deaf) completed the online assessment, which consisted of a subjective hearing impairment scale and silent videos in which different item categories (numbers, words, and sentences) were spoken. The task for our participants was to recognize the spoken stimuli just by visual inspection. We used different versions of one test and investigated the impact of item categories, word frequency in the spoken language, articulation, sentence frequency in the spoken language, sentence length, and differences between speakers on the recognition score. We found an effect of item categories, articulation, sentence frequency, and sentence length on the recognition score. With respect to hearing impairment we found that higher subjective hearing impairment is associated with higher test score. We did not find any evidence that prelingually deaf individuals show enhanced lipreading skills over people with postlingual acquired hearing impairment. However, we see an interaction with education only in the prelingual deaf, but not in the population with postlingual acquired hearing loss. This points to the fact that there are different factors contributing to enhanced lipreading abilities depending on the onset of hearing impairment (prelingual vs. postlingual). Overall, lipreading skills vary strongly in the general population independent of hearing impairment. Based on our findings we constructed a new and efficient lipreading assessment tool (SaLT) that can be used to test behavioral lipreading abilities in the German speaking population.
Collapse
|
22
|
Ross LA, Molholm S, Butler JS, Bene VAD, Foxe JJ. Neural correlates of multisensory enhancement in audiovisual narrative speech perception: a fMRI investigation. Neuroimage 2022; 263:119598. [PMID: 36049699 DOI: 10.1016/j.neuroimage.2022.119598] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 08/26/2022] [Accepted: 08/28/2022] [Indexed: 11/25/2022] Open
Abstract
This fMRI study investigated the effect of seeing articulatory movements of a speaker while listening to a naturalistic narrative stimulus. It had the goal to identify regions of the language network showing multisensory enhancement under synchronous audiovisual conditions. We expected this enhancement to emerge in regions known to underlie the integration of auditory and visual information such as the posterior superior temporal gyrus as well as parts of the broader language network, including the semantic system. To this end we presented 53 participants with a continuous narration of a story in auditory alone, visual alone, and both synchronous and asynchronous audiovisual speech conditions while recording brain activity using BOLD fMRI. We found multisensory enhancement in an extensive network of regions underlying multisensory integration and parts of the semantic network as well as extralinguistic regions not usually associated with multisensory integration, namely the primary visual cortex and the bilateral amygdala. Analysis also revealed involvement of thalamic brain regions along the visual and auditory pathways more commonly associated with early sensory processing. We conclude that under natural listening conditions, multisensory enhancement not only involves sites of multisensory integration but many regions of the wider semantic network and includes regions associated with extralinguistic sensory, perceptual and cognitive processing.
Collapse
Affiliation(s)
- Lars A Ross
- The Frederick J. and Marion A. Schindler Cognitive Neurophysiology Laboratory, The Ernest J. Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, New York, 14642, USA; Department of Imaging Sciences, University of Rochester Medical Center, University of Rochester School of Medicine and Dentistry, Rochester, New York, 14642, USA; The Cognitive Neurophysiology Laboratory, Departments of Pediatrics and Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, New York, 10461, USA.
| | - Sophie Molholm
- The Frederick J. and Marion A. Schindler Cognitive Neurophysiology Laboratory, The Ernest J. Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, New York, 14642, USA; The Cognitive Neurophysiology Laboratory, Departments of Pediatrics and Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, New York, 10461, USA
| | - John S Butler
- The Cognitive Neurophysiology Laboratory, Departments of Pediatrics and Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, New York, 10461, USA; School of Mathematical Sciences, Technological University Dublin, Kevin Street Campus, Dublin, Ireland
| | - Victor A Del Bene
- The Cognitive Neurophysiology Laboratory, Departments of Pediatrics and Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, New York, 10461, USA; University of Alabama at Birmingham, Heersink School of Medicine, Department of Neurology, Birmingham, Alabama, 35233, USA
| | - John J Foxe
- The Frederick J. and Marion A. Schindler Cognitive Neurophysiology Laboratory, The Ernest J. Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, New York, 14642, USA; The Cognitive Neurophysiology Laboratory, Departments of Pediatrics and Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, New York, 10461, USA.
| |
Collapse
|
23
|
Yu W, Zeiler S, Kolossa D. Reliability-Based Large-Vocabulary Audio-Visual Speech Recognition. SENSORS (BASEL, SWITZERLAND) 2022; 22:5501. [PMID: 35898005 PMCID: PMC9370936 DOI: 10.3390/s22155501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Revised: 07/15/2022] [Accepted: 07/17/2022] [Indexed: 06/15/2023]
Abstract
Audio-visual speech recognition (AVSR) can significantly improve performance over audio-only recognition for small or medium vocabularies. However, current AVSR, whether hybrid or end-to-end (E2E), still does not appear to make optimal use of this secondary information stream as the performance is still clearly diminished in noisy conditions for large-vocabulary systems. We, therefore, propose a new fusion architecture-the decision fusion net (DFN). A broad range of time-variant reliability measures are used as an auxiliary input to improve performance. The DFN is used in both hybrid and E2E models. Our experiments on two large-vocabulary datasets, the Lip Reading Sentences 2 and 3 (LRS2 and LRS3) corpora, show highly significant improvements in performance over previous AVSR systems for large-vocabulary datasets. The hybrid model with the proposed DFN integration component even outperforms oracle dynamic stream-weighting, which is considered to be the theoretical upper bound for conventional dynamic stream-weighting approaches. Compared to the hybrid audio-only model, the proposed DFN achieves a relative word-error-rate reduction of 51% on average, while the E2E-DFN model, with its more competitive audio-only baseline system, achieves a relative word error rate reduction of 43%, both showing the efficacy of our proposed fusion architecture.
Collapse
|
24
|
Crosse MJ, Foxe JJ, Tarrit K, Freedman EG, Molholm S. Resolution of impaired multisensory processing in autism and the cost of switching sensory modality. Commun Biol 2022; 5:601. [PMID: 35773473 PMCID: PMC9246932 DOI: 10.1038/s42003-022-03519-1] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Accepted: 05/23/2022] [Indexed: 11/09/2022] Open
Abstract
Children with autism spectrum disorders (ASD) exhibit alterations in multisensory processing, which may contribute to the prevalence of social and communicative deficits in this population. Resolution of multisensory deficits has been observed in teenagers with ASD for complex, social speech stimuli; however, whether this resolution extends to more basic multisensory processing deficits remains unclear. Here, in a cohort of 364 participants we show using simple, non-social audiovisual stimuli that deficits in multisensory processing observed in high-functioning children and teenagers with ASD are not evident in adults with the disorder. Computational modelling indicated that multisensory processing transitions from a default state of competition to one of facilitation, and that this transition is delayed in ASD. Further analysis revealed group differences in how sensory channels are weighted, and how this is impacted by preceding cross-sensory inputs. Our findings indicate that there is a complex and dynamic interplay among the sensory systems that differs considerably in individuals with ASD. Crosse et al. study a cohort of 364 participants with autism spectrum disorders (ASD) and matched controls, and show that deficits in multisensory processing observed in high-functioning children and teenagers with ASD are not evident in adults with the disorder. Using computational modelling they go on to demonstrate that there is a delayed transition of multisensory processing from a default state of competition to one of facilitation in ASD, as well as differences in sensory weighting and the ability to switch between sensory modalities, which sheds light on the interplay among sensory systems that differ in ASD individuals.
Collapse
Affiliation(s)
- Michael J Crosse
- The Cognitive Neurophysiology Laboratory, Department of Pediatrics, Albert Einstein College of Medicine, Bronx, NY, USA. .,The Dominick P. Purpura Department of Neuroscience, Rose F. Kennedy Intellectual and Developmental Disabilities Research Center, Albert Einstein College of Medicine, Bronx, NY, USA. .,Trinity Centre for Biomedical Engineering, Department of Mechanical, Manufacturing & Biomedical Engineering, Trinity College Dublin, Dublin, Ireland.
| | - John J Foxe
- The Cognitive Neurophysiology Laboratory, Department of Pediatrics, Albert Einstein College of Medicine, Bronx, NY, USA.,The Dominick P. Purpura Department of Neuroscience, Rose F. Kennedy Intellectual and Developmental Disabilities Research Center, Albert Einstein College of Medicine, Bronx, NY, USA.,The Cognitive Neurophysiology Laboratory, Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Katy Tarrit
- The Cognitive Neurophysiology Laboratory, Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Edward G Freedman
- The Cognitive Neurophysiology Laboratory, Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Sophie Molholm
- The Cognitive Neurophysiology Laboratory, Department of Pediatrics, Albert Einstein College of Medicine, Bronx, NY, USA. .,The Dominick P. Purpura Department of Neuroscience, Rose F. Kennedy Intellectual and Developmental Disabilities Research Center, Albert Einstein College of Medicine, Bronx, NY, USA. .,The Cognitive Neurophysiology Laboratory, Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY, USA.
| |
Collapse
|
25
|
Jia J, Wang T, Chen S, Ding N, Fang F. Ensemble size perception: Its neural signature and the role of global interaction over individual items. Neuropsychologia 2022; 173:108290. [PMID: 35697088 DOI: 10.1016/j.neuropsychologia.2022.108290] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 06/02/2022] [Accepted: 06/07/2022] [Indexed: 10/18/2022]
Abstract
To efficiently process complex visual scenes, the visual system often summarizes statistical information across individual items and represents them as an ensemble. However, due to the lack of techniques to disentangle the representation of the ensemble from that of the individual items constituting the ensemble, whether there exists a specialized neural mechanism for ensemble processing and how ensemble perception is computed in the brain remain unknown. To address these issues, we used a frequency-tagging EEG approach to track brain responses to periodically updated ensemble sizes. Neural responses tracking the ensemble size were detected in parieto-occipital electrodes, revealing a global and specialized neural mechanism of ensemble size perception. We then used the temporal response function to isolate neural responses to the individual sizes and their interactions. Notably, while the individual sizes and their local and global interactions were encoded in the EEG signals, only the global interaction contributed directly to the ensemble size perception. Finally, distributed attention to the global stimulus pattern enhanced the neural signature of the ensemble size, mainly by modulating the neural representation of the global interaction between all individual sizes. These findings advocate a specialized, global neural mechanism of ensemble size perception and suggest that global interaction between individual items contributes to ensemble perception.
Collapse
Affiliation(s)
- Jianrong Jia
- Center for Cognition and Brain Disorders, The Affiliated Hospital of Hangzhou Normal University, Hangzhou, 311121, China; Institute of Psychological Sciences, Hangzhou Normal University, Hangzhou, 311121, China
| | - Tongyu Wang
- Center for Cognition and Brain Disorders, The Affiliated Hospital of Hangzhou Normal University, Hangzhou, 311121, China; Institute of Psychological Sciences, Hangzhou Normal University, Hangzhou, 311121, China
| | - Siqi Chen
- Center for Cognition and Brain Disorders, The Affiliated Hospital of Hangzhou Normal University, Hangzhou, 311121, China; Institute of Psychological Sciences, Hangzhou Normal University, Hangzhou, 311121, China
| | - Nai Ding
- Key Laboratory for Biomedical Engineering of Ministry of Education, College of Biomedical Engineering and Instrument Sciences, Zhejiang University, Hangzhou, 311121, China; Research Center for Advanced Artificial Intelligence Theory, Zhejiang Lab, Hangzhou, 311121, China
| | - Fang Fang
- School of Psychological and Cognitive Sciences and Beijing Key Laboratory of Behavior and Mental Health, Peking University, Beijing, 100871, China; IDG/McGovern Institute for Brain Research, Peking University, Beijing, 100871, China; Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, 100871, China; Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, 100871, China.
| |
Collapse
|
26
|
Bigelow J, Morrill RJ, Olsen T, Hasenstaub AR. Visual modulation of firing and spectrotemporal receptive fields in mouse auditory cortex. CURRENT RESEARCH IN NEUROBIOLOGY 2022; 3:100040. [PMID: 36518337 PMCID: PMC9743056 DOI: 10.1016/j.crneur.2022.100040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/26/2022] [Accepted: 05/06/2022] [Indexed: 10/18/2022] Open
Abstract
Recent studies have established significant anatomical and functional connections between visual areas and primary auditory cortex (A1), which may be important for cognitive processes such as communication and spatial perception. These studies have raised two important questions: First, which cell populations in A1 respond to visual input and/or are influenced by visual context? Second, which aspects of sound encoding are affected by visual context? To address these questions, we recorded single-unit activity across cortical layers in awake mice during exposure to auditory and visual stimuli. Neurons responsive to visual stimuli were most prevalent in the deep cortical layers and included both excitatory and inhibitory cells. The overwhelming majority of these neurons also responded to sound, indicating unimodal visual neurons are rare in A1. Other neurons for which sound-evoked responses were modulated by visual context were similarly excitatory or inhibitory but more evenly distributed across cortical layers. These modulatory influences almost exclusively affected sustained sound-evoked firing rate (FR) responses or spectrotemporal receptive fields (STRFs); transient FR changes at stimulus onset were rarely modified by visual context. Neuron populations with visually modulated STRFs and sustained FR responses were mostly non-overlapping, suggesting spectrotemporal feature selectivity and overall excitability may be differentially sensitive to visual context. The effects of visual modulation were heterogeneous, increasing and decreasing STRF gain in roughly equal proportions of neurons. Our results indicate visual influences are surprisingly common and diversely expressed throughout layers and cell types in A1, affecting nearly one in five neurons overall.
Collapse
Affiliation(s)
- James Bigelow
- Coleman Memorial Laboratory, University of California, San Francisco, USA
- Department of Otolaryngology–Head and Neck Surgery, University of California, San Francisco, 94143, USA
| | - Ryan J. Morrill
- Coleman Memorial Laboratory, University of California, San Francisco, USA
- Neuroscience Graduate Program, University of California, San Francisco, USA
- Department of Otolaryngology–Head and Neck Surgery, University of California, San Francisco, 94143, USA
| | - Timothy Olsen
- Coleman Memorial Laboratory, University of California, San Francisco, USA
- Department of Otolaryngology–Head and Neck Surgery, University of California, San Francisco, 94143, USA
| | - Andrea R. Hasenstaub
- Coleman Memorial Laboratory, University of California, San Francisco, USA
- Neuroscience Graduate Program, University of California, San Francisco, USA
- Department of Otolaryngology–Head and Neck Surgery, University of California, San Francisco, 94143, USA
| |
Collapse
|
27
|
Zhang L, Du Y. Lip movements enhance speech representations and effective connectivity in auditory dorsal stream. Neuroimage 2022; 257:119311. [PMID: 35589000 DOI: 10.1016/j.neuroimage.2022.119311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2022] [Revised: 05/09/2022] [Accepted: 05/11/2022] [Indexed: 11/25/2022] Open
Abstract
Viewing speaker's lip movements facilitates speech perception, especially under adverse listening conditions, but the neural mechanisms of this perceptual benefit at the phonemic and feature levels remain unclear. This fMRI study addressed this question by quantifying regional multivariate representation and network organization underlying audiovisual speech-in-noise perception. Behaviorally, valid lip movements improved recognition of place of articulation to aid phoneme identification. Meanwhile, lip movements enhanced neural representations of phonemes in left auditory dorsal stream regions, including frontal speech motor areas and supramarginal gyrus (SMG). Moreover, neural representations of place of articulation and voicing features were promoted differentially by lip movements in these regions, with voicing enhanced in Broca's area while place of articulation better encoded in left ventral premotor cortex and SMG. Next, dynamic causal modeling (DCM) analysis showed that such local changes were accompanied by strengthened effective connectivity along the dorsal stream. Moreover, the neurite orientation dispersion of the left arcuate fasciculus, the bearing skeleton of auditory dorsal stream, predicted the visual enhancements of neural representations and effective connectivity. Our findings provide novel insight to speech science that lip movements promote both local phonemic and feature encoding and network connectivity in the dorsal pathway and the functional enhancement is mediated by the microstructural architecture of the circuit.
Collapse
Affiliation(s)
- Lei Zhang
- CAS Key Laboratory of Behavioral Science, Institute of Psychology, Chinese Academy of Sciences, Beijing, China 100101; Department of Psychology, University of Chinese Academy of Sciences, Beijing, China 100049
| | - Yi Du
- CAS Key Laboratory of Behavioral Science, Institute of Psychology, Chinese Academy of Sciences, Beijing, China 100101; Department of Psychology, University of Chinese Academy of Sciences, Beijing, China 100049; CAS Center for Excellence in Brain Science and Intelligence Technology, Shanghai, China 200031; Chinese Institute for Brain Research, Beijing, China 102206.
| |
Collapse
|
28
|
Haider CL, Suess N, Hauswald A, Park H, Weisz N. Masking of the mouth area impairs reconstruction of acoustic speech features and higher-level segmentational features in the presence of a distractor speaker. Neuroimage 2022; 252:119044. [PMID: 35240298 DOI: 10.1016/j.neuroimage.2022.119044] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/26/2022] [Accepted: 02/27/2022] [Indexed: 11/29/2022] Open
Abstract
Multisensory integration enables stimulus representation even when the sensory input in a single modality is weak. In the context of speech, when confronted with a degraded acoustic signal, congruent visual inputs promote comprehension. When this input is masked, speech comprehension consequently becomes more difficult. But it still remains inconclusive which levels of speech processing are affected under which circumstances by occluding the mouth area. To answer this question, we conducted an audiovisual (AV) multi-speaker experiment using naturalistic speech. In half of the trials, the target speaker wore a (surgical) face mask, while we measured the brain activity of normal hearing participants via magnetoencephalography (MEG). We additionally added a distractor speaker in half of the trials in order to create an ecologically difficult listening situation. A decoding model on the clear AV speech was trained and used to reconstruct crucial speech features in each condition. We found significant main effects of face masks on the reconstruction of acoustic features, such as the speech envelope and spectral speech features (i.e. pitch and formant frequencies), while reconstruction of higher level features of speech segmentation (phoneme and word onsets) were especially impaired through masks in difficult listening situations. As we used surgical face masks in our study, which only show mild effects on speech acoustics, we interpret our findings as the result of the missing visual input. Our findings extend previous behavioural results, by demonstrating the complex contextual effects of occluding relevant visual information on speech processing.
Collapse
Affiliation(s)
- Chandra Leon Haider
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria.
| | - Nina Suess
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria
| | - Anne Hauswald
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria
| | - Hyojin Park
- School of Psychology & Centre for Human Brain Health (CHBH), University of Birmingham, Birmingham, UK
| | - Nathan Weisz
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria; Neuroscience Institute, Christian Doppler University Hospital, Paracelsus Medical University, Salzburg, Austria
| |
Collapse
|
29
|
Jessica Tan SH, Kalashnikova M, Di Liberto GM, Crosse MJ, Burnham D. Seeing a Talking Face Matters: The Relationship between Cortical Tracking of Continuous Auditory-Visual Speech and Gaze Behaviour in Infants, Children and Adults. Neuroimage 2022; 256:119217. [PMID: 35436614 DOI: 10.1016/j.neuroimage.2022.119217] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2021] [Revised: 04/09/2022] [Accepted: 04/14/2022] [Indexed: 11/24/2022] Open
Abstract
An auditory-visual speech benefit, the benefit that visual speech cues bring to auditory speech perception, is experienced from early on in infancy and continues to be experienced to an increasing degree with age. While there is both behavioural and neurophysiological evidence for children and adults, only behavioural evidence exists for infants - as no neurophysiological study has provided a comprehensive examination of the auditory-visual speech benefit in infants. It is also surprising that most studies on auditory-visual speech benefit do not concurrently report looking behaviour especially since the auditory-visual speech benefit rests on the assumption that listeners attend to a speaker's talking face and that there are meaningful individual differences in looking behaviour. To address these gaps, we simultaneously recorded electroencephalographic (EEG) and eye-tracking data of 5-month-olds, 4-year-olds and adults as they were presented with a speaker in auditory-only (AO), visual-only (VO), and auditory-visual (AV) modes. Cortical tracking analyses that involved forward encoding models of the speech envelope revealed that there was an auditory-visual speech benefit [i.e., AV > (A+V)], evident in 5-month-olds and adults but not 4-year-olds. Examination of cortical tracking accuracy in relation to looking behaviour, showed that infants' relative attention to the speaker's mouth (vs. eyes) was positively correlated with cortical tracking accuracy of VO speech, whereas adults' attention to the display overall was negatively correlated with cortical tracking accuracy of VO speech. This study provides the first neurophysiological evidence of auditory-visual speech benefit in infants and our results suggest ways in which current models of speech processing can be fine-tuned.
Collapse
Affiliation(s)
- S H Jessica Tan
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University.
| | - Marina Kalashnikova
- The Basque Center on Cognition, Brain and Language; IKERBASQUE, Basque Foundation for Science
| | | | - Michael J Crosse
- Trinity Center for Biomedical Engineering, Department of Mechanical, Manufacturing & Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
| | - Denis Burnham
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University
| |
Collapse
|
30
|
Asymmetrical cross-modal influence on neural encoding of auditory and visual features in natural scenes. Neuroimage 2022; 255:119182. [PMID: 35395403 DOI: 10.1016/j.neuroimage.2022.119182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2021] [Revised: 03/24/2022] [Accepted: 04/04/2022] [Indexed: 11/22/2022] Open
Abstract
Natural scenes contain multi-modal information, which is integrated to form a coherent perception. Previous studies have demonstrated that cross-modal information can modulate neural encoding of low-level sensory features. These studies, however, mostly focus on the processing of single sensory events or rhythmic sensory sequences. Here, we investigate how the neural encoding of basic auditory and visual features is modulated by cross-modal information when the participants watch movie clips primarily composed of non-rhythmic events. We presented audiovisual congruent and audiovisual incongruent movie clips, and since attention can modulate cross-modal interactions, we separately analyzed high- and low-arousal movie clips. We recorded neural responses using electroencephalography (EEG), and employed the temporal response function (TRF) to quantify the neural encoding of auditory and visual features. The neural encoding of sound envelope is enhanced in the audiovisual congruent condition than the incongruent condition, but this effect is only significant for high-arousal movie clips. In contrast, audiovisual congruency does not significantly modulate the neural encoding of visual features, e.g., luminance or visual motion. In summary, our findings demonstrate asymmetrical cross-modal interactions during the processing of natural scenes that lack rhythmicity: Congruent visual information enhances low-level auditory processing, while congruent auditory information does not significantly modulate low-level visual processing.
Collapse
|
31
|
Enhancement of speech-in-noise comprehension through vibrotactile stimulation at the syllabic rate. Proc Natl Acad Sci U S A 2022; 119:e2117000119. [PMID: 35312362 PMCID: PMC9060510 DOI: 10.1073/pnas.2117000119] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Syllables are important building blocks of speech. They occur at a rate between 4 and 8 Hz, corresponding to the theta frequency range of neural activity in the cerebral cortex. When listening to speech, the theta activity becomes aligned to the syllabic rhythm, presumably aiding in parsing a speech signal into distinct syllables. However, this neural activity cannot only be influenced by sound, but also by somatosensory information. Here, we show that the presentation of vibrotactile signals at the syllabic rate can enhance the comprehension of speech in background noise. We further provide evidence that this multisensory enhancement of speech comprehension reflects the multisensory integration of auditory and tactile information in the auditory cortex. Speech unfolds over distinct temporal scales, in particular, those related to the rhythm of phonemes, syllables, and words. When a person listens to continuous speech, the syllabic rhythm is tracked by neural activity in the theta frequency range. The tracking plays a functional role in speech processing: Influencing the theta activity through transcranial current stimulation, for instance, can impact speech perception. The theta-band activity in the auditory cortex can also be modulated through the somatosensory system, but the effect on speech processing has remained unclear. Here, we show that vibrotactile feedback presented at the rate of syllables can modulate and, in fact, enhance the comprehension of a speech signal in background noise. The enhancement occurs when vibrotactile pulses occur at the perceptual center of the syllables, whereas a temporal delay between the vibrotactile signals and the speech stream can lead to a lower level of speech comprehension. We further investigate the neural mechanisms underlying the audiotactile integration through electroencephalographic (EEG) recordings. We find that the audiotactile stimulation modulates the neural response to the speech rhythm, as well as the neural response to the vibrotactile pulses. The modulations of these neural activities reflect the behavioral effects on speech comprehension. Moreover, we demonstrate that speech comprehension can be predicted by particular aspects of the neural responses. Our results evidence a role of vibrotactile information for speech processing and may have applications in future auditory prosthesis.
Collapse
|
32
|
Di Liberto GM, Hjortkjær J, Mesgarani N. Editorial: Neural Tracking: Closing the Gap Between Neurophysiology and Translational Medicine. Front Neurosci 2022; 16:872600. [PMID: 35368278 PMCID: PMC8966872 DOI: 10.3389/fnins.2022.872600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Accepted: 02/17/2022] [Indexed: 11/25/2022] Open
Affiliation(s)
- Giovanni M. Di Liberto
- School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland
- ADAPT Centre, d-real, Trinity College Institute for Neuroscience, Dublin, Ireland
- *Correspondence: Giovanni M. Di Liberto
| | - Jens Hjortkjær
- Hearing Systems Group, Department of Health Technology, Technical University of Denmark, Kongens Lyngby, Ireland
| | - Nima Mesgarani
- Electrical Engineering Department, Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, United States
| |
Collapse
|
33
|
Varano E, Vougioukas K, Ma P, Petridis S, Pantic M, Reichenbach T. Speech-Driven Facial Animations Improve Speech-in-Noise Comprehension of Humans. Front Neurosci 2022; 15:781196. [PMID: 35069100 PMCID: PMC8766421 DOI: 10.3389/fnins.2021.781196] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Accepted: 11/29/2021] [Indexed: 12/02/2022] Open
Abstract
Understanding speech becomes a demanding task when the environment is noisy. Comprehension of speech in noise can be substantially improved by looking at the speaker’s face, and this audiovisual benefit is even more pronounced in people with hearing impairment. Recent advances in AI have allowed to synthesize photorealistic talking faces from a speech recording and a still image of a person’s face in an end-to-end manner. However, it has remained unknown whether such facial animations improve speech-in-noise comprehension. Here we consider facial animations produced by a recently introduced generative adversarial network (GAN), and show that humans cannot distinguish between the synthesized and the natural videos. Importantly, we then show that the end-to-end synthesized videos significantly aid humans in understanding speech in noise, although the natural facial motions yield a yet higher audiovisual benefit. We further find that an audiovisual speech recognizer (AVSR) benefits from the synthesized facial animations as well. Our results suggest that synthesizing facial motions from speech can be used to aid speech comprehension in difficult listening environments.
Collapse
Affiliation(s)
- Enrico Varano
- Department of Bioengineering and Centre for Neurotechnology, Imperial College London, London, United Kingdom
| | | | - Pingchuan Ma
- Department of Computing, Imperial College London, London, United Kingdom
| | - Stavros Petridis
- Department of Computing, Imperial College London, London, United Kingdom
| | - Maja Pantic
- Department of Computing, Imperial College London, London, United Kingdom
| | - Tobias Reichenbach
- Department of Artificial Intelligence in Biomedical Engineering, Friedrich-Alexander-University Erlangen-Nuremberg, Erlangen, Germany
| |
Collapse
|
34
|
Bilinguals Show Proportionally Greater Benefit From Visual Speech Cues and Sentence Context in Their Second Compared to Their First Language. Ear Hear 2021; 43:1316-1326. [PMID: 34966162 DOI: 10.1097/aud.0000000000001182] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
OBJECTIVES Speech perception in noise is challenging, but evidence suggests that it may be facilitated by visual speech cues (e.g., lip movements) and supportive sentence context in native speakers. Comparatively few studies have investigated speech perception in noise in bilinguals, and little is known about the impact of visual speech cues and supportive sentence context in a first language compared to a second language within the same individual. The current study addresses this gap by directly investigating the extent to which bilinguals benefit from visual speech cues and supportive sentence context under similarly noisy conditions in their first and second language. DESIGN Thirty young adult English-French/French-English bilinguals were recruited from the undergraduate psychology program at Concordia University and from the Montreal community. They completed a speech perception in noise task during which they were presented with video-recorded sentences and instructed to repeat the last word of each sentence out loud. Sentences were presented in three different modalities: visual-only, auditory-only, and audiovisual. Additionally, sentences had one of two levels of context: moderate (e.g., "In the woods, the hiker saw a bear.") and low (e.g., "I had not thought about that bear."). Each participant completed this task in both their first and second language; crucially, the level of background noise was calibrated individually for each participant and was the same throughout the first language and second language (L2) portions of the experimental task. RESULTS Overall, speech perception in noise was more accurate in bilinguals' first language compared to the second. However, participants benefited from visual speech cues and supportive sentence context to a proportionally greater extent in their second language compared to their first. At the individual level, performance during the speech perception in noise task was related to aspects of bilinguals' experience in their second language (i.e., age of acquisition, relative balance between the first and the second language). CONCLUSIONS Bilinguals benefit from visual speech cues and sentence context in their second language during speech in noise and do so to a greater extent than in their first language given the same level of background noise. Together, this indicates that L2 speech perception can be conceptualized within an inverse effectiveness hypothesis framework with a complex interplay of sensory factors (i.e., the quality of the auditory speech signal and visual speech cues) and linguistic factors (i.e., presence or absence of supportive context and L2 experience of the listener).
Collapse
|
35
|
Crosse MJ, Zuk NJ, Di Liberto GM, Nidiffer AR, Molholm S, Lalor EC. Linear Modeling of Neurophysiological Responses to Speech and Other Continuous Stimuli: Methodological Considerations for Applied Research. Front Neurosci 2021; 15:705621. [PMID: 34880719 PMCID: PMC8648261 DOI: 10.3389/fnins.2021.705621] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Accepted: 09/21/2021] [Indexed: 01/01/2023] Open
Abstract
Cognitive neuroscience, in particular research on speech and language, has seen an increase in the use of linear modeling techniques for studying the processing of natural, environmental stimuli. The availability of such computational tools has prompted similar investigations in many clinical domains, facilitating the study of cognitive and sensory deficits under more naturalistic conditions. However, studying clinical (and often highly heterogeneous) cohorts introduces an added layer of complexity to such modeling procedures, potentially leading to instability of such techniques and, as a result, inconsistent findings. Here, we outline some key methodological considerations for applied research, referring to a hypothetical clinical experiment involving speech processing and worked examples of simulated electrophysiological (EEG) data. In particular, we focus on experimental design, data preprocessing, stimulus feature extraction, model design, model training and evaluation, and interpretation of model weights. Throughout the paper, we demonstrate the implementation of each step in MATLAB using the mTRF-Toolbox and discuss how to address issues that could arise in applied research. In doing so, we hope to provide better intuition on these more technical points and provide a resource for applied and clinical researchers investigating sensory and cognitive processing using ecologically rich stimuli.
Collapse
Affiliation(s)
- Michael J. Crosse
- Department of Mechanical, Manufacturing and Biomedical Engineering, Trinity Centre for Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
- X, The Moonshot Factory, Mountain View, CA, United States
- Department of Pediatrics, Albert Einstein College of Medicine, New York, NY, United States
- Department of Neuroscience, Albert Einstein College of Medicine, New York, NY, United States
| | - Nathaniel J. Zuk
- Department of Mechanical, Manufacturing and Biomedical Engineering, Trinity Centre for Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
- Department of Biomedical Engineering, University of Rochester, Rochester, NY, United States
- Department of Neuroscience, University of Rochester, Rochester, NY, United States
| | - Giovanni M. Di Liberto
- Department of Mechanical, Manufacturing and Biomedical Engineering, Trinity Centre for Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
- Centre for Biomedical Engineering, School of Electrical and Electronic Engineering, University College Dublin, Dublin, Ireland
- School of Computer Science and Statistics, Trinity College Dublin, Dublin, Ireland
| | - Aaron R. Nidiffer
- Department of Biomedical Engineering, University of Rochester, Rochester, NY, United States
- Department of Neuroscience, University of Rochester, Rochester, NY, United States
| | - Sophie Molholm
- Department of Pediatrics, Albert Einstein College of Medicine, New York, NY, United States
- Department of Neuroscience, Albert Einstein College of Medicine, New York, NY, United States
| | - Edmund C. Lalor
- Department of Mechanical, Manufacturing and Biomedical Engineering, Trinity Centre for Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
- Department of Biomedical Engineering, University of Rochester, Rochester, NY, United States
- Department of Neuroscience, University of Rochester, Rochester, NY, United States
| |
Collapse
|
36
|
Fleming JT, Maddox RK, Shinn-Cunningham BG. Spatial alignment between faces and voices improves selective attention to audio-visual speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 150:3085. [PMID: 34717460 DOI: 10.1121/10.0006415] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Accepted: 09/01/2021] [Indexed: 06/13/2023]
Abstract
The ability to see a talker's face improves speech intelligibility in noise, provided that the auditory and visual speech signals are approximately aligned in time. However, the importance of spatial alignment between corresponding faces and voices remains unresolved, particularly in multi-talker environments. In a series of online experiments, we investigated this using a task that required participants to selectively attend a target talker in noise while ignoring a distractor talker. In experiment 1, we found improved task performance when the talkers' faces were visible, but only when corresponding faces and voices were presented in the same hemifield (spatially aligned). In experiment 2, we tested for possible influences of eye position on this result. In auditory-only conditions, directing gaze toward the distractor voice reduced performance, but this effect could not fully explain the cost of audio-visual (AV) spatial misalignment. Lowering the signal-to-noise ratio (SNR) of the speech from +4 to -4 dB increased the magnitude of the AV spatial alignment effect (experiment 3), but accurate closed-set lipreading caused a floor effect that influenced results at lower SNRs (experiment 4). Taken together, these results demonstrate that spatial alignment between faces and voices contributes to the ability to selectively attend AV speech.
Collapse
Affiliation(s)
- Justin T Fleming
- Speech and Hearing Bioscience and Technology Program, Harvard University, 243 Charles Street, Boston, Massachusetts 02114, USA
| | - Ross K Maddox
- Department of Biomedical Engineering, University of Rochester, 430 Elmwood Avenue, Rochester, New York 14620, USA
| | - Barbara G Shinn-Cunningham
- Neuroscience Institute, Carnegie Mellon University, 4825 Frew Street, Pittsburgh, Pennsylvania 15213, USA
| |
Collapse
|
37
|
Chen L, Liao HI. Microsaccadic Eye Movements but not Pupillary Dilation Response Characterizes the Crossmodal Freezing Effect. Cereb Cortex Commun 2021; 1:tgaa072. [PMID: 34296132 PMCID: PMC8153075 DOI: 10.1093/texcom/tgaa072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Revised: 09/24/2020] [Accepted: 09/25/2020] [Indexed: 11/14/2022] Open
Abstract
In typical spatial orienting tasks, the perception of crossmodal (e.g., audiovisual) stimuli evokes greater pupil dilation and microsaccade inhibition than unisensory stimuli (e.g., visual). The characteristic pupil dilation and microsaccade inhibition has been observed in response to "salient" events/stimuli. Although the "saliency" account is appealing in the spatial domain, whether this occurs in the temporal context remains largely unknown. Here, in a brief temporal scale (within 1 s) and with the working mechanism of involuntary temporal attention, we investigated how eye metric characteristics reflect the temporal dynamics of perceptual organization, with and without multisensory integration. We adopted the crossmodal freezing paradigm using the classical Ternus apparent motion. Results showed that synchronous beeps biased the perceptual report for group motion and triggered the prolonged sound-induced oculomotor inhibition (OMI), whereas the sound-induced OMI was not obvious in a crossmodal task-free scenario (visual localization without audiovisual integration). A general pupil dilation response was observed in the presence of sounds in both visual Ternus motion categorization and visual localization tasks. This study provides the first empirical account of crossmodal integration by capturing microsaccades within a brief temporal scale; OMI but not pupillary dilation response characterizes task-specific audiovisual integration (shown by the crossmodal freezing effect).
Collapse
Affiliation(s)
- Lihan Chen
- Department of Brain and Cognitive Sciences, Schools of Psychological and Cognitive Sciences, Peking University, Beijing, 100871, China
| | - Hsin-I Liao
- NTT Communication Science Laboratories, NTT Corporation, Atsugi, Kanagawa, 243-0198, Japan
| |
Collapse
|
38
|
O'Sullivan AE, Crosse MJ, Liberto GMD, de Cheveigné A, Lalor EC. Neurophysiological Indices of Audiovisual Speech Processing Reveal a Hierarchy of Multisensory Integration Effects. J Neurosci 2021; 41:4991-5003. [PMID: 33824190 PMCID: PMC8197638 DOI: 10.1523/jneurosci.0906-20.2021] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Revised: 03/16/2021] [Accepted: 03/22/2021] [Indexed: 12/27/2022] Open
Abstract
Seeing a speaker's face benefits speech comprehension, especially in challenging listening conditions. This perceptual benefit is thought to stem from the neural integration of visual and auditory speech at multiple stages of processing, whereby movement of a speaker's face provides temporal cues to auditory cortex, and articulatory information from the speaker's mouth can aid recognizing specific linguistic units (e.g., phonemes, syllables). However, it remains unclear how the integration of these cues varies as a function of listening conditions. Here, we sought to provide insight on these questions by examining EEG responses in humans (males and females) to natural audiovisual (AV), audio, and visual speech in quiet and in noise. We represented our speech stimuli in terms of their spectrograms and their phonetic features and then quantified the strength of the encoding of those features in the EEG using canonical correlation analysis (CCA). The encoding of both spectrotemporal and phonetic features was shown to be more robust in AV speech responses than what would have been expected from the summation of the audio and visual speech responses, suggesting that multisensory integration occurs at both spectrotemporal and phonetic stages of speech processing. We also found evidence to suggest that the integration effects may change with listening conditions; however, this was an exploratory analysis and future work will be required to examine this effect using a within-subject design. These findings demonstrate that integration of audio and visual speech occurs at multiple stages along the speech processing hierarchy.SIGNIFICANCE STATEMENT During conversation, visual cues impact our perception of speech. Integration of auditory and visual speech is thought to occur at multiple stages of speech processing and vary flexibly depending on the listening conditions. Here, we examine audiovisual (AV) integration at two stages of speech processing using the speech spectrogram and a phonetic representation, and test how AV integration adapts to degraded listening conditions. We find significant integration at both of these stages regardless of listening conditions. These findings reveal neural indices of multisensory interactions at different stages of processing and provide support for the multistage integration framework.
Collapse
Affiliation(s)
- Aisling E O'Sullivan
- School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
| | - Michael J Crosse
- X, The Moonshot Factory, Mountain View, CA and Department of Neuroscience, Albert Einstein College of Medicine, Bronx, New York 10461
| | - Giovanni M Di Liberto
- Laboratoire des Systèmes Perceptifs, Département d'Études Cognitives, École Normale Supérieure, Paris Sciences et Lettres University, Centre National de la Recherche Scientifique, Paris 75005, France
| | - Alain de Cheveigné
- Laboratoire des Systèmes Perceptifs, Département d'Études Cognitives, École Normale Supérieure, Paris Sciences et Lettres University, Centre National de la Recherche Scientifique, Paris 75005, France
- University College London Ear Institute, University College London, London WC1X 8EE, United Kingdom
| | - Edmund C Lalor
- School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
- Department of Biomedical Engineering and Department of Neuroscience, University of Rochester, Rochester, New York 14627
| |
Collapse
|
39
|
Effects of stimulus intensity on audiovisual integration in aging across the temporal dynamics of processing. Int J Psychophysiol 2021; 162:95-103. [PMID: 33529642 DOI: 10.1016/j.ijpsycho.2021.01.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 10/26/2020] [Accepted: 01/24/2021] [Indexed: 11/24/2022]
Abstract
Previous studies have drawn different conclusions about whether older adults benefit more from audiovisual integration, and such conflicts may have been due to the stimulus features investigated in those studies, such as stimulus intensity. In the current study, using ERPs, we compared the effects of stimulus intensity on audiovisual integration between young adults and older adults. The results showed that inverse effectiveness, which depicts a phenomenon that lowing the effectiveness of sensory stimuli increases benefits of multisensory integration, was observed in young adults at earlier processing stages but was absent in older adults. Moreover, at the earlier processing stages (60-90 ms and 110-140 ms), older adults exhibited significantly greater audiovisual integration than young adults (all ps < 0.05). However, at the later processing stages (220-250 ms and 340-370 ms), young adults exhibited significantly greater audiovisual integration than old adults (all ps < 0.001). The results suggested that there is an age-related dissociation between early integration and late integration, which indicates that there are different audiovisual processing mechanisms in play between older adults and young adults.
Collapse
|
40
|
Mégevand P, Mercier MR, Groppe DM, Zion Golumbic E, Mesgarani N, Beauchamp MS, Schroeder CE, Mehta AD. Crossmodal Phase Reset and Evoked Responses Provide Complementary Mechanisms for the Influence of Visual Speech in Auditory Cortex. J Neurosci 2020; 40:8530-8542. [PMID: 33023923 PMCID: PMC7605423 DOI: 10.1523/jneurosci.0555-20.2020] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 07/27/2020] [Accepted: 08/31/2020] [Indexed: 12/26/2022] Open
Abstract
Natural conversation is multisensory: when we can see the speaker's face, visual speech cues improve our comprehension. The neuronal mechanisms underlying this phenomenon remain unclear. The two main alternatives are visually mediated phase modulation of neuronal oscillations (excitability fluctuations) in auditory neurons and visual input-evoked responses in auditory neurons. Investigating this question using naturalistic audiovisual speech with intracranial recordings in humans of both sexes, we find evidence for both mechanisms. Remarkably, auditory cortical neurons track the temporal dynamics of purely visual speech using the phase of their slow oscillations and phase-related modulations in broadband high-frequency activity. Consistent with known perceptual enhancement effects, the visual phase reset amplifies the cortical representation of concomitant auditory speech. In contrast to this, and in line with earlier reports, visual input reduces the amplitude of evoked responses to concomitant auditory input. We interpret the combination of improved phase tracking and reduced response amplitude as evidence for more efficient and reliable stimulus processing in the presence of congruent auditory and visual speech inputs.SIGNIFICANCE STATEMENT Watching the speaker can facilitate our understanding of what is being said. The mechanisms responsible for this influence of visual cues on the processing of speech remain incompletely understood. We studied these mechanisms by recording the electrical activity of the human brain through electrodes implanted surgically inside the brain. We found that visual inputs can operate by directly activating auditory cortical areas, and also indirectly by modulating the strength of cortical responses to auditory input. Our results help to understand the mechanisms by which the brain merges auditory and visual speech into a unitary perception.
Collapse
Affiliation(s)
- Pierre Mégevand
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
- Department of Basic Neurosciences, Faculty of Medicine, University of Geneva, 1211 Geneva, Switzerland
| | - Manuel R Mercier
- Department of Neurology, Montefiore Medical Center, Bronx, New York 10467
- Department of Neuroscience, Albert Einstein College of Medicine, Bronx, New York 10461
- Institut de Neurosciences des Systèmes, Aix Marseille University, INSERM, 13005 Marseille, France
| | - David M Groppe
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
- The Krembil Neuroscience Centre, University Health Network, Toronto, Ontario M5T 1M8, Canada
| | - Elana Zion Golumbic
- The Gonda Brain Research Center, Bar Ilan University, Ramat Gan 5290002, Israel
| | - Nima Mesgarani
- Department of Electrical Engineering, Columbia University, New York, New York 10027
| | - Michael S Beauchamp
- Department of Neurosurgery, Baylor College of Medicine, Houston, Texas 77030
| | - Charles E Schroeder
- Nathan S. Kline Institute, Orangeburg, New York 10962
- Department of Psychiatry, Columbia University, New York, New York 10032
| | - Ashesh D Mehta
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
| |
Collapse
|
41
|
Destoky F, Bertels J, Niesen M, Wens V, Vander Ghinst M, Leybaert J, Lallier M, Ince RAA, Gross J, De Tiège X, Bourguignon M. Cortical tracking of speech in noise accounts for reading strategies in children. PLoS Biol 2020; 18:e3000840. [PMID: 32845876 PMCID: PMC7478533 DOI: 10.1371/journal.pbio.3000840] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Revised: 09/08/2020] [Accepted: 08/12/2020] [Indexed: 11/29/2022] Open
Abstract
Humans' propensity to acquire literacy relates to several factors, including the ability to understand speech in noise (SiN). Still, the nature of the relation between reading and SiN perception abilities remains poorly understood. Here, we dissect the interplay between (1) reading abilities, (2) classical behavioral predictors of reading (phonological awareness, phonological memory, and rapid automatized naming), and (3) electrophysiological markers of SiN perception in 99 elementary school children (26 with dyslexia). We demonstrate that, in typical readers, cortical representation of the phrasal content of SiN relates to the degree of development of the lexical (but not sublexical) reading strategy. In contrast, classical behavioral predictors of reading abilities and the ability to benefit from visual speech to represent the syllabic content of SiN account for global reading performance (i.e., speed and accuracy of lexical and sublexical reading). In individuals with dyslexia, we found preserved integration of visual speech information to optimize processing of syntactic information but not to sustain acoustic/phonemic processing. Finally, within children with dyslexia, measures of cortical representation of the phrasal content of SiN were negatively related to reading speed and positively related to the compromise between reading precision and reading speed, potentially owing to compensatory attentional mechanisms. These results clarify the nature of the relation between SiN perception and reading abilities in typical child readers and children with dyslexia and identify novel electrophysiological markers of emergent literacy.
Collapse
Affiliation(s)
- Florian Destoky
- Laboratoire de Cartographie fonctionnelle du Cerveau, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
| | - Julie Bertels
- Laboratoire de Cartographie fonctionnelle du Cerveau, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
- Consciousness, Cognition and Computation group, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
| | - Maxime Niesen
- Laboratoire de Cartographie fonctionnelle du Cerveau, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
- Service d'ORL et de chirurgie cervico-faciale, ULB-Hôpital Erasme, Université libre de Bruxelles (ULB), Brussels, Belgium
| | - Vincent Wens
- Laboratoire de Cartographie fonctionnelle du Cerveau, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
- Department of Functional Neuroimaging, Service of Nuclear Medicine, CUB Hôpital Erasme, Université libre de Bruxelles (ULB), Brussels, Belgium
| | - Marc Vander Ghinst
- Laboratoire de Cartographie fonctionnelle du Cerveau, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
| | - Jacqueline Leybaert
- Laboratoire Cognition Langage et Développement, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
| | - Marie Lallier
- BCBL, Basque Center on Cognition, Brain and Language, San Sebastian, Spain
| | - Robin A. A. Ince
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
| | - Joachim Gross
- Institute of Neuroscience and Psychology, University of Glasgow, Glasgow, United Kingdom
- Institute for Biomagnetism and Biosignal analysis, University of Muenster, Muenster, Germany
| | - Xavier De Tiège
- Laboratoire de Cartographie fonctionnelle du Cerveau, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
- Department of Functional Neuroimaging, Service of Nuclear Medicine, CUB Hôpital Erasme, Université libre de Bruxelles (ULB), Brussels, Belgium
| | - Mathieu Bourguignon
- Laboratoire de Cartographie fonctionnelle du Cerveau, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
- Laboratoire Cognition Langage et Développement, UNI–ULB Neuroscience Institute, Université libre de Bruxelles (ULB), Brussels, Belgium
- BCBL, Basque Center on Cognition, Brain and Language, San Sebastian, Spain
| |
Collapse
|
42
|
Fletcher MD, Song H, Perry SW. Electro-haptic stimulation enhances speech recognition in spatially separated noise for cochlear implant users. Sci Rep 2020; 10:12723. [PMID: 32728109 PMCID: PMC7391652 DOI: 10.1038/s41598-020-69697-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 07/14/2020] [Indexed: 11/10/2022] Open
Abstract
Hundreds of thousands of profoundly hearing-impaired people perceive sounds through electrical stimulation of the auditory nerve using a cochlear implant (CI). However, CI users are often poor at understanding speech in noisy environments and separating sounds that come from different locations. We provided missing speech and spatial hearing cues through haptic stimulation to augment the electrical CI signal. After just 30 min of training, we found this “electro-haptic” stimulation substantially improved speech recognition in multi-talker noise when the speech and noise came from different locations. Our haptic stimulus was delivered to the wrists at an intensity that can be produced by a compact, low-cost, wearable device. These findings represent a significant step towards the production of a non-invasive neuroprosthetic that can improve CI users’ ability to understand speech in realistic noisy environments.
Collapse
Affiliation(s)
- Mark D Fletcher
- University of Southampton Auditory Implant Service, University of Southampton, University Road, Southampton, SO17 1BJ, UK.
| | - Haoheng Song
- Faculty of Engineering and Physical Sciences, University of Southampton, University Road, Southampton, SO17 1BJ, UK
| | - Samuel W Perry
- University of Southampton Auditory Implant Service, University of Southampton, University Road, Southampton, SO17 1BJ, UK
| |
Collapse
|
43
|
de Boer MJ, Başkent D, Cornelissen FW. Eyes on Emotion: Dynamic Gaze Allocation During Emotion Perception From Speech-Like Stimuli. Multisens Res 2020; 34:17-47. [PMID: 33706278 DOI: 10.1163/22134808-bja10029] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2019] [Accepted: 05/29/2020] [Indexed: 11/19/2022]
Abstract
The majority of emotional expressions used in daily communication are multimodal and dynamic in nature. Consequently, one would expect that human observers utilize specific perceptual strategies to process emotions and to handle the multimodal and dynamic nature of emotions. However, our present knowledge on these strategies is scarce, primarily because most studies on emotion perception have not fully covered this variation, and instead used static and/or unimodal stimuli with few emotion categories. To resolve this knowledge gap, the present study examined how dynamic emotional auditory and visual information is integrated into a unified percept. Since there is a broad spectrum of possible forms of integration, both eye movements and accuracy of emotion identification were evaluated while observers performed an emotion identification task in one of three conditions: audio-only, visual-only video, or audiovisual video. In terms of adaptations of perceptual strategies, eye movement results showed a shift in fixations toward the eyes and away from the nose and mouth when audio is added. Notably, in terms of task performance, audio-only performance was mostly significantly worse than video-only and audiovisual performances, but performance in the latter two conditions was often not different. These results suggest that individuals flexibly and momentarily adapt their perceptual strategies to changes in the available information for emotion recognition, and these changes can be comprehensively quantified with eye tracking.
Collapse
Affiliation(s)
- Minke J de Boer
- 1Research School of Behavioural and Cognitive Neurosciences (BCN), University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.,2Department of Otorhinolaryngology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.,3Laboratory for Experimental Ophthalmology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Deniz Başkent
- 1Research School of Behavioural and Cognitive Neurosciences (BCN), University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.,2Department of Otorhinolaryngology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| | - Frans W Cornelissen
- 1Research School of Behavioural and Cognitive Neurosciences (BCN), University Medical Center Groningen, University of Groningen, Groningen, The Netherlands.,3Laboratory for Experimental Ophthalmology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
| |
Collapse
|
44
|
Greenlaw KM, Puschmann S, Coffey EBJ. Decoding of Envelope vs. Fundamental Frequency During Complex Auditory Stream Segregation. NEUROBIOLOGY OF LANGUAGE (CAMBRIDGE, MASS.) 2020; 1:268-287. [PMID: 37215227 PMCID: PMC10158587 DOI: 10.1162/nol_a_00013] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2019] [Accepted: 04/25/2020] [Indexed: 05/24/2023]
Abstract
Hearing-in-noise perception is a challenging task that is critical to human function, but how the brain accomplishes it is not well understood. A candidate mechanism proposes that the neural representation of an attended auditory stream is enhanced relative to background sound via a combination of bottom-up and top-down mechanisms. To date, few studies have compared neural representation and its task-related enhancement across frequency bands that carry different auditory information, such as a sound's amplitude envelope (i.e., syllabic rate or rhythm; 1-9 Hz), and the fundamental frequency of periodic stimuli (i.e., pitch; >40 Hz). Furthermore, hearing-in-noise in the real world is frequently both messier and richer than the majority of tasks used in its study. In the present study, we use continuous sound excerpts that simultaneously offer predictive, visual, and spatial cues to help listeners separate the target from four acoustically similar simultaneously presented sound streams. We show that while both lower and higher frequency information about the entire sound stream is represented in the brain's response, the to-be-attended sound stream is strongly enhanced only in the slower, lower frequency sound representations. These results are consistent with the hypothesis that attended sound representations are strengthened progressively at higher level, later processing stages, and that the interaction of multiple brain systems can aid in this process. Our findings contribute to our understanding of auditory stream separation in difficult, naturalistic listening conditions and demonstrate that pitch and envelope information can be decoded from single-channel EEG data.
Collapse
Affiliation(s)
- Keelin M. Greenlaw
- Department of Psychology, Concordia University, Montreal, QC, Canada
- International Laboratory for Brain, Music and Sound Research (BRAMS)
- The Centre for Research on Brain, Language and Music (CRBLM)
| | | | | |
Collapse
|
45
|
Shaw LH, Freedman EG, Crosse MJ, Nicholas E, Chen AM, Braiman MS, Molholm S, Foxe JJ. Operating in a Multisensory Context: Assessing the Interplay Between Multisensory Reaction Time Facilitation and Inter-sensory Task-switching Effects. Neuroscience 2020; 436:122-135. [PMID: 32325100 DOI: 10.1016/j.neuroscience.2020.04.013] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 04/03/2020] [Accepted: 04/06/2020] [Indexed: 11/28/2022]
Abstract
Individuals respond faster to presentations of bisensory stimuli (e.g. audio-visual targets) than to presentations of either unisensory constituent in isolation (i.e. to the auditory-alone or visual-alone components of an audio-visual stimulus). This well-established multisensory speeding effect, termed the redundant signals effect (RSE), is not predicted by simple linear summation of the unisensory response time probability distributions. Rather, the speeding is typically faster than this prediction, leading researchers to ascribe the RSE to a so-called co-activation account. According to this account, multisensory neural processing occurs whereby the unisensory inputs are integrated to produce more effective sensory-motor activation. However, the typical paradigm used to test for RSE involves random sequencing of unisensory and bisensory inputs in a mixed design, raising the possibility of an alternate attention-switching account. This intermixed design requires participants to switch between sensory modalities on many task trials (e.g. from responding to a visual stimulus to an auditory stimulus). Here we show that much, if not all, of the RSE under this paradigm can be attributed to slowing of reaction times to unisensory stimuli resulting from modality switching, and is not in fact due to speeding of responses to AV stimuli. As such, the present data do not support a co-activation account, but rather suggest that switching and mixing costs akin to those observed during classic task-switching paradigms account for the observed RSE.
Collapse
Affiliation(s)
- Luke H Shaw
- The Cognitive Neurophysiology Laboratory, The Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA
| | - Edward G Freedman
- The Cognitive Neurophysiology Laboratory, The Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA
| | - Michael J Crosse
- The Cognitive Neurophysiology Laboratory, Department of Pediatrics & Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, NY 10461, USA
| | - Eric Nicholas
- The Cognitive Neurophysiology Laboratory, The Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA
| | - Allen M Chen
- The Cognitive Neurophysiology Laboratory, The Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA
| | - Matthew S Braiman
- The Cognitive Neurophysiology Laboratory, The Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA
| | - Sophie Molholm
- The Cognitive Neurophysiology Laboratory, The Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA; The Cognitive Neurophysiology Laboratory, Department of Pediatrics & Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, NY 10461, USA
| | - John J Foxe
- The Cognitive Neurophysiology Laboratory, The Del Monte Institute for Neuroscience, Department of Neuroscience, University of Rochester School of Medicine and Dentistry, Rochester, NY 14642, USA; The Cognitive Neurophysiology Laboratory, Department of Pediatrics & Neuroscience, Albert Einstein College of Medicine & Montefiore Medical Center, Bronx, NY 10461, USA.
| |
Collapse
|
46
|
Vanheusden FJ, Kegler M, Ireland K, Georga C, Simpson DM, Reichenbach T, Bell SL. Hearing Aids Do Not Alter Cortical Entrainment to Speech at Audible Levels in Mild-to-Moderately Hearing-Impaired Subjects. Front Hum Neurosci 2020; 14:109. [PMID: 32317951 PMCID: PMC7147120 DOI: 10.3389/fnhum.2020.00109] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2019] [Accepted: 03/11/2020] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Cortical entrainment to speech correlates with speech intelligibility and attention to a speech stream in noisy environments. However, there is a lack of data on whether cortical entrainment can help in evaluating hearing aid fittings for subjects with mild to moderate hearing loss. One particular problem that may arise is that hearing aids may alter the speech stimulus during (pre-)processing steps, which might alter cortical entrainment to the speech. Here, the effect of hearing aid processing on cortical entrainment to running speech in hearing impaired subjects was investigated. METHODOLOGY Seventeen native English-speaking subjects with mild-to-moderate hearing loss participated in the study. Hearing function and hearing aid fitting were evaluated using standard clinical procedures. Participants then listened to a 25-min audiobook under aided and unaided conditions at 70 dBA sound pressure level (SPL) in quiet conditions. EEG data were collected using a 32-channel system. Cortical entrainment to speech was evaluated using decoders reconstructing the speech envelope from the EEG data. Null decoders, obtained from EEG and the time-reversed speech envelope, were used to assess the chance level reconstructions. Entrainment in the delta- (1-4 Hz) and theta- (4-8 Hz) band, as well as wideband (1-20 Hz) EEG data was investigated. RESULTS Significant cortical responses could be detected for all but one subject in all three frequency bands under both aided and unaided conditions. However, no significant differences could be found between the two conditions in the number of responses detected, nor in the strength of cortical entrainment. The results show that the relatively small change in speech input provided by the hearing aid was not sufficient to elicit a detectable change in cortical entrainment. CONCLUSION For subjects with mild to moderate hearing loss, cortical entrainment to speech in quiet at an audible level is not affected by hearing aids. These results clear the pathway for exploring the potential to use cortical entrainment to running speech for evaluating hearing aid fitting at lower speech intensities (which could be inaudible when unaided), or using speech in noise conditions.
Collapse
Affiliation(s)
- Frederique J. Vanheusden
- Department of Engineering, School of Science and Technology, Nottingham Trent University, Nottingham, United Kingdom
- Institute of Sound and Vibration Research, Faculty of Engineering and Physical Sciences, University of Southampton, Southampton, United Kingdom
| | - Mikolaj Kegler
- Department of Bioengineering and Centre for Neurotechnology, Imperial College London, South Kensington Campus, London, United Kingdom
| | - Katie Ireland
- Audiology Department, Royal Berkshire NHS Foundation Trust, Reading, United Kingdom
| | - Constantina Georga
- Audiology Department, Royal Berkshire NHS Foundation Trust, Reading, United Kingdom
| | - David M. Simpson
- Institute of Sound and Vibration Research, Faculty of Engineering and Physical Sciences, University of Southampton, Southampton, United Kingdom
| | - Tobias Reichenbach
- Department of Bioengineering and Centre for Neurotechnology, Imperial College London, South Kensington Campus, London, United Kingdom
| | - Steven L. Bell
- Institute of Sound and Vibration Research, Faculty of Engineering and Physical Sciences, University of Southampton, Southampton, United Kingdom
| |
Collapse
|
47
|
Micheli C, Schepers IM, Ozker M, Yoshor D, Beauchamp MS, Rieger JW. Electrocorticography reveals continuous auditory and visual speech tracking in temporal and occipital cortex. Eur J Neurosci 2020; 51:1364-1376. [PMID: 29888819 PMCID: PMC6289876 DOI: 10.1111/ejn.13992] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Revised: 05/19/2018] [Accepted: 05/29/2018] [Indexed: 12/11/2022]
Abstract
During natural speech perception, humans must parse temporally continuous auditory and visual speech signals into sequences of words. However, most studies of speech perception present only single words or syllables. We used electrocorticography (subdural electrodes implanted on the brains of epileptic patients) to investigate the neural mechanisms for processing continuous audiovisual speech signals consisting of individual sentences. Using partial correlation analysis, we found that posterior superior temporal gyrus (pSTG) and medial occipital cortex tracked both the auditory and the visual speech envelopes. These same regions, as well as inferior temporal cortex, responded more strongly to a dynamic video of a talking face compared to auditory speech paired with a static face. Occipital cortex and pSTG carry temporal information about both auditory and visual speech dynamics. Visual speech tracking in pSTG may be a mechanism for enhancing perception of degraded auditory speech.
Collapse
Affiliation(s)
- Cristiano Micheli
- Department of Psychology, Carl von Ossietzky University, Oldenburg, Germany
- Donders Centre for Cognitive Neuroimaging, Radboud University, Nijmegen, The Netherlands
| | - Inga M Schepers
- Department of Psychology, Carl von Ossietzky University, Oldenburg, Germany
- Research Center Neurosensory Science, Carl von Ossietzky University, Oldenburg, Germany
| | - Müge Ozker
- Department of Neurosurgery, Baylor College of Medicine, Houston, Texas
| | - Daniel Yoshor
- Department of Neurosurgery, Baylor College of Medicine, Houston, Texas
- Michael E. DeBakey Veterans Affairs Medical Center, Houston, Texas
| | | | - Jochem W Rieger
- Department of Psychology, Carl von Ossietzky University, Oldenburg, Germany
- Research Center Neurosensory Science, Carl von Ossietzky University, Oldenburg, Germany
| |
Collapse
|
48
|
Yuan Y, Wayland R, Oh Y. Visual analog of the acoustic amplitude envelope benefits speech perception in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:EL246. [PMID: 32237828 DOI: 10.1121/10.0000737] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 01/29/2020] [Indexed: 06/11/2023]
Abstract
The nature of the visual input that integrates with the audio signal to yield speech processing advantages remains controversial. This study tests the hypothesis that the information extracted for audiovisual integration includes co-occurring suprasegmental dynamic changes in the acoustic and visual signal. English sentences embedded in multi-talker babble noise were presented to native English listeners in audio-only and audiovisual modalities. A significant intelligibility enhancement with the visual analogs congruent to the acoustic amplitude envelopes was observed. These results suggest that dynamic visual modulation provides speech rhythmic information that can be integrated online with the audio signal to enhance speech intelligibility.
Collapse
Affiliation(s)
- Yi Yuan
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, Florida 32610, USA
| | - Ratree Wayland
- Department of Linguistics, University of Florida, Gainesville, Florida 32611, , ,
| | - Yonghee Oh
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville, Florida 32610, USA
| |
Collapse
|
49
|
Shavit-Cohen K, Zion Golumbic E. The Dynamics of Attention Shifts Among Concurrent Speech in a Naturalistic Multi-speaker Virtual Environment. Front Hum Neurosci 2019; 13:386. [PMID: 31780911 PMCID: PMC6857110 DOI: 10.3389/fnhum.2019.00386] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2019] [Accepted: 10/16/2019] [Indexed: 12/18/2022] Open
Abstract
Focusing attention on one speaker on the background of other irrelevant speech can be a challenging feat. A longstanding question in attention research is whether and how frequently individuals shift their attention towards task-irrelevant speech, arguably leading to occasional detection of words in a so-called unattended message. However, this has been difficult to gauge empirically, particularly when participants attend to continuous natural speech, due to the lack of appropriate metrics for detecting shifts in internal attention. Here we introduce a new experimental platform for studying the dynamic deployment of attention among concurrent speakers, utilizing a unique combination of Virtual Reality (VR) and Eye-Tracking technology. We created a Virtual Café in which participants sit across from and attend to the narrative of a target speaker. We manipulated the number and location of distractor speakers by placing additional characters throughout the Virtual Café. By monitoring participant's eye-gaze dynamics, we studied the patterns of overt attention-shifts among concurrent speakers as well as the consequences of these shifts on speech comprehension. Our results reveal important individual differences in the gaze-pattern displayed during selective attention to speech. While some participants stayed fixated on a target speaker throughout the entire experiment, approximately 30% of participants frequently shifted their gaze toward distractor speakers or other locations in the environment, regardless of the severity of audiovisual distraction. Critically, preforming frequent gaze-shifts negatively impacted the comprehension of target speech, and participants made more mistakes when looking away from the target speaker. We also found that gaze-shifts occurred primarily during gaps in the acoustic input, suggesting that momentary reductions in acoustic masking prompt attention-shifts between competing speakers, in line with "glimpsing" theories of processing speech in noise. These results open a new window into understanding the dynamics of attention as they wax and wane over time, and the different listening patterns employed for dealing with the influx of sensory input in multisensory environments. Moreover, the novel approach developed here for tracking the locus of momentary attention in a naturalistic virtual-reality environment holds high promise for extending the study of human behavior and cognition and bridging the gap between the laboratory and real-life.
Collapse
Affiliation(s)
| | - Elana Zion Golumbic
- The Gonda Multidisciplinary Brain Research Center, Bar Ilan University, Ramat Gan, Israel
| |
Collapse
|
50
|
Fu Z, Wu X, Chen J. Congruent audiovisual speech enhances auditory attention decoding with EEG. J Neural Eng 2019; 16:066033. [PMID: 31505476 DOI: 10.1088/1741-2552/ab4340] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
OBJECTIVE The auditory attention decoding (AAD) approach can be used to determine the identity of the attended speaker during an auditory selective attention task, by analyzing measurements of electroencephalography (EEG) data. The AAD approach has the potential to guide the design of speech enhancement algorithms in hearing aids, i.e. to identify the speech stream of listener's interest so that hearing aids algorithms can amplify the target speech and attenuate other distracting sounds. This would consequently result in improved speech understanding and communication and reduced cognitive load, etc. The present work aimed to investigate whether additional visual input (i.e. lipreading) would enhance the AAD performance for normal-hearing listeners. APPROACH In a two-talker scenario, where auditory stimuli of audiobooks narrated by two speakers were presented, multi-channel EEG signals were recorded while participants were selectively attending to one speaker and ignoring the other one. Speakers' mouth movements were recorded during narrating for providing visual stimuli. Stimulus conditions included audio-only, visual input congruent with either (i.e. attended or unattended) speaker, and visual input incongruent with either speaker. The AAD approach was performed separately for each condition to evaluate the effect of additional visual input on AAD. MAIN RESULTS Relative to the audio-only condition, the AAD performance was found improved by visual input only when it was congruent with the attended speech stream, and the improvement was about 14 percentage points on decoding accuracy. Cortical envelope tracking activities in both auditory and visual cortex were demonstrated stronger for the congruent audiovisual speech condition than other conditions. In addition, a higher AAD robustness was revealed for the congruent audiovisual condition, with reduced channel number and trial duration achieving higher accuracy than the audio-only condition. SIGNIFICANCE The present work complements previous studies and further manifests the feasibility of the AAD-guided design of hearing aids in daily face-to-face conversations. The present work also has a directive significance for designing a low-density EEG setup for the AAD approach.
Collapse
Affiliation(s)
- Zhen Fu
- Department of Machine Intelligence, Speech and Hearing Research Center, and Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing 100871, People's Republic of China
| | | | | |
Collapse
|