1
|
Dorsi J, Lacey S, Sathian K. Multisensory and lexical information in speech perception. Front Hum Neurosci 2024; 17:1331129. [PMID: 38259332 PMCID: PMC10800662 DOI: 10.3389/fnhum.2023.1331129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/11/2023] [Indexed: 01/24/2024] Open
Abstract
Both multisensory and lexical information are known to influence the perception of speech. However, an open question remains: is either source more fundamental to perceiving speech? In this perspective, we review the literature and argue that multisensory information plays a more fundamental role in speech perception than lexical information. Three sets of findings support this conclusion: first, reaction times and electroencephalographic signal latencies indicate that the effects of multisensory information on speech processing seem to occur earlier than the effects of lexical information. Second, non-auditory sensory input influences the perception of features that differentiate phonetic categories; thus, multisensory information determines what lexical information is ultimately processed. Finally, there is evidence that multisensory information helps form some lexical information as part of a phenomenon known as sound symbolism. These findings support a framework of speech perception that, while acknowledging the influential roles of both multisensory and lexical information, holds that multisensory information is more fundamental to the process.
Collapse
Affiliation(s)
- Josh Dorsi
- Department of Neurology, Penn State College of Medicine, Hershey, PA, United States
| | - Simon Lacey
- Department of Neurology, Penn State College of Medicine, Hershey, PA, United States
- Department of Neural and Behavioral Sciences, Penn State College of Medicine, Hershey, PA, United States
- Department of Psychology, Penn State Colleges of Medicine and Liberal Arts, Hershey, PA, United States
| | - K. Sathian
- Department of Neurology, Penn State College of Medicine, Hershey, PA, United States
- Department of Neural and Behavioral Sciences, Penn State College of Medicine, Hershey, PA, United States
- Department of Psychology, Penn State Colleges of Medicine and Liberal Arts, Hershey, PA, United States
| |
Collapse
|
2
|
Ahn E, Majumdar A, Lee T, Brang D. Evidence for a Causal Dissociation of the McGurk Effect and Congruent Audiovisual Speech Perception via TMS. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.27.568892. [PMID: 38077093 PMCID: PMC10705272 DOI: 10.1101/2023.11.27.568892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2023]
Abstract
Congruent visual speech improves speech perception accuracy, particularly in noisy environments. Conversely, mismatched visual speech can alter what is heard, leading to an illusory percept known as the McGurk effect. This illusion has been widely used to study audiovisual speech integration, illustrating that auditory and visual cues are combined in the brain to generate a single coherent percept. While prior transcranial magnetic stimulation (TMS) and neuroimaging studies have identified the left posterior superior temporal sulcus (pSTS) as a causal region involved in the generation of the McGurk effect, it remains unclear whether this region is critical only for this illusion or also for the more general benefits of congruent visual speech (e.g., increased accuracy and faster reaction times). Indeed, recent correlative research suggests that the benefits of congruent visual speech and the McGurk effect reflect largely independent mechanisms. To better understand how these different features of audiovisual integration are causally generated by the left pSTS, we used single-pulse TMS to temporarily impair processing while subjects were presented with either incongruent (McGurk) or congruent audiovisual combinations. Consistent with past research, we observed that TMS to the left pSTS significantly reduced the strength of the McGurk effect. Importantly, however, left pSTS stimulation did not affect the positive benefits of congruent audiovisual speech (increased accuracy and faster reaction times), demonstrating a causal dissociation between the two processes. Our results are consistent with models proposing that the pSTS is but one of multiple critical areas supporting audiovisual speech interactions. Moreover, these data add to a growing body of evidence suggesting that the McGurk effect is an imperfect surrogate measure for more general and ecologically valid audiovisual speech behaviors.
Collapse
Affiliation(s)
- EunSeon Ahn
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - Areti Majumdar
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - Taraz Lee
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| |
Collapse
|
3
|
Nidiffer AR, Cao CZ, O'Sullivan A, Lalor EC. A representation of abstract linguistic categories in the visual system underlies successful lipreading. Neuroimage 2023; 282:120391. [PMID: 37757989 DOI: 10.1016/j.neuroimage.2023.120391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 09/22/2023] [Accepted: 09/24/2023] [Indexed: 09/29/2023] Open
Abstract
There is considerable debate over how visual speech is processed in the absence of sound and whether neural activity supporting lipreading occurs in visual brain areas. Much of the ambiguity stems from a lack of behavioral grounding and neurophysiological analyses that cannot disentangle high-level linguistic and phonetic/energetic contributions from visual speech. To address this, we recorded EEG from human observers as they watched silent videos, half of which were novel and half of which were previously rehearsed with the accompanying audio. We modeled how the EEG responses to novel and rehearsed silent speech reflected the processing of low-level visual features (motion, lip movements) and a higher-level categorical representation of linguistic units, known as visemes. The ability of these visemes to account for the EEG - beyond the motion and lip movements - was significantly enhanced for rehearsed videos in a way that correlated with participants' trial-by-trial ability to lipread that speech. Source localization of viseme processing showed clear contributions from visual cortex, with no strong evidence for the involvement of auditory areas. We interpret this as support for the idea that the visual system produces its own specialized representation of speech that is (1) well-described by categorical linguistic features, (2) dissociable from lip movements, and (3) predictive of lipreading ability. We also suggest a reinterpretation of previous findings of auditory cortical activation during silent speech that is consistent with hierarchical accounts of visual and audiovisual speech perception.
Collapse
Affiliation(s)
- Aaron R Nidiffer
- Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA
| | - Cody Zhewei Cao
- Department of Psychology, University of Michigan, Ann Arbor, MI, USA
| | - Aisling O'Sullivan
- School of Engineering, Trinity College Institute of Neuroscience, Trinity Centre for Biomedical Engineering, Trinity College, Dublin, Ireland
| | - Edmund C Lalor
- Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA; School of Engineering, Trinity College Institute of Neuroscience, Trinity Centre for Biomedical Engineering, Trinity College, Dublin, Ireland.
| |
Collapse
|
4
|
Yamoah EN, Pavlinkova G, Fritzsch B. The Development of Speaking and Singing in Infants May Play a Role in Genomics and Dementia in Humans. Brain Sci 2023; 13:1190. [PMID: 37626546 PMCID: PMC10452560 DOI: 10.3390/brainsci13081190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2023] [Revised: 08/04/2023] [Accepted: 08/08/2023] [Indexed: 08/27/2023] Open
Abstract
The development of the central auditory system, including the auditory cortex and other areas involved in processing sound, is shaped by genetic and environmental factors, enabling infants to learn how to speak. Before explaining hearing in humans, a short overview of auditory dysfunction is provided. Environmental factors such as exposure to sound and language can impact the development and function of the auditory system sound processing, including discerning in speech perception, singing, and language processing. Infants can hear before birth, and sound exposure sculpts their developing auditory system structure and functions. Exposing infants to singing and speaking can support their auditory and language development. In aging humans, the hippocampus and auditory nuclear centers are affected by neurodegenerative diseases such as Alzheimer's, resulting in memory and auditory processing difficulties. As the disease progresses, overt auditory nuclear center damage occurs, leading to problems in processing auditory information. In conclusion, combined memory and auditory processing difficulties significantly impact people's ability to communicate and engage with their societal essence.
Collapse
Affiliation(s)
- Ebenezer N. Yamoah
- Department of Physiology and Cell Biology, School of Medicine, University of Nevada, Reno, NV 89557, USA;
| | | | - Bernd Fritzsch
- Department of Neurological Sciences, University of Nebraska Medical Center, Omaha, NE 68198, USA
| |
Collapse
|
5
|
Giovanelli E, Gianfreda G, Gessa E, Valzolgher C, Lamano L, Lucioli T, Tomasuolo E, Rinaldi P, Pavani F. The effect of face masks on sign language comprehension: performance and metacognitive dimensions. Conscious Cogn 2023; 109:103490. [PMID: 36842317 DOI: 10.1016/j.concog.2023.103490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 02/16/2023] [Accepted: 02/16/2023] [Indexed: 02/27/2023]
Abstract
In spoken languages, face masks represent an obstacle to speech understanding and influence metacognitive judgments, reducing confidence and increasing effort while listening. To date, all studies on face masks and communication involved spoken languages and hearing participants, leaving us with no insight on how masked communication impacts on non-spoken languages. Here, we examined the effects of face masks on sign language comprehension and metacognition. In an online experiment, deaf participants (N = 60) watched three parts of a story signed without mask, with a transparent mask or with an opaque mask, and answered questions about story content, as well as their perceived effort, feeling of understanding, and confidence in their answers. Results showed that feeling of understanding and perceived effort worsened as the visual condition changed from no mask to transparent or opaque masks, while comprehension of the story was not significantly different across visual conditions. We propose that metacognitive effects could be due to the reduction of pragmatic, linguistic and para-linguistic cues from the lower face, hidden by the mask. This reduction could impact on lower-face linguistic components perception, attitude attribution, classification of emotions and prosody of a conversation, driving the observed effects on metacognitive judgments but leaving sign language comprehension substantially unchanged, even if with a higher effort. These results represent a novel step towards better understanding what drives metacognitive effects of face masks while communicating face to face and highlight the importance of including the metacognitive dimension in human communication research.
Collapse
Affiliation(s)
- Elena Giovanelli
- Center for Mind/Brain Sciences - CIMeC, University of Trento, Rovereto, Italy.
| | - Gabriele Gianfreda
- Institute of Cognitive Sciences and Technologies (ISTC-CNR), Rome, Italy
| | - Elena Gessa
- Center for Mind/Brain Sciences - CIMeC, University of Trento, Rovereto, Italy
| | - Chiara Valzolgher
- Center for Mind/Brain Sciences - CIMeC, University of Trento, Rovereto, Italy
| | - Luca Lamano
- Institute of Cognitive Sciences and Technologies (ISTC-CNR), Rome, Italy
| | - Tommaso Lucioli
- Institute of Cognitive Sciences and Technologies (ISTC-CNR), Rome, Italy
| | - Elena Tomasuolo
- Institute of Cognitive Sciences and Technologies (ISTC-CNR), Rome, Italy
| | - Pasquale Rinaldi
- Institute of Cognitive Sciences and Technologies (ISTC-CNR), Rome, Italy
| | - Francesco Pavani
- Center for Mind/Brain Sciences - CIMeC, University of Trento, Rovereto, Italy; Centro Interuniversitario di Ricerca "Cognizione, Linguaggio e Sordità" - CIRCLeS, Trento, Italy
| |
Collapse
|
6
|
Suess N, Hauswald A, Reisinger P, Rösch S, Keitel A, Weisz N. Cortical tracking of formant modulations derived from silently presented lip movements and its decline with age. Cereb Cortex 2022; 32:4818-4833. [PMID: 35062025 PMCID: PMC9627034 DOI: 10.1093/cercor/bhab518] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Revised: 12/15/2021] [Accepted: 12/16/2021] [Indexed: 11/26/2022] Open
Abstract
The integration of visual and auditory cues is crucial for successful processing of speech, especially under adverse conditions. Recent reports have shown that when participants watch muted videos of speakers, the phonological information about the acoustic speech envelope, which is associated with but independent from the speakers' lip movements, is tracked by the visual cortex. However, the speech signal also carries richer acoustic details, for example, about the fundamental frequency and the resonant frequencies, whose visuophonological transformation could aid speech processing. Here, we investigated the neural basis of the visuo-phonological transformation processes of these more fine-grained acoustic details and assessed how they change as a function of age. We recorded whole-head magnetoencephalographic (MEG) data while the participants watched silent normal (i.e., natural) and reversed videos of a speaker and paid attention to their lip movements. We found that the visual cortex is able to track the unheard natural modulations of resonant frequencies (or formants) and the pitch (or fundamental frequency) linked to lip movements. Importantly, only the processing of natural unheard formants decreases significantly with age in the visual and also in the cingulate cortex. This is not the case for the processing of the unheard speech envelope, the fundamental frequency, or the purely visual information carried by lip movements. These results show that unheard spectral fine details (along with the unheard acoustic envelope) are transformed from a mere visual to a phonological representation. Aging affects especially the ability to derive spectral dynamics at formant frequencies. As listening in noisy environments should capitalize on the ability to track spectral fine details, our results provide a novel focus on compensatory processes in such challenging situations.
Collapse
Affiliation(s)
- Nina Suess
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
| | - Anne Hauswald
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
| | - Patrick Reisinger
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
| | - Sebastian Rösch
- Department of Otorhinolaryngology, Head and Neck Surgery, Paracelsus Medical University Salzburg, University Hospital Salzburg, Salzburg 5020, Austria
| | - Anne Keitel
- School of Social Sciences, University of Dundee, Dundee DD1 4HN, UK
| | - Nathan Weisz
- Department of Psychology, Centre for Cognitive Neuroscience, University of Salzburg, Salzburg 5020, Austria
- Department of Psychology, Neuroscience Institute, Christian Doppler University Hospital, Paracelsus Medical University, Salzburg 5020, Austria
| |
Collapse
|
7
|
Fritzsch B, Elliott KL, Yamoah EN. Neurosensory development of the four brainstem-projecting sensory systems and their integration in the telencephalon. Front Neural Circuits 2022; 16:913480. [PMID: 36213204 PMCID: PMC9539932 DOI: 10.3389/fncir.2022.913480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 08/23/2022] [Indexed: 11/18/2022] Open
Abstract
Somatosensory, taste, vestibular, and auditory information is first processed in the brainstem. From the brainstem, the respective information is relayed to specific regions within the cortex, where these inputs are further processed and integrated with other sensory systems to provide a comprehensive sensory experience. We provide the organization, genetics, and various neuronal connections of four sensory systems: trigeminal, taste, vestibular, and auditory systems. The development of trigeminal fibers is comparable to many sensory systems, for they project mostly contralaterally from the brainstem or spinal cord to the telencephalon. Taste bud information is primarily projected ipsilaterally through the thalamus to reach the insula. The vestibular fibers develop bilateral connections that eventually reach multiple areas of the cortex to provide a complex map. The auditory fibers project in a tonotopic contour to the auditory cortex. The spatial and tonotopic organization of trigeminal and auditory neuron projections are distinct from the taste and vestibular systems. The individual sensory projections within the cortex provide multi-sensory integration in the telencephalon that depends on context-dependent tertiary connections to integrate other cortical sensory systems across the four modalities.
Collapse
Affiliation(s)
- Bernd Fritzsch
- Department of Biology, The University of Iowa, Iowa City, IA, United States
- Department of Otolaryngology, The University of Iowa, Iowa City, IA, United States
- *Correspondence: Bernd Fritzsch,
| | - Karen L. Elliott
- Department of Biology, The University of Iowa, Iowa City, IA, United States
| | - Ebenezer N. Yamoah
- Department of Physiology and Cell Biology, School of Medicine, University of Nevada, Reno, Reno, NV, United States
| |
Collapse
|
8
|
Haider CL, Suess N, Hauswald A, Park H, Weisz N. Masking of the mouth area impairs reconstruction of acoustic speech features and higher-level segmentational features in the presence of a distractor speaker. Neuroimage 2022; 252:119044. [PMID: 35240298 DOI: 10.1016/j.neuroimage.2022.119044] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 02/26/2022] [Accepted: 02/27/2022] [Indexed: 11/29/2022] Open
Abstract
Multisensory integration enables stimulus representation even when the sensory input in a single modality is weak. In the context of speech, when confronted with a degraded acoustic signal, congruent visual inputs promote comprehension. When this input is masked, speech comprehension consequently becomes more difficult. But it still remains inconclusive which levels of speech processing are affected under which circumstances by occluding the mouth area. To answer this question, we conducted an audiovisual (AV) multi-speaker experiment using naturalistic speech. In half of the trials, the target speaker wore a (surgical) face mask, while we measured the brain activity of normal hearing participants via magnetoencephalography (MEG). We additionally added a distractor speaker in half of the trials in order to create an ecologically difficult listening situation. A decoding model on the clear AV speech was trained and used to reconstruct crucial speech features in each condition. We found significant main effects of face masks on the reconstruction of acoustic features, such as the speech envelope and spectral speech features (i.e. pitch and formant frequencies), while reconstruction of higher level features of speech segmentation (phoneme and word onsets) were especially impaired through masks in difficult listening situations. As we used surgical face masks in our study, which only show mild effects on speech acoustics, we interpret our findings as the result of the missing visual input. Our findings extend previous behavioural results, by demonstrating the complex contextual effects of occluding relevant visual information on speech processing.
Collapse
Affiliation(s)
- Chandra Leon Haider
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria.
| | - Nina Suess
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria
| | - Anne Hauswald
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria
| | - Hyojin Park
- School of Psychology & Centre for Human Brain Health (CHBH), University of Birmingham, Birmingham, UK
| | - Nathan Weisz
- Centre for Cognitive Neuroscience and Department of Psychology, University of Salzburg, Austria; Neuroscience Institute, Christian Doppler University Hospital, Paracelsus Medical University, Salzburg, Austria
| |
Collapse
|
9
|
Rennig J, Beauchamp MS. Intelligibility of audiovisual sentences drives multivoxel response patterns in human superior temporal cortex. Neuroimage 2022; 247:118796. [PMID: 34906712 PMCID: PMC8819942 DOI: 10.1016/j.neuroimage.2021.118796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 11/18/2021] [Accepted: 12/08/2021] [Indexed: 11/18/2022] Open
Abstract
Regions of the human posterior superior temporal gyrus and sulcus (pSTG/S) respond to the visual mouth movements that constitute visual speech and the auditory vocalizations that constitute auditory speech, and neural responses in pSTG/S may underlie the perceptual benefit of visual speech for the comprehension of noisy auditory speech. We examined this possibility through the lens of multivoxel pattern responses in pSTG/S. BOLD fMRI data was collected from 22 participants presented with speech consisting of English sentences presented in five different formats: visual-only; auditory with and without added auditory noise; and audiovisual with and without auditory noise. Participants reported the intelligibility of each sentence with a button press and trials were sorted post-hoc into those that were more or less intelligible. Response patterns were measured in regions of the pSTG/S identified with an independent localizer. Noisy audiovisual sentences with very similar physical properties evoked very different response patterns depending on their intelligibility. When a noisy audiovisual sentence was reported as intelligible, the pattern was nearly identical to that elicited by clear audiovisual sentences. In contrast, an unintelligible noisy audiovisual sentence evoked a pattern like that of visual-only sentences. This effect was less pronounced for noisy auditory-only sentences, which evoked similar response patterns regardless of intelligibility. The successful integration of visual and auditory speech produces a characteristic neural signature in pSTG/S, highlighting the importance of this region in generating the perceptual benefit of visual speech.
Collapse
Affiliation(s)
- Johannes Rennig
- Division of Neuropsychology, Center of Neurology, Hertie-Institute for Clinical Brain Research, University of Tübingen, Tübingen, Germany
| | - Michael S Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Richards Medical Research Building, A607, 3700 Hamilton Walk, Philadelphia, PA 19104-6016, United States.
| |
Collapse
|
10
|
Kaganovich N, Christ S. Event-related potentials evidence for long-term audiovisual representations of phonemes in adults. Eur J Neurosci 2021; 54:7860-7875. [PMID: 34750895 DOI: 10.1111/ejn.15519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 11/03/2021] [Accepted: 11/05/2021] [Indexed: 10/19/2022]
Abstract
The presence of long-term auditory representations for phonemes has been well-established. However, since speech perception is typically audiovisual, we hypothesized that long-term phoneme representations may also contain information on speakers' mouth shape during articulation. We used an audiovisual oddball paradigm in which, on each trial, participants saw a face and heard one of two vowels. One vowel occurred frequently (standard), while another occurred rarely (deviant). In one condition (neutral), the face had a closed, non-articulating mouth. In the other condition (audiovisual violation), the mouth shape matched the frequent vowel. Although in both conditions stimuli were audiovisual, we hypothesized that identical auditory changes would be perceived differently by participants. Namely, in the neutral condition, deviants violated only the audiovisual pattern specific to each block. By contrast, in the audiovisual violation condition, deviants additionally violated long-term representations for how a speaker's mouth looks during articulation. We compared the amplitude of mismatch negativity (MMN) and P3 components elicited by deviants in the two conditions. The MMN extended posteriorly over temporal and occipital sites even though deviants contained no visual changes, suggesting that deviants were perceived as interruptions in audiovisual, rather than auditory only, sequences. As predicted, deviants elicited larger MMN and P3 in the audiovisual violation compared to the neutral condition. The results suggest that long-term representations of phonemes are indeed audiovisual.
Collapse
Affiliation(s)
- Natalya Kaganovich
- Department of Speech, Language, and Hearing Sciences, Purdue University, West Lafayette, Indiana, USA.,Department of Psychological Sciences, Purdue University, West Lafayette, Indiana, USA
| | - Sharon Christ
- Department of Human Development and Family Studies, Purdue University, West Lafayette, Indiana, USA.,Department of Statistics, Purdue University, West Lafayette, Indiana, USA
| |
Collapse
|
11
|
Karthik G, Plass J, Beltz AM, Liu Z, Grabowecky M, Suzuki S, Stacey WC, Wasade VS, Towle VL, Tao JX, Wu S, Issa NP, Brang D. Visual speech differentially modulates beta, theta, and high gamma bands in auditory cortex. Eur J Neurosci 2021; 54:7301-7317. [PMID: 34587350 DOI: 10.1111/ejn.15482] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 08/20/2021] [Accepted: 08/28/2021] [Indexed: 12/13/2022]
Abstract
Speech perception is a central component of social communication. Although principally an auditory process, accurate speech perception in everyday settings is supported by meaningful information extracted from visual cues. Visual speech modulates activity in cortical areas subserving auditory speech perception including the superior temporal gyrus (STG). However, it is unknown whether visual modulation of auditory processing is a unitary phenomenon or, rather, consists of multiple functionally distinct processes. To explore this question, we examined neural responses to audiovisual speech measured from intracranially implanted electrodes in 21 patients with epilepsy. We found that visual speech modulated auditory processes in the STG in multiple ways, eliciting temporally and spatially distinct patterns of activity that differed across frequency bands. In the theta band, visual speech suppressed the auditory response from before auditory speech onset to after auditory speech onset (-93 to 500 ms) most strongly in the posterior STG. In the beta band, suppression was seen in the anterior STG from -311 to -195 ms before auditory speech onset and in the middle STG from -195 to 235 ms after speech onset. In high gamma, visual speech enhanced the auditory response from -45 to 24 ms only in the posterior STG. We interpret the visual-induced changes prior to speech onset as reflecting crossmodal prediction of speech signals. In contrast, modulations after sound onset may reflect a decrease in sustained feedforward auditory activity. These results are consistent with models that posit multiple distinct mechanisms supporting audiovisual speech perception.
Collapse
Affiliation(s)
- G Karthik
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| | - John Plass
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| | - Adriene M Beltz
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| | - Zhongming Liu
- Department of Biomedical Engineering and Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Marcia Grabowecky
- Department of Psychology, Northwestern University, Evanston, Illinois, USA
| | - Satoru Suzuki
- Department of Psychology, Northwestern University, Evanston, Illinois, USA
| | - William C Stacey
- Department of Neurology and Department of Biomedical Engineering, University of Michigan, Ann Arbor, Michigan, USA
| | - Vibhangini S Wasade
- Department of Neurology, Henry Ford Hospital, Detroit, Michigan, USA.,Department of Neurology, Wayne State University School of Medicine, Detroit, Michigan, USA
| | - Vernon L Towle
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - James X Tao
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - Shasha Wu
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - Naoum P Issa
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
12
|
O'Sullivan AE, Crosse MJ, Liberto GMD, de Cheveigné A, Lalor EC. Neurophysiological Indices of Audiovisual Speech Processing Reveal a Hierarchy of Multisensory Integration Effects. J Neurosci 2021; 41:4991-5003. [PMID: 33824190 PMCID: PMC8197638 DOI: 10.1523/jneurosci.0906-20.2021] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Revised: 03/16/2021] [Accepted: 03/22/2021] [Indexed: 12/27/2022] Open
Abstract
Seeing a speaker's face benefits speech comprehension, especially in challenging listening conditions. This perceptual benefit is thought to stem from the neural integration of visual and auditory speech at multiple stages of processing, whereby movement of a speaker's face provides temporal cues to auditory cortex, and articulatory information from the speaker's mouth can aid recognizing specific linguistic units (e.g., phonemes, syllables). However, it remains unclear how the integration of these cues varies as a function of listening conditions. Here, we sought to provide insight on these questions by examining EEG responses in humans (males and females) to natural audiovisual (AV), audio, and visual speech in quiet and in noise. We represented our speech stimuli in terms of their spectrograms and their phonetic features and then quantified the strength of the encoding of those features in the EEG using canonical correlation analysis (CCA). The encoding of both spectrotemporal and phonetic features was shown to be more robust in AV speech responses than what would have been expected from the summation of the audio and visual speech responses, suggesting that multisensory integration occurs at both spectrotemporal and phonetic stages of speech processing. We also found evidence to suggest that the integration effects may change with listening conditions; however, this was an exploratory analysis and future work will be required to examine this effect using a within-subject design. These findings demonstrate that integration of audio and visual speech occurs at multiple stages along the speech processing hierarchy.SIGNIFICANCE STATEMENT During conversation, visual cues impact our perception of speech. Integration of auditory and visual speech is thought to occur at multiple stages of speech processing and vary flexibly depending on the listening conditions. Here, we examine audiovisual (AV) integration at two stages of speech processing using the speech spectrogram and a phonetic representation, and test how AV integration adapts to degraded listening conditions. We find significant integration at both of these stages regardless of listening conditions. These findings reveal neural indices of multisensory interactions at different stages of processing and provide support for the multistage integration framework.
Collapse
Affiliation(s)
- Aisling E O'Sullivan
- School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
| | - Michael J Crosse
- X, The Moonshot Factory, Mountain View, CA and Department of Neuroscience, Albert Einstein College of Medicine, Bronx, New York 10461
| | - Giovanni M Di Liberto
- Laboratoire des Systèmes Perceptifs, Département d'Études Cognitives, École Normale Supérieure, Paris Sciences et Lettres University, Centre National de la Recherche Scientifique, Paris 75005, France
| | - Alain de Cheveigné
- Laboratoire des Systèmes Perceptifs, Département d'Études Cognitives, École Normale Supérieure, Paris Sciences et Lettres University, Centre National de la Recherche Scientifique, Paris 75005, France
- University College London Ear Institute, University College London, London WC1X 8EE, United Kingdom
| | - Edmund C Lalor
- School of Engineering, Trinity Centre for Biomedical Engineering and Trinity College Institute of Neuroscience, Trinity College Dublin, Dublin 2, Ireland
- Department of Biomedical Engineering and Department of Neuroscience, University of Rochester, Rochester, New York 14627
| |
Collapse
|
13
|
Keitel A, Gross J, Kayser C. Shared and modality-specific brain regions that mediate auditory and visual word comprehension. eLife 2020; 9:e56972. [PMID: 32831168 PMCID: PMC7470824 DOI: 10.7554/elife.56972] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2020] [Accepted: 08/18/2020] [Indexed: 12/22/2022] Open
Abstract
Visual speech carried by lip movements is an integral part of communication. Yet, it remains unclear in how far visual and acoustic speech comprehension are mediated by the same brain regions. Using multivariate classification of full-brain MEG data, we first probed where the brain represents acoustically and visually conveyed word identities. We then tested where these sensory-driven representations are predictive of participants' trial-wise comprehension. The comprehension-relevant representations of auditory and visual speech converged only in anterior angular and inferior frontal regions and were spatially dissociated from those representations that best reflected the sensory-driven word identity. These results provide a neural explanation for the behavioural dissociation of acoustic and visual speech comprehension and suggest that cerebral representations encoding word identities may be more modality-specific than often upheld.
Collapse
Affiliation(s)
- Anne Keitel
- Psychology, University of DundeeDundeeUnited Kingdom
- Institute of Neuroscience and Psychology, University of GlasgowGlasgowUnited Kingdom
| | - Joachim Gross
- Institute of Neuroscience and Psychology, University of GlasgowGlasgowUnited Kingdom
- Institute for Biomagnetism and Biosignalanalysis, University of MünsterMünsterGermany
| | - Christoph Kayser
- Department for Cognitive Neuroscience, Faculty of Biology, Bielefeld UniversityBielefeldGermany
| |
Collapse
|
14
|
Plass J, Brang D, Suzuki S, Grabowecky M. Vision perceptually restores auditory spectral dynamics in speech. Proc Natl Acad Sci U S A 2020; 117:16920-16927. [PMID: 32632010 PMCID: PMC7382243 DOI: 10.1073/pnas.2002887117] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time-frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.
Collapse
Affiliation(s)
- John Plass
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109;
- Department of Psychology, Northwestern University, Evanston, IL 60208
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - Satoru Suzuki
- Department of Psychology, Northwestern University, Evanston, IL 60208
- Interdepartmental Neuroscience Program, Northwestern University, Chicago, IL 60611
| | - Marcia Grabowecky
- Department of Psychology, Northwestern University, Evanston, IL 60208
- Interdepartmental Neuroscience Program, Northwestern University, Chicago, IL 60611
| |
Collapse
|