1
|
Maguinness C, Schall S, Mathias B, Schoemann M, von Kriegstein K. Prior multisensory learning can facilitate auditory-only voice-identity and speech recognition in noise. Q J Exp Psychol (Hove) 2024:17470218241278649. [PMID: 39164830 DOI: 10.1177/17470218241278649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/22/2024]
Abstract
Seeing the visual articulatory movements of a speaker, while hearing their voice, helps with understanding what is said. This multisensory enhancement is particularly evident in noisy listening conditions. Multisensory enhancement also occurs even in auditory-only conditions: auditory-only speech and voice-identity recognition are superior for speakers previously learned with their face, compared to control learning; an effect termed the "face-benefit." Whether the face-benefit can assist in maintaining robust perception in increasingly noisy listening conditions, similar to concurrent multisensory input, is unknown. Here, in two behavioural experiments, we examined this hypothesis. In each experiment, participants learned a series of speakers' voices together with their dynamic face or control image. Following learning, participants listened to auditory-only sentences spoken by the same speakers and recognised the content of the sentences (speech recognition, Experiment 1) or the voice-identity of the speaker (Experiment 2) in increasing levels of auditory noise. For speech recognition, we observed that 14 of 30 participants (47%) showed a face-benefit. 19 of 25 participants (76%) showed a face-benefit for voice-identity recognition. For those participants who demonstrated a face-benefit, the face-benefit increased with auditory noise levels. Taken together, the results support an audio-visual model of auditory communication and suggest that the brain can develop a flexible system in which learned facial characteristics are used to deal with varying auditory uncertainty.
Collapse
Affiliation(s)
- Corrina Maguinness
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität Dresden, Dresden, Germany
- Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
| | - Sonja Schall
- Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
| | - Brian Mathias
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität Dresden, Dresden, Germany
- School of Psychology, University of Aberdeen, Aberdeen, United Kingdom
| | - Martin Schoemann
- Chair of Psychological Methods and Cognitive Modelling, Faculty of Psychology, Technische Universität Dresden, Dresden, Germany
| | - Katharina von Kriegstein
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität Dresden, Dresden, Germany
- Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany
| |
Collapse
|
2
|
Karthik G, Cao CZ, Demidenko MI, Jahn A, Stacey WC, Wasade VS, Brang D. Auditory cortex encodes lipreading information through spatially distributed activity. Curr Biol 2024; 34:4021-4032.e5. [PMID: 39153482 PMCID: PMC11387126 DOI: 10.1016/j.cub.2024.07.073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 04/29/2024] [Accepted: 07/19/2024] [Indexed: 08/19/2024]
Abstract
Watching a speaker's face improves speech perception accuracy. This benefit is enabled, in part, by implicit lipreading abilities present in the general population. While it is established that lipreading can alter the perception of a heard word, it is unknown how these visual signals are represented in the auditory system or how they interact with auditory speech representations. One influential, but untested, hypothesis is that visual speech modulates the population-coded representations of phonetic and phonemic features in the auditory system. This model is largely supported by data showing that silent lipreading evokes activity in the auditory cortex, but these activations could alternatively reflect general effects of arousal or attention or the encoding of non-linguistic features such as visual timing information. This gap limits our understanding of how vision supports speech perception. To test the hypothesis that the auditory system encodes visual speech information, we acquired functional magnetic resonance imaging (fMRI) data from healthy adults and intracranial recordings from electrodes implanted in patients with epilepsy during auditory and visual speech perception tasks. Across both datasets, linear classifiers successfully decoded the identity of silently lipread words using the spatial pattern of auditory cortex responses. Examining the time course of classification using intracranial recordings, lipread words were classified at earlier time points relative to heard words, suggesting a predictive mechanism for facilitating speech. These results support a model in which the auditory system combines the joint neural distributions evoked by heard and lipread words to generate a more precise estimate of what was said.
Collapse
Affiliation(s)
- Ganesan Karthik
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Cody Zhewei Cao
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | | | - Andrew Jahn
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | - William C Stacey
- Department of Neurology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Vibhangini S Wasade
- Henry Ford Hospital, Detroit, MI 48202, USA; Department of Neurology, Wayne State University School of Medicine, Detroit, MI 48201, USA
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
3
|
Ahn E, Majumdar A, Lee TG, Brang D. Evidence for a Causal Dissociation of the McGurk Effect and Congruent Audiovisual Speech Perception via TMS to the Left pSTS. Multisens Res 2024; 37:341-363. [PMID: 39191410 PMCID: PMC11388023 DOI: 10.1163/22134808-bja10129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 07/21/2024] [Indexed: 08/29/2024]
Abstract
Congruent visual speech improves speech perception accuracy, particularly in noisy environments. Conversely, mismatched visual speech can alter what is heard, leading to an illusory percept that differs from the auditory and visual components, known as the McGurk effect. While prior transcranial magnetic stimulation (TMS) and neuroimaging studies have identified the left posterior superior temporal sulcus (pSTS) as a causal region involved in the generation of the McGurk effect, it remains unclear whether this region is critical only for this illusion or also for the more general benefits of congruent visual speech (e.g., increased accuracy and faster reaction times). Indeed, recent correlative research suggests that the benefits of congruent visual speech and the McGurk effect rely on largely independent mechanisms. To better understand how these different features of audiovisual integration are causally generated by the left pSTS, we used single-pulse TMS to temporarily disrupt processing within this region while subjects were presented with either congruent or incongruent (McGurk) audiovisual combinations. Consistent with past research, we observed that TMS to the left pSTS reduced the strength of the McGurk effect. Importantly, however, left pSTS stimulation had no effect on the positive benefits of congruent audiovisual speech (increased accuracy and faster reaction times), demonstrating a causal dissociation between the two processes. Our results are consistent with models proposing that the pSTS is but one of multiple critical areas supporting audiovisual speech interactions. Moreover, these data add to a growing body of evidence suggesting that the McGurk effect is an imperfect surrogate measure for more general and ecologically valid audiovisual speech behaviors.
Collapse
Affiliation(s)
- EunSeon Ahn
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Areti Majumdar
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Taraz G Lee
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
4
|
Hong Y, Ryun S, Chung CK. Evoking artificial speech perception through invasive brain stimulation for brain-computer interfaces: current challenges and future perspectives. Front Neurosci 2024; 18:1428256. [PMID: 38988764 PMCID: PMC11234843 DOI: 10.3389/fnins.2024.1428256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 06/10/2024] [Indexed: 07/12/2024] Open
Abstract
Encoding artificial perceptions through brain stimulation, especially that of higher cognitive functions such as speech perception, is one of the most formidable challenges in brain-computer interfaces (BCI). Brain stimulation has been used for functional mapping in clinical practices for the last 70 years to treat various disorders affecting the nervous system, including epilepsy, Parkinson's disease, essential tremors, and dystonia. Recently, direct electrical stimulation has been used to evoke various forms of perception in humans, ranging from sensorimotor, auditory, and visual to speech cognition. Successfully evoking and fine-tuning artificial perceptions could revolutionize communication for individuals with speech disorders and significantly enhance the capabilities of brain-computer interface technologies. However, despite the extensive literature on encoding various perceptions and the rising popularity of speech BCIs, inducing artificial speech perception is still largely unexplored, and its potential has yet to be determined. In this paper, we examine the various stimulation techniques used to evoke complex percepts and the target brain areas for the input of speech-like information. Finally, we discuss strategies to address the challenges of speech encoding and discuss the prospects of these approaches.
Collapse
Affiliation(s)
- Yirye Hong
- Department of Brain and Cognitive Sciences, College of Natural Sciences, Seoul National University, Seoul, Republic of Korea
| | - Seokyun Ryun
- Neuroscience Research Institute, Seoul National University Medical Research Center, Seoul, Republic of Korea
| | - Chun Kee Chung
- Neuroscience Research Institute, Seoul National University Medical Research Center, Seoul, Republic of Korea
| |
Collapse
|
5
|
Sato M. Audiovisual speech asynchrony asymmetrically modulates neural binding. Neuropsychologia 2024; 198:108866. [PMID: 38518889 DOI: 10.1016/j.neuropsychologia.2024.108866] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 01/09/2024] [Accepted: 03/19/2024] [Indexed: 03/24/2024]
Abstract
Previous psychophysical and neurophysiological studies in young healthy adults have provided evidence that audiovisual speech integration occurs with a large degree of temporal tolerance around true simultaneity. To further determine whether audiovisual speech asynchrony modulates auditory cortical processing and neural binding in young healthy adults, N1/P2 auditory evoked responses were compared using an additive model during a syllable categorization task, without or with an audiovisual asynchrony ranging from 240 ms visual lead to 240 ms auditory lead. Consistent with previous psychophysical findings, the observed results converge in favor of an asymmetric temporal integration window. Three main findings were observed: 1) predictive temporal and phonetic cues from pre-phonatory visual movements before the acoustic onset appeared essential for neural binding to occur, 2) audiovisual synchrony, with visual pre-phonatory movements predictive of the onset of the acoustic signal, was a prerequisite for N1 latency facilitation, and 3) P2 amplitude suppression and latency facilitation occurred even when visual pre-phonatory movements were not predictive of the acoustic onset but the syllable to come. Taken together, these findings help further clarify how audiovisual speech integration partly operates through two stages of visually-based temporal and phonetic predictions.
Collapse
Affiliation(s)
- Marc Sato
- Laboratoire Parole et Langage, Centre National de la Recherche Scientifique, Aix-Marseille Université, Aix-en-Provence, France.
| |
Collapse
|
6
|
Wang K, Fang Y, Guo Q, Shen L, Chen Q. Superior Attentional Efficiency of Auditory Cue via the Ventral Auditory-thalamic Pathway. J Cogn Neurosci 2024; 36:303-326. [PMID: 38010315 DOI: 10.1162/jocn_a_02090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Auditory commands are often executed more efficiently than visual commands. However, empirical evidence on the underlying behavioral and neural mechanisms remains scarce. In two experiments, we manipulated the delivery modality of informative cues and the prediction violation effect and found consistently enhanced RT benefits for the matched auditory cues compared with the matched visual cues. At the neural level, when the bottom-up perceptual input matched the prior prediction induced by the auditory cue, the auditory-thalamic pathway was significantly activated. Moreover, the stronger the auditory-thalamic connectivity, the higher the behavioral benefits of the matched auditory cue. When the bottom-up input violated the prior prediction induced by the auditory cue, the ventral auditory pathway was specifically involved. Moreover, the stronger the ventral auditory-prefrontal connectivity, the larger the behavioral costs caused by the violation of the auditory cue. In addition, the dorsal frontoparietal network showed a supramodal function in reacting to the violation of informative cues irrespective of the delivery modality of the cue. Taken together, the results reveal novel behavioral and neural evidence that the superior efficiency of the auditory cue is twofold: The auditory-thalamic pathway is associated with improvements in task performance when the bottom-up input matches the auditory cue, whereas the ventral auditory-prefrontal pathway is involved when the auditory cue is violated.
Collapse
Affiliation(s)
- Ke Wang
- South China Normal University, Guangzhou, China
| | - Ying Fang
- South China Normal University, Guangzhou, China
| | - Qiang Guo
- Guangdong Sanjiu Brain Hospital, Guangzhou, China
| | - Lu Shen
- South China Normal University, Guangzhou, China
| | - Qi Chen
- South China Normal University, Guangzhou, China
| |
Collapse
|
7
|
Le Rhun L, Llorach G, Delmas T, Suied C, Arnal LH, Lazard DS. A standardised test to evaluate audio-visual speech intelligibility in French. Heliyon 2024; 10:e24750. [PMID: 38312568 PMCID: PMC10835303 DOI: 10.1016/j.heliyon.2024.e24750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Revised: 12/07/2023] [Accepted: 01/12/2024] [Indexed: 02/06/2024] Open
Abstract
Objective Lipreading, which plays a major role in the communication of the hearing impaired, lacked a French standardised tool. Our aim was to create and validate an audio-visual (AV) version of the French Matrix Sentence Test (FrMST). Design Video recordings were created by dubbing the existing audio files. Sample Thirty-five young, normal-hearing participants were tested in auditory and visual modalities alone (Ao, Vo) and in AV conditions, in quiet, noise, and open and closed-set response formats. Results Lipreading ability (Vo) ranged from 1 % to 77%-word comprehension. The absolute AV benefit was 9.25 dB SPL in quiet and 4.6 dB SNR in noise. The response format did not influence the results in the AV noise condition, except during the training phase. Lipreading ability and AV benefit were significantly correlated. Conclusions The French video material achieved similar AV benefits as those described in the literature for AV MST in other languages. For clinical purposes, we suggest targeting SRT80 to avoid ceiling effects, and performing two training lists in the AV condition in noise, followed by one AV list in noise, one Ao list in noise and one Vo list, in a randomised order, in open or close set-format.
Collapse
Affiliation(s)
- Loïc Le Rhun
- Institut Pasteur, Université Paris Cité, Inserm UA06, Institut de l’Audition, Paris, France
| | - Gerard Llorach
- Auditory Signal Processing, Dept. of Medical Physics and Acoustics, University of Oldenburg Oldenburg, Germany
| | - Tanguy Delmas
- Institut Pasteur, Université Paris Cité, Inserm UA06, Institut de l’Audition, Paris, France
- ECLEAR, Audition Lefeuvre – Audition Marc Boulet, Athis-Mons, France
| | - Clara Suied
- Institut de Recherche Biomédicale des Armées, Département Neurosciences et Sciences Cognitives, Brétigny-sur-Orge, France
| | - Luc H. Arnal
- Institut Pasteur, Université Paris Cité, Inserm UA06, Institut de l’Audition, Paris, France
| | - Diane S. Lazard
- Institut Pasteur, Université Paris Cité, Inserm UA06, Institut de l’Audition, Paris, France
- Princess Grace Hospital, ENT & Maxillo-facial Surgery Department, Monaco
- Institut Arthur Vernes, ENT Surgery Department, Paris, France
| |
Collapse
|
8
|
Sato M. Competing influence of visual speech on auditory neural adaptation. BRAIN AND LANGUAGE 2023; 247:105359. [PMID: 37951157 DOI: 10.1016/j.bandl.2023.105359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 09/25/2023] [Accepted: 11/06/2023] [Indexed: 11/13/2023]
Abstract
Visual information from a speaker's face enhances auditory neural processing and speech recognition. To determine whether auditory memory can be influenced by visual speech, the degree of auditory neural adaptation of an auditory syllable preceded by an auditory, visual, or audiovisual syllable was examined using EEG. Consistent with previous findings and additional adaptation of auditory neurons tuned to acoustic features, stronger adaptation of N1, P2 and N2 auditory evoked responses was observed when the auditory syllable was preceded by an auditory compared to a visual syllable. However, although stronger than when preceded by a visual syllable, lower adaptation was observed when the auditory syllable was preceded by an audiovisual compared to an auditory syllable. In addition, longer N1 and P2 latencies were then observed. These results further demonstrate that visual speech acts on auditory memory but suggest competing visual influences in the case of audiovisual stimulation.
Collapse
Affiliation(s)
- Marc Sato
- Laboratoire Parole et Langage, Centre National de la Recherche Scientifique, UMR 7309 CNRS & Aix-Marseille Université, Aix-Marseille Université, 5 avenue Pasteur, Aix-en-Provence, France.
| |
Collapse
|
9
|
Ahn E, Majumdar A, Lee T, Brang D. Evidence for a Causal Dissociation of the McGurk Effect and Congruent Audiovisual Speech Perception via TMS. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.27.568892. [PMID: 38077093 PMCID: PMC10705272 DOI: 10.1101/2023.11.27.568892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/24/2023]
Abstract
Congruent visual speech improves speech perception accuracy, particularly in noisy environments. Conversely, mismatched visual speech can alter what is heard, leading to an illusory percept known as the McGurk effect. This illusion has been widely used to study audiovisual speech integration, illustrating that auditory and visual cues are combined in the brain to generate a single coherent percept. While prior transcranial magnetic stimulation (TMS) and neuroimaging studies have identified the left posterior superior temporal sulcus (pSTS) as a causal region involved in the generation of the McGurk effect, it remains unclear whether this region is critical only for this illusion or also for the more general benefits of congruent visual speech (e.g., increased accuracy and faster reaction times). Indeed, recent correlative research suggests that the benefits of congruent visual speech and the McGurk effect reflect largely independent mechanisms. To better understand how these different features of audiovisual integration are causally generated by the left pSTS, we used single-pulse TMS to temporarily impair processing while subjects were presented with either incongruent (McGurk) or congruent audiovisual combinations. Consistent with past research, we observed that TMS to the left pSTS significantly reduced the strength of the McGurk effect. Importantly, however, left pSTS stimulation did not affect the positive benefits of congruent audiovisual speech (increased accuracy and faster reaction times), demonstrating a causal dissociation between the two processes. Our results are consistent with models proposing that the pSTS is but one of multiple critical areas supporting audiovisual speech interactions. Moreover, these data add to a growing body of evidence suggesting that the McGurk effect is an imperfect surrogate measure for more general and ecologically valid audiovisual speech behaviors.
Collapse
Affiliation(s)
- EunSeon Ahn
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - Areti Majumdar
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - Taraz Lee
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| |
Collapse
|
10
|
Nidiffer AR, Cao CZ, O'Sullivan A, Lalor EC. A representation of abstract linguistic categories in the visual system underlies successful lipreading. Neuroimage 2023; 282:120391. [PMID: 37757989 DOI: 10.1016/j.neuroimage.2023.120391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2022] [Revised: 09/22/2023] [Accepted: 09/24/2023] [Indexed: 09/29/2023] Open
Abstract
There is considerable debate over how visual speech is processed in the absence of sound and whether neural activity supporting lipreading occurs in visual brain areas. Much of the ambiguity stems from a lack of behavioral grounding and neurophysiological analyses that cannot disentangle high-level linguistic and phonetic/energetic contributions from visual speech. To address this, we recorded EEG from human observers as they watched silent videos, half of which were novel and half of which were previously rehearsed with the accompanying audio. We modeled how the EEG responses to novel and rehearsed silent speech reflected the processing of low-level visual features (motion, lip movements) and a higher-level categorical representation of linguistic units, known as visemes. The ability of these visemes to account for the EEG - beyond the motion and lip movements - was significantly enhanced for rehearsed videos in a way that correlated with participants' trial-by-trial ability to lipread that speech. Source localization of viseme processing showed clear contributions from visual cortex, with no strong evidence for the involvement of auditory areas. We interpret this as support for the idea that the visual system produces its own specialized representation of speech that is (1) well-described by categorical linguistic features, (2) dissociable from lip movements, and (3) predictive of lipreading ability. We also suggest a reinterpretation of previous findings of auditory cortical activation during silent speech that is consistent with hierarchical accounts of visual and audiovisual speech perception.
Collapse
Affiliation(s)
- Aaron R Nidiffer
- Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA
| | - Cody Zhewei Cao
- Department of Psychology, University of Michigan, Ann Arbor, MI, USA
| | - Aisling O'Sullivan
- School of Engineering, Trinity College Institute of Neuroscience, Trinity Centre for Biomedical Engineering, Trinity College, Dublin, Ireland
| | - Edmund C Lalor
- Department of Biomedical Engineering, Department of Neuroscience, Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA; School of Engineering, Trinity College Institute of Neuroscience, Trinity Centre for Biomedical Engineering, Trinity College, Dublin, Ireland.
| |
Collapse
|
11
|
Jeschke L, Mathias B, von Kriegstein K. Inhibitory TMS over Visual Area V5/MT Disrupts Visual Speech Recognition. J Neurosci 2023; 43:7690-7699. [PMID: 37848284 PMCID: PMC10634547 DOI: 10.1523/jneurosci.0975-23.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Revised: 07/26/2023] [Accepted: 09/04/2023] [Indexed: 10/19/2023] Open
Abstract
During face-to-face communication, the perception and recognition of facial movements can facilitate individuals' understanding of what is said. Facial movements are a form of complex biological motion. Separate neural pathways are thought to processing (1) simple, nonbiological motion with an obligatory waypoint in the motion-sensitive visual middle temporal area (V5/MT); and (2) complex biological motion. Here, we present findings that challenge this dichotomy. Neuronavigated offline transcranial magnetic stimulation (TMS) over V5/MT on 24 participants (17 females and 7 males) led to increased response times in the recognition of simple, nonbiological motion as well as visual speech recognition compared with TMS over the vertex, an active control region. TMS of area V5/MT also reduced practice effects on response times, that are typically observed in both visual speech and motion recognition tasks over time. Our findings provide the first indication that area V5/MT causally influences the recognition of visual speech.SIGNIFICANCE STATEMENT In everyday face-to-face communication, speech comprehension is often facilitated by viewing a speaker's facial movements. Several brain areas contribute to the recognition of visual speech. One area of interest is the motion-sensitive visual medial temporal area (V5/MT), which has been associated with the perception of simple, nonbiological motion such as moving dots, as well as more complex, biological motion such as visual speech. Here, we demonstrate using noninvasive brain stimulation that area V5/MT is causally relevant in recognizing visual speech. This finding provides new insights into the neural mechanisms that support the perception of human communication signals, which will help guide future research in typically developed individuals and populations with communication difficulties.
Collapse
Affiliation(s)
- Lisa Jeschke
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität Dresden, 01069 Dresden, Germany
| | - Brian Mathias
- School of Psychology, University of Aberdeen, Aberdeen AB243FX, United Kingdom
| | - Katharina von Kriegstein
- Chair of Cognitive and Clinical Neuroscience, Faculty of Psychology, Technische Universität Dresden, 01069 Dresden, Germany
| |
Collapse
|
12
|
Tan SHJ, Kalashnikova M, Di Liberto GM, Crosse MJ, Burnham D. Seeing a Talking Face Matters: Gaze Behavior and the Auditory-Visual Speech Benefit in Adults' Cortical Tracking of Infant-directed Speech. J Cogn Neurosci 2023; 35:1741-1759. [PMID: 37677057 DOI: 10.1162/jocn_a_02044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/09/2023]
Abstract
In face-to-face conversations, listeners gather visual speech information from a speaker's talking face that enhances their perception of the incoming auditory speech signal. This auditory-visual (AV) speech benefit is evident even in quiet environments but is stronger in situations that require greater listening effort such as when the speech signal itself deviates from listeners' expectations. One example is infant-directed speech (IDS) presented to adults. IDS has exaggerated acoustic properties that are easily discriminable from adult-directed speech (ADS). Although IDS is a speech register that adults typically use with infants, no previous neurophysiological study has directly examined whether adult listeners process IDS differently from ADS. To address this, the current study simultaneously recorded EEG and eye-tracking data from adult participants as they were presented with auditory-only (AO), visual-only, and AV recordings of IDS and ADS. Eye-tracking data were recorded because looking behavior to the speaker's eyes and mouth modulates the extent of AV speech benefit experienced. Analyses of cortical tracking accuracy revealed that cortical tracking of the speech envelope was significant in AO and AV modalities for IDS and ADS. However, the AV speech benefit [i.e., AV > (A + V)] was only present for IDS trials. Gaze behavior analyses indicated differences in looking behavior during IDS and ADS trials. Surprisingly, looking behavior to the speaker's eyes and mouth was not correlated with cortical tracking accuracy. Additional exploratory analyses indicated that attention to the whole display was negatively correlated with cortical tracking accuracy of AO and visual-only trials in IDS. Our results underscore the nuances involved in the relationship between neurophysiological AV speech benefit and looking behavior.
Collapse
Affiliation(s)
- Sok Hui Jessica Tan
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University, Australia
- Science of Learning in Education Centre, Office of Education Research, National Institute of Education, Nanyang Technological University, Singapore
| | - Marina Kalashnikova
- The Basque Center on Cognition, Brain and Language
- IKERBASQUE, Basque Foundation for Science
| | - Giovanni M Di Liberto
- ADAPT Centre, School of Computer Science and Statistics, Trinity College Institute of Neuroscience, Trinity College, The University of Dublin, Ireland
| | - Michael J Crosse
- SEGOTIA, Galway, Ireland
- Trinity Center for Biomedical Engineering, Department of Mechanical, Manufacturing & Biomedical Engineering, Trinity College Dublin, Dublin, Ireland
| | - Denis Burnham
- The MARCS Institute of Brain, Behaviour and Development, Western Sydney University, Australia
| |
Collapse
|
13
|
Jiang Z, An X, Liu S, Yin E, Yan Y, Ming D. Neural oscillations reflect the individual differences in the temporal perception of audiovisual speech. Cereb Cortex 2023; 33:10575-10583. [PMID: 37727958 DOI: 10.1093/cercor/bhad304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 08/01/2023] [Accepted: 08/02/2023] [Indexed: 09/21/2023] Open
Abstract
Multisensory integration occurs within a limited time interval between multimodal stimuli. Multisensory temporal perception varies widely among individuals and involves perceptual synchrony and temporal sensitivity processes. Previous studies explored the neural mechanisms of individual differences for beep-flash stimuli, whereas there was no study for speech. In this study, 28 subjects (16 male) performed an audiovisual speech/ba/simultaneity judgment task while recording their electroencephalography. We examined the relationship between prestimulus neural oscillations (i.e. the pre-pronunciation movement-related oscillations) and temporal perception. The perceptual synchrony was quantified using the Point of Subjective Simultaneity and temporal sensitivity using the Temporal Binding Window. Our results revealed dissociated neural mechanisms for individual differences in Temporal Binding Window and Point of Subjective Simultaneity. The frontocentral delta power, reflecting top-down attention control, is positively related to the magnitude of individual auditory leading Temporal Binding Windows (auditory Temporal Binding Windows; LTBWs), whereas the parieto-occipital theta power, indexing bottom-up visual temporal attention specific to speech, is negatively associated with the magnitude of individual visual leading Temporal Binding Windows (visual Temporal Binding Windows; RTBWs). In addition, increased left frontal and bilateral temporoparietal occipital alpha power, reflecting general attentional states, is associated with increased Points of Subjective Simultaneity. Strengthening attention abilities might improve the audiovisual temporal perception of speech and further impact speech integration.
Collapse
Affiliation(s)
- Zeliang Jiang
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Xingwei An
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Shuang Liu
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| | - Erwei Yin
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
- Defense Innovation Institute, Academy of Military Sciences (AMS), 100071 Beijing, China
- Tianjin Artificial Intelligence Innovation Center (TAIIC), 300457 Tianjin, China
| | - Ye Yan
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
- Defense Innovation Institute, Academy of Military Sciences (AMS), 100071 Beijing, China
- Tianjin Artificial Intelligence Innovation Center (TAIIC), 300457 Tianjin, China
| | - Dong Ming
- Academy of Medical Engineering and Translational Medicine, Tianjin University, 300072 Tianjin, China
| |
Collapse
|
14
|
Hansmann D, Derrick D, Theys C. Hearing, seeing, and feeling speech: the neurophysiological correlates of trimodal speech perception. Front Hum Neurosci 2023; 17:1225976. [PMID: 37706173 PMCID: PMC10495990 DOI: 10.3389/fnhum.2023.1225976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Accepted: 08/08/2023] [Indexed: 09/15/2023] Open
Abstract
Introduction To perceive speech, our brains process information from different sensory modalities. Previous electroencephalography (EEG) research has established that audio-visual information provides an advantage compared to auditory-only information during early auditory processing. In addition, behavioral research showed that auditory speech perception is not only enhanced by visual information but also by tactile information, transmitted by puffs of air arriving at the skin and aligned with speech. The current EEG study aimed to investigate whether the behavioral benefits of bimodal audio-aerotactile and trimodal audio-visual-aerotactile speech presentation are reflected in cortical auditory event-related neurophysiological responses. Methods To examine the influence of multimodal information on speech perception, 20 listeners conducted a two-alternative forced-choice syllable identification task at three different signal-to-noise levels. Results Behavioral results showed increased syllable identification accuracy when auditory information was complemented with visual information, but did not show the same effect for the addition of tactile information. Similarly, EEG results showed an amplitude suppression for the auditory N1 and P2 event-related potentials for the audio-visual and audio-visual-aerotactile modalities compared to auditory and audio-aerotactile presentations of the syllable/pa/. No statistically significant difference was present between audio-aerotactile and auditory-only modalities. Discussion Current findings are consistent with past EEG research showing a visually induced amplitude suppression during early auditory processing. In addition, the significant neurophysiological effect of audio-visual but not audio-aerotactile presentation is in line with the large benefit of visual information but comparatively much smaller effect of aerotactile information on auditory speech perception previously identified in behavioral research.
Collapse
Affiliation(s)
- Doreen Hansmann
- School of Psychology, Speech and Hearing, University of Canterbury, Christchurch, New Zealand
| | - Donald Derrick
- New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand
| | - Catherine Theys
- School of Psychology, Speech and Hearing, University of Canterbury, Christchurch, New Zealand
- New Zealand Institute of Language, Brain and Behaviour, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
15
|
Harwood V, Baron A, Kleinman D, Campanelli L, Irwin J, Landi N. Event-Related Potentials in Assessing Visual Speech Cues in the Broader Autism Phenotype: Evidence from a Phonemic Restoration Paradigm. Brain Sci 2023; 13:1011. [PMID: 37508944 PMCID: PMC10377560 DOI: 10.3390/brainsci13071011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 06/26/2023] [Accepted: 06/29/2023] [Indexed: 07/30/2023] Open
Abstract
Audiovisual speech perception includes the simultaneous processing of auditory and visual speech. Deficits in audiovisual speech perception are reported in autistic individuals; however, less is known regarding audiovisual speech perception within the broader autism phenotype (BAP), which includes individuals with elevated, yet subclinical, levels of autistic traits. We investigate the neural indices of audiovisual speech perception in adults exhibiting a range of autism-like traits using event-related potentials (ERPs) in a phonemic restoration paradigm. In this paradigm, we consider conditions where speech articulators (mouth and jaw) are present (AV condition) and obscured by a pixelated mask (PX condition). These two face conditions were included in both passive (simply viewing a speaking face) and active (participants were required to press a button for a specific consonant-vowel stimulus) experiments. The results revealed an N100 ERP component which was present for all listening contexts and conditions; however, it was attenuated in the active AV condition where participants were able to view the speaker's face, including the mouth and jaw. The P300 ERP component was present within the active experiment only, and significantly greater within the AV condition compared to the PX condition. This suggests increased neural effort for detecting deviant stimuli when visible articulation was present and visual influence on perception. Finally, the P300 response was negatively correlated with autism-like traits, suggesting that higher autistic traits were associated with generally smaller P300 responses in the active AV and PX conditions. The conclusions support the finding that atypical audiovisual processing may be characteristic of the BAP in adults.
Collapse
Affiliation(s)
- Vanessa Harwood
- Department of Communicative Disorders, University of Rhode Island, Kingston, RI 02881, USA
| | - Alisa Baron
- Department of Communicative Disorders, University of Rhode Island, Kingston, RI 02881, USA
| | | | - Luca Campanelli
- Department of Communicative Disorders, University of Alabama, Tuscaloosa, AL 35487, USA
| | - Julia Irwin
- Haskins Laboratories, New Haven, CT 06519, USA
- Department of Psychology, Southern Connecticut State University, New Haven, CT 06515, USA
| | - Nicole Landi
- Haskins Laboratories, New Haven, CT 06519, USA
- Department of Psychological Sciences, University of Connecticut, Storrs, CT 06269, USA
| |
Collapse
|
16
|
Shi L, Liu C, Peng X, Cao Y, Levy DA, Xue G. The neural representations underlying asymmetric cross-modal prediction of words. Hum Brain Mapp 2023; 44:2418-2435. [PMID: 36715307 PMCID: PMC10028649 DOI: 10.1002/hbm.26219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Revised: 12/20/2022] [Accepted: 01/18/2023] [Indexed: 01/31/2023] Open
Abstract
Cross-modal prediction serves a crucial adaptive role in the multisensory world, yet the neural mechanisms underlying this prediction are poorly understood. The present study addressed this important question by combining a novel audiovisual sequence memory task, functional magnetic resonance imaging (fMRI), and multivariate neural representational analyses. Our behavioral results revealed a reliable asymmetric cross-modal predictive effect, with a stronger prediction from visual to auditory (VA) modality than auditory to visual (AV) modality. Mirroring the behavioral pattern, we found the superior parietal lobe (SPL) showed higher pattern similarity for VA than AV pairs, and the strength of the predictive coding in the SPL was positively correlated with the behavioral predictive effect in the VA condition. Representational connectivity analyses further revealed that the SPL mediated the neural pathway from the visual to the auditory cortex in the VA condition but was not involved in the auditory to visual cortex pathway in the AV condition. Direct neural pathways within the unimodal regions were found for the visual-to-visual and auditory-to-auditory predictions. Together, these results provide novel insights into the neural mechanisms underlying cross-modal sequence prediction.
Collapse
Affiliation(s)
- Liang Shi
- State Key Laboratory of Cognitive Neuroscience and Learning and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, People's Republic of China
| | - Chuqi Liu
- State Key Laboratory of Cognitive Neuroscience and Learning and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, People's Republic of China
| | - Xiaojing Peng
- State Key Laboratory of Cognitive Neuroscience and Learning and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, People's Republic of China
| | - Yifei Cao
- State Key Laboratory of Cognitive Neuroscience and Learning and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, People's Republic of China
| | - Daniel A Levy
- Baruch Ivcher School of Psychology, Interdisciplinary Center Herzliya, Herzliya, Israel
| | - Gui Xue
- State Key Laboratory of Cognitive Neuroscience and Learning and IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, People's Republic of China
| |
Collapse
|
17
|
Chalas N, Omigie D, Poeppel D, van Wassenhove V. Hierarchically nested networks optimize the analysis of audiovisual speech. iScience 2023; 26:106257. [PMID: 36909667 PMCID: PMC9993032 DOI: 10.1016/j.isci.2023.106257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2022] [Revised: 12/22/2022] [Accepted: 02/17/2023] [Indexed: 02/22/2023] Open
Abstract
In conversational settings, seeing the speaker's face elicits internal predictions about the upcoming acoustic utterance. Understanding how the listener's cortical dynamics tune to the temporal statistics of audiovisual (AV) speech is thus essential. Using magnetoencephalography, we explored how large-scale frequency-specific dynamics of human brain activity adapt to AV speech delays. First, we show that the amplitude of phase-locked responses parametrically decreases with natural AV speech synchrony, a pattern that is consistent with predictive coding. Second, we show that the temporal statistics of AV speech affect large-scale oscillatory networks at multiple spatial and temporal resolutions. We demonstrate a spatial nestedness of oscillatory networks during the processing of AV speech: these oscillatory hierarchies are such that high-frequency activity (beta, gamma) is contingent on the phase response of low-frequency (delta, theta) networks. Our findings suggest that the endogenous temporal multiplexing of speech processing confers adaptability within the temporal regimes that are essential for speech comprehension.
Collapse
Affiliation(s)
- Nikos Chalas
- Institute for Biomagnetism and Biosignal Analysis, University of Münster, P.C., 48149 Münster, Germany
- CEA, DRF/Joliot, NeuroSpin, INSERM, Cognitive Neuroimaging Unit; CNRS; Université Paris-Saclay, 91191 Gif/Yvette, France
- School of Biology, Faculty of Sciences, Aristotle University of Thessaloniki, P.C., 54124 Thessaloniki, Greece
- Corresponding author
| | - Diana Omigie
- Department of Psychology, Goldsmiths University London, London, UK
| | - David Poeppel
- Department of Psychology, New York University, New York, NY 10003, USA
- Ernst Struengmann Institute for Neuroscience, 60528 Frankfurt am Main, Frankfurt, Germany
| | - Virginie van Wassenhove
- CEA, DRF/Joliot, NeuroSpin, INSERM, Cognitive Neuroimaging Unit; CNRS; Université Paris-Saclay, 91191 Gif/Yvette, France
- Corresponding author
| |
Collapse
|
18
|
Pourhashemi F, Baart M, van Laarhoven T, Vroomen J. Want to quickly adapt to distorted speech and become a better listener? Read lips, not text. PLoS One 2022; 17:e0278986. [PMID: 36580461 PMCID: PMC9799298 DOI: 10.1371/journal.pone.0278986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 11/28/2022] [Indexed: 12/30/2022] Open
Abstract
When listening to distorted speech, does one become a better listener by looking at the face of the speaker or by reading subtitles that are presented along with the speech signal? We examined this question in two experiments in which we presented participants with spectrally distorted speech (4-channel noise-vocoded speech). During short training sessions, listeners received auditorily distorted words or pseudowords that were partially disambiguated by concurrently presented lipread information or text. After each training session, listeners were tested with new degraded auditory words. Learning effects (based on proportions of correctly identified words) were stronger if listeners had trained with words rather than with pseudowords (a lexical boost), and adding lipread information during training was more effective than adding text (a lipread boost). Moreover, the advantage of lipread speech over text training was also found when participants were tested more than a month later. The current results thus suggest that lipread speech may have surprisingly long-lasting effects on adaptation to distorted speech.
Collapse
Affiliation(s)
- Faezeh Pourhashemi
- Dept. of Cognitive Neuropsychology, Tilburg University, Tilburg, The Netherlands
| | - Martijn Baart
- Dept. of Cognitive Neuropsychology, Tilburg University, Tilburg, The Netherlands
- BCBL, Basque Center on Cognition, Brain, and Language, Donostia, Spain
- * E-mail:
| | - Thijs van Laarhoven
- Dept. of Cognitive Neuropsychology, Tilburg University, Tilburg, The Netherlands
| | - Jean Vroomen
- Dept. of Cognitive Neuropsychology, Tilburg University, Tilburg, The Netherlands
| |
Collapse
|
19
|
Sato M. The timing of visual speech modulates auditory neural processing. BRAIN AND LANGUAGE 2022; 235:105196. [PMID: 36343508 DOI: 10.1016/j.bandl.2022.105196] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 10/15/2022] [Accepted: 10/19/2022] [Indexed: 06/16/2023]
Abstract
In face-to-face communication, visual information from a speaker's face and time-varying kinematics of articulatory movements have been shown to fine-tune auditory neural processing and improve speech recognition. To further determine whether the timing of visual gestures modulates auditory cortical processing, three sets of syllables only differing in the onset and duration of silent prephonatory movements, before the acoustic speech signal, were contrasted using EEG. Despite similar visual recognition rates, an increase in the amplitude of P2 auditory evoked responses was observed from the longest to the shortest movements. Taken together, these results clarify how audiovisual speech perception partly operates through visually-based predictions and related processing time, with acoustic-phonetic neural processing paralleling the timing of visual prephonatory gestures.
Collapse
Affiliation(s)
- Marc Sato
- Laboratoire Parole et Langage, Centre National de la Recherche Scientifique, Aix-Marseille Université, Aix-en-Provence, France.
| |
Collapse
|
20
|
Van Engen KJ, Dey A, Sommers MS, Peelle JE. Audiovisual speech perception: Moving beyond McGurk. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:3216. [PMID: 36586857 PMCID: PMC9894660 DOI: 10.1121/10.0015262] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 10/26/2022] [Accepted: 11/05/2022] [Indexed: 05/29/2023]
Abstract
Although it is clear that sighted listeners use both auditory and visual cues during speech perception, the manner in which multisensory information is combined is a matter of debate. One approach to measuring multisensory integration is to use variants of the McGurk illusion, in which discrepant auditory and visual cues produce auditory percepts that differ from those based on unimodal input. Not all listeners show the same degree of susceptibility to the McGurk illusion, and these individual differences are frequently used as a measure of audiovisual integration ability. However, despite their popularity, we join the voices of others in the field to argue that McGurk tasks are ill-suited for studying real-life multisensory speech perception: McGurk stimuli are often based on isolated syllables (which are rare in conversations) and necessarily rely on audiovisual incongruence that does not occur naturally. Furthermore, recent data show that susceptibility to McGurk tasks does not correlate with performance during natural audiovisual speech perception. Although the McGurk effect is a fascinating illusion, truly understanding the combined use of auditory and visual information during speech perception requires tasks that more closely resemble everyday communication: namely, words, sentences, and narratives with congruent auditory and visual speech cues.
Collapse
Affiliation(s)
- Kristin J Van Engen
- Department of Psychological and Brain Sciences, Washington University, St. Louis, Missouri 63130, USA
| | - Avanti Dey
- PLOS ONE, 1265 Battery Street, San Francisco, California 94111, USA
| | - Mitchell S Sommers
- Department of Psychological and Brain Sciences, Washington University, St. Louis, Missouri 63130, USA
| | - Jonathan E Peelle
- Department of Otolaryngology, Washington University, St. Louis, Missouri 63130, USA
| |
Collapse
|
21
|
Neurodevelopmental oscillatory basis of speech processing in noise. Dev Cogn Neurosci 2022; 59:101181. [PMID: 36549148 PMCID: PMC9792357 DOI: 10.1016/j.dcn.2022.101181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 10/31/2022] [Accepted: 11/25/2022] [Indexed: 11/27/2022] Open
Abstract
Humans' extraordinary ability to understand speech in noise relies on multiple processes that develop with age. Using magnetoencephalography (MEG), we characterize the underlying neuromaturational basis by quantifying how cortical oscillations in 144 participants (aged 5-27 years) track phrasal and syllabic structures in connected speech mixed with different types of noise. While the extraction of prosodic cues from clear speech was stable during development, its maintenance in a multi-talker background matured rapidly up to age 9 and was associated with speech comprehension. Furthermore, while the extraction of subtler information provided by syllables matured at age 9, its maintenance in noisy backgrounds progressively matured until adulthood. Altogether, these results highlight distinct behaviorally relevant maturational trajectories for the neuronal signatures of speech perception. In accordance with grain-size proposals, neuromaturational milestones are reached increasingly late for linguistic units of decreasing size, with further delays incurred by noise.
Collapse
|
22
|
Pei C, Qiu Y, Li F, Huang X, Si Y, Li Y, Zhang X, Chen C, Liu Q, Cao Z, Ding N, Gao S, Alho K, Yao D, Xu P. The different brain areas occupied for integrating information of hierarchical linguistic units: a study based on EEG and TMS. Cereb Cortex 2022; 33:4740-4751. [PMID: 36178127 DOI: 10.1093/cercor/bhac376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 08/29/2022] [Accepted: 08/30/2022] [Indexed: 11/13/2022] Open
Abstract
Human language units are hierarchical, and reading acquisition involves integrating multisensory information (typically from auditory and visual modalities) to access meaning. However, it is unclear how the brain processes and integrates language information at different linguistic units (words, phrases, and sentences) provided simultaneously in auditory and visual modalities. To address the issue, we presented participants with sequences of short Chinese sentences through auditory, visual, or combined audio-visual modalities while electroencephalographic responses were recorded. With a frequency tagging approach, we analyzed the neural representations of basic linguistic units (i.e. characters/monosyllabic words) and higher-level linguistic structures (i.e. phrases and sentences) across the 3 modalities separately. We found that audio-visual integration occurs in all linguistic units, and the brain areas involved in the integration varied across different linguistic levels. In particular, the integration of sentences activated the local left prefrontal area. Therefore, we used continuous theta-burst stimulation to verify that the left prefrontal cortex plays a vital role in the audio-visual integration of sentence information. Our findings suggest the advantage of bimodal language comprehension at hierarchical stages in language-related information processing and provide evidence for the causal role of the left prefrontal regions in processing information of audio-visual sentences.
Collapse
Affiliation(s)
- Changfu Pei
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Yuan Qiu
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Fali Li
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China.,Research Unit of Neuroscience, Chinese Academy of Medical Science, 2019RU035, Chengdu, China
| | - Xunan Huang
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Foreign Languages, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China
| | - Yajing Si
- School of Psychology, Xinxiang Medical University, Xinxiang, 453003, China
| | - Yuqin Li
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Xiabing Zhang
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Chunli Chen
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China
| | - Qiang Liu
- Institute of Brain and Psychological Sciences, Sichuan Normal University, Chengdu, Sichuan, 610066, China
| | - Zehong Cao
- STEM, Mawson Lakes Campus, University of South Australia, Adelaide, SA 5095, Australia
| | - Nai Ding
- College of Biomedical Engineering and Instrument Sciences, Key Laboratory for Biomedical Engineering of Ministry of Education, Zhejiang University, Hangzhou, 310007, China
| | - Shan Gao
- School of Foreign Languages, University of Electronic Science and Technology of China, Chengdu, Sichuan, 611731, China
| | - Kimmo Alho
- Department of Psychology and Logopedics, University of Helsinki, Helsinki, FI 00014, Finland
| | - Dezhong Yao
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China.,Research Unit of Neuroscience, Chinese Academy of Medical Science, 2019RU035, Chengdu, China
| | - Peng Xu
- The Clinical Hospital of Chengdu Brain Science Institute, MOE Key Lab for Neuroinformation, University of Electronic Science and Technology of China, Chengdu, 611731, China.,School of Life Science and Technology, Center for Information in BioMedicine, University of Electronic Science and Technology of China, Chengdu, 611731, China.,Research Unit of Neuroscience, Chinese Academy of Medical Science, 2019RU035, Chengdu, China.,Radiation Oncology Key Laboratory of Sichuan Province, Chengdu, 610041, China
| |
Collapse
|
23
|
Bröhl F, Keitel A, Kayser C. MEG Activity in Visual and Auditory Cortices Represents Acoustic Speech-Related Information during Silent Lip Reading. eNeuro 2022; 9:ENEURO.0209-22.2022. [PMID: 35728955 PMCID: PMC9239847 DOI: 10.1523/eneuro.0209-22.2022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 06/06/2022] [Indexed: 11/21/2022] Open
Abstract
Speech is an intrinsically multisensory signal, and seeing the speaker's lips forms a cornerstone of communication in acoustically impoverished environments. Still, it remains unclear how the brain exploits visual speech for comprehension. Previous work debated whether lip signals are mainly processed along the auditory pathways or whether the visual system directly implements speech-related processes. To probe this, we systematically characterized dynamic representations of multiple acoustic and visual speech-derived features in source localized MEG recordings that were obtained while participants listened to speech or viewed silent speech. Using a mutual-information framework we provide a comprehensive assessment of how well temporal and occipital cortices reflect the physically presented signals and unique aspects of acoustic features that were physically absent but may be critical for comprehension. Our results demonstrate that both cortices feature a functionally specific form of multisensory restoration: during lip reading, they reflect unheard acoustic features, independent of co-existing representations of the visible lip movements. This restoration emphasizes the unheard pitch signature in occipital cortex and the speech envelope in temporal cortex and is predictive of lip-reading performance. These findings suggest that when seeing the speaker's lips, the brain engages both visual and auditory pathways to support comprehension by exploiting multisensory correspondences between lip movements and spectro-temporal acoustic cues.
Collapse
Affiliation(s)
- Felix Bröhl
- Department for Cognitive Neuroscience, Faculty of Biology, Bielefeld University, Bielefeld 33615, Germany
| | - Anne Keitel
- Psychology, University of Dundee, Dundee DD1 4HN, United Kingdom
| | - Christoph Kayser
- Department for Cognitive Neuroscience, Faculty of Biology, Bielefeld University, Bielefeld 33615, Germany
| |
Collapse
|
24
|
Sato M. Motor and visual influences on auditory neural processing during speaking and listening. Cortex 2022; 152:21-35. [DOI: 10.1016/j.cortex.2022.03.013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2021] [Revised: 02/02/2022] [Accepted: 03/15/2022] [Indexed: 11/03/2022]
|
25
|
Pattamadilok C, Sato M. How are visemes and graphemes integrated with speech sounds during spoken word recognition? ERP evidence for supra-additive responses during audiovisual compared to auditory speech processing. BRAIN AND LANGUAGE 2022; 225:105058. [PMID: 34929531 DOI: 10.1016/j.bandl.2021.105058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 10/31/2021] [Accepted: 12/08/2021] [Indexed: 06/14/2023]
Abstract
Both visual articulatory gestures and orthography provide information on the phonological content of speech. This EEG study investigated the integration between speech and these two visual inputs. A comparison of skilled readers' brain responses elicited by a spoken word presented alone versus synchronously with a static image of a viseme or a grapheme of the spoken word's onset showed that while neither visual input induced audiovisual integration on N1 acoustic component, both led to a supra-additive integration on P2, with a stronger integration between speech and graphemes on left-anterior electrodes. This pattern persisted in P350 time-window and generalized to all electrodes. The finding suggests a strong impact of spelling knowledge on phonetic processing and lexical access. It also indirectly indicates that the dynamic and predictive value present in natural lip movements but not in static visemes is particularly critical to the contribution of visual articulatory gestures to speech processing.
Collapse
Affiliation(s)
| | - Marc Sato
- Aix Marseille Univ, CNRS, LPL, Aix-en-Provence, France
| |
Collapse
|
26
|
López Zunini RA, Baart M, Samuel AG, Armstrong BC. Lexico-semantic access and audiovisual integration in the aging brain: Insights from mixed-effects regression analyses of event-related potentials. Neuropsychologia 2021; 165:108107. [PMID: 34921819 DOI: 10.1016/j.neuropsychologia.2021.108107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2021] [Revised: 11/17/2021] [Accepted: 11/29/2021] [Indexed: 10/19/2022]
Abstract
We investigated how aging modulates lexico-semantic processes in the visual (seeing written items), auditory (hearing spoken items) and audiovisual (seeing written items while hearing congruent spoken items) modalities. Participants were young and older adults who performed a delayed lexical decision task (LDT) presented in blocks of visual, auditory, and audiovisual stimuli. Event-related potentials (ERPs) revealed differences between young and older adults despite older adults' ability to identify words and pseudowords as accurately as young adults. The observed differences included more focalized lexico-semantic access in the N400 time window in older relative to young adults, stronger re-instantiation and/or more widespread activity of the lexicality effect at the time of responding, and stronger multimodal integration for older relative to young adults. Our results offer new insights into how functional neural differences in older adults can result in efficient access to lexico-semantic representations across the lifespan.
Collapse
Affiliation(s)
| | - Martijn Baart
- BCBL, Basque Center on Cognition, Brain and Language, San Sebastian, Spain; Tilburg University, Dept. of Cognitive Neuropsychology, Tilburg, the Netherlands
| | - Arthur G Samuel
- BCBL, Basque Center on Cognition, Brain and Language, San Sebastian, Spain; IKERBASQUE, Basque Foundation for Science, Spain; Stony Brook University, Dept. of Psychology, Stony Brook, NY, United States
| | - Blair C Armstrong
- BCBL, Basque Center on Cognition, Brain and Language, San Sebastian, Spain; University of Toronto, Dept. of Psychology and Department of Language Studies, Toronto, ON, Canada
| |
Collapse
|
27
|
Zhao T, Hu A, Su R, Lyu C, Wang L, Yan N. Phonetic versus spatial processes during motor-oriented imitations of visuo-labial and visuo-lingual speech: A functional near-infrared spectroscopy study. Eur J Neurosci 2021; 55:154-174. [PMID: 34854143 DOI: 10.1111/ejn.15550] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Revised: 10/21/2021] [Accepted: 11/23/2021] [Indexed: 12/28/2022]
Abstract
While a large amount of research has studied the facilitation of visual speech on auditory speech recognition, few have investigated the processing of visual speech gestures in motor-oriented tasks that focus on the spatial and motor features of the articulator actions instead of the phonetic features of auditory and visual speech. The current study examined the engagement of spatial and phonetic processing of visual speech in a motor-oriented speech imitation task. Functional near-infrared spectroscopy (fNIRS) was used to measure the haemodynamic activities related to spatial processing and audiovisual integration in the superior parietal lobe (SPL) and the posterior superior/middle temporal gyrus (pSTG/pMTG) respectively. In addition, visuo-labial and visuo-lingual speech were compared with examine the influence of visual familiarity and audiovisual association on the processes in question. fNIRS revealed significant activations in the SPL but found no supra-additive audiovisual activations in the pSTG/pMTG, suggesting that the processing of audiovisual speech stimuli was primarily focused on spatial processes related to action comprehension and preparation, whereas phonetic processes related to audiovisual integration was minimal. Comparisons between visuo-labial and visuo-lingual speech imitations revealed no significant difference in the activation of the SPL or the pSTG/pMTG, suggesting that a higher degree of visual familiarity and audiovisual association did not significantly influence how visuo-labial speech was processed compared with visuo-lingual speech. The current study offered insights on the pattern of visual-speech processing under a motor-oriented task objective and provided further evidence for the modulation of multimodal speech integration by voluntary selective attention and task objective.
Collapse
Affiliation(s)
- Tinghao Zhao
- CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.,Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Anming Hu
- Department of Rehabilitation Medicine, Beijing Tiantan Hospital, Capital Medical University, Beijing, China
| | - Rongfeng Su
- CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.,Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Chengchen Lyu
- Institute of Software, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Lan Wang
- CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.,Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Nan Yan
- CAS Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China.,Guangdong-Hong Kong-Macao Joint Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| |
Collapse
|
28
|
Karthik G, Plass J, Beltz AM, Liu Z, Grabowecky M, Suzuki S, Stacey WC, Wasade VS, Towle VL, Tao JX, Wu S, Issa NP, Brang D. Visual speech differentially modulates beta, theta, and high gamma bands in auditory cortex. Eur J Neurosci 2021; 54:7301-7317. [PMID: 34587350 DOI: 10.1111/ejn.15482] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Revised: 08/20/2021] [Accepted: 08/28/2021] [Indexed: 12/13/2022]
Abstract
Speech perception is a central component of social communication. Although principally an auditory process, accurate speech perception in everyday settings is supported by meaningful information extracted from visual cues. Visual speech modulates activity in cortical areas subserving auditory speech perception including the superior temporal gyrus (STG). However, it is unknown whether visual modulation of auditory processing is a unitary phenomenon or, rather, consists of multiple functionally distinct processes. To explore this question, we examined neural responses to audiovisual speech measured from intracranially implanted electrodes in 21 patients with epilepsy. We found that visual speech modulated auditory processes in the STG in multiple ways, eliciting temporally and spatially distinct patterns of activity that differed across frequency bands. In the theta band, visual speech suppressed the auditory response from before auditory speech onset to after auditory speech onset (-93 to 500 ms) most strongly in the posterior STG. In the beta band, suppression was seen in the anterior STG from -311 to -195 ms before auditory speech onset and in the middle STG from -195 to 235 ms after speech onset. In high gamma, visual speech enhanced the auditory response from -45 to 24 ms only in the posterior STG. We interpret the visual-induced changes prior to speech onset as reflecting crossmodal prediction of speech signals. In contrast, modulations after sound onset may reflect a decrease in sustained feedforward auditory activity. These results are consistent with models that posit multiple distinct mechanisms supporting audiovisual speech perception.
Collapse
Affiliation(s)
- G Karthik
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| | - John Plass
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| | - Adriene M Beltz
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| | - Zhongming Liu
- Department of Biomedical Engineering and Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, USA
| | - Marcia Grabowecky
- Department of Psychology, Northwestern University, Evanston, Illinois, USA
| | - Satoru Suzuki
- Department of Psychology, Northwestern University, Evanston, Illinois, USA
| | - William C Stacey
- Department of Neurology and Department of Biomedical Engineering, University of Michigan, Ann Arbor, Michigan, USA
| | - Vibhangini S Wasade
- Department of Neurology, Henry Ford Hospital, Detroit, Michigan, USA.,Department of Neurology, Wayne State University School of Medicine, Detroit, Michigan, USA
| | - Vernon L Towle
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - James X Tao
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - Shasha Wu
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - Naoum P Issa
- Department of Neurology, The University of Chicago, Chicago, Illinois, USA
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
29
|
Sun J, Wang Z, Tian X. Manual Gestures Modulate Early Neural Responses in Loudness Perception. Front Neurosci 2021; 15:634967. [PMID: 34539324 PMCID: PMC8440995 DOI: 10.3389/fnins.2021.634967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Accepted: 08/06/2021] [Indexed: 12/02/2022] Open
Abstract
How different sensory modalities interact to shape perception is a fundamental question in cognitive neuroscience. Previous studies in audiovisual interaction have focused on abstract levels such as categorical representation (e.g., McGurk effect). It is unclear whether the cross-modal modulation can extend to low-level perceptual attributes. This study used motional manual gestures to test whether and how the loudness perception can be modulated by visual-motion information. Specifically, we implemented a novel paradigm in which participants compared the loudness of two consecutive sounds whose intensity changes around the just noticeable difference (JND), with manual gestures concurrently presented with the second sound. In two behavioral experiments and two EEG experiments, we investigated our hypothesis that the visual-motor information in gestures would modulate loudness perception. Behavioral results showed that the gestural information biased the judgment of loudness. More importantly, the EEG results demonstrated that early auditory responses around 100 ms after sound onset (N100) were modulated by the gestures. These consistent results in four behavioral and EEG experiments suggest that visual-motor processing can integrate with auditory processing at an early perceptual stage to shape the perception of a low-level perceptual attribute such as loudness, at least under challenging listening conditions.
Collapse
Affiliation(s)
- Jiaqiu Sun
- Division of Arts and Sciences, New York University Shanghai, Shanghai, China.,NYU-ECNU Institute of Brain and Cognitive Science, New York University Shanghai, Shanghai, China
| | - Ziqing Wang
- NYU-ECNU Institute of Brain and Cognitive Science, New York University Shanghai, Shanghai, China.,Shanghai Key Laboratory of Brain Functional Genomics, Ministry of Education, School of Psychology and Cognitive Science, East China Normal University, Shanghai, China
| | - Xing Tian
- Division of Arts and Sciences, New York University Shanghai, Shanghai, China.,NYU-ECNU Institute of Brain and Cognitive Science, New York University Shanghai, Shanghai, China.,Shanghai Key Laboratory of Brain Functional Genomics, Ministry of Education, School of Psychology and Cognitive Science, East China Normal University, Shanghai, China
| |
Collapse
|
30
|
Motor Circuit and Superior Temporal Sulcus Activities Linked to Individual Differences in Multisensory Speech Perception. Brain Topogr 2021; 34:779-792. [PMID: 34480635 DOI: 10.1007/s10548-021-00869-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 08/24/2021] [Indexed: 10/20/2022]
Abstract
Integrating multimodal information into a unified perception is a fundamental human capacity. McGurk effect is a remarkable multisensory illusion that demonstrates a percept different from incongruent auditory and visual syllables. However, not all listeners perceive the McGurk illusion to the same degree. The neural basis for individual differences in modulation of multisensory integration and syllabic perception remains largely unclear. To probe the possible involvement of specific neural circuits in individual differences in multisensory speech perception, we first implemented a behavioral experiment to examine the McGurk susceptibility. Then, functional magnetic resonance imaging was performed in 63 participants to measure the brain activity in response to non-McGurk audiovisual syllables. We revealed significant individual variability in McGurk illusion perception. Moreover, we found significant differential activations of the auditory and visual regions and the left Superior temporal sulcus (STS), as well as multiple motor areas between strong and weak McGurk perceivers. Importantly, the individual engagement of the STS and motor areas could specifically predict the behavioral McGurk susceptibility, contrary to the sensory regions. These findings suggest that the distinct multimodal integration in STS as well as coordinated phonemic modulatory processes in motor circuits may serve as a neural substrate for interindividual differences in multisensory speech perception.
Collapse
|
31
|
Tremblay P, Basirat A, Pinto S, Sato M. Visual prediction cues can facilitate behavioural and neural speech processing in young and older adults. Neuropsychologia 2021; 159:107949. [PMID: 34228997 DOI: 10.1016/j.neuropsychologia.2021.107949] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2020] [Revised: 06/16/2021] [Accepted: 07/01/2021] [Indexed: 02/06/2023]
Abstract
The ability to process speech evolves over the course of the lifespan. Understanding speech at low acoustic intensity and in the presence of background noise becomes harder, and the ability for older adults to benefit from audiovisual speech also appears to decline. These difficulties can have important consequences on quality of life. Yet, a consensus on the cause of these difficulties is still lacking. The objective of this study was to examine the processing of speech in young and older adults under different modalities (i.e. auditory [A], visual [V], audiovisual [AV]) and in the presence of different visual prediction cues (i.e., no predictive cue (control), temporal predictive cue, phonetic predictive cue, and combined temporal and phonetic predictive cues). We focused on recognition accuracy and four auditory evoked potential (AEP) components: P1-N1-P2 and N2. Thirty-four right-handed French-speaking adults were recruited, including 17 younger adults (28 ± 2 years; 20-42 years) and 17 older adults (67 ± 3.77 years; 60-73 years). Participants completed a forced-choice speech identification task. The main findings of the study are: (1) The faciliatory effect of visual information was reduced, but present, in older compared to younger adults, (2) visual predictive cues facilitated speech recognition in younger and older adults alike, (3) age differences in AEPs were localized to later components (P2 and N2), suggesting that aging predominantly affects higher-order cortical processes related to speech processing rather than lower-level auditory processes. (4) Specifically, AV facilitation on P2 amplitude was lower in older adults, there was a reduced effect of the temporal predictive cue on N2 amplitude for older compared to younger adults, and P2 and N2 latencies were longer for older adults. Finally (5) behavioural performance was associated with P2 amplitude in older adults. Our results indicate that aging affects speech processing at multiple levels, including audiovisual integration (P2) and auditory attentional processes (N2). These findings have important implications for understanding barriers to communication in older ages, as well as for the development of compensation strategies for those with speech processing difficulties.
Collapse
Affiliation(s)
- Pascale Tremblay
- Département de Réadaptation, Faculté de Médecine, Université Laval, Quebec City, Canada; Cervo Brain Research Centre, Quebec City, Canada.
| | - Anahita Basirat
- Univ. Lille, CNRS, UMR 9193 - SCALab - Sciences Cognitives et Sciences Affectives, Lille, France
| | - Serge Pinto
- France Aix Marseille Univ, CNRS, LPL, Aix-en-Provence, France
| | - Marc Sato
- France Aix Marseille Univ, CNRS, LPL, Aix-en-Provence, France
| |
Collapse
|
32
|
Opoku-Baah C, Schoenhaut AM, Vassall SG, Tovar DA, Ramachandran R, Wallace MT. Visual Influences on Auditory Behavioral, Neural, and Perceptual Processes: A Review. J Assoc Res Otolaryngol 2021; 22:365-386. [PMID: 34014416 PMCID: PMC8329114 DOI: 10.1007/s10162-021-00789-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2020] [Accepted: 02/07/2021] [Indexed: 01/03/2023] Open
Abstract
In a naturalistic environment, auditory cues are often accompanied by information from other senses, which can be redundant with or complementary to the auditory information. Although the multisensory interactions derived from this combination of information and that shape auditory function are seen across all sensory modalities, our greatest body of knowledge to date centers on how vision influences audition. In this review, we attempt to capture the state of our understanding at this point in time regarding this topic. Following a general introduction, the review is divided into 5 sections. In the first section, we review the psychophysical evidence in humans regarding vision's influence in audition, making the distinction between vision's ability to enhance versus alter auditory performance and perception. Three examples are then described that serve to highlight vision's ability to modulate auditory processes: spatial ventriloquism, cross-modal dynamic capture, and the McGurk effect. The final part of this section discusses models that have been built based on available psychophysical data and that seek to provide greater mechanistic insights into how vision can impact audition. The second section reviews the extant neuroimaging and far-field imaging work on this topic, with a strong emphasis on the roles of feedforward and feedback processes, on imaging insights into the causal nature of audiovisual interactions, and on the limitations of current imaging-based approaches. These limitations point to a greater need for machine-learning-based decoding approaches toward understanding how auditory representations are shaped by vision. The third section reviews the wealth of neuroanatomical and neurophysiological data from animal models that highlights audiovisual interactions at the neuronal and circuit level in both subcortical and cortical structures. It also speaks to the functional significance of audiovisual interactions for two critically important facets of auditory perception-scene analysis and communication. The fourth section presents current evidence for alterations in audiovisual processes in three clinical conditions: autism, schizophrenia, and sensorineural hearing loss. These changes in audiovisual interactions are postulated to have cascading effects on higher-order domains of dysfunction in these conditions. The final section highlights ongoing work seeking to leverage our knowledge of audiovisual interactions to develop better remediation approaches to these sensory-based disorders, founded in concepts of perceptual plasticity in which vision has been shown to have the capacity to facilitate auditory learning.
Collapse
Affiliation(s)
- Collins Opoku-Baah
- Neuroscience Graduate Program, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Brain Institute, Vanderbilt University, Nashville, TN, USA
| | - Adriana M Schoenhaut
- Neuroscience Graduate Program, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Brain Institute, Vanderbilt University, Nashville, TN, USA
| | - Sarah G Vassall
- Neuroscience Graduate Program, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Brain Institute, Vanderbilt University, Nashville, TN, USA
| | - David A Tovar
- Neuroscience Graduate Program, Vanderbilt University, Nashville, TN, USA
- Vanderbilt Brain Institute, Vanderbilt University, Nashville, TN, USA
| | - Ramnarayan Ramachandran
- Vanderbilt Brain Institute, Vanderbilt University, Nashville, TN, USA
- Department of Psychology, Vanderbilt University, Nashville, TN, USA
- Department of Hearing and Speech, Vanderbilt University Medical Center, Nashville, TN, USA
- Vanderbilt Vision Research Center, Nashville, TN, USA
| | - Mark T Wallace
- Vanderbilt Brain Institute, Vanderbilt University, Nashville, TN, USA.
- Department of Psychology, Vanderbilt University, Nashville, TN, USA.
- Department of Hearing and Speech, Vanderbilt University Medical Center, Nashville, TN, USA.
- Vanderbilt Vision Research Center, Nashville, TN, USA.
- Department of Psychiatry and Behavioral Sciences, Vanderbilt University Medical Center, Nashville, TN, USA.
- Department of Pharmacology, Vanderbilt University, Nashville, TN, USA.
| |
Collapse
|
33
|
Lindborg A, Andersen TS. Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception. PLoS One 2021; 16:e0246986. [PMID: 33606815 PMCID: PMC7895372 DOI: 10.1371/journal.pone.0246986] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Accepted: 01/31/2021] [Indexed: 11/24/2022] Open
Abstract
Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. "ba" dubbed onto a visual stimulus such as "ga" produces the illusion of hearing "da". Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.
Collapse
Affiliation(s)
- Alma Lindborg
- Department of Psychology, University of Potsdam, Potsdam, Germany
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Tobias S. Andersen
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
34
|
Sorati M, Behne DM. Considerations in Audio-Visual Interaction Models: An ERP Study of Music Perception by Musicians and Non-musicians. Front Psychol 2021; 11:594434. [PMID: 33551911 PMCID: PMC7854916 DOI: 10.3389/fpsyg.2020.594434] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2020] [Accepted: 12/03/2020] [Indexed: 11/13/2022] Open
Abstract
Previous research with speech and non-speech stimuli suggested that in audiovisual perception, visual information starting prior to the onset of corresponding sound can provide visual cues, and form a prediction about the upcoming auditory sound. This prediction leads to audiovisual (AV) interaction. Auditory and visual perception interact and induce suppression and speeding up of the early auditory event-related potentials (ERPs) such as N1 and P2. To investigate AV interaction, previous research examined N1 and P2 amplitudes and latencies in response to audio only (AO), video only (VO), audiovisual, and control (CO) stimuli, and compared AV with auditory perception based on four AV interaction models (AV vs. AO+VO, AV-VO vs. AO, AV-VO vs. AO-CO, AV vs. AO). The current study addresses how different models of AV interaction express N1 and P2 suppression in music perception. Furthermore, the current study took one step further and examined whether previous musical experience, which can potentially lead to higher N1 and P2 amplitudes in auditory perception, influenced AV interaction in different models. Musicians and non-musicians were presented the recordings (AO, AV, VO) of a keyboard /C4/ key being played, as well as CO stimuli. Results showed that AV interaction models differ in their expression of N1 and P2 amplitude and latency suppression. The calculation of model (AV-VO vs. AO) and (AV-VO vs. AO-CO) has consequences for the resulting N1 and P2 difference waves. Furthermore, while musicians, compared to non-musicians, showed higher N1 amplitude in auditory perception, suppression of amplitudes and latencies for N1 and P2 was similar for the two groups across the AV models. Collectively, these results suggest that when visual cues from finger and hand movements predict the upcoming sound in AV music perception, suppression of early ERPs is similar for musicians and non-musicians. Notably, the calculation differences across models do not lead to the same pattern of results for N1 and P2, demonstrating that the four models are not interchangeable and are not directly comparable.
Collapse
Affiliation(s)
- Marzieh Sorati
- Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Dawn M Behne
- Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
35
|
Auditory detection is modulated by theta phase of silent lip movements. CURRENT RESEARCH IN NEUROBIOLOGY 2021; 2:100014. [PMID: 36246505 PMCID: PMC9559921 DOI: 10.1016/j.crneur.2021.100014] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2020] [Revised: 05/12/2021] [Accepted: 05/19/2021] [Indexed: 11/23/2022] Open
Abstract
Audiovisual speech perception relies, among other things, on our expertise to map a speaker's lip movements with speech sounds. This multimodal matching is facilitated by salient syllable features that align lip movements and acoustic envelope signals in the 4–8 Hz theta band. Although non-exclusive, the predominance of theta rhythms in speech processing has been firmly established by studies showing that neural oscillations track the acoustic envelope in the primary auditory cortex. Equivalently, theta oscillations in the visual cortex entrain to lip movements, and the auditory cortex is recruited during silent speech perception. These findings suggest that neuronal theta oscillations may play a functional role in organising information flow across visual and auditory sensory areas. We presented silent speech movies while participants performed a pure tone detection task to test whether entrainment to lip movements directs the auditory system and drives behavioural outcomes. We showed that auditory detection varied depending on the ongoing theta phase conveyed by lip movements in the movies. In a complementary experiment presenting the same movies while recording participants' electro-encephalogram (EEG), we found that silent lip movements entrained neural oscillations in the visual and auditory cortices with the visual phase leading the auditory phase. These results support the idea that the visual cortex entrained by lip movements filtered the sensitivity of the auditory cortex via theta phase synchronization. Subjects entrain to visual activity conveyed by speakers' lip movements. Visual entrainment modulates auditory perception and performances. Silent lips perception recruits both visual and auditory cortices. Visual and auditory cortices synchronize via theta phase coupling.
Collapse
|
36
|
Pan CC, Zhou AB. Altered Auditory Self-recognition in People with Schizophrenia. THE SPANISH JOURNAL OF PSYCHOLOGY 2020; 23:e52. [PMID: 33213608 DOI: 10.1017/sjp.2020.53] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Self-recognition is of great significance to our sense of self. To date, disturbances in the processing of visual self-recognition are well studied in people with schizophrenia, whereas relatively few studies have focused on the processing of self in other domains, such as auditory. An investigation of auditory self-recognition contributes to delineate changes related to self and the potential roots of the described psychopathological aspects connoting schizophrenia. By applying unimodal task and multisensory test, this study investigated auditory self-recognition in people with schizophrenia under unimodal and bimodal circumstances. Forty-six adults diagnosed with schizophrenia and thirty-two healthy controls were involved in this study. Results suggested that people with schizophrenia seemed to have significantly lower perceptual sensitivity in detecting self-voice, and also showed stricter judgment criteria in self-voice decision. Furthermore, in the presentation of stimuli that combined the stimulation of others' faces with one's own voice, people with schizophrenia mistakenly attributed the voices of others as their own. In conclusion, altered auditory self-recognition in people with schizophrenia was found.
Collapse
Affiliation(s)
| | - Aibao B Zhou
- Northwest Normal University of Anning District (China)
| |
Collapse
|
37
|
Mégevand P, Mercier MR, Groppe DM, Zion Golumbic E, Mesgarani N, Beauchamp MS, Schroeder CE, Mehta AD. Crossmodal Phase Reset and Evoked Responses Provide Complementary Mechanisms for the Influence of Visual Speech in Auditory Cortex. J Neurosci 2020; 40:8530-8542. [PMID: 33023923 PMCID: PMC7605423 DOI: 10.1523/jneurosci.0555-20.2020] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 07/27/2020] [Accepted: 08/31/2020] [Indexed: 12/26/2022] Open
Abstract
Natural conversation is multisensory: when we can see the speaker's face, visual speech cues improve our comprehension. The neuronal mechanisms underlying this phenomenon remain unclear. The two main alternatives are visually mediated phase modulation of neuronal oscillations (excitability fluctuations) in auditory neurons and visual input-evoked responses in auditory neurons. Investigating this question using naturalistic audiovisual speech with intracranial recordings in humans of both sexes, we find evidence for both mechanisms. Remarkably, auditory cortical neurons track the temporal dynamics of purely visual speech using the phase of their slow oscillations and phase-related modulations in broadband high-frequency activity. Consistent with known perceptual enhancement effects, the visual phase reset amplifies the cortical representation of concomitant auditory speech. In contrast to this, and in line with earlier reports, visual input reduces the amplitude of evoked responses to concomitant auditory input. We interpret the combination of improved phase tracking and reduced response amplitude as evidence for more efficient and reliable stimulus processing in the presence of congruent auditory and visual speech inputs.SIGNIFICANCE STATEMENT Watching the speaker can facilitate our understanding of what is being said. The mechanisms responsible for this influence of visual cues on the processing of speech remain incompletely understood. We studied these mechanisms by recording the electrical activity of the human brain through electrodes implanted surgically inside the brain. We found that visual inputs can operate by directly activating auditory cortical areas, and also indirectly by modulating the strength of cortical responses to auditory input. Our results help to understand the mechanisms by which the brain merges auditory and visual speech into a unitary perception.
Collapse
Affiliation(s)
- Pierre Mégevand
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
- Department of Basic Neurosciences, Faculty of Medicine, University of Geneva, 1211 Geneva, Switzerland
| | - Manuel R Mercier
- Department of Neurology, Montefiore Medical Center, Bronx, New York 10467
- Department of Neuroscience, Albert Einstein College of Medicine, Bronx, New York 10461
- Institut de Neurosciences des Systèmes, Aix Marseille University, INSERM, 13005 Marseille, France
| | - David M Groppe
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
- The Krembil Neuroscience Centre, University Health Network, Toronto, Ontario M5T 1M8, Canada
| | - Elana Zion Golumbic
- The Gonda Brain Research Center, Bar Ilan University, Ramat Gan 5290002, Israel
| | - Nima Mesgarani
- Department of Electrical Engineering, Columbia University, New York, New York 10027
| | - Michael S Beauchamp
- Department of Neurosurgery, Baylor College of Medicine, Houston, Texas 77030
| | - Charles E Schroeder
- Nathan S. Kline Institute, Orangeburg, New York 10962
- Department of Psychiatry, Columbia University, New York, New York 10032
| | - Ashesh D Mehta
- Department of Neurosurgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York 11549
- Feinstein Institutes for Medical Research, Manhasset, New York 11030
| |
Collapse
|
38
|
Michon M, Boncompte G, López V. Electrophysiological Dynamics of Visual Speech Processing and the Role of Orofacial Effectors for Cross-Modal Predictions. Front Hum Neurosci 2020; 14:538619. [PMID: 33192386 PMCID: PMC7653187 DOI: 10.3389/fnhum.2020.538619] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Accepted: 09/29/2020] [Indexed: 11/13/2022] Open
Abstract
The human brain generates predictions about future events. During face-to-face conversations, visemic information is used to predict upcoming auditory input. Recent studies suggest that the speech motor system plays a role in these cross-modal predictions, however, usually only audio-visual paradigms are employed. Here we tested whether speech sounds can be predicted on the basis of visemic information only, and to what extent interfering with orofacial articulatory effectors can affect these predictions. We registered EEG and employed N400 as an index of such predictions. Our results show that N400's amplitude was strongly modulated by visemic salience, coherent with cross-modal speech predictions. Additionally, N400 ceased to be evoked when syllables' visemes were presented backwards, suggesting that predictions occur only when the observed viseme matched an existing articuleme in the observer's speech motor system (i.e., the articulatory neural sequence required to produce a particular phoneme/viseme). Importantly, we found that interfering with the motor articulatory system strongly disrupted cross-modal predictions. We also observed a late P1000 that was evoked only for syllable-related visual stimuli, but whose amplitude was not modulated by interfering with the motor system. The present study provides further evidence of the importance of the speech production system for speech sounds predictions based on visemic information at the pre-lexical level. The implications of these results are discussed in the context of a hypothesized trimodal repertoire for speech, in which speech perception is conceived as a highly interactive process that involves not only your ears but also your eyes, lips and tongue.
Collapse
Affiliation(s)
- Maëva Michon
- Laboratorio de Neurociencia Cognitiva y Evolutiva, Escuela de Medicina, Pontificia Universidad Católica de Chile, Santiago, Chile
- Laboratorio de Neurociencia Cognitiva y Social, Facultad de Psicología, Universidad Diego Portales, Santiago, Chile
| | - Gonzalo Boncompte
- Laboratorio de Neurodinámicas de la Cognición, Escuela de Medicina, Pontificia Universidad Católica de Chile, Santiago, Chile
| | - Vladimir López
- Laboratorio de Psicología Experimental, Escuela de Psicología, Pontificia Universidad Católica de Chile, Santiago, Chile
| |
Collapse
|
39
|
Audio-visual combination of syllables involves time-sensitive dynamics following from fusion failure. Sci Rep 2020; 10:18009. [PMID: 33093570 PMCID: PMC7583249 DOI: 10.1038/s41598-020-75201-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2020] [Accepted: 10/05/2020] [Indexed: 11/08/2022] Open
Abstract
In face-to-face communication, audio-visual (AV) stimuli can be fused, combined or perceived as mismatching. While the left superior temporal sulcus (STS) is presumably the locus of AV integration, the process leading to combination is unknown. Based on previous modelling work, we hypothesize that combination results from a complex dynamic originating in a failure to integrate AV inputs, followed by a reconstruction of the most plausible AV sequence. In two different behavioural tasks and one MEG experiment, we observed that combination is more time demanding than fusion. Using time-/source-resolved human MEG analyses with linear and dynamic causal models, we show that both fusion and combination involve early detection of AV incongruence in the STS, whereas combination is further associated with enhanced activity of AV asynchrony-sensitive regions (auditory and inferior frontal cortices). Based on neural signal decoding, we finally show that only combination can be decoded from the IFG activity and that combination is decoded later than fusion in the STS. These results indicate that the AV speech integration outcome primarily depends on whether the STS converges or not onto an existing multimodal syllable representation, and that combination results from subsequent temporal processing, presumably the off-line re-ordering of incongruent AV stimuli.
Collapse
|
40
|
Plass J, Brang D, Suzuki S, Grabowecky M. Vision perceptually restores auditory spectral dynamics in speech. Proc Natl Acad Sci U S A 2020; 117:16920-16927. [PMID: 32632010 PMCID: PMC7382243 DOI: 10.1073/pnas.2002887117] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Visual speech facilitates auditory speech perception, but the visual cues responsible for these benefits and the information they provide remain unclear. Low-level models emphasize basic temporal cues provided by mouth movements, but these impoverished signals may not fully account for the richness of auditory information provided by visual speech. High-level models posit interactions among abstract categorical (i.e., phonemes/visemes) or amodal (e.g., articulatory) speech representations, but require lossy remapping of speech signals onto abstracted representations. Because visible articulators shape the spectral content of speech, we hypothesized that the perceptual system might exploit natural correlations between midlevel visual (oral deformations) and auditory speech features (frequency modulations) to extract detailed spectrotemporal information from visual speech without employing high-level abstractions. Consistent with this hypothesis, we found that the time-frequency dynamics of oral resonances (formants) could be predicted with unexpectedly high precision from the changing shape of the mouth during speech. When isolated from other speech cues, speech-based shape deformations improved perceptual sensitivity for corresponding frequency modulations, suggesting that listeners could exploit this cross-modal correspondence to facilitate perception. To test whether this type of correspondence could improve speech comprehension, we selectively degraded the spectral or temporal dimensions of auditory sentence spectrograms to assess how well visual speech facilitated comprehension under each degradation condition. Visual speech produced drastically larger enhancements during spectral degradation, suggesting a condition-specific facilitation effect driven by cross-modal recovery of auditory speech spectra. The perceptual system may therefore use audiovisual correlations rooted in oral acoustics to extract detailed spectrotemporal information from visual speech.
Collapse
Affiliation(s)
- John Plass
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109;
- Department of Psychology, Northwestern University, Evanston, IL 60208
| | - David Brang
- Department of Psychology, University of Michigan, Ann Arbor, MI 48109
| | - Satoru Suzuki
- Department of Psychology, Northwestern University, Evanston, IL 60208
- Interdepartmental Neuroscience Program, Northwestern University, Chicago, IL 60611
| | - Marcia Grabowecky
- Department of Psychology, Northwestern University, Evanston, IL 60208
- Interdepartmental Neuroscience Program, Northwestern University, Chicago, IL 60611
| |
Collapse
|
41
|
Sorati M, Behne DM. Audiovisual Modulation in Music Perception for Musicians and Non-musicians. Front Psychol 2020; 11:1094. [PMID: 32547458 PMCID: PMC7273518 DOI: 10.3389/fpsyg.2020.01094] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Accepted: 04/29/2020] [Indexed: 11/13/2022] Open
Abstract
In audiovisual music perception, visual information from a musical instrument being played is available prior to the onset of the corresponding musical sound and consequently allows a perceiver to form a prediction about the upcoming audio music. This prediction in audiovisual music perception, compared to auditory music perception, leads to lower N1 and P2 amplitudes and latencies. Although previous research suggests that audiovisual experience, such as previous musical experience may enhance this prediction, a remaining question is to what extent musical experience modifies N1 and P2 amplitudes and latencies. Furthermore, corresponding event-related phase modulations quantified as inter-trial phase coherence (ITPC) have not previously been reported for audiovisual music perception. In the current study, audio video recordings of a keyboard key being played were presented to musicians and non-musicians in audio only (AO), video only (VO), and audiovisual (AV) conditions. With predictive movements from playing the keyboard isolated from AV music perception (AV-VO), the current findings demonstrated that, compared to the AO condition, both groups had a similar decrease in N1 amplitude and latency, and P2 amplitude, along with correspondingly lower ITPC values in the delta, theta, and alpha frequency bands. However, while musicians showed lower ITPC values in the beta-band in AV-VO compared to the AO, non-musicians did not show this pattern. Findings indicate that AV perception may be broadly correlated with auditory perception, and differences between musicians and non-musicians further indicate musical experience to be a specific factor influencing AV perception. Predicting an upcoming sound in AV music perception may involve visual predictory processes, as well as beta-band oscillations, which may be influenced by years of musical training. This study highlights possible interconnectivity in AV perception as well as potential modulation with experience.
Collapse
Affiliation(s)
- Marzieh Sorati
- Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Dawn Marie Behne
- Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
42
|
Randazzo M, Priefer R, Smith PJ, Nagler A, Avery T, Froud K. Neural Correlates of Modality-Sensitive Deviance Detection in the Audiovisual Oddball Paradigm. Brain Sci 2020; 10:brainsci10060328. [PMID: 32481538 PMCID: PMC7348766 DOI: 10.3390/brainsci10060328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 05/15/2020] [Accepted: 05/25/2020] [Indexed: 11/16/2022] Open
Abstract
The McGurk effect, an incongruent pairing of visual /ga/–acoustic /ba/, creates a fusion illusion /da/ and is the cornerstone of research in audiovisual speech perception. Combination illusions occur given reversal of the input modalities—auditory /ga/-visual /ba/, and percept /bga/. A robust literature shows that fusion illusions in an oddball paradigm evoke a mismatch negativity (MMN) in the auditory cortex, in absence of changes to acoustic stimuli. We compared fusion and combination illusions in a passive oddball paradigm to further examine the influence of visual and auditory aspects of incongruent speech stimuli on the audiovisual MMN. Participants viewed videos under two audiovisual illusion conditions: fusion with visual aspect of the stimulus changing, and combination with auditory aspect of the stimulus changing, as well as two unimodal auditory- and visual-only conditions. Fusion and combination deviants exerted similar influence in generating congruency predictions with significant differences between standards and deviants in the N100 time window. Presence of the MMN in early and late time windows differentiated fusion from combination deviants. When the visual signal changes, a new percept is created, but when the visual is held constant and the auditory changes, the response is suppressed, evoking a later MMN. In alignment with models of predictive processing in audiovisual speech perception, we interpreted our results to indicate that visual information can both predict and suppress auditory speech perception.
Collapse
Affiliation(s)
- Melissa Randazzo
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
- Correspondence: ; Tel.: +1-516-877-4769
| | - Ryan Priefer
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
| | - Paul J. Smith
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| | - Amanda Nagler
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
| | - Trey Avery
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| | - Karen Froud
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| |
Collapse
|
43
|
Maddaluno O, Guidali G, Zazio A, Miniussi C, Bolognini N. Touch anticipation mediates cross-modal Hebbian plasticity in the primary somatosensory cortex. Cortex 2020; 126:173-181. [DOI: 10.1016/j.cortex.2020.01.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Revised: 12/15/2019] [Accepted: 01/07/2020] [Indexed: 12/28/2022]
|
44
|
Micheli C, Schepers IM, Ozker M, Yoshor D, Beauchamp MS, Rieger JW. Electrocorticography reveals continuous auditory and visual speech tracking in temporal and occipital cortex. Eur J Neurosci 2020; 51:1364-1376. [PMID: 29888819 PMCID: PMC6289876 DOI: 10.1111/ejn.13992] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2017] [Revised: 05/19/2018] [Accepted: 05/29/2018] [Indexed: 12/11/2022]
Abstract
During natural speech perception, humans must parse temporally continuous auditory and visual speech signals into sequences of words. However, most studies of speech perception present only single words or syllables. We used electrocorticography (subdural electrodes implanted on the brains of epileptic patients) to investigate the neural mechanisms for processing continuous audiovisual speech signals consisting of individual sentences. Using partial correlation analysis, we found that posterior superior temporal gyrus (pSTG) and medial occipital cortex tracked both the auditory and the visual speech envelopes. These same regions, as well as inferior temporal cortex, responded more strongly to a dynamic video of a talking face compared to auditory speech paired with a static face. Occipital cortex and pSTG carry temporal information about both auditory and visual speech dynamics. Visual speech tracking in pSTG may be a mechanism for enhancing perception of degraded auditory speech.
Collapse
Affiliation(s)
- Cristiano Micheli
- Department of Psychology, Carl von Ossietzky University, Oldenburg, Germany
- Donders Centre for Cognitive Neuroimaging, Radboud University, Nijmegen, The Netherlands
| | - Inga M Schepers
- Department of Psychology, Carl von Ossietzky University, Oldenburg, Germany
- Research Center Neurosensory Science, Carl von Ossietzky University, Oldenburg, Germany
| | - Müge Ozker
- Department of Neurosurgery, Baylor College of Medicine, Houston, Texas
| | - Daniel Yoshor
- Department of Neurosurgery, Baylor College of Medicine, Houston, Texas
- Michael E. DeBakey Veterans Affairs Medical Center, Houston, Texas
| | | | - Jochem W Rieger
- Department of Psychology, Carl von Ossietzky University, Oldenburg, Germany
- Research Center Neurosensory Science, Carl von Ossietzky University, Oldenburg, Germany
| |
Collapse
|
45
|
Walsh KS, McGovern DP, Clark A, O'Connell RG. Evaluating the neurophysiological evidence for predictive processing as a model of perception. Ann N Y Acad Sci 2020; 1464:242-268. [PMID: 32147856 PMCID: PMC7187369 DOI: 10.1111/nyas.14321] [Citation(s) in RCA: 120] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2019] [Revised: 01/21/2020] [Accepted: 02/03/2020] [Indexed: 12/12/2022]
Abstract
For many years, the dominant theoretical framework guiding research into the neural origins of perceptual experience has been provided by hierarchical feedforward models, in which sensory inputs are passed through a series of increasingly complex feature detectors. However, the long-standing orthodoxy of these accounts has recently been challenged by a radically different set of theories that contend that perception arises from a purely inferential process supported by two distinct classes of neurons: those that transmit predictions about sensory states and those that signal sensory information that deviates from those predictions. Although these predictive processing (PP) models have become increasingly influential in cognitive neuroscience, they are also criticized for lacking the empirical support to justify their status. This limited evidence base partly reflects the considerable methodological challenges that are presented when trying to test the unique predictions of these models. However, a confluence of technological and theoretical advances has prompted a recent surge in human and nonhuman neurophysiological research seeking to fill this empirical gap. Here, we will review this new research and evaluate the degree to which its findings support the key claims of PP.
Collapse
Affiliation(s)
- Kevin S. Walsh
- Trinity College Institute of Neuroscience and School of PsychologyTrinity College DublinDublinIreland
| | - David P. McGovern
- Trinity College Institute of Neuroscience and School of PsychologyTrinity College DublinDublinIreland
- School of PsychologyDublin City UniversityDublinIreland
| | - Andy Clark
- Department of PhilosophyUniversity of SussexBrightonUK
- Department of InformaticsUniversity of SussexBrightonUK
| | - Redmond G. O'Connell
- Trinity College Institute of Neuroscience and School of PsychologyTrinity College DublinDublinIreland
| |
Collapse
|
46
|
|
47
|
Hueber T, Tatulli E, Girin L, Schwartz JL. Evaluating the Potential Gain of Auditory and Audiovisual Speech-Predictive Coding Using Deep Learning. Neural Comput 2020; 32:596-625. [PMID: 31951798 DOI: 10.1162/neco_a_01264] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
Sensory processing is increasingly conceived in a predictive framework in which neurons would constantly process the error signal resulting from the comparison of expected and observed stimuli. Surprisingly, few data exist on the accuracy of predictions that can be computed in real sensory scenes. Here, we focus on the sensory processing of auditory and audiovisual speech. We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and past audio or audiovisual observations (i.e., including lip movements). Those predictions exploit purely local phonetic regularities with no explicit call to higher linguistic levels. Experiments are conducted on the multispeaker LibriSpeech audio speech database (around 100 hours) and on the NTCD-TIMIT audiovisual speech database (around 7 hours). They appear to be efficient in a short temporal range (25-50 ms), predicting 50% to 75% of the variance of the incoming stimulus, which could result in potentially saving up to three-quarters of the processing power. Then they quickly decrease and almost vanish after 250 ms. Adding information on the lips slightly improves predictions, with a 5% to 10% increase in explained variance. Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound.
Collapse
Affiliation(s)
- Thomas Hueber
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| | - Eric Tatulli
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| | - Laurent Girin
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France, and Inria Grenoble-Rhône-Alpes, 38330 Montbonnot-Saint Martin, France
| | - Jean-Luc Schwartz
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| |
Collapse
|
48
|
REN Y, XU Z, WANG T, YANG W. AGE-RELATED ALTERATIONS IN AUDIOVISUAL INTEGRATION: A BRIEF OVERVIEW. PSYCHOLOGIA 2020. [DOI: 10.2117/psysoc.2020-a002] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Affiliation(s)
- Yanna REN
- Guizhou University of Chinese Medicine
| | - Zhihan XU
- Okayama University
- Ningbo University of Technology
| | - Tao WANG
- Guizhou Light Industry Technical College
| | | |
Collapse
|
49
|
Lee JJ, Finch E, Rose T. Exploring the outcomes and perceptions of people with aphasia who conversed with speech pathology students via telepractice: a pilot study. SPEECH, LANGUAGE AND HEARING 2019. [DOI: 10.1080/2050571x.2019.1702241] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Affiliation(s)
- Jeslyn J. Lee
- School of Health and Rehabilitation Sciences, The University of Queensland, Brisbane, QLD, Australia
| | - Emma Finch
- School of Health and Rehabilitation Sciences, The University of Queensland, Brisbane, QLD, Australia
- Speech Pathology Department, Princess Alexandra Hospital, Brisbane, QLD, Australia
- Centre for Functioning and Health Research, Metro South Health, Brisbane, QLD, Australia
| | - Tanya Rose
- School of Health and Rehabilitation Sciences, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
50
|
Sorati M, Behne DM. Musical Expertise Affects Audiovisual Speech Perception: Findings From Event-Related Potentials and Inter-trial Phase Coherence. Front Psychol 2019; 10:2562. [PMID: 31803107 PMCID: PMC6874039 DOI: 10.3389/fpsyg.2019.02562] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 10/29/2019] [Indexed: 12/03/2022] Open
Abstract
In audiovisual speech perception, visual information from a talker's face during mouth articulation is available before the onset of the corresponding audio speech, and thereby allows the perceiver to use visual information to predict the upcoming audio. This prediction from phonetically congruent visual information modulates audiovisual speech perception and leads to a decrease in N1 and P2 amplitudes and latencies compared to the perception of audio speech alone. Whether audiovisual experience, such as with musical training, influences this prediction is unclear, but if so, may explain some of the variations observed in previous research. The current study addresses whether audiovisual speech perception is affected by musical training, first assessing N1 and P2 event-related potentials (ERPs) and in addition, inter-trial phase coherence (ITPC). Musicians and non-musicians are presented the syllable, /ba/ in audio only (AO), video only (VO), and audiovisual (AV) conditions. With the predictory effect of mouth movement isolated from the AV speech (AV-VO), results showed that, compared to audio speech, both groups have a lower N1 latency and P2 amplitude and latency. Moreover, they also showed lower ITPCs in the delta, theta, and beta bands in audiovisual speech perception. However, musicians showed significant suppression of N1 amplitude and desynchronization in the alpha band in audiovisual speech, not present for non-musicians. Collectively, the current findings indicate that early sensory processing can be modified by musical experience, which in turn can explain some of the variations in previous AV speech perception research.
Collapse
Affiliation(s)
- Marzieh Sorati
- Department of Psychology, Norwegian University of Science and Technology, Trondheim, Norway
| | | |
Collapse
|