1
|
Alemi R, Wolfe J, Neumann S, Manning J, Towler W, Koirala N, Gracco VL, Deroche M. Audiovisual integration in children with cochlear implants revealed through EEG and fNIRS. Brain Res Bull 2023; 205:110817. [PMID: 37989460 DOI: 10.1016/j.brainresbull.2023.110817] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 09/22/2023] [Accepted: 11/13/2023] [Indexed: 11/23/2023]
Abstract
Sensory deprivation can offset the balance of audio versus visual information in multimodal processing. Such a phenomenon could persist for children born deaf, even after they receive cochlear implants (CIs), and could potentially explain why one modality is given priority over the other. Here, we recorded cortical responses to a single speaker uttering two syllables, presented in audio-only (A), visual-only (V), and audio-visual (AV) modes. Electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) were successively recorded in seventy-five school-aged children. Twenty-five were children with normal hearing (NH) and fifty wore CIs, among whom 26 had relatively high language abilities (HL) comparable to those of NH children, while 24 others had low language abilities (LL). In EEG data, visual-evoked potentials were captured in occipital regions, in response to V and AV stimuli, and they were accentuated in the HL group compared to the LL group (the NH group being intermediate). Close to the vertex, auditory-evoked potentials were captured in response to A and AV stimuli and reflected a differential treatment of the two syllables but only in the NH group. None of the EEG metrics revealed any interaction between group and modality. In fNIRS data, each modality induced a corresponding activity in visual or auditory regions, but no group difference was observed in A, V, or AV stimulation. The present study did not reveal any sign of abnormal AV integration in children with CI. An efficient multimodal integrative network (at least for rudimentary speech materials) is clearly not a sufficient condition to exhibit good language and literacy.
Collapse
Affiliation(s)
- Razieh Alemi
- Department of Psychology, Concordia University, 7141 Sherbrooke St. West, Montreal, Quebec H4B 1R6, Canada.
| | - Jace Wolfe
- Oberkotter Foundation, Oklahoma City, OK, USA
| | - Sara Neumann
- Hearts for Hearing Foundation, 11500 Portland Av., Oklahoma City, OK 73120, USA
| | - Jacy Manning
- Hearts for Hearing Foundation, 11500 Portland Av., Oklahoma City, OK 73120, USA
| | - Will Towler
- Hearts for Hearing Foundation, 11500 Portland Av., Oklahoma City, OK 73120, USA
| | - Nabin Koirala
- Haskins Laboratories, 300 George St., New Haven, CT 06511, USA
| | | | - Mickael Deroche
- Department of Psychology, Concordia University, 7141 Sherbrooke St. West, Montreal, Quebec H4B 1R6, Canada
| |
Collapse
|
2
|
Sewell K, Brown VA, Farwell G, Rogers M, Zhang X, Strand JF. The effects of temporal cues, point-light displays, and faces on speech identification and listening effort. PLoS One 2023; 18:e0290826. [PMID: 38019831 PMCID: PMC10686424 DOI: 10.1371/journal.pone.0290826] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Accepted: 08/16/2023] [Indexed: 12/01/2023] Open
Abstract
Among the most robust findings in speech research is that the presence of a talking face improves the intelligibility of spoken language. Talking faces supplement the auditory signal by providing fine phonetic cues based on the placement of the articulators, as well as temporal cues to when speech is occurring. In this study, we varied the amount of information contained in the visual signal, ranging from temporal information alone to a natural talking face. Participants were presented with spoken sentences in energetic or informational masking in four different visual conditions: audio-only, a modulating circle providing temporal cues to salient features of the speech, a digitally rendered point-light display showing lip movement, and a natural talking face. We assessed both sentence identification accuracy and self-reported listening effort. Audiovisual benefit for intelligibility was observed for the natural face in both informational and energetic masking, but the digitally rendered point-light display only provided benefit in energetic masking. Intelligibility for speech accompanied by the modulating circle did not differ from the audio-only conditions in either masker type. Thus, the temporal cues used here were insufficient to improve speech intelligibility in noise, but some types of digital point-light displays may contain enough phonetic detail to produce modest improvements in speech identification in noise.
Collapse
Affiliation(s)
- Katrina Sewell
- Department of Psychology, Carleton College, Northfield, MN, United States of America
| | - Violet A. Brown
- Department of Psychological & Brain Sciences, Washington University in St. Louis, St. Louis, MO, United States of America
| | - Grace Farwell
- Department of Psychology, Carleton College, Northfield, MN, United States of America
| | - Maya Rogers
- Department of Psychology, Carleton College, Northfield, MN, United States of America
| | - Xingyi Zhang
- Department of Psychology, Carleton College, Northfield, MN, United States of America
| | - Julia F. Strand
- Department of Psychology, Carleton College, Northfield, MN, United States of America
| |
Collapse
|
3
|
Cappelloni MS, Mateo VS, Maddox RK. Performance in an Audiovisual Selective Attention Task Using Speech-Like Stimuli Depends on the Talker Identities, But Not Temporal Coherence. Trends Hear 2023; 27:23312165231207235. [PMID: 37847849 PMCID: PMC10586009 DOI: 10.1177/23312165231207235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 09/25/2023] [Accepted: 09/26/2023] [Indexed: 10/19/2023] Open
Abstract
Audiovisual integration of speech can benefit the listener by not only improving comprehension of what a talker is saying but also helping a listener select a particular talker's voice from a mixture of sounds. Binding, an early integration of auditory and visual streams that helps an observer allocate attention to a combined audiovisual object, is likely involved in processing audiovisual speech. Although temporal coherence of stimulus features across sensory modalities has been implicated as an important cue for non-speech stimuli (Maddox et al., 2015), the specific cues that drive binding in speech are not fully understood due to the challenges of studying binding in natural stimuli. Here we used speech-like artificial stimuli that allowed us to isolate three potential contributors to binding: temporal coherence (are the face and the voice changing synchronously?), articulatory correspondence (do visual faces represent the correct phones?), and talker congruence (do the face and voice come from the same person?). In a trio of experiments, we examined the relative contributions of each of these cues. Normal hearing listeners performed a dual task in which they were instructed to respond to events in a target auditory stream while ignoring events in a distractor auditory stream (auditory discrimination) and detecting flashes in a visual stream (visual detection). We found that viewing the face of a talker who matched the attended voice (i.e., talker congruence) offered a performance benefit. We found no effect of temporal coherence on performance in this task, prompting an important recontextualization of previous findings.
Collapse
Affiliation(s)
- Madeline S. Cappelloni
- Biomedical Engineering, University of Rochester, Rochester, NY, USA
- Center for Visual Science, University of Rochester, Rochester, NY, USA
- Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA
| | - Vincent S. Mateo
- Audio and Music Engineering, University of Rochester, Rochester, NY, USA
| | - Ross K. Maddox
- Biomedical Engineering, University of Rochester, Rochester, NY, USA
- Center for Visual Science, University of Rochester, Rochester, NY, USA
- Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA
- Neuroscience, University of Rochester, Rochester, NY, USA
| |
Collapse
|
4
|
Shan T, Wenner CE, Xu C, Duan Z, Maddox RK. Speech-In-Noise Comprehension is Improved When Viewing a Deep-Neural-Network-Generated Talking Face. Trends Hear 2022; 26:23312165221136934. [PMID: 36384325 PMCID: PMC9677167 DOI: 10.1177/23312165221136934] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Listening in a noisy environment is challenging, but many previous studies have demonstrated that comprehension of speech can be substantially improved by looking at the talker's face. We recently developed a deep neural network (DNN) based system that generates movies of a talking face from speech audio and a single face image. In this study, we aimed to quantify the benefits that such a system can bring to speech comprehension, especially in noise. The target speech audio was masked with signal to noise ratios of -9, -6, -3, and 0 dB and was presented to subjects in three audio-visual (AV) stimulus conditions: (1) synthesized AV: audio with the synthesized talking face movie; (2) natural AV: audio with the original movie from the corpus; and (3) audio-only: audio with a static image of the talker. Subjects were asked to type the sentences they heard in each trial and keyword recognition was quantified for each condition. Overall, performance in the synthesized AV condition fell approximately halfway between the other two conditions, showing a marked improvement over the audio-only control but still falling short of the natural AV condition. Every subject showed some benefit from the synthetic AV stimulus. The results of this study support the idea that a DNN-based model that generates a talking face from speech audio can meaningfully enhance comprehension in noisy environments, and has the potential to be used as a visual hearing aid.
Collapse
Affiliation(s)
- Tong Shan
- Department of Biomedical Engineering, University of Rochester, Rochester, NY, USA,Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA,Center for Visual Science, University of Rochester, Rochester, NY, USA
| | - Casper E. Wenner
- Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA
| | - Chenliang Xu
- Department of Computer Science, University of Rochester, Rochester, NY, USA
| | - Zhiyao Duan
- Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA
| | - Ross K. Maddox
- Department of Biomedical Engineering, University of Rochester, Rochester, NY, USA,Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA,Center for Visual Science, University of Rochester, Rochester, NY, USA,Department of Neuroscience, University of Rochester, Rochester, NY, USA,Ross K. Maddox, Department of Biomedical Engineering and Department of Neuroscience, University of Rochester, Rochester, NY, USA.
| |
Collapse
|