1
|
Magnotti JF, Lado A, Beauchamp MS. The noisy encoding of disparity model predicts perception of the McGurk effect in native Japanese speakers. Front Neurosci 2024; 18:1421713. [PMID: 38988770 PMCID: PMC11233445 DOI: 10.3389/fnins.2024.1421713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Accepted: 05/28/2024] [Indexed: 07/12/2024] Open
Abstract
In the McGurk effect, visual speech from the face of the talker alters the perception of auditory speech. The diversity of human languages has prompted many intercultural studies of the effect in both Western and non-Western cultures, including native Japanese speakers. Studies of large samples of native English speakers have shown that the McGurk effect is characterized by high variability in the susceptibility of different individuals to the illusion and in the strength of different experimental stimuli to induce the illusion. The noisy encoding of disparity (NED) model of the McGurk effect uses principles from Bayesian causal inference to account for this variability, separately estimating the susceptibility and sensory noise for each individual and the strength of each stimulus. To determine whether variation in McGurk perception is similar between Western and non-Western cultures, we applied the NED model to data collected from 80 native Japanese-speaking participants. Fifteen different McGurk stimuli that varied in syllable content (unvoiced auditory "pa" + visual "ka" or voiced auditory "ba" + visual "ga") were presented interleaved with audiovisual congruent stimuli. The McGurk effect was highly variable across stimuli and participants, with the percentage of illusory fusion responses ranging from 3 to 78% across stimuli and from 0 to 91% across participants. Despite this variability, the NED model accurately predicted perception, predicting fusion rates for individual stimuli with 2.1% error and for individual participants with 2.4% error. Stimuli containing the unvoiced pa/ka pairing evoked more fusion responses than the voiced ba/ga pairing. Model estimates of sensory noise were correlated with participant age, with greater sensory noise in older participants. The NED model of the McGurk effect offers a principled way to account for individual and stimulus differences when examining the McGurk effect in different cultures.
Collapse
Affiliation(s)
- John F Magnotti
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Anastasia Lado
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Michael S Beauchamp
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
2
|
Hisaizumi M, Tantam D. Enhanced sensitivity to pitch perception and its possible relation to language acquisition in autism. AUTISM & DEVELOPMENTAL LANGUAGE IMPAIRMENTS 2024; 9:23969415241248618. [PMID: 38817731 PMCID: PMC11138189 DOI: 10.1177/23969415241248618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Background and aims Fascinations for or aversions to particular sounds are a familiar feature of autism, as is an ability to reproduce another person's utterances, precisely copying the other person's prosody as well as their words. Such observations seem to indicate not only that autistic people can pay close attention to what they hear, but also that they have the ability to perceive the finer details of auditory stimuli. This is consistent with the previously reported consensus that absolute pitch is more common in autistic individuals than in neurotypicals. We take this to suggest that autistic people have perception that allows them to pay attention to fine details. It is important to establish whether or not this is so as autism is often presented as a deficit rather than a difference. We therefore undertook a narrative literature review of studies of auditory perception, in autistic and nonautistic individuals, focussing on any differences in processing linguistic and nonlinguistic sounds. Main contributions We find persuasive evidence that nonlinguistic auditory perception in autistic children differs from that of nonautistic children. This is supported by the additional finding of a higher prevalence of absolute pitch and enhanced pitch discriminating abilities in autistic children compared to neurotypical children. Such abilities appear to stem from atypical perception, which is biased toward local-level information necessary for processing pitch and other prosodic features. Enhanced pitch discriminating abilities tend to be found in autistic individuals with a history of language delay, suggesting possible reciprocity. Research on various aspects of language development in autism also supports the hypothesis that atypical pitch perception may be accountable for observed differences in language development in autism. Conclusions The results of our review of previously published studies are consistent with the hypothesis that auditory perception, and particularly pitch perception, in autism are different from the norm but not always impaired. Detail-oriented pitch perception may be an advantage given the right environment. We speculate that unusually heightened sensitivity to pitch differences may be at the cost of the normal development of the perception of the sounds that contribute most to early language development. Implications The acquisition of speech and language may be a process that normally involves an enhanced perception of speech sounds at the expense of the processing of nonlinguistic sounds, but autistic children may not give speech sounds this same priority.
Collapse
Affiliation(s)
| | - Digby Tantam
- Middlesex University, Existential Academy, London, UK
| |
Collapse
|
3
|
Winn MB, Wright RA, Tucker BV. Reconsidering classic ideas in speech communication. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:1623. [PMID: 37002094 DOI: 10.1121/10.0017487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Accepted: 02/20/2023] [Indexed: 05/18/2023]
Abstract
The papers in this special issue provide a critical look at some historical ideas that have had an influence on research and teaching in the field of speech communication. They also address widely used methodologies or address long-standing methodological challenges in the areas of speech perception and speech production. The goal is to reconsider and evaluate the need for caution or replacement of historical ideas with more modern results and methods. The contributions provide respectful historical context to the classic ideas, as well as new original research or discussion that clarifies the limitations of the original ideas.
Collapse
Affiliation(s)
- Matthew B Winn
- Speech-Language-Hearing Sciences, University of Minnesota, Minneapolis, Minnesota 55455, USA
| | - Richard A Wright
- Department of Linguistics, University of Washington, Seattle, Washington 98195, USA
| | - Benjamin V Tucker
- Department of Communication Sciences and Disorders, Northern Arizona University, Flagstaff, Arizona 86011, USA
| |
Collapse
|
4
|
Abstract
Visual speech cues play an important role in speech recognition, and the McGurk effect is a classic demonstration of this. In the original McGurk & Macdonald (Nature 264, 746-748 1976) experiment, 98% of participants reported an illusory "fusion" percept of /d/ when listening to the spoken syllable /b/ and watching the visual speech movements for /g/. However, more recent work shows that subject and task differences influence the proportion of fusion responses. In the current study, we varied task (forced-choice vs. open-ended), stimulus set (including /d/ exemplars vs. not), and data collection environment (lab vs. Mechanical Turk) to investigate the robustness of the McGurk effect. Across experiments, using the same stimuli to elicit the McGurk effect, we found fusion responses ranging from 10% to 60%, thus showing large variability in the likelihood of experiencing the McGurk effect across factors that are unrelated to the perceptual information provided by the stimuli. Rather than a robust perceptual illusion, we therefore argue that the McGurk effect exists only for some individuals under specific task situations.Significance: This series of studies re-evaluates the classic McGurk effect, which shows the relevance of visual cues on speech perception. We highlight the importance of taking into account subject variables and task differences, and challenge future researchers to think carefully about the perceptual basis of the McGurk effect, how it is defined, and what it can tell us about audiovisual integration in speech.
Collapse
|
5
|
Lindborg A, Andersen TS. Bayesian binding and fusion models explain illusion and enhancement effects in audiovisual speech perception. PLoS One 2021; 16:e0246986. [PMID: 33606815 PMCID: PMC7895372 DOI: 10.1371/journal.pone.0246986] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Accepted: 01/31/2021] [Indexed: 11/24/2022] Open
Abstract
Speech is perceived with both the ears and the eyes. Adding congruent visual speech improves the perception of a faint auditory speech stimulus, whereas adding incongruent visual speech can alter the perception of the utterance. The latter phenomenon is the case of the McGurk illusion, where an auditory stimulus such as e.g. "ba" dubbed onto a visual stimulus such as "ga" produces the illusion of hearing "da". Bayesian models of multisensory perception suggest that both the enhancement and the illusion case can be described as a two-step process of binding (informed by prior knowledge) and fusion (informed by the information reliability of each sensory cue). However, there is to date no study which has accounted for how they each contribute to audiovisual speech perception. In this study, we expose subjects to both congruent and incongruent audiovisual speech, manipulating the binding and the fusion stages simultaneously. This is done by varying both temporal offset (binding) and auditory and visual signal-to-noise ratio (fusion). We fit two Bayesian models to the behavioural data and show that they can both account for the enhancement effect in congruent audiovisual speech, as well as the McGurk illusion. This modelling approach allows us to disentangle the effects of binding and fusion on behavioural responses. Moreover, we find that these models have greater predictive power than a forced fusion model. This study provides a systematic and quantitative approach to measuring audiovisual integration in the perception of the McGurk illusion as well as congruent audiovisual speech, which we hope will inform future work on audiovisual speech perception.
Collapse
Affiliation(s)
- Alma Lindborg
- Department of Psychology, University of Potsdam, Potsdam, Germany
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Tobias S. Andersen
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
6
|
Thézé R, Gadiri MA, Albert L, Provost A, Giraud AL, Mégevand P. Animated virtual characters to explore audio-visual speech in controlled and naturalistic environments. Sci Rep 2020; 10:15540. [PMID: 32968127 PMCID: PMC7511320 DOI: 10.1038/s41598-020-72375-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2020] [Accepted: 08/31/2020] [Indexed: 11/09/2022] Open
Abstract
Natural speech is processed in the brain as a mixture of auditory and visual features. An example of the importance of visual speech is the McGurk effect and related perceptual illusions that result from mismatching auditory and visual syllables. Although the McGurk effect has widely been applied to the exploration of audio-visual speech processing, it relies on isolated syllables, which severely limits the conclusions that can be drawn from the paradigm. In addition, the extreme variability and the quality of the stimuli usually employed prevents comparability across studies. To overcome these limitations, we present an innovative methodology using 3D virtual characters with realistic lip movements synchronized on computer-synthesized speech. We used commercially accessible and affordable tools to facilitate reproducibility and comparability, and the set-up was validated on 24 participants performing a perception task. Within complete and meaningful French sentences, we paired a labiodental fricative viseme (i.e. /v/) with a bilabial occlusive phoneme (i.e. /b/). This audiovisual mismatch is known to induce the illusion of hearing /v/ in a proportion of trials. We tested the rate of the illusion while varying the magnitude of background noise and audiovisual lag. Overall, the effect was observed in 40% of trials. The proportion rose to about 50% with added background noise and up to 66% when controlling for phonetic features. Our results conclusively demonstrate that computer-generated speech stimuli are judicious, and that they can supplement natural speech with higher control over stimulus timing and content.
Collapse
Affiliation(s)
- Raphaël Thézé
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland
| | - Mehdi Ali Gadiri
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland
| | - Louis Albert
- Human Neuroscience Platform, Fondation Campus Biotech Geneva, Geneva, Switzerland
| | - Antoine Provost
- Human Neuroscience Platform, Fondation Campus Biotech Geneva, Geneva, Switzerland
| | - Anne-Lise Giraud
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland
| | - Pierre Mégevand
- Department of Basic Neurosciences, University of Geneva, Campus Biotech, Chemin des Mines 9, 1202, Geneva, Switzerland. .,Division of Neurology, Geneva University Hospitals, Geneva, Switzerland.
| |
Collapse
|
7
|
Randazzo M, Priefer R, Smith PJ, Nagler A, Avery T, Froud K. Neural Correlates of Modality-Sensitive Deviance Detection in the Audiovisual Oddball Paradigm. Brain Sci 2020; 10:brainsci10060328. [PMID: 32481538 PMCID: PMC7348766 DOI: 10.3390/brainsci10060328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 05/15/2020] [Accepted: 05/25/2020] [Indexed: 11/16/2022] Open
Abstract
The McGurk effect, an incongruent pairing of visual /ga/–acoustic /ba/, creates a fusion illusion /da/ and is the cornerstone of research in audiovisual speech perception. Combination illusions occur given reversal of the input modalities—auditory /ga/-visual /ba/, and percept /bga/. A robust literature shows that fusion illusions in an oddball paradigm evoke a mismatch negativity (MMN) in the auditory cortex, in absence of changes to acoustic stimuli. We compared fusion and combination illusions in a passive oddball paradigm to further examine the influence of visual and auditory aspects of incongruent speech stimuli on the audiovisual MMN. Participants viewed videos under two audiovisual illusion conditions: fusion with visual aspect of the stimulus changing, and combination with auditory aspect of the stimulus changing, as well as two unimodal auditory- and visual-only conditions. Fusion and combination deviants exerted similar influence in generating congruency predictions with significant differences between standards and deviants in the N100 time window. Presence of the MMN in early and late time windows differentiated fusion from combination deviants. When the visual signal changes, a new percept is created, but when the visual is held constant and the auditory changes, the response is suppressed, evoking a later MMN. In alignment with models of predictive processing in audiovisual speech perception, we interpreted our results to indicate that visual information can both predict and suppress auditory speech perception.
Collapse
Affiliation(s)
- Melissa Randazzo
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
- Correspondence: ; Tel.: +1-516-877-4769
| | - Ryan Priefer
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
| | - Paul J. Smith
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| | - Amanda Nagler
- Department of Communication Sciences and Disorders, Adelphi University, Garden City, NY 11530, USA; (R.P.); (A.N.)
| | - Trey Avery
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| | - Karen Froud
- Neuroscience and Education, Department of Biobehavioral Sciences, Teachers College, Columbia University, New York, NY 10027, USA; (P.J.S.); (T.A.); (K.F.)
| |
Collapse
|
8
|
Shahin AJ. Neural evidence accounting for interindividual variability of the McGurk illusion. Neurosci Lett 2019; 707:134322. [PMID: 31181299 DOI: 10.1016/j.neulet.2019.134322] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 06/04/2019] [Accepted: 06/06/2019] [Indexed: 11/30/2022]
Abstract
The McGurk illusion is experienced to various degrees among the general population. Previous studies have implicated the left superior temporal sulcus (STS) and auditory cortex (AC) as regions associated with this interindividual variability. We sought to further investigate the neurophysiology underlying this variability using a variant of the McGurk illusion design. Electroencephalography (EEG) was recorded while human subjects were presented with videos of a speaker uttering the consonant-vowels (CVs) /ba/ and /fa/, which were mixed and matched with audio of /ba/ and /fa/ to produce congruent and incongruent conditions. Subjects were also presented with unimodal stimuli of silent videos and audios of the CVs. They responded to whether they heard (or saw in the silent condition) /ba/ or /fa/. An illusion during the incongruent conditions was deemed successful when individuals heard the syllable conveyed by mouth movements. We hypothesized that individuals who experience the illusion more strongly should exhibit more robust desynchronization of alpha (7-12 Hz) at fronto-central and temporal sites, emphasizing more engagement of neural generators at the AC and STS. We found, however, that compared to weaker illusion perceivers, stronger illusion perceivers exhibited greater alpha synchronization at fronto-central and posterior temporal sites, which is consistent with inhibition of auditory representations. These findings suggest that stronger McGurk illusion perceivers possess more robust cross-modal sensory gating mechanisms whereby phonetic representations not conveyed by the visual system are inhibited, and in turn reinforcing perception of the visually targeted phonemes.
Collapse
Affiliation(s)
- Antoine J Shahin
- Department of Cognitive and Information Sciences, University of California, Merced, CA 95343, United States and Center for Mind and Brain, University of California, Davis, CA 95618, United States.
| |
Collapse
|
9
|
Modelska M, Pourquié M, Baart M. No "Self" Advantage for Audiovisual Speech Aftereffects. Front Psychol 2019; 10:658. [PMID: 30967827 PMCID: PMC6440388 DOI: 10.3389/fpsyg.2019.00658] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Accepted: 03/08/2019] [Indexed: 11/13/2022] Open
Abstract
Although the default state of the world is that we see and hear other people talking, there is evidence that seeing and hearing ourselves rather than someone else may lead to visual (i.e., lip-read) or auditory "self" advantages. We assessed whether there is a "self" advantage for phonetic recalibration (a lip-read driven cross-modal learning effect) and selective adaptation (a contrastive effect in the opposite direction of recalibration). We observed both aftereffects as well as an on-line effect of lip-read information on auditory perception (i.e., immediate capture), but there was no evidence for a "self" advantage in any of the tasks (as additionally supported by Bayesian statistics). These findings strengthen the emerging notion that recalibration reflects a general learning mechanism, and bolster the argument that adaptation depends on rather low-level auditory/acoustic features of the speech signal.
Collapse
Affiliation(s)
- Maria Modelska
- BCBL – Basque Center on Cognition, Brain and Language, Donostia, Spain
| | - Marie Pourquié
- BCBL – Basque Center on Cognition, Brain and Language, Donostia, Spain
- UPPA, IKER (UMR5478), Bayonne, France
| | - Martijn Baart
- BCBL – Basque Center on Cognition, Brain and Language, Donostia, Spain
- Department of Cognitive Neuropsychology, Tilburg University, Tilburg, Netherlands
| |
Collapse
|
10
|
Abstract
Speech research during recent years has moved progressively away from its traditional focus on audition toward a more multisensory approach. In addition to audition and vision, many somatosenses including proprioception, pressure, vibration and aerotactile sensation are all highly relevant modalities for experiencing and/or conveying speech. In this article, we review both long-standing cross-modal effects stemming from decades of audiovisual speech research as well as new findings related to somatosensory effects. Cross-modal effects in speech perception to date are found to be constrained by temporal congruence and signal relevance, but appear to be unconstrained by spatial congruence. Far from taking place in a one-, two- or even three-dimensional space, the literature reveals that speech occupies a highly multidimensional sensory space. We argue that future research in cross-modal effects should expand to consider each of these modalities both separately and in combination with other modalities in speech.
Collapse
Affiliation(s)
- Megan Keough
- Interdisciplinary Speech Research Lab, Department of Linguistics, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - Donald Derrick
- New Zealand Institute of Brain and Behaviour, University of Canterbury, Christchurch 8140, New Zealand
- MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Penrith, New South Wales 2751, Australia
| | - Bryan Gick
- Interdisciplinary Speech Research Lab, Department of Linguistics, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
- Haskins Laboratories, Yale University, New Haven, CT 06511, USA
| |
Collapse
|
11
|
Barnaud ML, Bessière P, Diard J, Schwartz JL. Reanalyzing neurocognitive data on the role of the motor system in speech perception within COSMO, a Bayesian perceptuo-motor model of speech communication. BRAIN AND LANGUAGE 2018; 187:19-32. [PMID: 29241588 PMCID: PMC6286382 DOI: 10.1016/j.bandl.2017.12.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2017] [Revised: 07/17/2017] [Accepted: 12/02/2017] [Indexed: 06/07/2023]
Abstract
While neurocognitive data provide clear evidence for the involvement of the motor system in speech perception, its precise role and the way motor information is involved in perceptual decision remain unclear. In this paper, we discuss some recent experimental results in light of COSMO, a Bayesian perceptuo-motor model of speech communication. COSMO enables us to model both speech perception and speech production with probability distributions relating phonological units with sensory and motor variables. Speech perception is conceived as a sensory-motor architecture combining an auditory and a motor decoder thanks to a Bayesian fusion process. We propose the sketch of a neuroanatomical architecture for COSMO, and we capitalize on properties of the auditory vs. motor decoders to address three neurocognitive studies of the literature. Altogether, this computational study reinforces functional arguments supporting the role of a motor decoding branch in the speech perception process.
Collapse
Affiliation(s)
- Marie-Lou Barnaud
- Univ. Grenoble Alpes, Gipsa-lab, F-38000 Grenoble, France; CNRS, Gipsa-lab, F-38000 Grenoble, France; Univ. Grenoble Alpes, LPNC, F-38000 Grenoble, France; CNRS, LPNC, F-38000 Grenoble, France.
| | | | - Julien Diard
- Univ. Grenoble Alpes, LPNC, F-38000 Grenoble, France; CNRS, LPNC, F-38000 Grenoble, France
| | - Jean-Luc Schwartz
- Univ. Grenoble Alpes, Gipsa-lab, F-38000 Grenoble, France; CNRS, Gipsa-lab, F-38000 Grenoble, France.
| |
Collapse
|
12
|
Stevenson RA, Sheffield SW, Butera IM, Gifford RH, Wallace MT. Multisensory Integration in Cochlear Implant Recipients. Ear Hear 2018; 38:521-538. [PMID: 28399064 DOI: 10.1097/aud.0000000000000435] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Speech perception is inherently a multisensory process involving integration of auditory and visual cues. Multisensory integration in cochlear implant (CI) recipients is a unique circumstance in that the integration occurs after auditory deprivation and the provision of hearing via the CI. Despite the clear importance of multisensory cues for perception, in general, and for speech intelligibility, specifically, the topic of multisensory perceptual benefits in CI users has only recently begun to emerge as an area of inquiry. We review the research that has been conducted on multisensory integration in CI users to date and suggest a number of areas needing further research. The overall pattern of results indicates that many CI recipients show at least some perceptual gain that can be attributable to multisensory integration. The extent of this gain, however, varies based on a number of factors, including age of implantation and specific task being assessed (e.g., stimulus detection, phoneme perception, word recognition). Although both children and adults with CIs obtain audiovisual benefits for phoneme, word, and sentence stimuli, neither group shows demonstrable gain for suprasegmental feature perception. Additionally, only early-implanted children and the highest performing adults obtain audiovisual integration benefits similar to individuals with normal hearing. Increasing age of implantation in children is associated with poorer gains resultant from audiovisual integration, suggesting a sensitive period in development for the brain networks that subserve these integrative functions, as well as length of auditory experience. This finding highlights the need for early detection of and intervention for hearing loss, not only in terms of auditory perception, but also in terms of the behavioral and perceptual benefits of audiovisual processing. Importantly, patterns of auditory, visual, and audiovisual responses suggest that underlying integrative processes may be fundamentally different between CI users and typical-hearing listeners. Future research, particularly in low-level processing tasks such as signal detection will help to further assess mechanisms of multisensory integration for individuals with hearing loss, both with and without CIs.
Collapse
Affiliation(s)
- Ryan A Stevenson
- 1Department of Psychology, University of Western Ontario, London, Ontario, Canada; 2Brain and Mind Institute, University of Western Ontario, London, Ontario, Canada; 3Walter Reed National Military Medical Center, Audiology and Speech Pathology Center, London, Ontario, Canada; 4Vanderbilt Brain Institute, Nashville, Tennesse; 5Vanderbilt Kennedy Center, Nashville, Tennesse; 6Department of Psychology, Vanderbilt University, Nashville, Tennesse; 7Department of Psychiatry, Vanderbilt University Medical Center, Nashville, Tennesse; and 8Department of Hearing and Speech Sciences, Vanderbilt University Medical Center, Nashville, Tennesse
| | | | | | | | | |
Collapse
|
13
|
Irwin J, Avery T, Brancazio L, Turcios J, Ryherd K, Landi N. Electrophysiological Indices of Audiovisual Speech Perception: Beyond the McGurk Effect and Speech in Noise. Multisens Res 2018; 31:39-56. [PMID: 31264595 DOI: 10.1163/22134808-00002580] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2016] [Accepted: 05/15/2017] [Indexed: 11/19/2022]
Abstract
Visual information on a talker's face can influence what a listener hears. Commonly used approaches to study this include mismatched audiovisual stimuli (e.g., McGurk type stimuli) or visual speech in auditory noise. In this paper we discuss potential limitations of these approaches and introduce a novel visual phonemic restoration method. This method always presents the same visual stimulus (e.g., /ba/) dubbed with a matched auditory stimulus (/ba/) or one that has weakened consonantal information and sounds more /a/-like). When this reduced auditory stimulus (or /a/) is dubbed with the visual /ba/, a visual influence will result in effectively 'restoring' the weakened auditory cues so that the stimulus is perceived as a /ba/. An oddball design in which participants are asked to detect the /a/ among a stream of more frequently occurring /ba/s while either a speaking face or face with no visual speech was used. In addition, the same paradigm was presented for a second contrast in which participants detected /pa/ among /ba/s, a contrast which should be unaltered by the presence of visual speech. Behavioral and some ERP findings reflect the expected phonemic restoration for the /ba/ vs. /a/ contrast; specifically, we observed reduced accuracy and P300 response in the presence of visual speech. Further, we report an unexpected finding of reduced accuracy and P300 response for both speech contrasts in the presence of visual speech, suggesting overall modulation of the auditory signal in the presence of visual speech. Consistent with this, we observed a mismatch negativity (MMN) effect for the /ba/ vs. /pa/ contrast only that was larger in absence of visual speech. We discuss the potential utility for this paradigm for listeners who cannot respond actively, such as infants and individuals with developmental disabilities.
Collapse
Affiliation(s)
- Julia Irwin
- Haskins Laboratories, New Haven, CT, USA.,Southern Connecticut State University, New Haven, CT, USA
| | - Trey Avery
- Haskins Laboratories, New Haven, CT, USA
| | - Lawrence Brancazio
- Haskins Laboratories, New Haven, CT, USA.,Southern Connecticut State University, New Haven, CT, USA
| | - Jacqueline Turcios
- Haskins Laboratories, New Haven, CT, USA.,Southern Connecticut State University, New Haven, CT, USA
| | - Kayleigh Ryherd
- Haskins Laboratories, New Haven, CT, USA.,University of Connecticut, Storrs, CT, USA
| | - Nicole Landi
- Haskins Laboratories, New Haven, CT, USA.,University of Connecticut, Storrs, CT, USA
| |
Collapse
|
14
|
Alsius A, Paré M, Munhall KG. Forty Years After Hearing Lips and Seeing Voices: the McGurk Effect Revisited. Multisens Res 2018; 31:111-144. [PMID: 31264597 DOI: 10.1163/22134808-00002565] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Accepted: 03/09/2017] [Indexed: 11/19/2022]
Abstract
Since its discovery 40 years ago, the McGurk illusion has been usually cited as a prototypical paradigmatic case of multisensory binding in humans, and has been extensively used in speech perception studies as a proxy measure for audiovisual integration mechanisms. Despite the well-established practice of using the McGurk illusion as a tool for studying the mechanisms underlying audiovisual speech integration, the magnitude of the illusion varies enormously across studies. Furthermore, the processing of McGurk stimuli differs from congruent audiovisual processing at both phenomenological and neural levels. This questions the suitability of this illusion as a tool to quantify the necessary and sufficient conditions under which audiovisual integration occurs in natural conditions. In this paper, we review some of the practical and theoretical issues related to the use of the McGurk illusion as an experimental paradigm. We believe that, without a richer understanding of the mechanisms involved in the processing of the McGurk effect, experimenters should be really cautious when generalizing data generated by McGurk stimuli to matching audiovisual speech events.
Collapse
Affiliation(s)
- Agnès Alsius
- Psychology Department, Queen's University, Humphrey Hall, 62 Arch St., Kingston, Ontario, K7L 3N6 Canada
| | - Martin Paré
- Psychology Department, Queen's University, Humphrey Hall, 62 Arch St., Kingston, Ontario, K7L 3N6 Canada
| | - Kevin G Munhall
- Psychology Department, Queen's University, Humphrey Hall, 62 Arch St., Kingston, Ontario, K7L 3N6 Canada
| |
Collapse
|
15
|
Baart M, Lindborg A, Andersen TS. Electrophysiological evidence for differences between fusion and combination illusions in audiovisual speech perception. Eur J Neurosci 2017; 46:2578-2583. [PMID: 28976045 PMCID: PMC5725699 DOI: 10.1111/ejn.13734] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2017] [Revised: 09/27/2017] [Accepted: 09/27/2017] [Indexed: 11/30/2022]
Abstract
Incongruent audiovisual speech stimuli can lead to perceptual illusions such as fusions or combinations. Here, we investigated the underlying audiovisual integration process by measuring ERPs. We observed that visual speech‐induced suppression of P2 amplitude (which is generally taken as a measure of audiovisual integration) for fusions was similar to suppression obtained with fully congruent stimuli, whereas P2 suppression for combinations was larger. We argue that these effects arise because the phonetic incongruency is solved differently for both types of stimuli.
Collapse
Affiliation(s)
- Martijn Baart
- Department of Cognitive Neuropsychology, Tilburg University, Warandelaan 2, Tilburg, 5000 LE, The Netherlands.,BCBL. Basque Center on Cognition, Brain and Language, Donostia, Spain
| | - Alma Lindborg
- Section for Cognitive Systems, DTU Compute, Technical University of Denmark, Lyngby, Denmark
| | - Tobias S Andersen
- Section for Cognitive Systems, DTU Compute, Technical University of Denmark, Lyngby, Denmark
| |
Collapse
|
16
|
Festa EK, Katz AP, Ott BR, Tremont G, Heindel WC. Dissociable Effects of Aging and Mild Cognitive Impairment on Bottom-Up Audiovisual Integration. J Alzheimers Dis 2017; 59:155-167. [DOI: 10.3233/jad-161062] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Elena K. Festa
- Department of Cognitive, Linguistic, and Psychological Sciences, Brown University, Providence, RI, USA
| | - Andrew P. Katz
- Department of Cognitive, Linguistic, and Psychological Sciences, Brown University, Providence, RI, USA
| | - Brian R. Ott
- Department of Neurology, Alpert Medical School, Brown University, Providence, RI, USA
- Department of Neurology, Rhode Island Hospital, Providence, RI, USA
| | - Geoffrey Tremont
- Department of Psychiatry and Human Behavior, Alpert Medical School, Brown University, Providence, RI, USA
- Department of Psychiatry, Rhode Island Hospital, Providence, RI, USA
| | - William C. Heindel
- Department of Cognitive, Linguistic, and Psychological Sciences, Brown University, Providence, RI, USA
| |
Collapse
|
17
|
Sight and sound persistently out of synch: stable individual differences in audiovisual synchronisation revealed by implicit measures of lip-voice integration. Sci Rep 2017; 7:46413. [PMID: 28429784 PMCID: PMC5399466 DOI: 10.1038/srep46413] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2016] [Accepted: 03/17/2017] [Indexed: 11/08/2022] Open
Abstract
Are sight and sound out of synch? Signs that they are have been dismissed for over two centuries as an artefact of attentional and response bias, to which traditional subjective methods are prone. To avoid such biases, we measured performance on objective tasks that depend implicitly on achieving good lip-synch. We measured the McGurk effect (in which incongruent lip-voice pairs evoke illusory phonemes), and also identification of degraded speech, while manipulating audiovisual asynchrony. Peak performance was found at an average auditory lag of ~100 ms, but this varied widely between individuals. Participants’ individual optimal asynchronies showed trait-like stability when the same task was re-tested one week later, but measures based on different tasks did not correlate. This discounts the possible influence of common biasing factors, suggesting instead that our different tasks probe different brain networks, each subject to their own intrinsic auditory and visual processing latencies. Our findings call for renewed interest in the biological causes and cognitive consequences of individual sensory asynchronies, leading potentially to fresh insights into the neural representation of sensory timing. A concrete implication is that speech comprehension might be enhanced, by first measuring each individual’s optimal asynchrony and then applying a compensatory auditory delay.
Collapse
|
18
|
Irwin J, DiBlasi L. Audiovisual speech perception: A new approach and implications for clinical populations. LANGUAGE AND LINGUISTICS COMPASS 2017; 11:77-91. [PMID: 29520300 PMCID: PMC5839512 DOI: 10.1111/lnc3.12237] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/18/2015] [Accepted: 01/25/2017] [Indexed: 06/01/2023]
Abstract
This selected overview of audiovisual (AV) speech perception examines the influence of visible articulatory information on what is heard. Thought to be a cross-cultural phenomenon that emerges early in typical language development, variables that influence AV speech perception include properties of the visual and the auditory signal, attentional demands, and individual differences. A brief review of the existing neurobiological evidence on how visual information influences heard speech indicates potential loci, timing, and facilitatory effects of AV over auditory only speech. The current literature on AV speech in certain clinical populations (individuals with an autism spectrum disorder, developmental language disorder, or hearing loss) reveals differences in processing that may inform interventions. Finally, a new method of assessing AV speech that does not require obvious cross-category mismatch or auditory noise was presented as a novel approach for investigators.
Collapse
Affiliation(s)
- Julia Irwin
- LEARN Center, Haskins Laboratories Inc., USA
| | | |
Collapse
|
19
|
A Causal Inference Model Explains Perception of the McGurk Effect and Other Incongruent Audiovisual Speech. PLoS Comput Biol 2017; 13:e1005229. [PMID: 28207734 PMCID: PMC5312805 DOI: 10.1371/journal.pcbi.1005229] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Accepted: 11/01/2016] [Indexed: 11/19/2022] Open
Abstract
Audiovisual speech integration combines information from auditory speech (talker’s voice) and visual speech (talker’s mouth movements) to improve perceptual accuracy. However, if the auditory and visual speech emanate from different talkers, integration decreases accuracy. Therefore, a key step in audiovisual speech perception is deciding whether auditory and visual speech have the same source, a process known as causal inference. A well-known illusion, the McGurk Effect, consists of incongruent audiovisual syllables, such as auditory “ba” + visual “ga” (AbaVga), that are integrated to produce a fused percept (“da”). This illusion raises two fundamental questions: first, given the incongruence between the auditory and visual syllables in the McGurk stimulus, why are they integrated; and second, why does the McGurk effect not occur for other, very similar syllables (e.g., AgaVba). We describe a simplified model of causal inference in multisensory speech perception (CIMS) that predicts the perception of arbitrary combinations of auditory and visual speech. We applied this model to behavioral data collected from 60 subjects perceiving both McGurk and non-McGurk incongruent speech stimuli. The CIMS model successfully predicted both the audiovisual integration observed for McGurk stimuli and the lack of integration observed for non-McGurk stimuli. An identical model without causal inference failed to accurately predict perception for either form of incongruent speech. The CIMS model uses causal inference to provide a computational framework for studying how the brain performs one of its most important tasks, integrating auditory and visual speech cues to allow us to communicate with others. During face-to-face conversations, we seamlessly integrate information from the talker’s voice with information from the talker’s face. This multisensory integration increases speech perception accuracy and can be critical for understanding speech in noisy environments with many people talking simultaneously. A major challenge for models of multisensory speech perception is thus deciding which voices and faces should be integrated. Our solution to this problem is based on the idea of causal inference—given a particular pair of auditory and visual syllables, the brain calculates the likelihood they are from a single vs. multiple talkers and uses this likelihood to determine the final speech percept. We compared our model with an alternative model that is identical, except that it always integrated the available cues. Using behavioral speech perception data from a large number of subjects, the model with causal inference better predicted how humans would (or would not) integrate audiovisual speech syllables. Our results suggest a fundamental role for a causal inference type calculation in multisensory speech perception.
Collapse
|
20
|
Landry SP, Sharp A, Pagé S, Champoux F. Temporal and spectral audiotactile interactions in musicians. Exp Brain Res 2016; 235:525-532. [PMID: 27803971 DOI: 10.1007/s00221-016-4813-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2016] [Accepted: 10/22/2016] [Indexed: 11/26/2022]
Abstract
Previous investigations have revealed that the complex sensory exposure of musical training alters audiovisual interactions. As of yet, there has been little evidence on the effects of musical training on audiotactile interactions at a behavioural level. Here, we tested audiotactile interaction in musicians using the audiotactile illusory flash and the parchment-skin illusion. Significant differences were only found between musicians and non-musicians for the audiotactile illusory flash. Both groups had similar task-relevant unisensory abilities, but unlike non-musicians, the number of auditory stimulations did not have a statistically important influence on the number of perceived tactile stimulations for musicians. Musicians and non-musicians similarly perceived the parchment-skin illusion. Spectral alterations of self-generated palmar sounds similarly altered the perception of wetness and dryness for both groups. These results suggest that musical training does not seem to alter multisensory interactions at large. The specificity of the sensory enhancement suggests that musical training specifically alters processes underlying the interaction of temporal audiotactile stimuli and not the global interaction between these modalities. These results are consistent with previous unisensory and multisensory investigations on sensory abilities related to audiotactile processing in musicians.
Collapse
Affiliation(s)
- Simon P Landry
- Faculté de médecine, École d'orthophonie et d'audiologie, Université de Montréal, C.P. 6128, Succursale Centre-Ville, Montreal, QC, H3C 3J7, Canada
| | - Andréanne Sharp
- Faculté de médecine, École d'orthophonie et d'audiologie, Université de Montréal, C.P. 6128, Succursale Centre-Ville, Montreal, QC, H3C 3J7, Canada
| | - Sara Pagé
- Faculté de médecine, École d'orthophonie et d'audiologie, Université de Montréal, C.P. 6128, Succursale Centre-Ville, Montreal, QC, H3C 3J7, Canada
| | - François Champoux
- Faculté de médecine, École d'orthophonie et d'audiologie, Université de Montréal, C.P. 6128, Succursale Centre-Ville, Montreal, QC, H3C 3J7, Canada.
| |
Collapse
|
21
|
Rishiq D, Rao A, Koerner T, Abrams H. Can a Commercially Available Auditory Training Program Improve Audiovisual Speech Performance? Am J Audiol 2016; 25:308-312. [PMID: 27768194 DOI: 10.1044/2016_aja-16-0017] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2016] [Accepted: 06/03/2016] [Indexed: 11/09/2022] Open
Abstract
PURPOSE The goal of this study was to determine whether hearing aids in combination with computer-based auditory training improve audiovisual (AV) performance compared with the use of hearing aids alone. METHOD Twenty-four participants were randomized into an experimental group (hearing aids plus ReadMyQuips [RMQ] training) and a control group (hearing aids only). The Multimodal Lexical Sentence Test for Adults (Kirk et al., 2012) was used to measure auditory-only (AO) and AV speech perception performance at three signal-to-noise ratios (SNRs). Participants were tested at the time of hearing aid fitting (pretest), after 4 weeks of hearing aid use (posttest I), and again after 4 weeks of RMQ training (posttest II). RESULTS Results did not reveal an effect of training. As expected, interactions were found between (a) modality (AO vs. AV) and SNR and (b) test (pretest vs. posttests) and SNR. CONCLUSION Data do not show a significant effect of RMQ training on AO or AV performance as measured using the Multimodal Lexical Sentence Test for Adults.
Collapse
Affiliation(s)
- Dania Rishiq
- Audiology Section, Otorhinolaryngology Department, Mayo Clinic, Jacksonville, FL
| | - Aparna Rao
- Department of Speech and Hearing Science, Arizona State University, Tempe
| | - Tess Koerner
- Department of Speech-Language-Hearing Sciences, University of Minnesota, Minneapolis
| | - Harvey Abrams
- Starkey Hearing Technologies, Eden Prairie, MN
- University of South Florida, Tampa
| |
Collapse
|
22
|
Abstract
In the McGurk effect, incongruent auditory and visual syllables are perceived as a third, completely different syllable. This striking illusion has become a popular assay of multisensory integration for individuals and clinical populations. However, there is enormous variability in how often the illusion is evoked by different stimuli and how often the illusion is perceived by different individuals. Most studies of the McGurk effect have used only one stimulus, making it impossible to separate stimulus and individual differences. We created a probabilistic model to separately estimate stimulus and individual differences in behavioral data from 165 individuals viewing up to 14 different McGurk stimuli. The noisy encoding of disparity (NED) model characterizes stimuli by their audiovisual disparity and characterizes individuals by how noisily they encode the stimulus disparity and by their disparity threshold for perceiving the illusion. The model accurately described perception of the McGurk effect in our sample, suggesting that differences between individuals are stable across stimulus differences. The most important benefit of the NED model is that it provides a method to compare multisensory integration across individuals and groups without the confound of stimulus differences. An added benefit is the ability to predict frequency of the McGurk effect for stimuli never before seen by an individual.
Collapse
|
23
|
Knowland VCP, Evans S, Snell C, Rosen S. Visual Speech Perception in Children With Language Learning Impairments. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2016; 59:1-14. [PMID: 26895558 DOI: 10.1044/2015_jslhr-s-14-0269] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Accepted: 07/30/2015] [Indexed: 06/05/2023]
Abstract
PURPOSE The purpose of the study was to assess the ability of children with developmental language learning impairments (LLIs) to use visual speech cues from the talking face. METHOD In this cross-sectional study, 41 typically developing children (mean age: 8 years 0 months, range: 4 years 5 months to 11 years 10 months) and 27 children with diagnosed LLI (mean age: 8 years 10 months, range: 5 years 2 months to 11 years 6 months) completed a silent speechreading task and a speech-in-noise task with and without visual support from the talking face. The speech-in-noise task involved the identification of a target word in a carrier sentence with a single competing speaker as a masker. RESULTS Children in the LLI group showed a deficit in speechreading when compared with their typically developing peers. Beyond the single-word level, this deficit became more apparent in older children. On the speech-in-noise task, a substantial benefit of visual cues was found regardless of age or group membership, although the LLI group showed an overall developmental delay in speech perception. CONCLUSION Although children with LLI were less accurate than their peers on the speechreading and speech-in noise-tasks, both groups were able to make equivalent use of visual cues to boost performance accuracy when listening in noise.
Collapse
|
24
|
Gau R, Noppeney U. How prior expectations shape multisensory perception. Neuroimage 2016; 124:876-886. [DOI: 10.1016/j.neuroimage.2015.09.045] [Citation(s) in RCA: 65] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2015] [Accepted: 09/20/2015] [Indexed: 11/24/2022] Open
|
25
|
Mangin O, Filliat D, ten Bosch L, Oudeyer PY. MCA-NMF: Multimodal Concept Acquisition with Non-Negative Matrix Factorization. PLoS One 2015; 10:e0140732. [PMID: 26489021 PMCID: PMC4619362 DOI: 10.1371/journal.pone.0140732] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2015] [Accepted: 09/28/2015] [Indexed: 11/19/2022] Open
Abstract
In this paper we introduce MCA-NMF, a computational model of the acquisition of multimodal concepts by an agent grounded in its environment. More precisely our model finds patterns in multimodal sensor input that characterize associations across modalities (speech utterances, images and motion). We propose this computational model as an answer to the question of how some class of concepts can be learnt. In addition, the model provides a way of defining such a class of plausibly learnable concepts. We detail why the multimodal nature of perception is essential to reduce the ambiguity of learnt concepts as well as to communicate about them through speech. We then present a set of experiments that demonstrate the learning of such concepts from real non-symbolic data consisting of speech sounds, images, and motions. Finally we consider structure in perceptual signals and demonstrate that a detailed knowledge of this structure, named compositional understanding can emerge from, instead of being a prerequisite of, global understanding. An open-source implementation of the MCA-NMF learner as well as scripts and associated experimental data to reproduce the experiments are publicly available.
Collapse
Affiliation(s)
- Olivier Mangin
- Flowers Team, Inria, Bordeaux, France
- U2IS, ENSTA ParisTech, Université Paris Saclay, Saclay, France
- * E-mail:
| | - David Filliat
- Flowers Team, Inria, Bordeaux, France
- U2IS, ENSTA ParisTech, Université Paris Saclay, Saclay, France
| | - Louis ten Bosch
- Centre for Language and Speech Technology, Radboud University, Nijmegen, Netherlands
| | - Pierre-Yves Oudeyer
- Flowers Team, Inria, Bordeaux, France
- U2IS, ENSTA ParisTech, Université Paris Saclay, Saclay, France
| |
Collapse
|
26
|
Andersen TS. The early maximum likelihood estimation model of audiovisual integration in speech perception. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 137:2884-2891. [PMID: 25994715 DOI: 10.1121/1.4916691] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Speech perception is facilitated by seeing the articulatory mouth movements of the talker. This is due to perceptual audiovisual integration, which also causes the McGurk-MacDonald illusion, and for which a comprehensive computational account is still lacking. Decades of research have largely focused on the fuzzy logical model of perception (FLMP), which provides excellent fits to experimental observations but also has been criticized for being too flexible, post hoc and difficult to interpret. The current study introduces the early maximum likelihood estimation (MLE) model of audiovisual integration to speech perception along with three model variations. In early MLE, integration is based on a continuous internal representation before categorization, which can make the model more parsimonious by imposing constraints that reflect experimental designs. The study also shows that cross-validation can evaluate models of audiovisual integration based on typical data sets taking both goodness-of-fit and model flexibility into account. All models were tested on a published data set previously used for testing the FLMP. Cross-validation favored the early MLE while more conventional error measures favored more complex models. This difference between conventional error measures and cross-validation was found to be indicative of over-fitting in more complex models such as the FLMP.
Collapse
Affiliation(s)
- Tobias S Andersen
- Section for Cognitive Systems, Department of Applied Mathematics and Computer Science, Technical University of Denmark, Richard Petersens Plads, Building 321, DK-2800 Kgs. Lyngby, Denmark
| |
Collapse
|
27
|
Nahorna O, Berthommier F, Schwartz JL. Audio-visual speech scene analysis: characterization of the dynamics of unbinding and rebinding the McGurk effect. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2015; 137:362-377. [PMID: 25618066 DOI: 10.1121/1.4904536] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
While audiovisual interactions in speech perception have long been considered as automatic, recent data suggest that this is not the case. In a previous study, Nahorna et al. [(2012). J. Acoust. Soc. Am. 132, 1061-1077] showed that the McGurk effect is reduced by a previous incoherent audiovisual context. This was interpreted as showing the existence of an audiovisual binding stage controlling the fusion process. Incoherence would produce unbinding and decrease the weight of the visual input in fusion. The present paper explores the audiovisual binding system to characterize its dynamics. A first experiment assesses the dynamics of unbinding, and shows that it is rapid: An incoherent context less than 0.5 s long (typically one syllable) suffices to produce a maximal reduction in the McGurk effect. A second experiment tests the rebinding process, by presenting a short period of either coherent material or silence after the incoherent unbinding context. Coherence provides rebinding, with a recovery of the McGurk effect, while silence provides no rebinding and hence freezes the unbinding process. These experiments are interpreted in the framework of an audiovisual speech scene analysis process assessing the perceptual organization of an audiovisual speech input before decision takes place at a higher processing stage.
Collapse
Affiliation(s)
- Olha Nahorna
- GIPSA-Lab, Speech and Cognition Department, UMR 5216, CNRS, Grenoble University, Grenoble, France
| | - Frédéric Berthommier
- GIPSA-Lab, Speech and Cognition Department, UMR 5216, CNRS, Grenoble University, Grenoble, France
| | - Jean-Luc Schwartz
- GIPSA-Lab, Speech and Cognition Department, UMR 5216, CNRS, Grenoble University, Grenoble, France
| |
Collapse
|
28
|
Strelnikov K, Marx M, Lagleyre S, Fraysse B, Deguine O, Barone P. PET-imaging of brain plasticity after cochlear implantation. Hear Res 2014; 322:180-7. [PMID: 25448166 DOI: 10.1016/j.heares.2014.10.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/04/2014] [Revised: 09/05/2014] [Accepted: 10/01/2014] [Indexed: 10/24/2022]
Abstract
In this article, we review the PET neuroimaging literature, which indicates peculiarities of brain networks involved in speech restoration after cochlear implantation. We consider data on implanted patients during stimulation as well as during resting state, which indicates basic long-term reorganisation of brain functional architecture. On the basis of our analysis of neuroimaging literature and considering our own studies, we indicate that auditory recovery in deaf patients after cochlear implantation partly relies on visual cues. The brain develops mechanisms of audio-visual integration as a strategy to achieve high levels of speech recognition. It turns out that this neuroimaging evidence is in line with behavioural findings of better audiovisual integration in these patients. Thus, strong visually and audio-visually based rehabilitation during the first months after cochlear implantation would significantly improve and fasten the functional recovery of speech intelligibility and other auditory functions in these patients. We provide perspectives for further neuroimaging studies in cochlear implanted patients, which would help understand brain organisation to restore auditory cognitive processing in the implanted patients and would potentially suggest novel approaches for their rehabilitation. This article is part of a Special Issue entitled <Lasker Award>.
Collapse
Affiliation(s)
- K Strelnikov
- Université de Toulouse, Cerveau & Cognition, Université Paul Sabatier, Toulouse France; CerCo, CNRS UMR 5549, Toulouse France
| | - M Marx
- Service d'Oto-Rhino-Laryngologie, Hopital Purpan, Toulouse, France
| | - S Lagleyre
- Service d'Oto-Rhino-Laryngologie, Hopital Purpan, Toulouse, France
| | - B Fraysse
- Service d'Oto-Rhino-Laryngologie, Hopital Purpan, Toulouse, France
| | - O Deguine
- Université de Toulouse, Cerveau & Cognition, Université Paul Sabatier, Toulouse France; CerCo, CNRS UMR 5549, Toulouse France; Service d'Oto-Rhino-Laryngologie, Hopital Purpan, Toulouse, France
| | - P Barone
- Université de Toulouse, Cerveau & Cognition, Université Paul Sabatier, Toulouse France; CerCo, CNRS UMR 5549, Toulouse France.
| |
Collapse
|
29
|
Huyse A, Leybaert J, Berthommier F. Effects of aging on audio-visual speech integration. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2014; 136:1918-1931. [PMID: 25324091 DOI: 10.1121/1.4894685] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
This study investigated the impact of aging on audio-visual speech integration. A syllable identification task was presented in auditory-only, visual-only, and audio-visual congruent and incongruent conditions. Visual cues were either degraded or unmodified. Stimuli were embedded in stationary noise alternating with modulated noise. Fifteen young adults and 15 older adults participated in this study. Results showed that older adults had preserved lipreading abilities when the visual input was clear but not when it was degraded. The impact of aging on audio-visual integration also depended on the quality of the visual cues. In the visual clear condition, the audio-visual gain was similar in both groups and analyses in the framework of the fuzzy-logical model of perception confirmed that older adults did not differ from younger adults in their audio-visual integration abilities. In the visual reduction condition, the audio-visual gain was reduced in the older group, but only when the noise was stationary, suggesting that older participants could compensate for the loss of lipreading abilities by using the auditory information available in the valleys of the noise. The fuzzy-logical model of perception confirmed the significant impact of aging on audio-visual integration by showing an increased weight of audition in the older group.
Collapse
Affiliation(s)
- Aurélie Huyse
- Université Libre de Bruxelles, Avenue F.D. Roosevelt, 50, CP 191, 1050 Brussels, Belgium
| | - Jacqueline Leybaert
- Université Libre de Bruxelles, Avenue F.D. Roosevelt, 50, CP 191, 1050 Brussels, Belgium
| | - Frédéric Berthommier
- Gipsa-Lab Grenoble, France, Domaine Universitaire BP 46, 38402 Martin d'Hères Cedex, France
| |
Collapse
|
30
|
Schwartz JL, Savariaux C. No, there is no 150 ms lead of visual speech on auditory speech, but a range of audiovisual asynchronies varying from small audio lead to large audio lag. PLoS Comput Biol 2014; 10:e1003743. [PMID: 25079216 PMCID: PMC4117430 DOI: 10.1371/journal.pcbi.1003743] [Citation(s) in RCA: 56] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2013] [Accepted: 06/09/2014] [Indexed: 11/18/2022] Open
Abstract
An increasing number of neuroscience papers capitalize on the assumption published in this journal that visual speech would be typically 150 ms ahead of auditory speech. It happens that the estimation of audiovisual asynchrony in the reference paper is valid only in very specific cases, for isolated consonant-vowel syllables or at the beginning of a speech utterance, in what we call "preparatory gestures". However, when syllables are chained in sequences, as they are typically in most parts of a natural speech utterance, asynchrony should be defined in a different way. This is what we call "comodulatory gestures" providing auditory and visual events more or less in synchrony. We provide audiovisual data on sequences of plosive-vowel syllables (pa, ta, ka, ba, da, ga, ma, na) showing that audiovisual synchrony is actually rather precise, varying between 20 ms audio lead and 70 ms audio lag. We show how more complex speech material should result in a range typically varying between 40 ms audio lead and 200 ms audio lag, and we discuss how this natural coordination is reflected in the so-called temporal integration window for audiovisual speech perception. Finally we present a toy model of auditory and audiovisual predictive coding, showing that visual lead is actually not necessary for visual prediction.
Collapse
Affiliation(s)
- Jean-Luc Schwartz
- GIPSA-Lab, Speech and Cognition Department, UMR 5216 CNRS Grenoble-Alps University, Grenoble, France
- * E-mail:
| | - Christophe Savariaux
- GIPSA-Lab, Speech and Cognition Department, UMR 5216 CNRS Grenoble-Alps University, Grenoble, France
| |
Collapse
|
31
|
Affiliation(s)
- Kaisa Tiippana
- Division of Cognitive Psychology and Neuropsychology, Institute of Behavioural Sciences, University of Helsinki Helsinki, Finland
| |
Collapse
|
32
|
Bayard C, Colin C, Leybaert J. How is the McGurk effect modulated by Cued Speech in deaf and hearing adults? Front Psychol 2014; 5:416. [PMID: 24904451 PMCID: PMC4032946 DOI: 10.3389/fpsyg.2014.00416] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2014] [Accepted: 04/21/2014] [Indexed: 11/21/2022] Open
Abstract
Speech perception for both hearing and deaf people involves an integrative process between auditory and lip-reading information. In order to disambiguate information from lips, manual cues from Cued Speech may be added. Cued Speech (CS) is a system of manual aids developed to help deaf people to clearly and completely understand speech visually (Cornett, 1967). Within this system, both labial and manual information, as lone input sources, remain ambiguous. Perceivers, therefore, have to combine both types of information in order to get one coherent percept. In this study, we examined how audio-visual (AV) integration is affected by the presence of manual cues and on which form of information (auditory, labial or manual) the CS receptors primarily rely. To address this issue, we designed a unique experiment that implemented the use of AV McGurk stimuli (audio /pa/ and lip-reading /ka/) which were produced with or without manual cues. The manual cue was congruent with either auditory information, lip information or the expected fusion. Participants were asked to repeat the perceived syllable aloud. Their responses were then classified into four categories: audio (when the response was /pa/), lip-reading (when the response was /ka/), fusion (when the response was /ta/) and other (when the response was something other than /pa/, /ka/ or /ta/). Data were collected from hearing impaired individuals who were experts in CS (all of which had either cochlear implants or binaural hearing aids; N = 8), hearing-individuals who were experts in CS (N = 14) and hearing-individuals who were completely naïve of CS (N = 15). Results confirmed that, like hearing-people, deaf people can merge auditory and lip-reading information into a single unified percept. Without manual cues, McGurk stimuli induced the same percentage of fusion responses in both groups. Results also suggest that manual cues can modify the AV integration and that their impact differs between hearing and deaf people.
Collapse
Affiliation(s)
- Clémence Bayard
- Center for Research in Cognition and Neurosciences, Université Libre de BruxellesBrussels, Belgium
| | | | | |
Collapse
|
33
|
Sekiyama K, Soshi T, Sakamoto S. Enhanced audiovisual integration with aging in speech perception: a heightened McGurk effect in older adults. Front Psychol 2014; 5:323. [PMID: 24782815 PMCID: PMC3995044 DOI: 10.3389/fpsyg.2014.00323] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2014] [Accepted: 03/28/2014] [Indexed: 11/13/2022] Open
Abstract
Two experiments compared young and older adults in order to examine whether aging leads to a larger dependence on visual articulatory movements in auditory-visual speech perception. These experiments examined accuracy and response time in syllable identification for auditory-visual (AV) congruent and incongruent stimuli. There were also auditory-only (AO) and visual-only (VO) presentation modes. Data were analyzed only for participants with normal hearing. It was found that the older adults were more strongly influenced by visual speech than the younger ones for acoustically identical signal-to-noise ratios (SNRs) of auditory speech (Experiment 1). This was also confirmed when the SNRs of auditory speech were calibrated for the equivalent AO accuracy between the two age groups (Experiment 2). There were no aging-related differences in VO lipreading accuracy. Combined with response time data, this enhanced visual influence for the older adults was likely to be associated with an aging-related delay in auditory processing.
Collapse
Affiliation(s)
- Kaoru Sekiyama
- Division of Cognitive Psychology, Faculty of Letters, Kumamoto University Kumamoto, Japan ; Division of Cognitive Psychology, School of Systems Information Science, Future University Hakodate, Japan
| | - Takahiro Soshi
- Division of Cognitive Psychology, Faculty of Letters, Kumamoto University Kumamoto, Japan
| | | |
Collapse
|
34
|
Kushnerenko E, Tomalski P, Ballieux H, Ribeiro H, Potton A, Axelsson EL, Murphy E, Moore DG. Brain responses to audiovisual speech mismatch in infants are associated with individual differences in looking behaviour. Eur J Neurosci 2013; 38:3363-9. [PMID: 23889202 DOI: 10.1111/ejn.12317] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2013] [Revised: 05/22/2013] [Accepted: 06/19/2013] [Indexed: 11/30/2022]
Abstract
Research on audiovisual speech integration has reported high levels of individual variability, especially among young infants. In the present study we tested the hypothesis that this variability results from individual differences in the maturation of audiovisual speech processing during infancy. A developmental shift in selective attention to audiovisual speech has been demonstrated between 6 and 9 months with an increase in the time spent looking to articulating mouths as compared to eyes (Lewkowicz & Hansen-Tift. (2012) Proc. Natl Acad. Sci. USA, 109, 1431-1436; Tomalski et al. (2012) Eur. J. Dev. Psychol., 1-14). In the present study we tested whether these changes in behavioural maturational level are associated with differences in brain responses to audiovisual speech across this age range. We measured high-density event-related potentials (ERPs) in response to videos of audiovisually matching and mismatched syllables /ba/ and /ga/, and subsequently examined visual scanning of the same stimuli with eye-tracking. There were no clear age-specific changes in ERPs, but the amplitude of audiovisual mismatch response (AVMMR) to the combination of visual /ba/ and auditory /ga/ was strongly negatively associated with looking time to the mouth in the same condition. These results have significant implications for our understanding of individual differences in neural signatures of audiovisual speech processing in infants, suggesting that they are not strictly related to chronological age but instead associated with the maturation of looking behaviour, and develop at individual rates in the second half of the first year of life.
Collapse
Affiliation(s)
- Elena Kushnerenko
- Institute for Research in Child Development, School of Psychology, University of East London, Water Lane, London, E15 4LZ, UK
| | | | | | | | | | | | | | | |
Collapse
|
35
|
Landry SP, Guillemot JP, Champoux F. Temporary deafness can impair multisensory integration: a study of cochlear-implant users. Psychol Sci 2013; 24:1260-8. [PMID: 23722977 DOI: 10.1177/0956797612471142] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Previous investigations suggest that temporary deafness can have a dramatic impact on audiovisual speech processing. The aim of this study was to test whether temporary deafness disturbs other multisensory processes in adults. A nonspeech task involving an audiotactile illusion was administered to a group of normally hearing individuals and a group of individuals who had been temporarily auditorily deprived. Members of this latter group had their auditory detection thresholds restored to normal levels through the use of a cochlear implant. Control conditions revealed that auditory and tactile discrimination capabilities were identical in the two groups. However, whereas normally hearing individuals integrated auditory and tactile information, so that they experienced the audiotactile illusion, individuals who had been temporarily deprived did not. Given the basic nature of the task, failure to integrate multisensory information could not be explained by the use of the cochlear implant. Thus, the results suggest that normally anticipated audiotactile interactions are disturbed following temporary deafness.
Collapse
Affiliation(s)
- Simon P Landry
- Centre de Recherche en Neuropsychologie et Cognition (CERNEC), Montréal, Québec, Canada
| | | | | |
Collapse
|
36
|
Nahorna O, Berthommier F, Schwartz JL. Binding and unbinding the auditory and visual streams in the McGurk effect. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2012; 132:1061-1077. [PMID: 22894226 DOI: 10.1121/1.4728187] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Subjects presented with coherent auditory and visual streams generally fuse them into a single percept. This results in enhanced intelligibility in noise, or in visual modification of the auditory percept in the McGurk effect. It is classically considered that processing is done independently in the auditory and visual systems before interaction occurs at a certain representational stage, resulting in an integrated percept. However, some behavioral and neurophysiological data suggest the existence of a two-stage process. A first stage would involve binding together the appropriate pieces of audio and video information before fusion per se in a second stage. Then it should be possible to design experiments leading to unbinding. It is shown here that if a given McGurk stimulus is preceded by an incoherent audiovisual context, the amount of McGurk effect is largely reduced. Various kinds of incoherent contexts (acoustic syllables dubbed on video sentences or phonetic or temporal modifications of the acoustic content of a regular sequence of audiovisual syllables) can significantly reduce the McGurk effect even when they are short (less than 4 s). The data are interpreted in the framework of a two-stage "binding and fusion" model for audiovisual speech perception.
Collapse
Affiliation(s)
- Olha Nahorna
- GIPSA-Lab, Speech and Cognition Department, UMR 5216, CNRS, Grenoble University, France
| | | | | |
Collapse
|
37
|
Landry S, Bacon BA, Leybaert J, Gagné JP, Champoux F. Audiovisual segregation in cochlear implant users. PLoS One 2012; 7:e33113. [PMID: 22427963 PMCID: PMC3299746 DOI: 10.1371/journal.pone.0033113] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2011] [Accepted: 02/10/2012] [Indexed: 11/18/2022] Open
Abstract
It has traditionally been assumed that cochlear implant users de facto perform atypically in audiovisual tasks. However, a recent study that combined an auditory task with visual distractors suggests that only those cochlear implant users that are not proficient at recognizing speech sounds might show abnormal audiovisual interactions. The present study aims at reinforcing this notion by investigating the audiovisual segregation abilities of cochlear implant users in a visual task with auditory distractors. Speechreading was assessed in two groups of cochlear implant users (proficient and non-proficient at sound recognition), as well as in normal controls. A visual speech recognition task (i.e. speechreading) was administered either in silence or in combination with three types of auditory distractors: i) noise ii) reverse speech sound and iii) non-altered speech sound. Cochlear implant users proficient at speech recognition performed like normal controls in all conditions, whereas non-proficient users showed significantly different audiovisual segregation patterns in both speech conditions. These results confirm that normal-like audiovisual segregation is possible in highly skilled cochlear implant users and, consequently, that proficient and non-proficient CI users cannot be lumped into a single group. This important feature must be taken into account in further studies of audiovisual interactions in cochlear implant users.
Collapse
Affiliation(s)
- Simon Landry
- Centre de Recherche en Neuropsychologie et Cognition (CERNEC), Montréal, Québec, Canada
| | - Benoit A. Bacon
- Centre de Recherche en Neuropsychologie et Cognition (CERNEC), Montréal, Québec, Canada
- Department of Psychology, Bishop's University, Sherbrooke, Québec, Canada
| | | | - Jean-Pierre Gagné
- Centre de recherche interdisciplinaire en réadaptation du Montréal métropolitain, Institut Raymond-Dewar, École d'orthophonie et d'audiologie, Université de Montréal, Montréal, Québec, Canada
| | - François Champoux
- Centre de Recherche en Neuropsychologie et Cognition (CERNEC), Montréal, Québec, Canada
- Centre de recherche interdisciplinaire en réadaptation du Montréal métropolitain, Institut Raymond-Dewar, École d'orthophonie et d'audiologie, Université de Montréal, Montréal, Québec, Canada
- * E-mail:
| |
Collapse
|
38
|
Neural correlates of interindividual differences in children's audiovisual speech perception. J Neurosci 2011; 31:13963-71. [PMID: 21957257 DOI: 10.1523/jneurosci.2605-11.2011] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
Children use information from both the auditory and visual modalities to aid in understanding speech. A dramatic illustration of this multisensory integration is the McGurk effect, an illusion in which an auditory syllable is perceived differently when it is paired with an incongruent mouth movement. However, there are significant interindividual differences in McGurk perception: some children never perceive the illusion, while others always do. Because converging evidence suggests that the posterior superior temporal sulcus (STS) is a critical site for multisensory integration, we hypothesized that activity within the STS would predict susceptibility to the McGurk effect. To test this idea, we used BOLD fMRI in 17 children aged 6-12 years to measure brain responses to the following three audiovisual stimulus categories: McGurk incongruent, non-McGurk incongruent, and congruent syllables. Two separate analysis approaches, one using independent functional localizers and another using whole-brain voxel-based regression, showed differences in the left STS between perceivers and nonperceivers. The STS of McGurk perceivers responded significantly more than that of nonperceivers to McGurk syllables, but not to other stimuli, and perceivers' hemodynamic responses in the STS were significantly prolonged. In addition to the STS, weaker differences between perceivers and nonperceivers were observed in the fusiform face area and extrastriate visual cortex. These results suggest that the STS is an important source of interindividual variability in children's audiovisual speech perception.
Collapse
|
39
|
Cue integration in categorical tasks: insights from audio-visual speech perception. PLoS One 2011; 6:e19812. [PMID: 21637344 PMCID: PMC3102664 DOI: 10.1371/journal.pone.0019812] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2011] [Accepted: 04/04/2011] [Indexed: 11/19/2022] Open
Abstract
Previous cue integration studies have examined continuous perceptual dimensions (e.g., size) and have shown that human cue integration is well described by a normative model in which cues are weighted in proportion to their sensory reliability, as estimated from single-cue performance. However, this normative model may not be applicable to categorical perceptual dimensions (e.g., phonemes). In tasks defined over categorical perceptual dimensions, optimal cue weights should depend not only on the sensory variance affecting the perception of each cue but also on the environmental variance inherent in each task-relevant category. Here, we present a computational and experimental investigation of cue integration in a categorical audio-visual (articulatory) speech perception task. Our results show that human performance during audio-visual phonemic labeling is qualitatively consistent with the behavior of a Bayes-optimal observer. Specifically, we show that the participants in our task are sensitive, on a trial-by-trial basis, to the sensory uncertainty associated with the auditory and visual cues, during phonemic categorization. In addition, we show that while sensory uncertainty is a significant factor in determining cue weights, it is not the only one and participants' performance is consistent with an optimal model in which environmental, within category variability also plays a role in determining cue weights. Furthermore, we show that in our task, the sensory variability affecting the visual modality during cue-combination is not well estimated from single-cue performance, but can be estimated from multi-cue performance. The findings and computational principles described here represent a principled first step towards characterizing the mechanisms underlying human cue integration in categorical tasks.
Collapse
|