1
|
Cappelloni MS, Mateo VS, Maddox RK. Performance in an Audiovisual Selective Attention Task Using Speech-Like Stimuli Depends on the Talker Identities, But Not Temporal Coherence. Trends Hear 2023; 27:23312165231207235. [PMID: 37847849 PMCID: PMC10586009 DOI: 10.1177/23312165231207235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 09/25/2023] [Accepted: 09/26/2023] [Indexed: 10/19/2023] Open
Abstract
Audiovisual integration of speech can benefit the listener by not only improving comprehension of what a talker is saying but also helping a listener select a particular talker's voice from a mixture of sounds. Binding, an early integration of auditory and visual streams that helps an observer allocate attention to a combined audiovisual object, is likely involved in processing audiovisual speech. Although temporal coherence of stimulus features across sensory modalities has been implicated as an important cue for non-speech stimuli (Maddox et al., 2015), the specific cues that drive binding in speech are not fully understood due to the challenges of studying binding in natural stimuli. Here we used speech-like artificial stimuli that allowed us to isolate three potential contributors to binding: temporal coherence (are the face and the voice changing synchronously?), articulatory correspondence (do visual faces represent the correct phones?), and talker congruence (do the face and voice come from the same person?). In a trio of experiments, we examined the relative contributions of each of these cues. Normal hearing listeners performed a dual task in which they were instructed to respond to events in a target auditory stream while ignoring events in a distractor auditory stream (auditory discrimination) and detecting flashes in a visual stream (visual detection). We found that viewing the face of a talker who matched the attended voice (i.e., talker congruence) offered a performance benefit. We found no effect of temporal coherence on performance in this task, prompting an important recontextualization of previous findings.
Collapse
Affiliation(s)
- Madeline S. Cappelloni
- Biomedical Engineering, University of Rochester, Rochester, NY, USA
- Center for Visual Science, University of Rochester, Rochester, NY, USA
- Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA
| | - Vincent S. Mateo
- Audio and Music Engineering, University of Rochester, Rochester, NY, USA
| | - Ross K. Maddox
- Biomedical Engineering, University of Rochester, Rochester, NY, USA
- Center for Visual Science, University of Rochester, Rochester, NY, USA
- Del Monte Institute for Neuroscience, University of Rochester, Rochester, NY, USA
- Neuroscience, University of Rochester, Rochester, NY, USA
| |
Collapse
|
2
|
Heins N, Pomp J, Kluger DS, Vinbrüx S, Trempler I, Kohler A, Kornysheva K, Zentgraf K, Raab M, Schubotz RI. Surmising synchrony of sound and sight: Factors explaining variance of audiovisual integration in hurdling, tap dancing and drumming. PLoS One 2021; 16:e0253130. [PMID: 34293800 PMCID: PMC8298114 DOI: 10.1371/journal.pone.0253130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Accepted: 05/31/2021] [Indexed: 11/18/2022] Open
Abstract
Auditory and visual percepts are integrated even when they are not perfectly temporally aligned with each other, especially when the visual signal precedes the auditory signal. This window of temporal integration for asynchronous audiovisual stimuli is relatively well examined in the case of speech, while other natural action-induced sounds have been widely neglected. Here, we studied the detection of audiovisual asynchrony in three different whole-body actions with natural action-induced sounds–hurdling, tap dancing and drumming. In Study 1, we examined whether audiovisual asynchrony detection, assessed by a simultaneity judgment task, differs as a function of sound production intentionality. Based on previous findings, we expected that auditory and visual signals should be integrated over a wider temporal window for actions creating sounds intentionally (tap dancing), compared to actions creating sounds incidentally (hurdling). While percentages of perceived synchrony differed in the expected way, we identified two further factors, namely high event density and low rhythmicity, to induce higher synchrony ratings as well. Therefore, we systematically varied event density and rhythmicity in Study 2, this time using drumming stimuli to exert full control over these variables, and the same simultaneity judgment tasks. Results suggest that high event density leads to a bias to integrate rather than segregate auditory and visual signals, even at relatively large asynchronies. Rhythmicity had a similar, albeit weaker effect, when event density was low. Our findings demonstrate that shorter asynchronies and visual-first asynchronies lead to higher synchrony ratings of whole-body action, pointing to clear parallels with audiovisual integration in speech perception. Overconfidence in the naturally expected, that is, synchrony of sound and sight, was stronger for intentional (vs. incidental) sound production and for movements with high (vs. low) rhythmicity, presumably because both encourage predictive processes. In contrast, high event density appears to increase synchronicity judgments simply because it makes the detection of audiovisual asynchrony more difficult. More studies using real-life audiovisual stimuli with varying event densities and rhythmicities are needed to fully uncover the general mechanisms of audiovisual integration.
Collapse
Affiliation(s)
- Nina Heins
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Jennifer Pomp
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Daniel S. Kluger
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
- Institute for Biomagnetism and Biosignal Analysis, University Hospital Muenster, Muenster, Germany
| | - Stefan Vinbrüx
- Institute of Sport and Exercise Sciences, Human Performance and Training, University of Muenster, Muenster, Germany
| | - Ima Trempler
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Axel Kohler
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
| | - Katja Kornysheva
- School of Psychology and Bangor Neuroimaging Unit, Bangor University, Wales, United Kingdom
| | - Karen Zentgraf
- Department of Movement Science and Training in Sports, Institute of Sport Sciences, Goethe University Frankfurt, Frankfurt, Germany
| | - Markus Raab
- Institute of Psychology, German Sport University Cologne, Cologne, Germany
- School of Applied Sciences, London South Bank University, London, United Kingdom
| | - Ricarda I. Schubotz
- Department of Psychology, University of Muenster, Muenster, Germany
- Otto Creutzfeldt Center for Cognitive and Behavioral Neuroscience, University of Muenster, Muenster, Germany
- * E-mail:
| |
Collapse
|
3
|
Li S, Ding Q, Yuan Y, Yue Z. Audio-Visual Causality and Stimulus Reliability Affect Audio-Visual Synchrony Perception. Front Psychol 2021; 12:629996. [PMID: 33679553 PMCID: PMC7930005 DOI: 10.3389/fpsyg.2021.629996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 01/28/2021] [Indexed: 11/18/2022] Open
Abstract
People can discriminate the synchrony between audio-visual scenes. However, the sensitivity of audio-visual synchrony perception can be affected by many factors. Using a simultaneity judgment task, the present study investigated whether the synchrony perception of complex audio-visual stimuli was affected by audio-visual causality and stimulus reliability. In Experiment 1, the results showed that audio-visual causality could increase one's sensitivity to audio-visual onset asynchrony (AVOA) of both action stimuli and speech stimuli. Moreover, participants were more tolerant of AVOA of speech stimuli than that of action stimuli in the high causality condition, whereas no significant difference between these two kinds of stimuli was found in the low causality condition. In Experiment 2, the speech stimuli were manipulated with either high or low stimulus reliability. The results revealed a significant interaction between audio-visual causality and stimulus reliability. Under the low causality condition, the percentage of “synchronous” responses of audio-visual intact stimuli was significantly higher than that of visual_intact/auditory_blurred stimuli and audio-visual blurred stimuli. In contrast, no significant difference among all levels of stimulus reliability was observed under the high causality condition. Our study supported the synergistic effect of top-down processing and bottom-up processing in audio-visual synchrony perception.
Collapse
Affiliation(s)
- Shao Li
- Department of Psychology, Sun Yat-sen University, Guangzhou, China
| | - Qi Ding
- Department of Psychology, Sun Yat-sen University, Guangzhou, China
| | - Yichen Yuan
- Department of Psychology, Sun Yat-sen University, Guangzhou, China
| | - Zhenzhu Yue
- Department of Psychology, Sun Yat-sen University, Guangzhou, China
| |
Collapse
|
4
|
Spence C. Soto-Faraco, S., Kvasova, D., Biau, E., Ikumi, N., Ruzzoli, M., Morís-Fernández, L., & Torralba, M. Multisensory Interactions in the Real World. Perception 2019. [DOI: 10.1177/0301006619896976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
5
|
Multisensory feature integration in (and out) of the focus of spatial attention. Atten Percept Psychophys 2019; 82:363-376. [DOI: 10.3758/s13414-019-01813-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
6
|
Kuznetsova N, Verkhodanova V. Phonetic Realisation and Phonemic Categorisation of the Final Reduced Corner Vowels in the Finnic Languages of Ingria. PHONETICA 2019; 76:201-233. [PMID: 31112960 DOI: 10.1159/000494927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Accepted: 10/29/2018] [Indexed: 06/09/2023]
Abstract
Individual variability in sound change was explored at three stages of final vowel reduction and loss in the endangered Finnic varieties of Ingria (subdialects of Ingrian, Votic and Ingrian Finnish). The correlation between the realisation of reduced vowels and their phonemic categorisation by speakers was studied. The correlated results showed that if V was pronounced >70%, its starting loss was not yet perceived, apart from certain frequent elements, but after >70% loss, V was not perceived any more. A split of 50/50 between V and loss in production correlated with the same split in categorisation. At the beginning of a sound change, production is, therefore, more innovative, but after reanalysis, categorisation becomes more innovative and leads the change. The vowel a was the most innovative in terms of loss, u/o were the most conservative, and i was in the middle, while consonantal palatalisation was more salient than labialisation. These differences are based on acoustics, articulation and perception.
Collapse
Affiliation(s)
- Natalia Kuznetsova
- Institute for Linguistic Studies, Department of the Languages of Russia, Russian Academy of Sciences, St. Petersburg, Russian Federation,
- Dipartimento di Lingue e Letterature straniere e Culture moderne, Universita degli Studi di Torino, Turin, Italy,
| | | |
Collapse
|
7
|
Smith E, Zhang S, Bennetto L. Temporal synchrony and audiovisual integration of speech and object stimuli in autism. RESEARCH IN AUTISM SPECTRUM DISORDERS 2017; 39:11-19. [PMID: 30220908 PMCID: PMC6135104 DOI: 10.1016/j.rasd.2017.04.001] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
BACKGROUND Individuals with Autism Spectrum Disorders (ASD) have been shown to have multisensory integration deficits, which may lead to problems perceiving complex, multisensory environments. For example, understanding audiovisual speech requires integration of visual information from the lips and face with auditory information from the voice, and audiovisual speech integration deficits can lead to impaired understanding and comprehension. While there is strong evidence for an audiovisual speech integration impairment in ASD, it is unclear whether this impairment is due to low level perceptual processes that affect all types of audiovisual integration or if it is specific to speech processing. METHOD Here, we measure audiovisual integration of basic speech (i.e., consonant-vowel utterances) and object stimuli (i.e., a bouncing ball) in adolescents with ASD and well-matched controls. We calculate a temporal window of integration (TWI) using each individual's ability to identify which of two videos (one temporally aligned and one misaligned) matches auditory stimuli. The TWI measures tolerance for temporal asynchrony between the auditory and visual streams, and is an important feature of audiovisual perception. RESULTS While controls showed similar tolerance of asynchrony for the simple speech and object stimuli, individuals with ASD did not. Specifically, individuals with ASD showed less tolerance of asynchrony for speech stimuli compared to object stimuli. In individuals with ASD, decreased tolerance for asynchrony in speech stimuli was associated with higher ratings of autism symptom severity. CONCLUSIONS These results suggest that audiovisual perception in ASD may vary for speech and object stimuli beyond what can be accounted for by stimulus complexity.
Collapse
Affiliation(s)
- Elizabeth Smith
- Department of Clinical and Social Sciences in Psychology, University of Rochester, Rochester, NY USA
| | - Shouling Zhang
- Department of Clinical and Social Sciences in Psychology, University of Rochester, Rochester, NY USA
| | - Loisa Bennetto
- Department of Clinical and Social Sciences in Psychology, University of Rochester, Rochester, NY USA
| |
Collapse
|
8
|
Shahin AJ, Shen S, Kerlin JR. Tolerance for audiovisual asynchrony is enhanced by the spectrotemporal fidelity of the speaker's mouth movements and speech. LANGUAGE, COGNITION AND NEUROSCIENCE 2017; 32:1102-1118. [PMID: 28966930 PMCID: PMC5617130 DOI: 10.1080/23273798.2017.1283428] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/09/2016] [Accepted: 01/07/2017] [Indexed: 06/07/2023]
Abstract
We examined the relationship between tolerance for audiovisual onset asynchrony (AVOA) and the spectrotemporal fidelity of the spoken words and the speaker's mouth movements. In two experiments that only varied in the temporal order of sensory modality, visual speech leading (exp1) or lagging (exp2) acoustic speech, participants watched intact and blurred videos of a speaker uttering trisyllabic words and nonwords that were noise vocoded with 4-, 8-, 16-, and 32-channels. They judged whether the speaker's mouth movements and the speech sounds were in-sync or out-of-sync. Individuals perceived synchrony (tolerated AVOA) on more trials when the acoustic speech was more speech-like (8 channels and higher vs. 4 channels), and when visual speech was intact than blurred (exp1 only). These findings suggest that enhanced spectrotemporal fidelity of the audiovisual (AV) signal prompts the brain to widen the window of integration promoting the fusion of temporally distant AV percepts.
Collapse
Affiliation(s)
- Antoine J Shahin
- Center for Mind and Brain, University of California, Davis, CA, 95618
| | - Stanley Shen
- Center for Mind and Brain, University of California, Davis, CA, 95618
| | - Jess R Kerlin
- Center for Mind and Brain, University of California, Davis, CA, 95618
| |
Collapse
|
9
|
Venezia JH, Thurman SM, Matchin W, George SE, Hickok G. Timing in audiovisual speech perception: A mini review and new psychophysical data. Atten Percept Psychophys 2016; 78:583-601. [PMID: 26669309 PMCID: PMC4744562 DOI: 10.3758/s13414-015-1026-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Recent influential models of audiovisual speech perception suggest that visual speech aids perception by generating predictions about the identity of upcoming speech sounds. These models place stock in the assumption that visual speech leads auditory speech in time. However, it is unclear whether and to what extent temporally-leading visual speech information contributes to perception. Previous studies exploring audiovisual-speech timing have relied upon psychophysical procedures that require artificial manipulation of cross-modal alignment or stimulus duration. We introduce a classification procedure that tracks perceptually relevant visual speech information in time without requiring such manipulations. Participants were shown videos of a McGurk syllable (auditory /apa/ + visual /aka/ = perceptual /ata/) and asked to perform phoneme identification (/apa/ yes-no). The mouth region of the visual stimulus was overlaid with a dynamic transparency mask that obscured visual speech in some frames but not others randomly across trials. Variability in participants' responses (~35 % identification of /apa/ compared to ~5 % in the absence of the masker) served as the basis for classification analysis. The outcome was a high resolution spatiotemporal map of perceptually relevant visual features. We produced these maps for McGurk stimuli at different audiovisual temporal offsets (natural timing, 50-ms visual lead, and 100-ms visual lead). Briefly, temporally-leading (~130 ms) visual information did influence auditory perception. Moreover, several visual features influenced perception of a single speech sound, with the relative influence of each feature depending on both its temporal relation to the auditory signal and its informational content.
Collapse
Affiliation(s)
- Jonathan H Venezia
- Department of Cognitive Sciences, University of California, Irvine, CA, 92697, USA.
| | - Steven M Thurman
- Department of Psychology, University of California, Los Angeles, CA, USA
| | - William Matchin
- Department of Linguistics, University of Maryland, Baltimore, MD, USA
| | - Sahara E George
- Department of Anatomy and Neurobiology, University of California, Irvine, CA, USA
| | - Gregory Hickok
- Department of Cognitive Sciences, University of California, Irvine, CA, 92697, USA
| |
Collapse
|
10
|
Eg R, Behne DM. Perceived synchrony for realistic and dynamic audiovisual events. Front Psychol 2015; 6:736. [PMID: 26082738 PMCID: PMC4451240 DOI: 10.3389/fpsyg.2015.00736] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Accepted: 05/17/2015] [Indexed: 11/13/2022] Open
Abstract
In well-controlled laboratory experiments, researchers have found that humans can perceive delays between auditory and visual signals as short as 20 ms. Conversely, other experiments have shown that humans can tolerate audiovisual asynchrony that exceeds 200 ms. This seeming contradiction in human temporal sensitivity can be attributed to a number of factors such as experimental approaches and precedence of the asynchronous signals, along with the nature, duration, location, complexity and repetitiveness of the audiovisual stimuli, and even individual differences. In order to better understand how temporal integration of audiovisual events occurs in the real world, we need to close the gap between the experimental setting and the complex setting of everyday life. With this work, we aimed to contribute one brick to the bridge that will close this gap. We compared perceived synchrony for long-running and eventful audiovisual sequences to shorter sequences that contain a single audiovisual event, for three types of content: action, music, and speech. The resulting windows of temporal integration showed that participants were better at detecting asynchrony for the longer stimuli, possibly because the long-running sequences contain multiple corresponding events that offer audiovisual timing cues. Moreover, the points of subjective simultaneity differ between content types, suggesting that the nature of a visual scene could influence the temporal perception of events. An expected outcome from this type of experiment was the rich variation among participants' distributions and the derived points of subjective simultaneity. Hence, the designs of similar experiments call for more participants than traditional psychophysical studies. Heeding this caution, we conclude that existing theories on multisensory perception are ready to be tested on more natural and representative stimuli.
Collapse
Affiliation(s)
| | - Dawn M Behne
- Department of Psychology, Norwegian University of Science and Technology Trondheim, Norway
| |
Collapse
|
11
|
Shi Z, Müller HJ. Multisensory perception and action: development, decision-making, and neural mechanisms. Front Integr Neurosci 2013; 7:81. [PMID: 24319414 PMCID: PMC3836185 DOI: 10.3389/fnint.2013.00081] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2013] [Accepted: 11/04/2013] [Indexed: 11/13/2022] Open
Affiliation(s)
- Zhuanghua Shi
- Department of Psychology, Experimental Psychology, Ludwig-Maximilians-Universität München Munich, Germany
| | | |
Collapse
|
12
|
Ten Oever S, Sack AT, Wheat KL, Bien N, van Atteveldt N. Audio-visual onset differences are used to determine syllable identity for ambiguous audio-visual stimulus pairs. Front Psychol 2013; 4:331. [PMID: 23805110 PMCID: PMC3693065 DOI: 10.3389/fpsyg.2013.00331] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2013] [Accepted: 05/21/2013] [Indexed: 11/15/2022] Open
Abstract
Content and temporal cues have been shown to interact during audio-visual (AV) speech identification. Typically, the most reliable unimodal cue is used more strongly to identify specific speech features; however, visual cues are only used if the AV stimuli are presented within a certain temporal window of integration (TWI). This suggests that temporal cues denote whether unimodal stimuli belong together, that is, whether they should be integrated. It is not known whether temporal cues also provide information about the identity of a syllable. Since spoken syllables have naturally varying AV onset asynchronies, we hypothesize that for suboptimal AV cues presented within the TWI, information about the natural AV onset differences can aid in speech identification. To test this, we presented low-intensity auditory syllables concurrently with visual speech signals, and varied the stimulus onset asynchronies (SOA) of the AV pair, while participants were instructed to identify the auditory syllables. We revealed that specific speech features (e.g., voicing) were identified by relying primarily on one modality (e.g., auditory). Additionally, we showed a wide window in which visual information influenced auditory perception, that seemed even wider for congruent stimulus pairs. Finally, we found a specific response pattern across the SOA range for syllables that were not reliably identified by the unimodal cues, which we explained as the result of the use of natural onset differences between AV speech signals. This indicates that temporal cues not only provide information about the temporal integration of AV stimuli, but additionally convey information about the identity of AV pairs. These results provide a detailed behavioral basis for further neuro-imaging and stimulation studies to unravel the neurofunctional mechanisms of the audio-visual-temporal interplay within speech perception.
Collapse
Affiliation(s)
- Sanne Ten Oever
- Faculty of Psychology and Neuroscience, Maastricht University Maastricht, Netherlands
| | | | | | | | | |
Collapse
|