1
|
Krason A, Zhang Y, Man H, Vigliocco G. Mouth and facial informativeness norms for 2276 English words. Behav Res Methods 2023:10.3758/s13428-023-02216-z. [PMID: 37604959 DOI: 10.3758/s13428-023-02216-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/01/2023] [Indexed: 08/23/2023]
Abstract
Mouth and facial movements are part and parcel of face-to-face communication. The primary way of assessing their role in speech perception has been by manipulating their presence (e.g., by blurring the area of a speaker's lips) or by looking at how informative different mouth patterns are for the corresponding phonemes (or visemes; e.g., /b/ is visually more salient than /g/). However, moving beyond informativeness of single phonemes is challenging due to coarticulation and language variations (to name just a few factors). Here, we present mouth and facial informativeness (MaFI) for words, i.e., how visually informative words are based on their corresponding mouth and facial movements. MaFI was quantified for 2276 English words, varying in length, frequency, and age of acquisition, using phonological distance between a word and participants' speechreading guesses. The results showed that MaFI norms capture well the dynamic nature of mouth and facial movements per word, with words containing phonemes with roundness and frontness features, as well as visemes characterized by lower lip tuck, lip rounding, and lip closure being visually more informative. We also showed that the more of these features there are in a word, the more informative it is based on mouth and facial movements. Finally, we demonstrated that the MaFI norms generalize across different variants of English language. The norms are freely accessible via Open Science Framework ( https://osf.io/mna8j/ ) and can benefit any language researcher using audiovisual stimuli (e.g., to control for the effect of speech-linked mouth and facial movements).
Collapse
Affiliation(s)
- Anna Krason
- Department of Experimental Psychology, University College London, 26 Bedford Way, London, WC1H, 0AP, UK.
| | - Ye Zhang
- Department of Experimental Psychology, University College London, 26 Bedford Way, London, WC1H, 0AP, UK.
| | - Hillarie Man
- Department of Experimental Psychology, University College London, 26 Bedford Way, London, WC1H, 0AP, UK
| | - Gabriella Vigliocco
- Department of Experimental Psychology, University College London, 26 Bedford Way, London, WC1H, 0AP, UK
| |
Collapse
|
2
|
Intelligibility of speech produced by sighted and blind adults. PLoS One 2022; 17:e0272127. [PMID: 36107945 PMCID: PMC9477328 DOI: 10.1371/journal.pone.0272127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 07/13/2022] [Indexed: 11/25/2022] Open
Abstract
Purpose It is well known that speech uses both the auditory and visual modalities to convey information. In cases of congenital sensory deprivation, the feedback language learners have access to for mapping visible and invisible orofacial articulation is impoverished. Although the effects of blindness on the movements of the lips, jaw, and tongue have been documented in francophone adults, not much is known about their consequences for speech intelligibility. The objective of this study is to investigate the effects of congenital visual deprivation on vowel intelligibility in adult speakers of Canadian French. Method Twenty adult listeners performed two perceptual identification tasks in which vowels produced by congenitally blind adults and sighted adults were used as stimuli. The vowels were presented in the auditory, visual, and audiovisual modalities (experiment 1) and at different signal-to-noise ratios in the audiovisual modality (experiment 2). Correct identification scores were calculated. Sequential information analyses were also conducted to assess the amount of information transmitted to the listeners along the three vowel features of height, place of articulation, and rounding. Results The results showed that, although blind speakers did not differ from their sighted peers in the auditory modality, they had lower scores in the audiovisual and visual modalities. Some vowels produced by blind speakers are also less robust in noise than those produced by sighted speakers. Conclusion Together, the results suggest that adult blind speakers have learned to adapt to their sensory loss so that they can successfully achieve intelligible vowel targets in non-noisy conditions but that they produce less intelligible speech in noisy conditions. Thus, the trade-off between visible (lips) and invisible (tongue) articulatory cues observed between vowels produced by blind and sighted speakers is not equivalent in terms of perceptual efficiency.
Collapse
|
3
|
Trudeau-Fisette P, Arnaud L, Ménard L. Visual Influence on Auditory Perception of Vowels by French-Speaking Children and Adults. Front Psychol 2022; 13:740271. [PMID: 35282186 PMCID: PMC8913716 DOI: 10.3389/fpsyg.2022.740271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Accepted: 01/04/2022] [Indexed: 11/26/2022] Open
Abstract
Audiovisual interaction in speech perception is well defined in adults. Despite the large body of evidence suggesting that children are also sensitive to visual input, very few empirical studies have been conducted. To further investigate whether visual inputs influence auditory perception of phonemes in preschoolers in the same way as in adults, we conducted an audiovisual identification test. The auditory stimuli (/e/-/ø/ continuum) were presented either in an auditory condition only or simultaneously with a visual presentation of the articulation of the vowel /e/ or /ø/. The results suggest that, although all participants experienced visual influence on auditory perception, substantial individual differences exist in the 5- to 6-year-old group. While additional work is required to confirm this hypothesis, we suggest that auditory and visual systems are developing at that age and that multisensory phonological categorization of the rounding contrast took place only in children whose sensory systems and sensorimotor representations were mature.
Collapse
Affiliation(s)
- Paméla Trudeau-Fisette
- Laboratoire de Phonétique, Université du Québec à Montréal, Montreal, QC, Canada
- Centre for Research on Brain, Language and Music, Montreal, QC, Canada
- *Correspondence: Paméla Trudeau-Fisette,
| | - Laureline Arnaud
- Centre for Research on Brain, Language and Music, Montreal, QC, Canada
- Integrated Program in Neuroscience, McGill University, Montreal, QC, Canada
| | - Lucie Ménard
- Laboratoire de Phonétique, Université du Québec à Montréal, Montreal, QC, Canada
- Centre for Research on Brain, Language and Music, Montreal, QC, Canada
| |
Collapse
|
4
|
Cho S, Jongman A, Wang Y, Sereno JA. Multi-modal cross-linguistic perception of fricatives in clear speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:2609. [PMID: 32359282 DOI: 10.1121/10.0001140] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/01/2019] [Accepted: 04/06/2020] [Indexed: 05/24/2023]
Abstract
Research shows that acoustic modifications in clearly enunciated fricative consonants (relative to the plain, conversational productions) facilitate auditory fricative perception, particularly for auditorily salient sibilant fricatives and for native perception. However, clear-speech effects on visual fricative perception have received less attention. A comparison of auditory and visual (facial) clear-fricative perception is particularly interesting since sibilant fricatives in English are more auditorily salient while non-sibilants are more visually salient. This study thus examines clear-speech effects on multi-modal perception of English sibilant and non-sibilant fricatives. Native English perceivers and non-native (Mandarin, Korean) perceivers with different fricative inventories in their native languages (L1s) identified clear and conversational fricative-vowel syllables in audio-only, visual-only, and audio-visual (AV) modes. The results reveal an overall positive clear-speech effect when visual information is involved. Considering the factor of AV saliency, clear speech benefits sibilants more in the auditory domain and non-sibilants more in the visual domain. With respect to language background, non-native (Mandarin and Korean) perceivers benefit from visual as well as auditory information, even for fricatives non-existent in their respective L1s, but the patterns of clear-speech gains are affected by the relative AV weighting and "nativeness" of the fricatives. These findings are discussed in terms of how saliency-enhancing and category-distinctive cues of speech sounds are adopted in AV perception to improve intelligibility.
Collapse
Affiliation(s)
- Sylvia Cho
- Language and Brain Lab, Department of Linguistics, Simon Fraser University, 8888 University Drive, Burnaby, British Columbia, V5A 1S6, Canada
| | - Allard Jongman
- The University of Kansas Phonetics and Psycholinguistics Lab, Department of Linguistics, The University of Kansas, Lawrence, Kansas 66044, USA
| | - Yue Wang
- Language and Brain Lab, Department of Linguistics, Simon Fraser University, 8888 University Drive, Burnaby, British Columbia, V5A 1S6, Canada
| | - Joan A Sereno
- The University of Kansas Phonetics and Psycholinguistics Lab, Department of Linguistics, The University of Kansas, Lawrence, Kansas 66044, USA
| |
Collapse
|
5
|
Speakers are able to categorize vowels based on tongue somatosensation. Proc Natl Acad Sci U S A 2020; 117:6255-6263. [PMID: 32123070 DOI: 10.1073/pnas.1911142117] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Auditory speech perception enables listeners to access phonological categories from speech sounds. During speech production and speech motor learning, speakers' experience matched auditory and somatosensory input. Accordingly, access to phonetic units might also be provided by somatosensory information. The present study assessed whether humans can identify vowels using somatosensory feedback, without auditory feedback. A tongue-positioning task was used in which participants were required to achieve different tongue postures within the /e, ε, a/ articulatory range, in a procedure that was totally nonspeech like, involving distorted visual feedback of tongue shape. Tongue postures were measured using electromagnetic articulography. At the end of each tongue-positioning trial, subjects were required to whisper the corresponding vocal tract configuration with masked auditory feedback and to identify the vowel associated with the reached tongue posture. Masked auditory feedback ensured that vowel categorization was based on somatosensory feedback rather than auditory feedback. A separate group of subjects was required to auditorily classify the whispered sounds. In addition, we modeled the link between vowel categories and tongue postures in normal speech production with a Bayesian classifier based on the tongue postures recorded from the same speakers for several repetitions of the /e, ε, a/ vowels during a separate speech production task. Overall, our results indicate that vowel categorization is possible with somatosensory feedback alone, with an accuracy that is similar to the accuracy of the auditory perception of whispered sounds, and in congruence with normal speech articulation, as accounted for by the Bayesian classifier.
Collapse
|
6
|
Trudeau-Fisette P, Ito T, Ménard L. Auditory and Somatosensory Interaction in Speech Perception in Children and Adults. Front Hum Neurosci 2019; 13:344. [PMID: 31636554 PMCID: PMC6788346 DOI: 10.3389/fnhum.2019.00344] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2019] [Accepted: 09/18/2019] [Indexed: 11/28/2022] Open
Abstract
Multisensory integration (MSI) allows us to link sensory cues from multiple sources and plays a crucial role in speech development. However, it is not clear whether humans have an innate ability or whether repeated sensory input while the brain is maturing leads to efficient integration of sensory information in speech. We investigated the integration of auditory and somatosensory information in speech processing in a bimodal perceptual task in 15 young adults (age 19–30) and 14 children (age 5–6). The participants were asked to identify if the perceived target was the sound /e/ or /ø/. Half of the stimuli were presented under a unimodal condition with only auditory input. The other stimuli were presented under a bimodal condition with both auditory input and somatosensory input consisting of facial skin stretches provided by a robotic device, which mimics the articulation of the vowel /e/. The results indicate that the effect of somatosensory information on sound categorization was larger in adults than in children. This suggests that integration of auditory and somatosensory information evolves throughout the course of development.
Collapse
Affiliation(s)
- Paméla Trudeau-Fisette
- Laboratoire de Phonétique, Université du Québec à Montréal, Montreal, QC, Canada.,Centre for Research on Brain, Language and Music, Montreal, QC, Canada
| | - Takayuki Ito
- GIPSA-Lab, CNRS, Grenoble INP, Université Grenoble Alpes, Grenoble, France.,Haskins Laboratories, Yale University, New Haven, CT, United States
| | - Lucie Ménard
- Laboratoire de Phonétique, Université du Québec à Montréal, Montreal, QC, Canada.,Centre for Research on Brain, Language and Music, Montreal, QC, Canada
| |
Collapse
|
7
|
Feng J, Liu C, Li M, Chen H, Sun P, Xie R, Zhao Y, Wu X. Effect of blindness on mismatch responses to Mandarin lexical tones, consonants, and vowels. Hear Res 2018; 371:87-97. [PMID: 30529909 DOI: 10.1016/j.heares.2018.11.010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Revised: 10/07/2018] [Accepted: 11/27/2018] [Indexed: 10/27/2022]
Abstract
According to the hypothesis of auditory compensation, blind listeners are more sensitive to auditory input than sighted listeners. In the current study, we employed the passive oddball paradigm to investigate the effect of blindness on listeners' mismatch responses to Mandarin lexical tones, consonants, and vowels. Twelve blind and twelve sighted age- and verbal IQ-matched adults with normal hearing participated in this study. Our results indicated that blind listeners possibly had a more efficient pre-attentive processing (shorter MMN peak latency) of lexical tones in the tone-dominant hemisphere (i.e., the right hemisphere); and that they exhibited greater sensitivity (larger MMN amplitude) when processing phonemes (consonants and/or vowels) at the pre-attentive stage in both hemispheres compared with sighted individuals. However, we observed longer MMN and P3a peak latencies during phoneme processing in the blind versus control participants, indicating that blind listeners may be slower in terms of pre-attentive processing and involuntary attention switching when processing phonemes. This could be due to a lack of visual experience in the production and perception of phonemes. In a word, the current study revealed a two-sided influence of blindness on Mandarin speech perception.
Collapse
Affiliation(s)
- Jie Feng
- Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University, Beijing, 100875, China
| | - Chang Liu
- Department of Communication Sciences and Disorders, The University of Texas at Austin, Austin, 1 University Station A1100, Austin, TX, 78712, USA
| | - Mingshuang Li
- Department of Communication Sciences and Disorders, The University of Texas at Austin, Austin, 1 University Station A1100, Austin, TX, 78712, USA
| | - Hongjun Chen
- Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University, Beijing, 100875, China
| | - Peng Sun
- Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University, Beijing, 100875, China
| | - Ruibo Xie
- Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University, Beijing, 100875, China
| | - Ying Zhao
- Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University, Beijing, 100875, China
| | - Xinchun Wu
- Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education, Faculty of Psychology, Beijing Normal University, Beijing, 100875, China.
| |
Collapse
|
8
|
Garnier M, Ménard L, Alexandre B. Hyper-articulation in Lombard speech: An active communicative strategy to enhance visible speech cues? THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:1059. [PMID: 30180713 DOI: 10.1121/1.5051321] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2017] [Accepted: 08/02/2018] [Indexed: 06/08/2023]
Abstract
This study investigates the hypothesis that speakers make active use of the visual modality in production to improve their speech intelligibility in noisy conditions. Six native speakers of Canadian French produced speech in quiet conditions and in 85 dB of babble noise, in three situations: interacting face-to-face with the experimenter (AV), using the auditory modality only (AO), or reading aloud (NI, no interaction). The audio signal was recorded with the three-dimensional movements of their lips and tongue, using electromagnetic articulography. All the speakers reacted similarly to the presence vs absence of communicative interaction, showing significant speech modifications with noise exposure in both interactive and non-interactive conditions, not only for parameters directly related to voice intensity or for lip movements (very visible) but also for tongue movements (less visible); greater adaptation was observed in interactive conditions, though. However, speakers reacted differently to the availability or unavailability of visual information: only four speakers enhanced their visible articulatory movements more in the AV condition. These results support the idea that the Lombard effect is at least partly a listener-oriented adaptation. However, to clarify their speech in noisy conditions, only some speakers appear to make active use of the visual modality.
Collapse
Affiliation(s)
- Maëva Garnier
- Centre National de la Recherche Scientifique, Laboratoire Grenoble Images Parole Signal Automatique, 11 rue des Mathématiques, Grenoble Campus, Boîte Postale 46, F-38402 Saint Martin d'Hères Cedex, France
| | - Lucie Ménard
- Département de Linguistique, Laboratoire de Phonétique, Center for Research on Brain, Language, and Music, Université du Québec à Montréal, 320, Ste-Catherine Est, Montréal, Quebec H2X 1L7, Canada
| | - Boris Alexandre
- Centre National de la Recherche Scientifique, Laboratoire Grenoble Images Parole Signal Automatique, 11 rue des Mathématiques, Grenoble Campus, Boîte Postale 46, F-38402 Saint Martin d'Hères Cedex, France
| |
Collapse
|
9
|
Hennequin A, Rochet-Capellan A, Gerber S, Dohen M. Does the Visual Channel Improve the Perception of Consonants Produced by Speakers of French With Down Syndrome? JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2018; 61:957-972. [PMID: 29635399 DOI: 10.1044/2017_jslhr-h-17-0112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 12/08/2017] [Indexed: 06/08/2023]
Abstract
PURPOSE This work evaluates whether seeing the speaker's face could improve the speech intelligibility of adults with Down syndrome (DS). This is not straightforward because DS induces a number of anatomical and motor anomalies affecting the orofacial zone. METHOD A speech-in-noise perception test was used to evaluate the intelligibility of 16 consonants (Cs) produced in a vowel-consonant-vowel context (Vo = /a/) by 4 speakers with DS and 4 control speakers. Forty-eight naïve participants were asked to identify the stimuli in 3 modalities: auditory (A), visual (V), and auditory-visual (AV). The probability of correct responses was analyzed, as well as AV gain, confusions, and transmitted information as a function of modality and phonetic features. RESULTS The probability of correct response follows the trend AV > A > V, with smaller values for the DS than the control speakers in A and AV but not in V. This trend depended on the C: the V information particularly improved the transmission of place of articulation and to a lesser extent of manner, whereas voicing remained specifically altered in DS. CONCLUSIONS The results suggest that the V information is intact in the speech of people with DS and improves the perception of some phonetic features in Cs in a similar way as for control speakers. This result has implications for further studies, rehabilitation protocols, and specific training of caregivers. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.6002267.
Collapse
Affiliation(s)
| | | | - Silvain Gerber
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| | - Marion Dohen
- Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France
| |
Collapse
|
10
|
Sánchez-García C, Kandel S, Savariaux C, Soto-Faraco S. The Time Course of Audio-Visual Phoneme Identification: a High Temporal Resolution Study. Multisens Res 2018; 31:57-78. [DOI: 10.1163/22134808-00002560] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Accepted: 02/20/2017] [Indexed: 11/19/2022]
Abstract
Speech unfolds in time and, as a consequence, its perception requires temporal integration. Yet, studies addressing audio-visual speech processing have often overlooked this temporal aspect. Here, we address the temporal course of audio-visual speech processing in a phoneme identification task using a Gating paradigm. We created disyllabic Spanish word-like utterances (e.g., /pafa/, /paθa/, …) from high-speed camera recordings. The stimuli differed only in the middle consonant (/f/, /θ/, /s/, /r/, /g/), which varied in visual and auditory saliency. As in classical Gating tasks, the utterances were presented in fragments of increasing length (gates), here in 10 ms steps, for identification and confidence ratings. We measured correct identification as a function of time (at each gate) for each critical consonant in audio, visual and audio-visual conditions, and computed the Identification Point and Recognition Point scores. The results revealed that audio-visual identification is a time-varying process that depends on the relative strength of each modality (i.e., saliency). In some cases, audio-visual identification followed the pattern of one dominant modality (either A or V), when that modality was very salient. In other cases, both modalities contributed to identification, hence resulting in audio-visual advantage or interference with respect to unimodal conditions. Both unimodal dominance and audio-visual interaction patterns may arise within the course of identification of the same utterance, at different times. The outcome of this study suggests that audio-visual speech integration models should take into account the time-varying nature of visual and auditory saliency.
Collapse
Affiliation(s)
- Carolina Sánchez-García
- Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain
| | - Sonia Kandel
- Université Grenoble Alpes, GIPSA-lab (CNRS UMR 5216), Grenoble, France
| | | | - Salvador Soto-Faraco
- Departament de Tecnologies de la Informació i les Comunicacions, Universitat Pompeu Fabra, Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
| |
Collapse
|
11
|
Havy M, Zesiger P. Learning Spoken Words via the Ears and Eyes: Evidence from 30-Month-Old Children. Front Psychol 2017; 8:2122. [PMID: 29276493 PMCID: PMC5727082 DOI: 10.3389/fpsyg.2017.02122] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Accepted: 11/21/2017] [Indexed: 12/02/2022] Open
Abstract
From the very first moments of their lives, infants are able to link specific movements of the visual articulators to auditory speech signals. However, recent evidence indicates that infants focus primarily on auditory speech signals when learning new words. Here, we ask whether 30-month-old children are able to learn new words based solely on visible speech information, and whether information from both auditory and visual modalities is available after learning in only one modality. To test this, children were taught new lexical mappings. One group of children experienced the words in the auditory modality (i.e., acoustic form of the word with no accompanying face). Another group experienced the words in the visual modality (seeing a silent talking face). Lexical recognition was tested in either the learning modality or in the other modality. Results revealed successful word learning in either modality. Results further showed cross-modal recognition following an auditory-only, but not a visual-only, experience of the words. Together, these findings suggest that visible speech becomes increasingly informative for the purpose of lexical learning, but that an auditory-only experience evokes a cross-modal representation of the words.
Collapse
Affiliation(s)
- Mélanie Havy
- Faculty of Psychology and Educational Sciences, University of Geneva, Geneva, Switzerland
| | | |
Collapse
|
12
|
Trudeau-Fisette P, Tiede M, Ménard L. Compensations to auditory feedback perturbations in congenitally blind and sighted speakers: Acoustic and articulatory data. PLoS One 2017; 12:e0180300. [PMID: 28678819 PMCID: PMC5498050 DOI: 10.1371/journal.pone.0180300] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 06/13/2017] [Indexed: 11/19/2022] Open
Abstract
This study investigated the effects of visual deprivation on the relationship between speech perception and production by examining compensatory responses to real-time perturbations in auditory feedback. Specifically, acoustic and articulatory data were recorded while sighted and congenitally blind French speakers produced several repetitions of the vowel /ø/. At the acoustic level, blind speakers produced larger compensatory responses to altered vowels than their sighted peers. At the articulatory level, blind speakers also produced larger displacements of the upper lip, the tongue tip, and the tongue dorsum in compensatory responses. These findings suggest that blind speakers tolerate less discrepancy between actual and expected auditory feedback than sighted speakers. The study also suggests that sighted speakers have acquired more constrained somatosensory goals through the influence of visual cues perceived in face-to-face conversation, leading them to tolerate less discrepancy between expected and altered articulatory positions compared to blind speakers and thus resulting in smaller observed compensatory responses.
Collapse
Affiliation(s)
- Pamela Trudeau-Fisette
- Laboratoire de Phonétique, Université du Québec à Montréal, Center For Research on Brain, Language, and Music, Montreal, Quebec, Canada
| | - Mark Tiede
- Haskins Laboratories, New Haven, Connecticut, United States
| | - Lucie Ménard
- Laboratoire de Phonétique, Université du Québec à Montréal, Center For Research on Brain, Language, and Music, Montreal, Quebec, Canada
| |
Collapse
|
13
|
Havy M, Foroud A, Fais L, Werker JF. The Role of Auditory and Visual Speech in Word Learning at 18 Months and in Adulthood. Child Dev 2017; 88:2043-2059. [PMID: 28124795 DOI: 10.1111/cdev.12715] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Visual information influences speech perception in both infants and adults. It is still unknown whether lexical representations are multisensory. To address this question, we exposed 18-month-old infants (n = 32) and adults (n = 32) to new word-object pairings: Participants either heard the acoustic form of the words or saw the talking face in silence. They were then tested on recognition in the same or the other modality. Both 18-month-old infants and adults learned the lexical mappings when the words were presented auditorily and recognized the mapping at test when the word was presented in either modality, but only adults learned new words in a visual-only presentation. These results suggest developmental changes in the sensory format of lexical representations.
Collapse
Affiliation(s)
- Mélanie Havy
- University of British Columbia.,Université de Genève
| | | | | | | |
Collapse
|
14
|
Jaekl P, Pesquita A, Alsius A, Munhall K, Soto-Faraco S. The contribution of dynamic visual cues to audiovisual speech perception. Neuropsychologia 2015; 75:402-10. [PMID: 26100561 DOI: 10.1016/j.neuropsychologia.2015.06.025] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 06/11/2015] [Accepted: 06/18/2015] [Indexed: 11/19/2022]
Abstract
Seeing a speaker's facial gestures can significantly improve speech comprehension, especially in noisy environments. However, the nature of the visual information from the speaker's facial movements that is relevant for this enhancement is still unclear. Like auditory speech signals, visual speech signals unfold over time and contain both dynamic configural information and luminance-defined local motion cues; two information sources that are thought to engage anatomically and functionally separate visual systems. Whereas, some past studies have highlighted the importance of local, luminance-defined motion cues in audiovisual speech perception, the contribution of dynamic configural information signalling changes in form over time has not yet been assessed. We therefore attempted to single out the contribution of dynamic configural information to audiovisual speech processing. To this aim, we measured word identification performance in noise using unimodal auditory stimuli, and with audiovisual stimuli. In the audiovisual condition, speaking faces were presented as point light displays achieved via motion capture of the original talker. Point light displays could be isoluminant, to minimise the contribution of effective luminance-defined local motion information, or with added luminance contrast, allowing the combined effect of dynamic configural cues and local motion cues. Audiovisual enhancement was found in both the isoluminant and contrast-based luminance conditions compared to an auditory-only condition, demonstrating, for the first time the specific contribution of dynamic configural cues to audiovisual speech improvement. These findings imply that globally processed changes in a speaker's facial shape contribute significantly towards the perception of articulatory gestures and the analysis of audiovisual speech.
Collapse
Affiliation(s)
- Philip Jaekl
- Center for Visual Science and Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY, USA.
| | - Ana Pesquita
- UBC Vision Lab, Department of Psychology, University of British Colombia, Vancouver, BC, Canada
| | - Agnes Alsius
- Department of Psychology, Queen's University, Kingston, ON, Canada
| | - Kevin Munhall
- Department of Psychology, Queen's University, Kingston, ON, Canada
| | - Salvador Soto-Faraco
- Centre for Brain and Cognition, Department of Information Technology and Communications, Universitat Pompeu Fabra, Spain; Institució Catalana de Recerca i Estudis Avançats (ICREA), Spain
| |
Collapse
|
15
|
Wallace MT, Stevenson RA. The construct of the multisensory temporal binding window and its dysregulation in developmental disabilities. Neuropsychologia 2014; 64:105-23. [PMID: 25128432 PMCID: PMC4326640 DOI: 10.1016/j.neuropsychologia.2014.08.005] [Citation(s) in RCA: 195] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2014] [Revised: 08/04/2014] [Accepted: 08/05/2014] [Indexed: 01/18/2023]
Abstract
Behavior, perception and cognition are strongly shaped by the synthesis of information across the different sensory modalities. Such multisensory integration often results in performance and perceptual benefits that reflect the additional information conferred by having cues from multiple senses providing redundant or complementary information. The spatial and temporal relationships of these cues provide powerful statistical information about how these cues should be integrated or "bound" in order to create a unified perceptual representation. Much recent work has examined the temporal factors that are integral in multisensory processing, with many focused on the construct of the multisensory temporal binding window - the epoch of time within which stimuli from different modalities is likely to be integrated and perceptually bound. Emerging evidence suggests that this temporal window is altered in a series of neurodevelopmental disorders, including autism, dyslexia and schizophrenia. In addition to their role in sensory processing, these deficits in multisensory temporal function may play an important role in the perceptual and cognitive weaknesses that characterize these clinical disorders. Within this context, focus on improving the acuity of multisensory temporal function may have important implications for the amelioration of the "higher-order" deficits that serve as the defining features of these disorders.
Collapse
Affiliation(s)
- Mark T Wallace
- Vanderbilt Brain Institute, Vanderbilt University, 465 21st Avenue South, Nashville, TN 37232, USA; Department of Hearing & Speech Sciences, Vanderbilt University, Nashville, TN, USA; Department of Psychology, Vanderbilt University, Nashville, TN, USA; Department of Psychiatry, Vanderbilt University, Nashville, TN, USA.
| | - Ryan A Stevenson
- Department of Psychology, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
16
|
Ménard L, Toupin C, Baum SR, Drouin S, Aubin J, Tiede M. Acoustic and articulatory analysis of French vowels produced by congenitally blind adults and sighted adults. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2013; 134:2975-2987. [PMID: 24116433 DOI: 10.1121/1.4818740] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
In a previous paper [Ménard et al., J. Acoust. Soc. Am. 126, 1406-1414 (2009)], it was demonstrated that, despite enhanced auditory discrimination abilities for synthesized vowels, blind adult French speakers produced vowels that were closer together in the acoustic space than those produced by sighted adult French speakers, suggesting finer control of speech production in the sighted speakers. The goal of the present study is to further investigate the articulatory effects of visual deprivation on vowels produced by 11 blind and 11 sighted adult French speakers. Synchronous ultrasound, acoustic, and video recordings of the participants articulating the ten French oral vowels were made. Results show that sighted speakers produce vowels that are spaced significantly farther apart in the acoustic vowel space than blind speakers. Furthermore, blind speakers use smaller differences in lip protrusion but larger differences in tongue position and shape than their sighted peers to produce rounding and place of articulation contrasts. Trade-offs between lip and tongue positions were examined. Results are discussed in the light of the perception-for-action control theory.
Collapse
Affiliation(s)
- Lucie Ménard
- Laboratoire de Phonétique, Université du Québec à Montréal, Department of Linguistics, 320, Sainte-Catherine East, Montréal, Quebec H2X 1L7, Canada
| | | | | | | | | | | |
Collapse
|
17
|
Valkenier B, Duyne JY, Andringa TC, Baskent D. Audiovisual perception of congruent and incongruent Dutch front vowels. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2012; 55:1788-1801. [PMID: 22992710 DOI: 10.1044/1092-4388(2012/11-0227)] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
PURPOSE Auditory perception of vowels in background noise is enhanced when combined with visually perceived speech features. The objective of this study was to investigate whether the influence of visual cues on vowel perception extends to incongruent vowels, in a manner similar to the McGurk effect observed with consonants. METHOD Identification of Dutch front vowels /i, y, e, Y/ that share all features other than height and lip-rounding was measured for congruent and incongruent audiovisual conditions. The audio channel was systematically degraded by adding noise, increasing the reliance on visual cues. RESULTS The height feature was more robustly carried over through the auditory channel and the lip-rounding feature through the visual channel. Hence, congruent audiovisual presentation enhanced identification, while incongruent presentation led to perceptual fusions and thus decreased identification. CONCLUSIONS Visual cues influence the identification of congruent as well as incongruent audiovisual vowels. Incongruent visual information results in perceptual fusions, demonstrating that the McGurk effect can be instigated by long phonemes such as vowels. This result extends to the incongruent presentation of the visually less reliably perceived height. The findings stress the importance of audiovisual congruency in communication devices, such as cochlear implants and videoconferencing tools, where the auditory signal could be degraded.
Collapse
|
18
|
Vatakis A, Maragos P, Rodomagoulakis I, Spence C. Assessing the effect of physical differences in the articulation of consonants and vowels on audiovisual temporal perception. Front Integr Neurosci 2012; 6:71. [PMID: 23060756 PMCID: PMC3461522 DOI: 10.3389/fnint.2012.00071] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 08/22/2012] [Indexed: 12/01/2022] Open
Abstract
We investigated how the physical differences associated with the articulation of speech affect the temporal aspects of audiovisual speech perception. Video clips of consonants and vowels uttered by three different speakers were presented. The video clips were analyzed using an auditory-visual signal saliency model in order to compare signal saliency and behavioral data. Participants made temporal order judgments (TOJs) regarding which speech-stream (auditory or visual) had been presented first. The sensitivity of participants' TOJs and the point of subjective simultaneity (PSS) were analyzed as a function of the place, manner of articulation, and voicing for consonants, and the height/backness of the tongue and lip-roundedness for vowels. We expected that in the case of the place of articulation and roundedness, where the visual-speech signal is more salient, temporal perception of speech would be modulated by the visual-speech signal. No such effect was expected for the manner of articulation or height. The results demonstrate that for place and manner of articulation, participants' temporal percept was affected (although not always significantly) by highly-salient speech-signals with the visual-signals requiring smaller visual-leads at the PSS. This was not the case when height was evaluated. These findings suggest that in the case of audiovisual speech perception, a highly salient visual-speech signal may lead to higher probabilities regarding the identity of the auditory-signal that modulate the temporal window of multisensory integration of the speech-stimulus.
Collapse
Affiliation(s)
| | - Petros Maragos
- Computer Vision, Speech Communication and Signal Processing Group, National Technical University of AthensAthens, Greece
| | - Isidoros Rodomagoulakis
- Computer Vision, Speech Communication and Signal Processing Group, National Technical University of AthensAthens, Greece
| | - Charles Spence
- Crossmodal Research Laboratory, Department of Experimental PsychologyUniversity of Oxford, UK
| |
Collapse
|
19
|
Audiovisual Asynchrony Detection and Speech Intelligibility in Noise With Moderate to Severe Sensorineural Hearing Impairment. Ear Hear 2011; 32:582-92. [DOI: 10.1097/aud.0b013e31820fca23] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
20
|
Kim J, Sironic A, Davis C. Hearing Speech in Noise: Seeing a Loud Talker is Better. Perception 2011; 40:853-62. [DOI: 10.1068/p6941] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
Seeing the talker improves the intelligibility of speech degraded by noise (a visual speech benefit). Given that talkers exaggerate spoken articulation in noise, this set of two experiments examined whether the visual speech benefit was greater for speech produced in noise than in quiet. We first examined the extent to which spoken articulation was exaggerated in noise by measuring the motion of face markers as four people uttered 10 sentences either in quiet or in babble-speech noise (these renditions were also filmed). The tracking results showed that articulated motion in speech produced in noise was greater than that produced in quiet and was more highly correlated with speech acoustics. Speech intelligibility was tested in a second experiment using a speech-perception-in-noise task under auditory-visual and auditory-only conditions. The results showed that the visual speech benefit was greater for speech recorded in noise than for speech recorded in quiet. Furthermore, the amount of articulatory movement was related to performance on the perception task, indicating that the enhanced gestures made when speaking in noise function to make speech more intelligible.
Collapse
Affiliation(s)
| | - Amanda Sironic
- Department of Psychology, The University of Melbourne, Australia
| | | |
Collapse
|
21
|
The temporal distribution of information in audiovisual spoken-word identification. Atten Percept Psychophys 2010; 72:209-25. [PMID: 20045890 DOI: 10.3758/app.72.1.209] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In the present study, we examined the distribution and processing of information over time in auditory and visual speech as it is used in unimodal and bimodal word recognition. English consonant-vowel-consonant words representing all possible initial consonants were presented as auditory, visual, or audiovisual speech in a gating task. The distribution of information over time varied across and within features. Visual speech information was generally fully available early during the phoneme, whereas auditory information was still accumulated. An audiovisual benefit was therefore already found early during the phoneme. The nature of the audiovisual recognition benefit changed, however, as more of the phoneme was presented. More features benefited at short gates rather than at longer ones. Visual speech information plays, therefore, a more important role early during the phoneme rather than later. The results of the study showed the complex interplay of information across modalities and time, since this is essential in determining the time course of audiovisual spoken-word recognition.
Collapse
|
22
|
Schwartz JL. A reanalysis of McGurk data suggests that audiovisual fusion in speech perception is subject-dependent. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2010; 127:1584-1594. [PMID: 20329858 DOI: 10.1121/1.3293001] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Audiovisual perception of conflicting stimuli displays a large level of intersubject variability, generally larger than pure auditory or visual data. However, it is not clear whether this actually reflects differences in integration per se or just the consequence of slight differences in unisensory perception. It is argued that the debate has been blurred by methodological problems in the analysis of experimental data, particularly when using the fuzzy-logical model of perception (FLMP) [Massaro, D. W. (1987). Speech Perception by Ear and Eye: A Paradigm for Psychological Inquiry (Laurence Erlbaum Associates, London)] shown to display overfitting abilities with McGurk stimuli [Schwartz, J. L. (2006). J. Acoust. Soc. Am. 120, 1795-1798]. A large corpus of McGurk data is reanalyzed, using a methodology based on (1) comparison of FLMP and a variant with subject-dependent weights of the auditory and visual inputs in the fusion process, weighted FLMP (WFLMP); (2) use of a Bayesian selection model criterion instead of a root mean square error fit in model assessment; and (3) systematic exploration of the number of useful parameters in the models to compare, attempting to discard poorly explicative parameters. It is shown that WFLMP performs significantly better than FLMP, suggesting that audiovisual fusion is indeed subject-dependent, some subjects being more "auditory," and others more "visual." Intersubject variability has important consequences for theoretical understanding of the fusion process, and re-education of hearing impaired people.
Collapse
Affiliation(s)
- Jean-Luc Schwartz
- Department of Speech and Cognition/Institut de la Communication Parlee, GIPSA-Lab, UMR 5216, CNRS, Grenoble University, 38402 Saint Martin d'Heres Cedex, France.
| |
Collapse
|
23
|
Kim J, Davis C, Groot C. Speech identification in noise: Contribution of temporal, spectral, and visual speech cues. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2009; 126:3246-3257. [PMID: 20000938 DOI: 10.1121/1.3250425] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
This study investigated the degree to which two types of reduced auditory signals (cochlear implant simulations) and visual speech cues combined for speech identification. The auditory speech stimuli were filtered to have only amplitude envelope cues or both amplitude envelope and spectral cues and were presented with/without visual speech. In Experiment 1, IEEE sentences were presented in quiet and noise. For in-quiet presentation, speech identification was enhanced by the addition of both spectral and visual speech cues. Due to a ceiling effect, the degree to which these effects combined could not be determined. In noise, these facilitation effects were more marked and were additive. Experiment 2 examined consonant and vowel identification in the context of CVC or VCV syllables presented in noise. For consonants, both spectral and visual speech cues facilitated identification and these effects were additive. For vowels, the effect of combined cues was underadditive, with the effect of spectral cues reduced when presented with visual speech cues. Analysis indicated that without visual speech, spectral cues facilitated the transmission of place information and vowel height, whereas with visual speech, they facilitated lip rounding, with little impact on the transmission of place information.
Collapse
Affiliation(s)
- Jeesun Kim
- MARCS Auditory Laboratories, University of Western Sydney, NSW 1797, Australia
| | | | | |
Collapse
|
24
|
Ménard L, Dupont S, Baum SR, Aubin J. Production and perception of French vowels by congenitally blind adults and sighted adults. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2009; 126:1406-14. [PMID: 19739754 DOI: 10.1121/1.3158930] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
The goal of this study is to investigate the production and perception of French vowels by blind and sighted speakers. 12 blind adults and 12 sighted adults served as subjects. The auditory-perceptual abilities of each subject were evaluated by discrimination tests (AXB). At the production level, ten repetitions of the ten French oral vowels were recorded. Formant values and fundamental frequency values were extracted from the acoustic signal. Measures of contrasts between vowel categories were computed and compared for each feature (height, place of articulation, roundedness) and group (blind, sighted). The results reveal a significant effect of group (blind vs sighted) on production, with sighted speakers producing vowels that are spaced further apart in the vowel space than those of blind speakers. A group effect emerged for a subset of the perceptual contrasts examined, with blind speakers having higher peak discrimination scores than sighted speakers. Results suggest an important role of visual input in determining speech goals.
Collapse
Affiliation(s)
- Lucie Ménard
- Departement de Linguistique et de Didactique des Langues, Laboratoire de Phonetique, Center for Research on Language, Mind, and Brain, Universite du Quebec a Montreal, Montreal, Quebec, Canada.
| | | | | | | |
Collapse
|
25
|
Sodoyer D, Rivet B, Girin L, Savariaux C, Schwartz JL, Jutten C. A study of lip movements during spontaneous dialog and its application to voice activity detection. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2009; 125:1184-1196. [PMID: 19206891 DOI: 10.1121/1.3050257] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
This paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech/nonspeech contexts, with a particular focus on silences (i.e., when no sound is produced by the speaker). The aim is to characterize the relationship between "lip activity" and "speech activity" and then to use visual speech information as a voice activity detector (VAD). To this aim, an original audiovisual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to synchronously monitor the speaker's lip movements. A comprehensive analysis was carried out on the lip shapes and lip movements in either silence or nonsilence (i.e., speech+nonspeech audible events). A single visual parameter, defined to characterize the lip movements, was shown to be efficient for the detection of silence sections. This results in a visual VAD that can be used in any kind of environment noise, including intricate and highly nonstationary noises, e.g., multiple and/or moving noise sources or competing speech signals.
Collapse
Affiliation(s)
- David Sodoyer
- Department of Speech and Cognition, GIPSA-lab, UMR 5126 CNRS, Grenoble-INP, Université Stendhal, Université Joseph Fourier, Grenoble, France
| | | | | | | | | | | |
Collapse
|
26
|
Richie C, Kewley-Port D. The effects of auditory-visual vowel identification training on speech recognition under difficult listening conditions. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2008; 51:1607-1619. [PMID: 18695021 DOI: 10.1044/1092-4388(2008/07-0069)] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
PURPOSE The effective use of visual cues to speech provides benefit for adults with normal hearing in noisy environments and for adults with hearing loss in everyday communication. The purpose of this study was to examine the effects of a computer-based, auditory-visual vowel identification training program on sentence recognition under difficult listening conditions. METHOD Normal-hearing adults were trained and tested under auditory-visual conditions, in noise designed to simulate the effects of a hearing loss. After initial tests of vowel, word, and sentence recognition, 1 group of participants received training on identification of 10 American English vowels in CVC context. Another group of participants received no training. All participants were then retested on vowel, word, and sentence recognition. RESULTS Improvements were seen for trained compared with untrained participants, in auditory-visual speech recognition under difficult listening conditions, for vowels in monosyllables and key words in sentences. CONCLUSIONS Results from this study suggest benefit may be gained from this computer-based, auditory-visual vowel identification training method.
Collapse
|
27
|
|
28
|
Schwartz JL, Berthommier F, Savariaux C. Seeing to hear better: evidence for early audio-visual interactions in speech identification. Cognition 2004; 93:B69-78. [PMID: 15147940 DOI: 10.1016/j.cognition.2004.01.006] [Citation(s) in RCA: 186] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2003] [Revised: 10/02/2003] [Accepted: 01/26/2004] [Indexed: 10/26/2022]
Abstract
Lip reading is the ability to partially understand speech by looking at the speaker's lips. It improves the intelligibility of speech in noise when audio-visual perception is compared with audio-only perception. A recent set of experiments showed that seeing the speaker's lips also enhances sensitivity to acoustic information, decreasing the auditory detection threshold of speech embedded in noise [J. Acoust. Soc. Am. 109 (2001) 2272; J. Acoust. Soc. Am. 108 (2000) 1197]. However, detection is different from comprehension, and it remains to be seen whether improved sensitivity also results in an intelligibility gain in audio-visual speech perception. In this work, we use an original paradigm to show that seeing the speaker's lips enables the listener to hear better and hence to understand better. The audio-visual stimuli used here could not be differentiated by lip reading per se since they contained exactly the same lip gesture matched with different compatible speech sounds. Nevertheless, the noise-masked stimuli were more intelligible in the audio-visual condition than in the audio-only condition due to the contribution of visual information to the extraction of acoustic cues. Replacing the lip gesture by a non-speech visual input with exactly the same time course, providing the same temporal cues for extraction, removed the intelligibility benefit. This early contribution to audio-visual speech identification is discussed in relationships with recent neurophysiological data on audio-visual perception.
Collapse
Affiliation(s)
- Jean-Luc Schwartz
- Institut de la Communication Parlée, CNRS-INPG-Université Stendhal, 46 Av. Félix Viallet, 38031 Grenoble 1, France.
| | | | | |
Collapse
|
29
|
Girin L, Schwartz JL, Feng G. Audio-visual enhancement of speech in noise. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2001; 109:3007-3020. [PMID: 11425143 DOI: 10.1121/1.1358887] [Citation(s) in RCA: 19] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
A key problem for telecommunication or human-machine communication systems concerns speech enhancement in noise. In this domain, a certain number of techniques exist, all of them based on an acoustic-only approach--that is, the processing of the audio corrupted signal using audio information (from the corrupted signal only or additive audio information). In this paper, an audio-visual approach to the problem is considered, since it has been demonstrated in several studies that viewing the speaker's face improves message intelligibility, especially in noisy environments. A speech enhancement prototype system that takes advantage of visual inputs is developed. A filtering process approach is proposed that uses enhancement filters estimated with the help of lip shape information. The estimation process is based on linear regression or simple neural networks using a training corpus. A set of experiments assessed by Gaussian classification and perceptual tests demonstrates that it is indeed possible to enhance simple stimuli (vowel-plosive-vowel sequences) embedded in white Gaussian noise.
Collapse
Affiliation(s)
- L Girin
- Institut de la Communication Parlée, INPG/Université Stendhal/CNRS UMR 5009, Grenoble, France.
| | | | | |
Collapse
|