1
|
Deng X, McClay E, Jastrzebski E, Wang Y, Yeung HH. Visual scanning patterns of a talking face when evaluating phonetic information in a native and non-native language. PLoS One 2024; 19:e0304150. [PMID: 38805447 PMCID: PMC11132507 DOI: 10.1371/journal.pone.0304150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Accepted: 05/07/2024] [Indexed: 05/30/2024] Open
Abstract
When comprehending speech, listeners can use information encoded in visual cues from a face to enhance auditory speech comprehension. For example, prior work has shown that the mouth movements reflect articulatory features of speech segments and durational information, while pitch and speech amplitude are primarily cued by eyebrow and head movements. Little is known about how the visual perception of segmental and prosodic speech information is influenced by linguistic experience. Using eye-tracking, we studied how perceivers' visual scanning of different regions on a talking face predicts accuracy in a task targeting both segmental versus prosodic information, and also asked how this was influenced by language familiarity. Twenty-four native English perceivers heard two audio sentences in either English or Mandarin (an unfamiliar, non-native language), which sometimes differed in segmental or prosodic information (or both). Perceivers then saw a silent video of a talking face, and judged whether that video matched either the first or second audio sentence (or whether both sentences were the same). First, increased looking to the mouth predicted correct responses only for non-native language trials. Second, the start of a successful search for speech information in the mouth area was significantly delayed in non-native versus native trials, but just when there were only prosodic differences in the auditory sentences, and not when there were segmental differences. Third, (in correct trials) the saccade amplitude in native language trials was significantly greater than in non-native trials, indicating more intensely focused fixations in the latter. Taken together, these results suggest that mouth-looking was generally more evident when processing a non-native versus native language in all analyses, but fascinatingly, when measuring perceivers' latency to fixate the mouth, this language effect was largest in trials where only prosodic information was useful for the task.
Collapse
Affiliation(s)
- Xizi Deng
- Department of Linguistics, Simon Fraser University, Burnaby BC, Canada
| | - Elise McClay
- Department of Linguistics, Simon Fraser University, Burnaby BC, Canada
| | - Erin Jastrzebski
- Department of Linguistics, Simon Fraser University, Burnaby BC, Canada
| | - Yue Wang
- Department of Linguistics, Simon Fraser University, Burnaby BC, Canada
| | - H. Henny Yeung
- Department of Linguistics, Simon Fraser University, Burnaby BC, Canada
| |
Collapse
|
2
|
Mitchel AD, Lusk LG, Wellington I, Mook AT. Segmenting Speech by Mouth: The Role of Oral Prosodic Cues for Visual Speech Segmentation. LANGUAGE AND SPEECH 2023; 66:819-832. [PMID: 36448317 DOI: 10.1177/00238309221137607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Adults are able to use visual prosodic cues in the speaker's face to segment speech. Furthermore, eye-tracking data suggest that learners will shift their gaze to the mouth during visual speech segmentation. Although these findings suggest that the mouth may be viewed more than the eyes or nose during visual speech segmentation, no study has examined the direct functional importance of individual features; thus, it is unclear which visual prosodic cues are important for word segmentation. In this study, we examined the impact of first removing (Experiment 1) and then isolating (Experiment 2) individual facial features on visual speech segmentation. Segmentation performance was above chance in all conditions except for when the visual display was restricted to the eye region (eyes only condition in Experiment 2). This suggests that participants were able to segment speech when they could visually access the mouth but not when the mouth was completely removed from the visual display, providing evidence that visual prosodic cues conveyed by the mouth are sufficient and likely necessary for visual speech segmentation.
Collapse
Affiliation(s)
| | - Laina G Lusk
- Bucknell University, USA; Children's Hospital of Philadelphia, USA
| | - Ian Wellington
- Bucknell University, USA; University of Connecticut, USA
| | | |
Collapse
|
3
|
Kawase S, Davis C, Kim J. A Visual Speech Intelligibility Benefit Based on Speech Rhythm. Brain Sci 2023; 13:932. [PMID: 37371410 DOI: 10.3390/brainsci13060932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2023] [Revised: 06/04/2023] [Accepted: 06/05/2023] [Indexed: 06/29/2023] Open
Abstract
This study examined whether visual speech provides speech-rhythm information that perceivers can use in speech perception. This was tested by using speech that naturally varied in the familiarity of its rhythm. Thirty Australian English L1 listeners performed a speech perception in noise task with English sentences produced by three speakers: an English L1 speaker (familiar rhythm); an experienced English L2 speaker who had a weak foreign accent (familiar rhythm), and an inexperienced English L2 speaker who had a strong foreign accent (unfamiliar speech rhythm). The spoken sentences were presented in three conditions: Audio-Only (AO), Audio-Visual with mouth covered (AVm), and Audio-Visual (AV). Speech was best recognized in the AV condition regardless of the degree of foreign accent. However, speech recognition in AVm was better than AO for the speech with no foreign accent and with a weak accent, but not for the speech with a strong accent. A follow-up experiment was conducted that only used the speech with a strong foreign accent, under more audible conditions. The results also showed no difference between the AVm and AO conditions, indicating the null effect was not due to a floor effect. We propose that speech rhythm is conveyed by the motion of the jaw opening and closing, and perceivers use this information to better perceive speech in noise.
Collapse
Affiliation(s)
- Saya Kawase
- The MARCS Institute, Western Sydney University, Penrith, NSW 2751, Australia
| | - Chris Davis
- The MARCS Institute, Western Sydney University, Penrith, NSW 2751, Australia
| | - Jeesun Kim
- The MARCS Institute, Western Sydney University, Penrith, NSW 2751, Australia
| |
Collapse
|
4
|
Żygis M, Fuchs S. Communicative constraints affect oro-facial gestures and acoustics: Whispered vs normal speech. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2023; 153:613. [PMID: 36732243 DOI: 10.1121/10.0015251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/22/2021] [Accepted: 11/04/2022] [Indexed: 06/18/2023]
Abstract
The present paper investigates a relationship between the acoustic signal and oro-facial expressions (gestures) when speakers (i) speak normally or whisper, (ii) do or do not see each other, and (iii) produce questions as opposed to statements. To this end, we conducted a motion capture experiment with 17 native speakers of German. The results provide partial support to the hypothesis that the most intensified oro-facial expressions occur when speakers whisper, do not see each other, and produce questions. The results are interpreted in terms of two hypotheses, i.e., the "hand-in-hand" and "trade-off" hypotheses. The relationship between acoustic properties and gestures does not provide straightforward support for one or the other hypothesis. Depending on the condition, speakers used more pronounced gestures and longer duration compensating for the lack of the fundamental frequency (supporting the trade-off hypothesis), but since the gestures were also enhanced when the listener was invisible, we conclude that they are not produced solely for the needs of the listener (supporting the hand-in-hand hypothesis), but rather they seem to help the speaker to achieve an overarching communicative goal.
Collapse
Affiliation(s)
- Marzena Żygis
- Leibniz-Zentrum Allgemeine Sprachwissenschaft, 10117 Berlin, Germany
| | - Susanne Fuchs
- Leibniz-Zentrum Allgemeine Sprachwissenschaft, 10117 Berlin, Germany
| |
Collapse
|
5
|
Esteve-Gibert N, Guellaï B. Prosody in the Auditory and Visual Domains: A Developmental Perspective. Front Psychol 2018; 9:338. [PMID: 29615944 PMCID: PMC5868325 DOI: 10.3389/fpsyg.2018.00338] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 02/27/2018] [Indexed: 11/13/2022] Open
Abstract
The development of body movements such as hand or head gestures, or facial expressions, seems to go hand-in-hand with the development of speech abilities. We know that very young infants rely on the movements of their caregivers' mouth to segment the speech stream, that infants' canonical babbling is temporally related to rhythmic hand movements, that narrative abilities emerge at a similar time in speech and gestures, and that children make use of both modalities to access complex pragmatic intentions. Prosody has emerged as a key linguistic component in this speech-gesture relationship, yet its exact role in the development of multimodal communication is still not well understood. For example, it is not clear what the relative weights of speech prosody and body gestures are in language acquisition, or whether both modalities develop at the same time or whether one modality needs to be in place for the other to emerge. The present paper reviews existing literature on the interactions between speech prosody and body movements from a developmental perspective in order to shed some light on these issues.
Collapse
Affiliation(s)
- Núria Esteve-Gibert
- Departament de Llengües i Literatures Modernes i d’Estudis Anglesos, Universitat de Barcelona (UB), Barcelona, Spain
| | - Bahia Guellaï
- Laboratoire Ethologie, Cognition, Développement, Université Paris Nanterre, Nanterre, France
| |
Collapse
|
6
|
|
7
|
Lusk LG, Mitchel AD. Differential Gaze Patterns on Eyes and Mouth During Audiovisual Speech Segmentation. Front Psychol 2016; 7:52. [PMID: 26869959 PMCID: PMC4735377 DOI: 10.3389/fpsyg.2016.00052] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2015] [Accepted: 01/11/2016] [Indexed: 11/17/2022] Open
Abstract
Speech is inextricably multisensory: both auditory and visual components provide critical information for all aspects of speech processing, including speech segmentation, the visual components of which have been the target of a growing number of studies. In particular, a recent study (Mitchel and Weiss, 2014) established that adults can utilize facial cues (i.e., visual prosody) to identify word boundaries in fluent speech. The current study expanded upon these results, using an eye tracker to identify highly attended facial features of the audiovisual display used in Mitchel and Weiss (2014). Subjects spent the most time watching the eyes and mouth. A significant trend in gaze durations was found with the longest gaze duration on the mouth, followed by the eyes and then the nose. In addition, eye-gaze patterns changed across familiarization as subjects learned the word boundaries, showing decreased attention to the mouth in later blocks while attention on other facial features remained consistent. These findings highlight the importance of the visual component of speech processing and suggest that the mouth may play a critical role in visual speech segmentation.
Collapse
Affiliation(s)
- Laina G Lusk
- Neuroscience Program, Bucknell University Lewisburg, PA, USA
| | - Aaron D Mitchel
- Neuroscience Program, Bucknell UniversityLewisburg, PA, USA; Department of Psychology, Bucknell UniversityLewisburg, PA, USA
| |
Collapse
|
8
|
Experience with a talker can transfer across modalities to facilitate lipreading. Atten Percept Psychophys 2014; 75:1359-65. [PMID: 23955059 DOI: 10.3758/s13414-013-0534-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Rosenblum, Miller, and Sanchez (Psychological Science, 18, 392-396, 2007) found that subjects first trained to lip-read a particular talker were then better able to perceive the auditory speech of that same talker, as compared with that of a novel talker. This suggests that the talker experience a perceiver gains in one sensory modality can be transferred to another modality to make that speech easier to perceive. An experiment was conducted to examine whether this cross-sensory transfer of talker experience could occur (1) from auditory to lip-read speech, (2) with subjects not screened for adequate lipreading skill, (3) when both a familiar and an unfamiliar talker are presented during lipreading, and (4) for both old (presentation set) and new words. Subjects were first asked to identify a set of words from a talker. They were then asked to perform a lipreading task from two faces, one of which was of the same talker they heard in the first phase of the experiment. Results revealed that subjects who lip-read from the same talker they had heard performed better than those who lip-read a different talker, regardless of whether the words were old or new. These results add further evidence that learning of amodal talker information can facilitate speech perception across modalities and also suggest that this information is not restricted to previously heard words.
Collapse
|
9
|
van der Zande P, Jesse A, Cutler A. Lexically guided retuning of visual phonetic categories. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2013; 134:562-571. [PMID: 23862831 DOI: 10.1121/1.4807814] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Listeners retune the boundaries between phonetic categories to adjust to individual speakers' productions. Lexical information, for example, indicates what an unusual sound is supposed to be, and boundary retuning then enables the speaker's sound to be included in the appropriate auditory phonetic category. In this study, it was investigated whether lexical knowledge that is known to guide the retuning of auditory phonetic categories, can also retune visual phonetic categories. In Experiment 1, exposure to a visual idiosyncrasy in ambiguous audiovisually presented target words in a lexical decision task indeed resulted in retuning of the visual category boundary based on the disambiguating lexical context. In Experiment 2 it was tested whether lexical information retunes visual categories directly, or indirectly through the generalization from retuned auditory phonetic categories. Here, participants were exposed to auditory-only versions of the same ambiguous target words as in Experiment 1. Auditory phonetic categories were retuned by lexical knowledge, but no shifts were observed for the visual phonetic categories. Lexical knowledge can therefore guide retuning of visual phonetic categories, but lexically guided retuning of auditory phonetic categories is not generalized to visual categories. Rather, listeners adjust auditory and visual phonetic categories to talker idiosyncrasies separately.
Collapse
Affiliation(s)
- Patrick van der Zande
- Max Planck Institute for Psycholinguistics, P.O. Box 310, 6500 A.H. Nijmegen, The Netherlands.
| | | | | |
Collapse
|