1
|
Cohn M, Barreda S, Graf Estes K, Yu Z, Zellou G. Children and adults produce distinct technology- and human-directed speech. Sci Rep 2024; 14:15611. [PMID: 38971806 PMCID: PMC11227501 DOI: 10.1038/s41598-024-66313-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 07/01/2024] [Indexed: 07/08/2024] Open
Abstract
This study compares how English-speaking adults and children from the United States adapt their speech when talking to a real person and a smart speaker (Amazon Alexa) in a psycholinguistic experiment. Overall, participants produced more effortful speech when talking to a device (longer duration and higher pitch). These differences also varied by age: children produced even higher pitch in device-directed speech, suggesting a stronger expectation to be misunderstood by the system. In support of this, we see that after a staged recognition error by the device, children increased pitch even more. Furthermore, both adults and children displayed the same degree of variation in their responses for whether "Alexa seems like a real person or not", further indicating that children's conceptualization of the system's competence shaped their register adjustments, rather than an increased anthropomorphism response. This work speaks to models on the mechanisms underlying speech production, and human-computer interaction frameworks, providing support for routinized theories of spoken interaction with technology.
Collapse
Affiliation(s)
- Michelle Cohn
- Phonetics Laboratory, Department of Linguistics, University of California, Davis, Davis, USA.
| | - Santiago Barreda
- Phonetics Laboratory, Department of Linguistics, University of California, Davis, Davis, USA
| | - Katharine Graf Estes
- Language Learning Lab, Department of Psychology, University of California, Davis, Davis, USA
| | - Zhou Yu
- Natural Language Processing (NLP) Lab, Department of Computer Science, Columbia University, New York, USA
| | - Georgia Zellou
- Phonetics Laboratory, Department of Linguistics, University of California, Davis, Davis, USA
| |
Collapse
|
2
|
Kim J, Hazan V, Tuomainen O, Davis C. Partner-directed gaze and co-speech hand gestures: effects of age, hearing loss and noise. Front Psychol 2024; 15:1324667. [PMID: 38882511 PMCID: PMC11178134 DOI: 10.3389/fpsyg.2024.1324667] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Accepted: 05/10/2024] [Indexed: 06/18/2024] Open
Abstract
Research on the adaptations talkers make to different communication conditions during interactive conversations has primarily focused on speech signals. We extended this type of investigation to two other important communicative signals, i.e., partner-directed gaze and iconic co-speech hand gestures with the aim of determining if the adaptations made by older adults differ from younger adults across communication conditions. We recruited 57 pairs of participants, comprising 57 primary talkers and 57 secondary ones. Primary talkers consisted of three groups: 19 older adults with mild Hearing Loss (older adult-HL); 17 older adults with Normal Hearing (older adult-NH); and 21 younger adults. The DiapixUK "spot the difference" conversation-based task was used to elicit conversions in participant pairs. One easy (No Barrier: NB) and three difficult communication conditions were tested. The three conditions consisted of two in which the primary talker could hear clearly, but the secondary talkers could not, due to multi-talker babble noise (BAB1) or a less familiar hearing loss simulation (HLS), and a condition in which both the primary and secondary talkers heard each other in babble noise (BAB2). For primary talkers, we measured mean number of partner-directed gazes; mean total gaze duration; and the mean number of co-speech hand gestures. We found a robust effects of communication condition that interacted with participant group. Effects of age were found for both gaze and gesture in BAB1, i.e., older adult-NH looked and gestured less than younger adults did when the secondary talker experienced babble noise. For hearing status, a difference in gaze between older adult-NH and older adult-HL was found for the BAB1 condition; for gesture this difference was significant in all three difficult communication conditions (older adult-HL gazed and gestured more). We propose the age effect may be due to a decline in older adult's attention to cues signaling how well a conversation is progressing. To explain the hearing status effect, we suggest that older adult's attentional decline is offset by hearing loss because these participants have learned to pay greater attention to visual cues for understanding speech.
Collapse
Affiliation(s)
- Jeesun Kim
- The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Sydney, NSW, Australia
| | - Valerie Hazan
- Speech Hearing and Phonetic Sciences, University College London, London, United Kingdom
| | - Outi Tuomainen
- Department of Linguistics, University of Potsdam, Potsdam, Germany
| | - Chris Davis
- The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Sydney, NSW, Australia
| |
Collapse
|
3
|
Sekine K, Özyürek A. Children benefit from gestures to understand degraded speech but to a lesser extent than adults. Front Psychol 2024; 14:1305562. [PMID: 38303780 PMCID: PMC10832995 DOI: 10.3389/fpsyg.2023.1305562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Accepted: 12/13/2023] [Indexed: 02/03/2024] Open
Abstract
The present study investigated to what extent children, compared to adults, benefit from gestures to disambiguate degraded speech by manipulating speech signals and manual modality. Dutch-speaking adults (N = 20) and 6- and 7-year-old children (N = 15) were presented with a series of video clips in which an actor produced a Dutch action verb with or without an accompanying iconic gesture. Participants were then asked to repeat what they had heard. The speech signal was either clear or altered into 4- or 8-band noise-vocoded speech. Children had more difficulty than adults in disambiguating degraded speech in the speech-only condition. However, when presented with both speech and gestures, children reached a comparable level of accuracy to that of adults in the degraded-speech-only condition. Furthermore, for adults, the enhancement of gestures was greater in the 4-band condition than in the 8-band condition, whereas children showed the opposite pattern. Gestures help children to disambiguate degraded speech, but children need more phonological information than adults to benefit from use of gestures. Children's multimodal language integration needs to further develop to adapt flexibly to challenging situations such as degraded speech, as tested in our study, or instances where speech is heard with environmental noise or through a face mask.
Collapse
Affiliation(s)
- Kazuki Sekine
- Faculty of Human Sciences, Waseda University, Tokorozawa, Japan
| | - Aslı Özyürek
- Centre for Language Studies, Radboud University, Nijmegen, Netherlands
- Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
| |
Collapse
|
4
|
Elie B, Šimko J, Turk A. Optimization-based modeling of Lombard speech articulation: Supraglottal characteristics. JASA EXPRESS LETTERS 2024; 4:015204. [PMID: 38206126 DOI: 10.1121/10.0024364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 12/30/2023] [Indexed: 01/12/2024]
Abstract
This paper shows that a highly simplified model of speech production based on the optimization of articulatory effort versus intelligibility can account for some observed articulatory consequences of signal-to-noise ratio. Simulations of static vowels in the presence of various background noise levels show that the model predicts articulatory and acoustic modifications of the type observed in Lombard speech. These features were obtained only when the constraint applied to articulatory effort decreases as the level of background noise increases. These results support the hypothesis that Lombard speech is listener oriented and speakers adapt their articulation in noisy environments.
Collapse
Affiliation(s)
- Benjamin Elie
- Linguistics and English Language, School of Philosophy, Psychology and Language Sciences, The University of Edinburgh, Edinburgh, Scotland, United Kingdom
| | - Juraj Šimko
- Department of Digital Humanities, Faculty of Arts, University of Helsinki, Helsinki, , ,
| | - Alice Turk
- Linguistics and English Language, School of Philosophy, Psychology and Language Sciences, The University of Edinburgh, Edinburgh, Scotland, United Kingdom
| |
Collapse
|
5
|
Kąkol K, Korvel G, Kostek B. Noise profiling for speech enhancement employing machine learning models. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 152:3595. [PMID: 36586827 DOI: 10.1121/10.0016495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
This paper aims to propose a noise profiling method that can be performed in near real time based on machine learning (ML). To address challenges related to noise profiling effectively, we start with a critical review of the literature background. Then, we outline the experiment performed consisting of two parts. The first part concerns the noise recognition model built upon several baseline classifiers and noise signal features derived from the Aurora noise dataset. This is to select the best-performing classifier in the context of noise profiling. Therefore, a comparison of all classifier outcomes is shown based on effectiveness metrics. Also, confusion matrices prepared for all tested models are presented. The second part of the experiment consists of selecting the algorithm that scored the best, i.e., Naive Bayes, resulting in an accuracy of 96.76%, and using it in a noise-type recognition model to demonstrate that it can perform in a stable way. Classification results are derived from the real-life recordings performed in momentary and averaging modes. The key contribution is discussed regarding speech intelligibility improvements in the presence of noise, where identifying the type of noise is crucial. Finally, conclusions deliver the overall findings and future work directions.
Collapse
Affiliation(s)
| | - Gražina Korvel
- Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, 08412, Lithuania
| | - Bożena Kostek
- Audio Acoustics Laboratory, Faculty of Electronics, Telecommunications and Informatics, Gdansk University of Technology, Gdańsk, 80-233, Poland
| |
Collapse
|
6
|
Pouw W, Fuchs S. Origins Of Vocal-Entangled Gesture. Neurosci Biobehav Rev 2022; 141:104836. [PMID: 36031008 DOI: 10.1016/j.neubiorev.2022.104836] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 08/12/2022] [Accepted: 08/21/2022] [Indexed: 01/13/2023]
Abstract
Gestures during speaking are typically understood in a representational framework: they represent absent or distal states of affairs by means of pointing, resemblance, or symbolic replacement. However, humans also gesture along with the rhythm of speaking, which is amenable to a non-representational perspective. Such a perspective centers on the phenomenon of vocal-entangled gestures and builds on evidence showing that when an upper limb with a certain mass decelerates/accelerates sufficiently, it yields impulses on the body that cascade in various ways into the respiratory-vocal system. It entails a physical entanglement between body motions, respiration, and vocal activities. It is shown that vocal-entangled gestures are realized in infant vocal-motor babbling before any representational use of gesture develops. Similarly, an overview is given of vocal-entangled processes in non-human animals. They can frequently be found in rats, bats, birds, and a range of other species that developed even earlier in the phylogenetic tree. Thus, the origins of human gesture lie in biomechanics, emerging early in ontogeny and running deep in phylogeny.
Collapse
Affiliation(s)
- Wim Pouw
- Donders Institute for Brain, Cognition, and Behaviour, Radboud University, Nijmegen, the Netherlands.
| | - Susanne Fuchs
- Leibniz Center General Linguistics, Berlin, Germany.
| |
Collapse
|
7
|
Trujillo JP, Levinson SC, Holler J. A multi-scale investigation of the human communication system's response to visual disruption. ROYAL SOCIETY OPEN SCIENCE 2022; 9:211489. [PMID: 35425638 PMCID: PMC9006025 DOI: 10.1098/rsos.211489] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/25/2022] [Indexed: 05/03/2023]
Abstract
In human communication, when the speech is disrupted, the visual channel (e.g. manual gestures) can compensate to ensure successful communication. Whether speech also compensates when the visual channel is disrupted is an open question, and one that significantly bears on the status of the gestural modality. We test whether gesture and speech are dynamically co-adapted to meet communicative needs. To this end, we parametrically reduce visibility during casual conversational interaction and measure the effects on speakers' communicative behaviour using motion tracking and manual annotation for kinematic and acoustic analyses. We found that visual signalling effort was flexibly adapted in response to a decrease in visual quality (especially motion energy, gesture rate, size, velocity and hold-time). Interestingly, speech was also affected: speech intensity increased in response to reduced visual quality (particularly in speech-gesture utterances, but independently of kinematics). Our findings highlight that multi-modal communicative behaviours are flexibly adapted at multiple scales of measurement and question the notion that gesture plays an inferior role to speech.
Collapse
Affiliation(s)
- James P. Trujillo
- Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, The Netherlands
- Max Planck Institute for Psycholinguistics, Wundtlaan 1, 6525XD Nijmegen, The Netherlands
| | - Stephen C. Levinson
- Max Planck Institute for Psycholinguistics, Wundtlaan 1, 6525XD Nijmegen, The Netherlands
| | - Judith Holler
- Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, The Netherlands
- Max Planck Institute for Psycholinguistics, Wundtlaan 1, 6525XD Nijmegen, The Netherlands
| |
Collapse
|