1
|
Shahid MS, French AP, Valstar MF, Yakubov GE. Research in methodologies for modelling the oral cavity. Biomed Phys Eng Express 2024; 10:032001. [PMID: 38350128 DOI: 10.1088/2057-1976/ad28cc] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 02/13/2024] [Indexed: 02/15/2024]
Abstract
The paper aims to explore the current state of understanding surrounding in silico oral modelling. This involves exploring methodologies, technologies and approaches pertaining to the modelling of the whole oral cavity; both internally and externally visible structures that may be relevant or appropriate to oral actions. Such a model could be referred to as a 'complete model' which includes consideration of a full set of facial features (i.e. not only mouth) as well as synergistic stimuli such as audio and facial thermal data. 3D modelling technologies capable of accurately and efficiently capturing a complete representation of the mouth for an individual have broad applications in the study of oral actions, due to their cost-effectiveness and time efficiency. This review delves into the field of clinical phonetics to classify oral actions pertaining to both speech and non-speech movements, identifying how the various vocal organs play a role in the articulatory and masticatory process. Vitaly, it provides a summation of 12 articulatory recording methods, forming a tool to be used by researchers in identifying which method of recording is appropriate for their work. After addressing the cost and resource-intensive limitations of existing methods, a new system of modelling is proposed that leverages external to internal correlation modelling techniques to create a more efficient models of the oral cavity. The vision is that the outcomes will be applicable to a broad spectrum of oral functions related to physiology, health and wellbeing, including speech, oral processing of foods as well as dental health. The applications may span from speech correction, designing foods for the aging population, whilst in the dental field we would be able to gain information about patient's oral actions that would become part of creating a personalised dental treatment plan.
Collapse
Affiliation(s)
| | - Andrew P French
- School of Computer Science, University of Nottingham, NG8 1BB, United Kingdom
- School of Biosciences, University of Nottingham, LE12 5RD, United Kingdom
| | - Michel F Valstar
- School of Computer Science, University of Nottingham, NG8 1BB, United Kingdom
| | - Gleb E Yakubov
- School of Biosciences, University of Nottingham, LE12 5RD, United Kingdom
| |
Collapse
|
2
|
Kuo C, Berry J. The Relationship Between Acoustic and Kinematic Vowel Space Areas With and Without Normalization for Speakers With and Without Dysarthria. AMERICAN JOURNAL OF SPEECH-LANGUAGE PATHOLOGY 2023; 32:1923-1937. [PMID: 37105919 PMCID: PMC10561967 DOI: 10.1044/2023_ajslp-22-00158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 09/09/2022] [Accepted: 01/17/2023] [Indexed: 06/19/2023]
Abstract
PURPOSE Few studies have reported on the vowel space area (VSA) in both acoustic and kinematic domains. This study examined acoustic and kinematic VSAs for speakers with and without dysarthria and evaluated effects of normalization on acoustic and kinematic VSAs and the relationship between these measures. METHOD Vowel data from 12 speakers with and without dysarthria, presenting with a range of speech abilities, were examined. The speakers included four speakers with Parkinson's disease (PD), four speakers with brain injury (BI), and four neurotypical (NT) speakers. Speech acoustic and kinematic data were acquired simultaneously using electromagnetic articulography during a passage reading task. Raw and normalized VSAs calculated from corner vowels /i/, /æ/, /ɑ/, and /u/ were evaluated. Normalization was achieved through z score transformations to the acoustic and kinematic data. The effect of normalization on variability within and across groups was evaluated. Regression analysis was used across speakers to assess the association between acoustic and kinematic VSAs for both raw and normalized data. RESULTS When evaluating the speakers as three different groups (i.e., PD, BI, and NT), normalization reduced the standard deviations within each group and changed the relative differences in average magnitude between groups. Regression analysis revealed a significant relationship between normalized, but not raw, acoustic and kinematic VSAs, after the exclusion of an outlier speaker. CONCLUSIONS Normalization reduces the variability across speakers, within groups, and changes average magnitudes affecting speaker group comparisons. Normalization also influences the correlation between acoustic and kinematic measures. Further investigation of the impact of normalization techniques upon acoustic and kinematic measures is warranted. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.22669747.
Collapse
Affiliation(s)
- Christina Kuo
- Department of Communication Sciences and Disorders, James Madison University, Harrisonburg, VA
| | - Jeffrey Berry
- Department of Speech Pathology and Audiology, Marquette University, Milwaukee, WI
| |
Collapse
|
3
|
Nault DR, Mitsuya T, Purcell DW, Munhall KG. Perturbing the consistency of auditory feedback in speech. Front Hum Neurosci 2022; 16:905365. [PMID: 36092651 PMCID: PMC9453207 DOI: 10.3389/fnhum.2022.905365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 08/04/2022] [Indexed: 11/13/2022] Open
Abstract
Sensory information, including auditory feedback, is used by talkers to maintain fluent speech articulation. Current models of speech motor control posit that speakers continually adjust their motor commands based on discrepancies between the sensory predictions made by a forward model and the sensory consequences of their speech movements. Here, in two within-subject design experiments, we used a real-time formant manipulation system to explore how reliant speech articulation is on the accuracy or predictability of auditory feedback information. This involved introducing random formant perturbations during vowel production that varied systematically in their spatial location in formant space (Experiment 1) and temporal consistency (Experiment 2). Our results indicate that, on average, speakers’ responses to auditory feedback manipulations varied based on the relevance and degree of the error that was introduced in the various feedback conditions. In Experiment 1, speakers’ average production was not reliably influenced by random perturbations that were introduced every utterance to the first (F1) and second (F2) formants in various locations of formant space that had an overall average of 0 Hz. However, when perturbations were applied that had a mean of +100 Hz in F1 and −125 Hz in F2, speakers demonstrated reliable compensatory responses that reflected the average magnitude of the applied perturbations. In Experiment 2, speakers did not significantly compensate for perturbations of varying magnitudes that were held constant for one and three trials at a time. Speakers’ average productions did, however, significantly deviate from a control condition when perturbations were held constant for six trials. Within the context of these conditions, our findings provide evidence that the control of speech movements is, at least in part, dependent upon the reliability and stability of the sensory information that it receives over time.
Collapse
Affiliation(s)
- Daniel R. Nault
- Department of Psychology, Queen’s University, Kingston, ON, Canada
- *Correspondence: Daniel R. Nault,
| | - Takashi Mitsuya
- School of Communication Sciences and Disorders, Western University, London, ON, Canada
- National Centre for Audiology, Western University, London, ON, Canada
| | - David W. Purcell
- School of Communication Sciences and Disorders, Western University, London, ON, Canada
- National Centre for Audiology, Western University, London, ON, Canada
| | - Kevin G. Munhall
- Department of Psychology, Queen’s University, Kingston, ON, Canada
| |
Collapse
|
4
|
Modern Responses to Traditional Pitfalls in Gender Affirming Behavioral Voice Modification. Otolaryngol Clin North Am 2022; 55:727-738. [PMID: 35752493 DOI: 10.1016/j.otc.2022.05.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Gender-affirming behavioral voice modification has primarily been directed by cisgender clinicians who do not actively live or master the process of voice modification themselves but instead observe it from the outside looking in. The lack of a "lived experience" by cisgender instructors naturally leaves gaps and oversights that may reduce the effective potential of voice training. Input from transgender people who have learned voice modifications techniques is key to providing the best possible care. Ear training, direct vocal modeling, and mastery of gender-modification techniques are crucial elements that are less emphasized in the current system.
Collapse
|
5
|
Roessig S, Winter B, Mücke D. Tracing the Phonetic Space of Prosodic Focus Marking. Front Artif Intell 2022; 5:842546. [PMID: 35664509 PMCID: PMC9160369 DOI: 10.3389/frai.2022.842546] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2021] [Accepted: 03/14/2022] [Indexed: 11/13/2022] Open
Abstract
Focus is known to be expressed by a wide range of phonetic cues but only a few studies have explicitly compared different phonetic variables within the same experiment. Therefore, we presented results from an analysis of 19 phonetic variables conducted on a data set of the German language that comprises the opposition of unaccented (background) vs. accented (in focus), as well as different focus types with the nuclear accent on the same syllable (broad, narrow, and contrastive focus). The phonetic variables are measures of the acoustic and articulographic signals of a target syllable. Overall, our results provide the highest number of reliable effects and largest effect sizes for accentuation (unaccented vs. accented), while the differentiation of focus types with accented target syllables (broad, narrow, and contrastive focus) are more subtle. The most important phonetic variables across all conditions are measures of the fundamental frequency. The articulatory variables and their corresponding acoustic formants reveal lower tongue positions for both vowels /o, a/, and larger lip openings for the vowel /a/ under increased prosodic prominence with the strongest effects for accentuation. While duration exhibits consistent mid-ranked results for both accentuation and the differentiation of focus types, measures related to intensity are particularly important for accentuation. Furthermore, voice quality and spectral tilt are affected by accentuation but also in the differentiation of focus types. Our results confirm that focus is realized via multiple phonetic cues. Additionally, the present analysis allows a comparison of the relative importance of different measures to better understand the phonetic space of focus marking.
Collapse
Affiliation(s)
- Simon Roessig
- IfL-Phonetik, University of Cologne, Cologne, Germany
| | - Bodo Winter
- Department of English Language and Linguistics, University of Birmingham, Birmingham, United Kingdom
| | - Doris Mücke
- IfL-Phonetik, University of Cologne, Cologne, Germany
| |
Collapse
|
6
|
Tilsen S, Kim SE, Wang C. Localizing category-related information in speech with multi-scale analyses. PLoS One 2021; 16:e0258178. [PMID: 34597350 PMCID: PMC8486085 DOI: 10.1371/journal.pone.0258178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2021] [Accepted: 09/22/2021] [Indexed: 11/25/2022] Open
Abstract
Measurements of the physical outputs of speech-vocal tract geometry and acoustic energy-are high-dimensional, but linguistic theories posit a low-dimensional set of categories such as phonemes and phrase types. How can it be determined when and where in high-dimensional articulatory and acoustic signals there is information related to theoretical categories? For a variety of reasons, it is problematic to directly quantify mutual information between hypothesized categories and signals. To address this issue, a multi-scale analysis method is proposed for localizing category-related information in an ensemble of speech signals using machine learning algorithms. By analyzing how classification accuracy on unseen data varies as the temporal extent of training input is systematically restricted, inferences can be drawn regarding the temporal distribution of category-related information. The method can also be used to investigate redundancy between subsets of signal dimensions. Two types of theoretical categories are examined in this paper: phonemic/gestural categories and syntactic relative clause categories. Moreover, two different machine learning algorithms were examined: linear discriminant analysis and neural networks with long short-term memory units. Both algorithms detected category-related information earlier and later in signals than would be expected given standard theoretical assumptions about when linguistic categories should influence speech. The neural network algorithm was able to identify category-related information to a greater extent than the discriminant analyses.
Collapse
Affiliation(s)
- Sam Tilsen
- Department of Linguistics, Cornell University, Ithaca, New York, United States of America
| | - Seung-Eun Kim
- Department of Linguistics, Cornell University, Ithaca, New York, United States of America
| | - Claire Wang
- Department of Linguistics, Cornell University, Ithaca, New York, United States of America
| |
Collapse
|
7
|
The Role of Acoustic Similarity and Non-Native Categorisation in Predicting Non-Native Discrimination: Brazilian Portuguese Vowels by English vs. Spanish Listeners. LANGUAGES 2021. [DOI: 10.3390/languages6010044] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This study tests whether Australian English (AusE) and European Spanish (ES) listeners differ in their categorisation and discrimination of Brazilian Portuguese (BP) vowels. In particular, we investigate two theoretically relevant measures of vowel category overlap (acoustic vs. perceptual categorisation) as predictors of non-native discrimination difficulty. We also investigate whether the individual listener’s own native vowel productions predict non-native vowel perception better than group averages. The results showed comparable performance for AusE and ES participants in their perception of the BP vowels. In particular, discrimination patterns were largely dependent on contrast-specific learning scenarios, which were similar across AusE and ES. We also found that acoustic similarity between individuals’ own native productions and the BP stimuli were largely consistent with the participants’ patterns of non-native categorisation. Furthermore, the results indicated that both acoustic and perceptual overlap successfully predict discrimination performance. However, accuracy in discrimination was better explained by perceptual similarity for ES listeners and by acoustic similarity for AusE listeners. Interestingly, we also found that for ES listeners, the group averages explained discrimination accuracy better than predictions based on individual production data, but that the AusE group showed no difference.
Collapse
|
8
|
Magnotti JF, Dzeda KB, Wegner-Clemens K, Rennig J, Beauchamp MS. Weak observer-level correlation and strong stimulus-level correlation between the McGurk effect and audiovisual speech-in-noise: A causal inference explanation. Cortex 2020; 133:371-383. [PMID: 33221701 DOI: 10.1016/j.cortex.2020.10.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2020] [Revised: 08/05/2020] [Accepted: 10/05/2020] [Indexed: 11/25/2022]
Abstract
The McGurk effect is a widely used measure of multisensory integration during speech perception. Two observations have raised questions about the validity of the effect as a tool for understanding speech perception. First, there is high variability in perception of the McGurk effect across different stimuli and observers. Second, across observers there is low correlation between McGurk susceptibility and recognition of visual speech paired with auditory speech-in-noise, another common measure of multisensory integration. Using the framework of the causal inference of multisensory speech (CIMS) model, we explored the relationship between the McGurk effect, syllable perception, and sentence perception in seven experiments with a total of 296 different participants. Perceptual reports revealed a relationship between the efficacy of different McGurk stimuli created from the same talker and perception of the auditory component of the McGurk stimuli presented in isolation, both with and without added noise. The CIMS model explained this strong stimulus-level correlation using the principles of noisy sensory encoding followed by optimal cue combination within a common representational space across speech types. Because the McGurk effect (but not speech-in-noise) requires the resolution of conflicting cues between modalities, there is an additional source of individual variability that can explain the weak observer-level correlation between McGurk and noisy speech. Power calculations show that detecting this weak correlation requires studies with many more participants than those conducted to-date. Perception of the McGurk effect and other types of speech can be explained by a common theoretical framework that includes causal inference, suggesting that the McGurk effect is a valid and useful experimental tool.
Collapse
|
9
|
Krivokapić J, Styler W, Parrell B. Pause Postures: The relationship between articulation and cognitive processes during pauses. JOURNAL OF PHONETICS 2020; 79:100953. [PMID: 32218635 PMCID: PMC7098615 DOI: 10.1016/j.wocn.2019.100953] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Studies examining articulatory characteristics of pauses have identified language-specific postures of the vocal tract in inter-utterance pauses and different articulatory patterns in grammatical and non-grammatical pauses. Pause postures-specific articulatory movements that occur during pauses at strong prosodic boundaries-have been identified for Greek and German. However, the cognitive function of these articulations has not been examined so far. We start addressing this question by investigating the effect of 1) utterance type and 2) planning on pause posture occurrence and properties in American English. We first examine whether pause postures exist in American English. In an electromagnetic articulometry study, seven participants produced sentences varying in linguistic structure (stress, boundary, sentence type). To determine the presence of pause postures, as well as to lay the groundwork for their future automatic annotation and detection, a Support Vector Machine Classifier was built to identify pause postures. Results show that pause postures exist for all speakers in this study but that the frequency of occurrence is speaker dependent. Across participants, we find that there is a stable relationship between the pause posture and other events (boundary tones and vowels) at prosodic boundaries, parallel to previous work in Greek. We find that the occurrence of pause postures is not systematically related to utterance type. Lastly, pause postures increase in frequency and duration as utterance length increases, suggesting that pause postures are at least partially related to speech planning processes.
Collapse
Affiliation(s)
- Jelena Krivokapić
- University of Michigan Department of Linguistics, 421 Lorch Hall, 611 Tappan Street, Ann Arbor, MI 48109-1220
- Haskins Laboratories, 300 George St 9th Fl, New Haven, CT 06511-6624
| | - Will Styler
- University of California, San Diego Department of Linguistics, 9500 Gilman Drive #0108, La Jolla, CA 92093-0108
| | - Benjamin Parrell
- University of Wisconsin-Madison Department of Communication Sciences and Disorders, Goodnight Hall, 1975 Willow Drive, Madison, WI 53706
| |
Collapse
|
10
|
Xu Y, Prom-on S. Economy of Effort or Maximum Rate of Information? Exploring Basic Principles of Articulatory Dynamics. Front Psychol 2019; 10:2469. [PMID: 31824364 PMCID: PMC6886388 DOI: 10.3389/fpsyg.2019.02469] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2019] [Accepted: 10/18/2019] [Indexed: 11/13/2022] Open
Abstract
Economy of effort, a popular notion in contemporary speech research, predicts that dynamic extremes such as the maximum speed of articulatory movement are avoided as much as possible and that approaching the dynamic extremes is necessary only when there is a need to enhance linguistic contrast, as in the case of stress or clear speech. Empirical data, however, do not always support these predictions. In the present study, we considered an alternative principle: maximum rate of information, which assumes that speech dynamics are ultimately driven by the pressure to transmit information as quickly and accurately as possible. For empirical data, we asked speakers of American English to produce repetitive syllable sequences such as wawawawawa as fast as possible by imitating recordings of the same sequences that had been artificially accelerated and to produce meaningful sentences containing the same syllables at normal and fast speaking rates. Analysis of formant trajectories shows that dynamic extremes in meaningful speech sometimes even exceeded those in the nonsense syllable sequences but that this happened more often in unstressed syllables than in stressed syllables. We then used a target approximation model based on a mass-spring system of varying orders to simulate the formant kinematics. The results show that the kind of formant kinematics found in the present study and in previous studies can only be generated by a dynamical system operating with maximal muscular force under strong time pressure and that the dynamics of this operation may hold the solution to the long-standing enigma of greater stiffness in unstressed than in stressed syllables. We conclude, therefore, that maximum rate of information can coherently explain both current and previous empirical data and could therefore be a fundamental principle of motor control in speech production.
Collapse
Affiliation(s)
- Yi Xu
- Department of Speech, Hearing and Phonetic Sciences, University College London, London, United Kingdom
| | - Santitham Prom-on
- Department of Computer Engineering, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
| |
Collapse
|