1
|
Badin P, Sawallis TR, Tabain M, Lamalle L. Bilinguals from Larynx to Lips: Exploring Bilingual Articulatory Strategies with Anatomic MRI Data. LANGUAGE AND SPEECH 2024:238309231224790. [PMID: 38680040 DOI: 10.1177/00238309231224790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
The goal of this article is to illustrate the use of MRI for exploring bi- and multi-lingual articulatory strategies. One male and one female speaker recorded sets of static midsagittal MRIs of the whole vocal tract, producing vowels as well as consonants in various vowel contexts in either the male's two or the female's three languages. Both speakers were native speakers of English (American and Australian English, respectively), and both were fluent L2 speakers of French. In addition, the female speaker was a heritage speaker of Croatian. Articulatory contours extracted from the MRIs were subsequently used at three progressively more compact and abstract levels of analysis. (1) Direct comparison of overlaid contours was used to assess whether phones analogous across L1 and L2 are similar or dissimilar, both overall and in specific vocal tract regions. (2) Consonant contour variability along the vocal tract due to vowel context was determined using dispersion ellipses and used to explore the variable resistance to coarticulation for non-analogous rhotics and analogous laterals in Australian, French, and Croatian. (3) Articulatory modeling was used to focus on specific articulatory gestures (tongue position and shape, lip protrusion, laryngeal height, etc.) and then to explore the articulatory strategies in the speakers' interlanguages for production of the French front rounded vowel series. This revealed that the Australian and American speakers used different strategies to produce the non-analogous French vowel series. We conclude that MRI-based articulatory data constitute a very rich and underused source of information that amply deserves applications to the study of L2 articulation and bilingual and multi-lingual speech.
Collapse
Affiliation(s)
- Pierre Badin
- Institute of Engineering, Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France
| | | | - Marija Tabain
- Department of Languages and Linguistics, La Trobe University, Australia
| | - Laurent Lamalle
- Université Grenoble Alpes and CHU de Grenoble, Inserm US 17, CNRS UMS 3552, UMS IRMaGe, France
| |
Collapse
|
2
|
Shahid MS, French AP, Valstar MF, Yakubov GE. Research in methodologies for modelling the oral cavity. Biomed Phys Eng Express 2024; 10:032001. [PMID: 38350128 DOI: 10.1088/2057-1976/ad28cc] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 02/13/2024] [Indexed: 02/15/2024]
Abstract
The paper aims to explore the current state of understanding surrounding in silico oral modelling. This involves exploring methodologies, technologies and approaches pertaining to the modelling of the whole oral cavity; both internally and externally visible structures that may be relevant or appropriate to oral actions. Such a model could be referred to as a 'complete model' which includes consideration of a full set of facial features (i.e. not only mouth) as well as synergistic stimuli such as audio and facial thermal data. 3D modelling technologies capable of accurately and efficiently capturing a complete representation of the mouth for an individual have broad applications in the study of oral actions, due to their cost-effectiveness and time efficiency. This review delves into the field of clinical phonetics to classify oral actions pertaining to both speech and non-speech movements, identifying how the various vocal organs play a role in the articulatory and masticatory process. Vitaly, it provides a summation of 12 articulatory recording methods, forming a tool to be used by researchers in identifying which method of recording is appropriate for their work. After addressing the cost and resource-intensive limitations of existing methods, a new system of modelling is proposed that leverages external to internal correlation modelling techniques to create a more efficient models of the oral cavity. The vision is that the outcomes will be applicable to a broad spectrum of oral functions related to physiology, health and wellbeing, including speech, oral processing of foods as well as dental health. The applications may span from speech correction, designing foods for the aging population, whilst in the dental field we would be able to gain information about patient's oral actions that would become part of creating a personalised dental treatment plan.
Collapse
Affiliation(s)
| | - Andrew P French
- School of Computer Science, University of Nottingham, NG8 1BB, United Kingdom
- School of Biosciences, University of Nottingham, LE12 5RD, United Kingdom
| | - Michel F Valstar
- School of Computer Science, University of Nottingham, NG8 1BB, United Kingdom
| | - Gleb E Yakubov
- School of Biosciences, University of Nottingham, LE12 5RD, United Kingdom
| |
Collapse
|
3
|
Belyk M, Carignan C, McGettigan C. An open-source toolbox for measuring vocal tract shape from real-time magnetic resonance images. Behav Res Methods 2024; 56:2623-2635. [PMID: 37507650 PMCID: PMC10990993 DOI: 10.3758/s13428-023-02171-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/14/2023] [Indexed: 07/30/2023]
Abstract
Real-time magnetic resonance imaging (rtMRI) is a technique that provides high-contrast videographic data of human anatomy in motion. Applied to the vocal tract, it is a powerful method for capturing the dynamics of speech and other vocal behaviours by imaging structures internal to the mouth and throat. These images provide a means of studying the physiological basis for speech, singing, expressions of emotion, and swallowing that are otherwise not accessible for external observation. However, taking quantitative measurements from these images is notoriously difficult. We introduce a signal processing pipeline that produces outlines of the vocal tract from the lips to the larynx as a quantification of the dynamic morphology of the vocal tract. Our approach performs simple tissue classification, but constrained to a researcher-specified region of interest. This combination facilitates feature extraction while retaining the domain-specific expertise of a human analyst. We demonstrate that this pipeline generalises well across datasets covering behaviours such as speech, vocal size exaggeration, laughter, and whistling, as well as producing reliable outcomes across analysts, particularly among users with domain-specific expertise. With this article, we make this pipeline available for immediate use by the research community, and further suggest that it may contribute to the continued development of fully automated methods based on deep learning algorithms.
Collapse
Affiliation(s)
- Michel Belyk
- Department of Psychology, Edge Hill University, Ormskirk, UK.
| | - Christopher Carignan
- Department of Speech Hearing and Phonetic Sciences, University College London, London, UK
| | - Carolyn McGettigan
- Department of Speech Hearing and Phonetic Sciences, University College London, London, UK
| |
Collapse
|
4
|
Ruthven M, Peplinski AM, Adams DM, King AP, Miquel ME. Real-time speech MRI datasets with corresponding articulator ground-truth segmentations. Sci Data 2023; 10:860. [PMID: 38042857 PMCID: PMC10693552 DOI: 10.1038/s41597-023-02766-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 11/20/2023] [Indexed: 12/04/2023] Open
Abstract
The use of real-time magnetic resonance imaging (rt-MRI) of speech is increasing in clinical practice and speech science research. Analysis of such images often requires segmentation of articulators and the vocal tract, and the community is turning to deep-learning-based methods to perform this segmentation. While there are publicly available rt-MRI datasets of speech, these do not include ground-truth (GT) segmentations, a key requirement for the development of deep-learning-based segmentation methods. To begin to address this barrier, this work presents rt-MRI speech datasets of five healthy adult volunteers with corresponding GT segmentations and velopharyngeal closure patterns. The images were acquired using standard clinical MRI scanners, coils and sequences to facilitate acquisition of similar images in other centres. The datasets include manually created GT segmentations of six anatomical features including the tongue, soft palate and vocal tract. In addition, this work makes code and instructions to implement a current state-of-the-art deep-learning-based method to segment rt-MRI speech datasets publicly available, thus providing the community and others with a starting point for developing such methods.
Collapse
Affiliation(s)
- Matthieu Ruthven
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London, EC1A 7BE, UK
- School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London, SE1 7EH, UK
| | | | - David M Adams
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London, EC1A 7BE, UK
| | - Andrew P King
- School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London, SE1 7EH, UK
| | - Marc Eric Miquel
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London, EC1A 7BE, UK.
- Digital Environment Research Institute (DERI), Empire House, 67-75 New Road, Queen Mary University of London, London, E1 1HH, UK.
- Advanced Cardiovascular Imaging, Barts NIHR BRC, Queen Mary University of London, London, EC1M 6BQ, UK.
| |
Collapse
|
5
|
Mofakham AA, Helenbrook BT, Erath BD, Ferro AR, Ahmed T, Brown DM, Ahmadi G. Influence of two-dimensional expiratory airflow variations on respiratory particle propagation during pronunciation of the fricative [f]. JOURNAL OF AEROSOL SCIENCE 2023; 173:106179. [PMID: 37069899 PMCID: PMC10088289 DOI: 10.1016/j.jaerosci.2023.106179] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 02/19/2023] [Accepted: 03/27/2023] [Indexed: 06/19/2023]
Abstract
Propagation of respiratory particles, potentially containing viable viruses, plays a significant role in the transmission of respiratory diseases (e.g., COVID-19) from infected people. Particles are produced in the upper respiratory system and exit the mouth during expiratory events such as sneezing, coughing, talking, and singing. The importance of considering speaking and singing as vectors of particle transmission has been recognized by researchers. Recently, in a companion paper, dynamics of expiratory flow during fricative utterances were explored, and significant variations of airflow jet trajectories were reported. This study focuses on respiratory particle propagation during fricative productions and the effect of airflow variations on particle transport and dispersion as a function of particle size. The commercial ANSYS-Fluent computational fluid dynamics (CFD) software was employed to quantify the fluid flow and particle dispersion from a two-dimensional mouth model of sustained fricative [f] utterance as well as a horizontal jet flow model. The fluid velocity field and particle distributions estimated from the mouth model were compared with those of the horizontal jet flow model. The significant effects of the airflow jet trajectory variations on the pattern of particle transport and dispersion during fricative utterances were studied. Distinct differences between the estimations of the horizontal jet model for particle propagation with those of the mouth model were observed. The importance of considering the vocal tract geometry and the failure of a horizontal jet model to properly estimate the expiratory airflow and respiratory particle propagation during the production of fricative utterances were emphasized.
Collapse
Affiliation(s)
- Amir A Mofakham
- Department of Mechanical and Aerospace Engineering, Clarkson University, Potsdam, NY 13699, United States of America
| | - Brian T Helenbrook
- Department of Mechanical and Aerospace Engineering, Clarkson University, Potsdam, NY 13699, United States of America
| | - Byron D Erath
- Department of Mechanical and Aerospace Engineering, Clarkson University, Potsdam, NY 13699, United States of America
| | - Andrea R Ferro
- Department of Civil and Environmental Engineering, Clarkson University, Potsdam, NY 13699, United States of America
| | - Tanvir Ahmed
- Department of Mechanical and Aerospace Engineering, Clarkson University, Potsdam, NY 13699, United States of America
| | - Deborah M Brown
- Joint Educational Programs, Trudeau Institute, Saranac Lake, NY 12983, United States of America
| | - Goodarz Ahmadi
- Department of Mechanical and Aerospace Engineering, Clarkson University, Potsdam, NY 13699, United States of America
| |
Collapse
|
6
|
Willett FR, Kunz E, Fan C, Avansino D, Wilson G, Choi EY, Kamdar F, Hochberg LRH, Druckmann S, Shenoy K, Henderson J. A high-performance speech neuroprosthesis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.21.524489. [PMID: 36711591 PMCID: PMC9882398 DOI: 10.1101/2023.01.21.524489] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Speech brain-computer interfaces (BCIs) have the potential to restore rapid communication to people with paralysis by decoding neural activity evoked by attempted speaking movements into text or sound. Early demonstrations, while promising, have not yet achieved accuracies high enough for communication of unconstrainted sentences from a large vocabulary. Here, we demonstrate the first speech-to-text BCI that records spiking activity from intracortical microelectrode arrays. Enabled by these high-resolution recordings, our study participant, who can no longer speak intelligibly due amyotrophic lateral sclerosis (ALS), achieved a 9.1% word error rate on a 50 word vocabulary (2.7 times fewer errors than the prior state of the art speech BCI2) and a 23.8% word error rate on a 125,000 word vocabulary (the first successful demonstration of large-vocabulary decoding). Our BCI decoded speech at 62 words per minute, which is 3.4 times faster than the prior record for any kind of BCI and begins to approach the speed of natural conversation (160 words per minute). Finally, we highlight two aspects of the neural code for speech that are encouraging for speech BCIs: spatially intermixed tuning to speech articulators that makes accurate decoding possible from only a small region of cortex, and a detailed articulatory representation of phonemes that persists years after paralysis. These results show a feasible path forward for using intracortical speech BCIs to restore rapid communication to people with paralysis who can no longer speak.
Collapse
|
7
|
Masapollo M, Nittrouer S. Interarticulator Speech Coordination: Timing Is of the Essence. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2023; 66:901-915. [PMID: 36827516 DOI: 10.1044/2022_jslhr-22-00594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
PURPOSE In skilled speech production, sets of articulators, such as the jaw, tongue, and lips, work cooperatively to achieve task-specific movement goals, despite rampant contextual variation. Efforts to understand these functional units, termed coordinative structures, have focused on identifying the essential control parameters responsible for allowing articulators to achieve these goals, with some research focusing on temporal parameters (relative timing of movements) and other research focusing on spatiotemporal parameters (phase angle of movement onset for one articulator, relative to another). Here, both types of parameters were investigated and compared in detail. METHOD Ten talkers recorded nonsense, disyllabic /tV#Cat/ utterances using electromagnetic articulography, with alternative V (/ɑ/-/ɛ/) and C (/t/-/d/), across variation in rate (fast-slow) and stress (first syllable stressed-unstressed). Two measures were obtained: (a) the timing of tongue-tip raising onset for medial C, relative to jaw opening-closing cycles and (b) the angle of tongue-tip raising onset, relative to the jaw phase plane. RESULTS Results showed that any manipulation that shortened the jaw opening-closing cycle reduced both the relative timing and phase angle of the tongue-tip movement onset, but relative timing of tongue-tip movement onset scaled more consistently with jaw opening-closing across rate and stress variation. CONCLUSION These findings suggest the existence of an intrinsic timing mechanism (or "central clock") that is the primary control parameter for coordinative structures, with online compensation then allowing these structures to achieve their goals spatially. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.22144259.
Collapse
Affiliation(s)
- Matthew Masapollo
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville
| | - Susan Nittrouer
- Department of Speech, Language, and Hearing Sciences, University of Florida, Gainesville
| |
Collapse
|
8
|
Nair NP, Sharma V, Dixit A, Kaushal D, Soni K, Choudhury B, Goyal A. Future Solutions for Voice Rehabilitation in Laryngectomees: A Review of Technologies Based on Electrophysiological Signals. Indian J Otolaryngol Head Neck Surg 2022; 74:5082-5090. [PMID: 36742837 PMCID: PMC9895460 DOI: 10.1007/s12070-021-02765-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Accepted: 07/11/2021] [Indexed: 02/07/2023] Open
Abstract
Loss of voice is a serious concern for a laryngectomee which should be addressed prior to planning the procedure. Voice rehabilitation options must be educated before the surgery. Even though many devices have been in use, each device has got its limitations. We are searching for probable future technologies for voice rehabilitation in laryngectomees and to familiarise with the ENT fraternity. We performed a bibliographic search using title/abstract searches and Medical Subject Headings (MeSHs) where appropriate, of the Medline, CINAHL, EMBASE, Web of Science and Google scholars for publications from January 1985 to January 2020. The obtained results with scope for the development of a device for speech rehabilitation were included in the review. A total of 1036 articles were identified and screened. After careful scrutining 40 articles have been included in this study. Silent speech interface is one of the topics which is extensively being studied. It is based on various electrophysiological biosignals like non-audible murmur, electromyography, ultrasound characteristics of vocal folds and optical imaging of lips and tongue, electro articulography and electroencephalography. Electromyographic signals have been studied in laryngectomised patients. Silent speech interface may be the answer for the future of voice rehabilitation in laryngectomees. However, all these technologies are in their primitive stages and are potential in conforming into a speech device.
Collapse
Affiliation(s)
| | - Vidhu Sharma
- Department of Otorhinolaryngology, AIIMS, Jodhpur, Rajasthan 342005 India
| | - Abhinav Dixit
- Department of Physiology, AIIMS, Jodhpur, Rajasthan 342005 India
| | - Darwin Kaushal
- Department of Otorhinolaryngology, AIIMS, Bilaspur, Himachal Pradesh India
| | - Kapil Soni
- Department of Otorhinolaryngology, AIIMS, Jodhpur, Rajasthan 342005 India
| | - Bikram Choudhury
- Department of Otorhinolaryngology, AIIMS, Jodhpur, Rajasthan 342005 India
| | - Amit Goyal
- Department of Otorhinolaryngology, AIIMS, Jodhpur, Rajasthan 342005 India
| |
Collapse
|
9
|
Belyk M, McGettigan C. Real-time magnetic resonance imaging reveals distinct vocal tract configurations during spontaneous and volitional laughter. Philos Trans R Soc Lond B Biol Sci 2022; 377:20210511. [PMID: 36126659 PMCID: PMC9489295 DOI: 10.1098/rstb.2021.0511] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2021] [Accepted: 02/15/2022] [Indexed: 12/22/2022] Open
Abstract
A substantial body of acoustic and behavioural evidence points to the existence of two broad categories of laughter in humans: spontaneous laughter that is emotionally genuine and somewhat involuntary, and volitional laughter that is produced on demand. In this study, we tested the hypothesis that these are also physiologically distinct vocalizations, by measuring and comparing them using real-time magnetic resonance imaging (rtMRI) of the vocal tract. Following Ruch and Ekman (Ruch and Ekman 2001 In Emotions, qualia, and consciousness (ed. A Kaszniak), pp. 426-443), we further predicted that spontaneous laughter should be relatively less speech-like (i.e. less articulate) than volitional laughter. We collected rtMRI data from five adult human participants during spontaneous laughter, volitional laughter and spoken vowels. We report distinguishable vocal tract shapes during the vocalic portions of these three vocalization types, where volitional laughs were intermediate between spontaneous laughs and vowels. Inspection of local features within the vocal tract across the different vocalization types offers some additional support for Ruch and Ekman's predictions. We discuss our findings in light of a dual pathway hypothesis for the neural control of human volitional and spontaneous vocal behaviours, identifying tongue shape and velum lowering as potential biomarkers of spontaneous laughter to be investigated in future research. This article is part of the theme issue 'Cracking the laugh code: laughter through the lens of biology, psychology and neuroscience'.
Collapse
Affiliation(s)
- Michel Belyk
- Department of Psychology, Edge Hill University, Ormskirk L39 4QP, UK
- Department of Speech, Hearing and Phonetic Sciences, University College London, London WC1N 1PF, UK
| | - Carolyn McGettigan
- Department of Speech, Hearing and Phonetic Sciences, University College London, London WC1N 1PF, UK
| |
Collapse
|
10
|
Kröger BJ. Computer-Implemented Articulatory Models for Speech Production: A Review. Front Robot AI 2022; 9:796739. [PMID: 35494539 PMCID: PMC9040071 DOI: 10.3389/frobt.2022.796739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2021] [Accepted: 02/21/2022] [Indexed: 11/24/2022] Open
Abstract
Modeling speech production and speech articulation is still an evolving research topic. Some current core questions are: What is the underlying (neural) organization for controlling speech articulation? How to model speech articulators like lips and tongue and their movements in an efficient but also biologically realistic way? How to develop high-quality articulatory-acoustic models leading to high-quality articulatory speech synthesis? Thus, on the one hand computer-modeling will help us to unfold underlying biological as well as acoustic-articulatory concepts of speech production and on the other hand further modeling efforts will help us to reach the goal of high-quality articulatory-acoustic speech synthesis based on more detailed knowledge on vocal tract acoustics and speech articulation. Currently, articulatory models are not able to reach the quality level of corpus-based speech synthesis. Moreover, biomechanical and neuromuscular based approaches are complex and still not usable for sentence-level speech synthesis. This paper lists many computer-implemented articulatory models and provides criteria for dividing articulatory models in different categories. A recent major research question, i.e., how to control articulatory models in a neurobiologically adequate manner is discussed in detail. It can be concluded that there is a strong need to further developing articulatory-acoustic models in order to test quantitative neurobiologically based control concepts for speech articulation as well as to uncover the remaining details in human articulatory and acoustic signal generation. Furthermore, these efforts may help us to approach the goal of establishing high-quality articulatory-acoustic as well as neurobiologically grounded speech synthesis.
Collapse
|
11
|
Ahmed T, Wendling HE, Mofakham AA, Ahmadi G, Helenbrook BT, Ferro AR, Brown DM, Erath BD. Variability in expiratory trajectory angles during consonant production by one human subject and from a physical mouth model: Application to respiratory droplet emission. INDOOR AIR 2021; 31:1896-1912. [PMID: 34297885 PMCID: PMC8447379 DOI: 10.1111/ina.12908] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 07/03/2021] [Accepted: 07/05/2021] [Indexed: 06/10/2023]
Abstract
The COVID-19 pandemic has highlighted the need to improve understanding of droplet transport during expiratory emissions. While historical emphasis has been placed on violent events such as coughing and sneezing, the recognition of asymptomatic and presymptomatic spread has identified the need to consider other modalities, such as speaking. Accurate prediction of infection risk produced by speaking requires knowledge of both the droplet size distributions that are produced, as well as the expiratory flow fields that transport the droplets into the surroundings. This work demonstrates that the expiratory flow field produced by consonant productions is highly unsteady, exhibiting extremely broad inter- and intra-consonant variability, with mean ejection angles varying from ≈+30° to -30°. Furthermore, implementation of a physical mouth model to quantify the expiratory flow fields for fricative pronunciation of [f] and [θ] demonstrates that flow velocities at the lips are higher than previously predicted, reaching 20-30 m/s, and that the resultant trajectories are unstable. Because both large and small droplet transport are directly influenced by the magnitude and trajectory of the expirated air stream, these findings indicate that prior investigations of the flow dynamics during speech have largely underestimated the fluid penetration distances that can be achieved for particular consonant utterances.
Collapse
Affiliation(s)
- Tanvir Ahmed
- Department of Mechanical and Aeronautical EngineeringClarkson UniversityPotsdamNew YorkUSA
| | - Hannah E. Wendling
- Department of Mechanical and Aeronautical EngineeringClarkson UniversityPotsdamNew YorkUSA
| | - Amir A. Mofakham
- Department of Mechanical and Aeronautical EngineeringClarkson UniversityPotsdamNew YorkUSA
| | - Goodarz Ahmadi
- Department of Mechanical and Aeronautical EngineeringClarkson UniversityPotsdamNew YorkUSA
| | - Brian T. Helenbrook
- Department of Mechanical and Aeronautical EngineeringClarkson UniversityPotsdamNew YorkUSA
| | - Andrea R. Ferro
- Department of Civil and Environmental EngineeringClarkson UniversityPotsdamNew YorkUSA
| | - Deborah M. Brown
- Joint Educational ProgramsTrudeau InstituteSaranac LakeNew YorkUSA
| | - Byron D. Erath
- Department of Mechanical and Aeronautical EngineeringClarkson UniversityPotsdamNew YorkUSA
| |
Collapse
|
12
|
Isaieva K, Laprie Y, Leclère J, Douros IK, Felblinger J, Vuissoz PA. Multimodal dataset of real-time 2D and static 3D MRI of healthy French speakers. Sci Data 2021; 8:258. [PMID: 34599194 PMCID: PMC8486854 DOI: 10.1038/s41597-021-01041-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2021] [Accepted: 08/25/2021] [Indexed: 12/28/2022] Open
Abstract
The study of articulatory gestures has a wide spectrum of applications, notably in speech production and recognition. Sets of phonemes, as well as their articulation, are language-specific; however, existing MRI databases mostly include English speakers. In our present work, we introduce a dataset acquired with MRI from 10 healthy native French speakers. A corpus consisting of synthetic sentences was used to ensure a good coverage of the French phonetic context. A real-time MRI technology with temporal resolution of 20 ms was used to acquire vocal tract images of the participants speaking. The sound was recorded simultaneously with MRI, denoised and temporally aligned with the images. The speech was transcribed to obtain phoneme-wise segmentation of sound. We also acquired static 3D MR images for a wide list of French phonemes. In addition, we include annotations of spontaneous swallowing. Measurement(s) | Vocal tract images • Speech | Technology Type(s) | Magnetic Resonance Imaging • Microphone Device | Sample Characteristic - Organism | Homo sapiens |
Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.16404453
Collapse
Affiliation(s)
- Karyna Isaieva
- Université de Lorraine, INSERM, IADI, Nancy, F-54000, France.
| | - Yves Laprie
- Université de Lorraine, CNRS, Inria, LORIA, Nancy, F-54000, France
| | - Justine Leclère
- Université de Lorraine, INSERM, IADI, Nancy, F-54000, France.,Oral Medicine Department, University Hospital of Reims, 45 rue Cognacq-Jay, 51092, Reims, Cedex, France
| | - Ioannis K Douros
- Université de Lorraine, INSERM, IADI, Nancy, F-54000, France.,Université de Lorraine, CNRS, Inria, LORIA, Nancy, F-54000, France
| | - Jacques Felblinger
- Université de Lorraine, INSERM, IADI, Nancy, F-54000, France.,CIC-IT, INSERM, CHRU de Nancy, Nancy, F-54000, France
| | | |
Collapse
|
13
|
Temporal Convolution Network Based Joint Optimization of Acoustic-to-Articulatory Inversion. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app11199056] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Articulatory features are proved to be efficient in the area of speech recognition and speech synthesis. However, acquiring articulatory features has always been a difficult research hotspot. A lightweight and accurate articulatory model is of significant meaning. In this study, we propose a novel temporal convolution network-based acoustic-to-articulatory inversion system. The acoustic feature is converted into a high-dimensional hidden space feature map through temporal convolution with frame-level feature correlations taken into account. Meanwhile, we construct a two-part target function combining prediction’s Root Mean Square Error (RMSE) and the sequences’ Pearson Correlation Coefficient (PCC) to jointly optimize the performance of the specific inversion model from both aspects. We also further conducted an analysis on the impact of the weight between the two parts on the final performance of the inversion model. Extensive experiments have shown that our, temporal convolution networks (TCN) model outperformed the Bi-derectional Long Short Term Memory model by 1.18 mm in RMSE and 0.845 in PCC with 14 model parameters when optimizing evenly with RMSE and PCC aspects.
Collapse
|
14
|
Lim Y, Toutios A, Bliesener Y, Tian Y, Lingala SG, Vaz C, Sorensen T, Oh M, Harper S, Chen W, Lee Y, Töger J, Monteserin ML, Smith C, Godinez B, Goldstein L, Byrd D, Nayak KS, Narayanan SS. A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images. Sci Data 2021; 8:187. [PMID: 34285240 PMCID: PMC8292336 DOI: 10.1038/s41597-021-00976-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 06/22/2021] [Indexed: 12/11/2022] Open
Abstract
Real-time magnetic resonance imaging (RT-MRI) of human speech production is enabling significant advances in speech science, linguistics, bio-inspired speech technology development, and clinical applications. Easy access to RT-MRI is however limited, and comprehensive datasets with broad access are needed to catalyze research across numerous domains. The imaging of the rapidly moving articulators and dynamic airway shaping during speech demands high spatio-temporal resolution and robust reconstruction methods. Further, while reconstructed images have been published, to-date there is no open dataset providing raw multi-coil RT-MRI data from an optimized speech production experimental setup. Such datasets could enable new and improved methods for dynamic image reconstruction, artifact correction, feature extraction, and direct extraction of linguistically-relevant biomarkers. The present dataset offers a unique corpus of 2D sagittal-view RT-MRI videos along with synchronized audio for 75 participants performing linguistically motivated speech tasks, alongside the corresponding public domain raw RT-MRI data. The dataset also includes 3D volumetric vocal tract MRI during sustained speech sounds and high-resolution static anatomical T2-weighted upper airway MRI for each participant.
Collapse
Affiliation(s)
- Yongwan Lim
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Asterios Toutios
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Yannick Bliesener
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Ye Tian
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Sajan Goud Lingala
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Colin Vaz
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Tanner Sorensen
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Miran Oh
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Sarah Harper
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Weiyi Chen
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Yoonjeong Lee
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Johannes Töger
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Mairym Lloréns Monteserin
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Caitlin Smith
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Bianca Godinez
- Department of Linguistics, California State University Long Beach, Long Beach, California, USA
| | - Louis Goldstein
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Dani Byrd
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Krishna S Nayak
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Shrikanth S Narayanan
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA.
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA.
| |
Collapse
|
15
|
Fu J, He F, Yin H, He L. Automatic detection of pharyngeal fricatives in cleft palate speech using acoustic features based on the vocal tract area spectrum. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2021.101203] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
16
|
Lynn E, Narayanan SS, Lammert AC. Dark tone quality and vocal tract shaping in soprano song production: Insights from real-time MRI. JASA EXPRESS LETTERS 2021; 1:075202. [PMID: 34291230 PMCID: PMC8273971 DOI: 10.1121/10.0005109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 05/10/2021] [Indexed: 06/13/2023]
Abstract
Tone quality termed "dark" is an aesthetically important property of Western classical voice performance and has been associated with lowered formant frequencies, lowered larynx, and widened pharynx. The present study uses real-time magnetic resonance imaging with synchronous audio recordings to investigate dark tone quality in four professionally trained sopranos with enhanced ecological validity and a relatively complete view of the vocal tract. Findings differ from traditional accounts, indicating that labial narrowing may be the primary driver of dark tone quality across performers, while many other aspects of vocal tract shaping are shown to differ significantly in a performer-specific way.
Collapse
Affiliation(s)
- Elisabeth Lynn
- Department of Biomedical Engineering, Worcester Polytechnic Institute, Worcester, Massachusetts 01690, USA
| | - Shrikanth S Narayanan
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, California 95616, USA , ,
| | - Adam C Lammert
- Department of Biomedical Engineering, Worcester Polytechnic Institute, Worcester, Massachusetts 01690, USA
| |
Collapse
|
17
|
A deep neural network based correction scheme for improved air-tissue boundary prediction in real-time magnetic resonance imaging video. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2020.101160] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
18
|
Harper S, Goldstein L, Narayanan S. Variability in individual constriction contributions to third formant values in American English /ɹ/. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:3905. [PMID: 32611162 PMCID: PMC7297543 DOI: 10.1121/10.0001413] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 05/23/2020] [Accepted: 05/28/2020] [Indexed: 06/11/2023]
Abstract
Although substantial variability is observed in the articulatory implementation of the constriction gestures involved in /ɹ/ production, studies of articulatory-acoustic relations in /ɹ/ have largely ignored the potential for subtle variation in the implementation of these gestures to affect salient acoustic dimensions. This study examines how variation in the articulation of American English /ɹ/ influences the relative sensitivity of the third formant to variation in palatal, pharyngeal, and labial constriction degree. Simultaneously recorded articulatory and acoustic data from six speakers in the USC-TIMIT corpus was analyzed to determine how variation in the implementation of each constriction across tokens of /ɹ/ relates to variation in third formant values. Results show that third formant values are differentially affected by constriction degree for the different constrictions used to produce /ɹ/. Additionally, interspeaker variation is observed in the relative effect of different constriction gestures on third formant values, most notably in a division between speakers exhibiting relatively equal effects of palatal and pharyngeal constriction degree on F3 and speakers exhibiting a stronger palatal effect. This division among speakers mirrors interspeaker differences in mean constriction length and location, suggesting that individual differences in /ɹ/ production lead to variation in articulatory-acoustic relations.
Collapse
Affiliation(s)
- Sarah Harper
- Department of Linguistics, University of Southern California, Los Angeles, California 90089, USA
| | - Louis Goldstein
- Department of Linguistics, University of Southern California, Los Angeles, California 90089, USA
| | - Shrikanth Narayanan
- Signal Analysis and Interpretation Laboratory, Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, California 90089, USA
| |
Collapse
|
19
|
Abstract
Emotional speech production has been previously studied using fleshpoint tracking data in speaker-specific experiment setups. The present study introduces a real-time magnetic resonance imaging database of emotional speech production from 10 speakers and presents articulatory analysis results of speech emotional expression using the database. Midsagittal vocal tract parameters (midsagittal distances and the vocal tract length) were parameterized based on a two-dimensional grid-line system, using image segmentation software. The principal feature analysis technique was applied to the grid-line system in order to find the major movement locations. Results reveal both speaker-dependent and speaker-independent variation patterns. For example, sad speech, a low arousal emotion, tends to show smaller opening for low vowels in the front cavity than the high arousal emotions more consistently than the other regions of the vocal tract. Happiness shows significantly shorter vocal tract length than anger and sadness in most speakers. Further details of speaker-dependent and speaker-independent speech articulation variation in emotional expression and their implications are described.
Collapse
|
20
|
Serrurier A, Badin P, Lamalle L, Neuschaefer-Rube C. Characterization of inter-speaker articulatory variability: A two-level multi-speaker modelling approach based on MRI data. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:2149. [PMID: 31046321 DOI: 10.1121/1.5096631] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2018] [Accepted: 03/14/2019] [Indexed: 06/09/2023]
Abstract
Speech communication relies on articulatory and acoustic codes shared between speakers and listeners despite inter-individual differences in morphology and idiosyncratic articulatory strategies. This study addresses the long-standing problem of characterizing and modelling speaker-independent articulatory strategies and inter-speaker articulatory variability. It explores a multi-speaker modelling approach based on two levels: statistically-based linear articulatory models, which capture the speaker-specific articulatory variability on the one hand, are in turn controlled by a speaker model, which captures the inter-speaker variability on the other hand. A low dimensionality speaker model is obtained by taking advantage of the inter-speaker correlations between morphology and strategy. To validate this approach, contours of the vocal tract articulators were manually segmented on midsagittal MRI data recorded from 11 French speakers uttering 62 vowels and consonants. Using these contours, multi-speaker models with 14 articulatory components and two morphology and strategy components led to overall variance explanations of 66%-69% and root-mean-square errors of 0.36-0.38 cm obtained in leave-one-out procedure over the speakers. Results suggest that inter-speaker variability is more related to the morphology than to the idiosyncratic strategies and illustrate the adaptation of the articulatory components to the morphology.
Collapse
Affiliation(s)
- Antoine Serrurier
- Clinic for Phoniatrics, Pedaudiology & Communication Disorders, University Hospital and Medical Faculty of the RWTH Aachen University, Aachen, Germany
| | - Pierre Badin
- Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, Grenoble, France
| | - Laurent Lamalle
- Inserm US 17-CNRS UMS 3552- Université Grenoble Alpes & CHU Grenoble Alpes, UMS IRMaGe, Grenoble, France
| | - Christiane Neuschaefer-Rube
- Clinic for Phoniatrics, Pedaudiology & Communication Disorders, University Hospital and Medical Faculty of the RWTH Aachen University, Aachen, Germany
| |
Collapse
|
21
|
Sorensen T, Toutios A, Goldstein L, Narayanan S. Task-dependence of articulator synergies. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 145:1504. [PMID: 31067947 PMCID: PMC6910022 DOI: 10.1121/1.5093538] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/18/2018] [Revised: 02/15/2019] [Accepted: 02/19/2019] [Indexed: 06/09/2023]
Abstract
In speech production, the motor system organizes articulators such as the jaw, tongue, and lips into synergies whose function is to produce speech sounds by forming constrictions at the phonetic places of articulation. The present study tests whether synergies for different constriction tasks differ in terms of inter-articulator coordination. The test is conducted on utterances [ɑpɑ], [ɑtɑ], [ɑiɑ], and [ɑkɑ] with a real-time magnetic resonance imaging biomarker that is computed using a statistical model of the forward kinematics of the vocal tract. The present study is the first to estimate the forward kinematics of the vocal tract from speech production data. Using the imaging biomarker, the study finds that the jaw contributes least to the velar stop for [k], more to pharyngeal approximation for [ɑ], still more to palatal approximation for [i], and most to the coronal stop for [t]. Additionally, the jaw contributes more to the coronal stop for [t] than to the bilabial stop for [p]. Finally, the study investigates how this pattern of results varies by participant. The study identifies differences in inter-articulator coordination by constriction task, which support the claim that inter-articulator coordination differs depending on the active articulator synergy.
Collapse
Affiliation(s)
- Tanner Sorensen
- Signal Analysis and Interpretation Laboratory, Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, California 90089, USA
| | - Asterios Toutios
- Signal Analysis and Interpretation Laboratory, Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, California 90089, USA
| | - Louis Goldstein
- Department of Linguistics, University of Southern California, Los Angeles, California 90089, USA
| | - Shrikanth Narayanan
- Signal Analysis and Interpretation Laboratory, Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, California 90089, USA
| |
Collapse
|
22
|
Kim YC. Fast upper airway magnetic resonance imaging for assessment of speech production and sleep apnea. PRECISION AND FUTURE MEDICINE 2018. [DOI: 10.23838/pfm.2018.00100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
|
23
|
Lin J, Xia J, Zhao HS, Hou R, Talukder M, Yu L, Guo JY, Li JL. Lycopene Triggers Nrf2-AMPK Cross Talk to Alleviate Atrazine-Induced Nephrotoxicity in Mice. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2018; 66:12385-12394. [PMID: 30360616 DOI: 10.1021/acs.jafc.8b04341] [Citation(s) in RCA: 67] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Atrazine (ATR), an environmental persistent and bioaccumulative herbicide, has been associated with environmental nephrosis. Lycopene (LYC) exhibits important properties of nephroprotection, but there are limited data on the specific underlying mechanism. The primary objective of this study was to explore the therapeutic effect of LYC on ATR-induced nephrotoxicity in mice. The mice were divided randomly into 6 groups and treated as follows: control group (C), 5 mg/kg LYC group (L), 50 mg/kg ATR group (A1), 200 mg/kg ATR group (A2), 50 mg/kg ATR plus 5 mg/kg LYC group (A1+L), and 200 mg/kg ATR plus 5 mg/kg LYC group (A2+L) by oral gavage administration for 21 days. We found that pretreatment with LYC significantly suppressed the ATR-induced renal tubular epithelial cell swelling. Furthermore, LYC mitigated ATR-induced dysregulation of oxidative stress markers by reducing MDA, H2O2 levels, and increasing SOD, GPx, CAT concentration, and Nrf2 activation. Moreover, LYC activated the autophagic flux by a detectable change in autophagy-related genes (Beclin-1 and ATGs) and proteins (p62/SQSTM) and by the formation of autophagic vacuole (AV) and LC3 aggregation, in parallel with AMPK activation (pAMPK/AMPK). Herein, ATR-up-regulated nuclear factor erythroid 2-related factor 2 (Nrf2) expression and Nrf2-regulated redox genes, including quinoneoxidoreductase-1 (NQO1) and heme oxidase-1 (HO1), whereas LYC down-regulated those of the above genes. In addition, LYC suppressed ATR-induced activation of autophagy (increased LC3II/LC3I, ATGs, Beclin1, and p62, in parallel with increased AMPK activation). Collectively, our findings identified a cross talk between AMPK-activated autophagy and the Nrf2 signaling pathway in LYC-mediated nephroprotection against ATR-induced toxicity in mice kidney.
Collapse
Affiliation(s)
- Jia Lin
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
| | - Jun Xia
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
| | - Hua-Shan Zhao
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
| | - Rui Hou
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
| | - Milton Talukder
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
- Department of Physiology and Pharmacology, Faculty of Animal Science and Veterinary Medicine , Patuakhali Science and Technology University , Barishal 8210 , Bangladesh
| | - Lei Yu
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
| | - Jian-Ying Guo
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
| | - Jin-Long Li
- College of Veterinary Medicine , ‡Key Laboratory of the Provincial Education Department of Heilongjiang for Common Animal Disease Prevention and Treatment , and §Heilongjiang Key Laboratory for Laboratory Animals and Comparative Medicine , Northeast Agricultural University , Harbin 150030 , P.R. China
| |
Collapse
|
24
|
Ramanarayanan V, Tilsen S, Proctor M, Töger J, Goldstein L, Nayak KS, Narayanan S. Analysis of speech production real-time MRI. COMPUT SPEECH LANG 2018. [DOI: 10.1016/j.csl.2018.04.002] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|
25
|
Oh M, Lee Y. ACT: An Automatic Centroid Tracking tool for analyzing vocal tract actions in real-time magnetic resonance imaging speech production data. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2018; 144:EL290. [PMID: 30404513 PMCID: PMC6192793 DOI: 10.1121/1.5057367] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/10/2018] [Revised: 08/28/2018] [Accepted: 09/11/2018] [Indexed: 06/08/2023]
Abstract
Real-time magnetic resonance imaging (MRI) speech production data have expanded the understanding of vocal tract actions. This letter presents an Automatic Centroid Tracking tool, ACT, which obtains both spatial and temporal information characterizing multi-directional articulatory movement. ACT auto-segments an articulatory object composed of connected pixels in a real-time MRI video, by finding its intensity centroids over time and returns kinematic profiles including direction and magnitude information of the object. This letter discusses the utility of ACT, which outperforms other similar object tracking techniques, by demonstrating its successful online tracking of vertical larynx movement. ACT can be deployed generally for dynamic image processing and analysis.
Collapse
Affiliation(s)
- Miran Oh
- Department of Linguistics, University of Southern California, Los Angeles, California 90089, USA ,
| | - Yoonjeong Lee
- Department of Linguistics, University of Southern California, Los Angeles, California 90089, USA ,
| |
Collapse
|
26
|
Abstract
Speech motor actions are performed quickly, while simultaneously maintaining a high degree of accuracy. Are speed and accuracy in conflict during speech production? Speed-accuracy tradeoffs have been shown in many domains of human motor action, but have not been directly examined in the domain of speech production. The present work seeks evidence for Fitts' law, a rigorous formulation of this fundamental tradeoff, in speech articulation kinematics by analyzing USC-TIMIT, a real-time magnetic resonance imaging data set of speech production. A theoretical framework for considering Fitts' law with respect to models of speech motor control is elucidated. Methodological challenges in seeking relationships consistent with Fitts' law are addressed, including the operational definitions and measurement of key variables in real-time MRI data. Results suggest the presence of speed-accuracy tradeoffs for certain types of speech production actions, with wide variability across syllable position, and substantial variability also across subjects. Coda consonant targets immediately following the syllabic nucleus show the strongest evidence of this tradeoff, with correlations as high as 0.72 between speed and accuracy. A discussion is provided concerning the potentially limited applicability of Fitts' law in the context of speech production, as well as the theoretical context for interpreting the results.
Collapse
|
27
|
Whalen DH, Chen WR, Tiede MK, Nam H. Variability of articulator positions and formants across nine English vowels. JOURNAL OF PHONETICS 2018; 68:1-14. [PMID: 30034052 PMCID: PMC6053058 DOI: 10.1016/j.wocn.2018.01.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Speech, though communicative, is quite variable both in articulation and acoustics, and it has often been claimed that articulation is more variable. Here we compared variability in articulation and acoustics for 32 speakers in the x-ray microbeam database (XRMB; Westbury, 1994). Variability in tongue, lip and jaw positions for nine English vowels (/u, ʊ, æ, ɑ, ʌ, ɔ, ε, ɪ, i/) was compared to that of the corresponding formant values. The domains were made comparable by creating three-dimensional spaces for each: the first three principal components from an analysis of a 14-dimensional space for articulation, and an F1xF2xF3 space for acoustics. More variability occurred in the articulation than the acoustics for half of the speakers, while the reverse was true for the other half. Individual tokens were further from the articulatory median than the acoustic median for 40-60% of tokens across speakers. A separate analysis of three non-low front vowels (/ε, ɪ, i/, for which the XRMB system provides the most direct articulatory evidence) did not differ from the omnibus analysis. Speakers tended to be either more or less variable consistently across vowels. Across speakers, there was a positive correlation between articulatory and acoustic variability, both for all vowels and for just the three non-low front vowels. Although the XRMB is an incomplete representation of articulation, it nonetheless provides data for direct comparisons between articulatory and acoustic variability that have not been reported previously. The results indicate that articulation is not more variable than acoustics, that speakers had relatively consistent variability across vowels, and that articulatory and acoustic variability were related for the vowels themselves.
Collapse
Affiliation(s)
- D H Whalen
- Haskins Laboratories
- City University of New York
- Yale University
| | | | | | | |
Collapse
|
28
|
|
29
|
Pattem AK, Illa A, Afshan A, Ghosh PK. Optimal sensor placement in electromagnetic articulography recording for speech production study. COMPUT SPEECH LANG 2018. [DOI: 10.1016/j.csl.2017.07.008] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
30
|
Töger J, Sorensen T, Somandepalli K, Toutios A, Lingala SG, Narayanan S, Nayak K. Test-retest repeatability of human speech biomarkers from static and real-time dynamic magnetic resonance imaging. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2017; 141:3323. [PMID: 28599561 PMCID: PMC5436977 DOI: 10.1121/1.4983081] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Static anatomical and real-time dynamic magnetic resonance imaging (RT-MRI) of the upper airway is a valuable method for studying speech production in research and clinical settings. The test-retest repeatability of quantitative imaging biomarkers is an important parameter, since it limits the effect sizes and intragroup differences that can be studied. Therefore, this study aims to present a framework for determining the test-retest repeatability of quantitative speech biomarkers from static MRI and RT-MRI, and apply the framework to healthy volunteers. Subjects (n = 8, 4 females, 4 males) are imaged in two scans on the same day, including static images and dynamic RT-MRI of speech tasks. The inter-study agreement is quantified using intraclass correlation coefficient (ICC) and mean within-subject standard deviation (σe). Inter-study agreement is strong to very strong for static measures (ICC: min/median/max 0.71/0.89/0.98, σe: 0.90/2.20/6.72 mm), poor to strong for dynamic RT-MRI measures of articulator motion range (ICC: 0.26/0.75/0.90, σe: 1.6/2.5/3.6 mm), and poor to very strong for velocities (ICC: 0.21/0.56/0.93, σe: 2.2/4.4/16.7 cm/s). In conclusion, this study characterizes repeatability of static and dynamic MRI-derived speech biomarkers using state-of-the-art imaging. The introduced framework can be used to guide future development of speech biomarkers. Test-retest MRI data are provided free for research use.
Collapse
Affiliation(s)
- Johannes Töger
- Ming Hsieh Department of Electrical Engineering, University of Southern California, 3740 McClintock Avenue, EEB 400, Los Angeles, California 90089-2560, USA
| | - Tanner Sorensen
- Ming Hsieh Department of Electrical Engineering, University of Southern California, 3740 McClintock Avenue, EEB 400, Los Angeles, California 90089-2560, USA
| | - Krishna Somandepalli
- Ming Hsieh Department of Electrical Engineering, University of Southern California, 3740 McClintock Avenue, EEB 400, Los Angeles, California 90089-2560, USA
| | - Asterios Toutios
- Ming Hsieh Department of Electrical Engineering, University of Southern California, 3740 McClintock Avenue, EEB 400, Los Angeles, California 90089-2560, USA
| | - Sajan Goud Lingala
- Ming Hsieh Department of Electrical Engineering, University of Southern California, 3740 McClintock Avenue, EEB 400, Los Angeles, California 90089-2560, USA
| | - Shrikanth Narayanan
- Ming Hsieh Department of Electrical Engineering, University of Southern California, 3740 McClintock Avenue, EEB 400, Los Angeles, California 90089-2560, USA
| | - Krishna Nayak
- Ming Hsieh Department of Electrical Engineering, University of Southern California, 3740 McClintock Avenue, EEB 400, Los Angeles, California 90089-2560, USA
| |
Collapse
|
31
|
Poddar S, Jacob M. Dynamic MRI Using SmooThness Regularization on Manifolds (SToRM). IEEE TRANSACTIONS ON MEDICAL IMAGING 2016; 35:1106-15. [PMID: 26685228 PMCID: PMC5334465 DOI: 10.1109/tmi.2015.2509245] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
We introduce a novel algorithm to recover real time dynamic MR images from highly under-sampled k- t space measurements. The proposed scheme models the images in the dynamic dataset as points on a smooth, low dimensional manifold in high dimensional space. We propose to exploit the non-linear and non-local redundancies in the dataset by posing its recovery as a manifold smoothness regularized optimization problem. A navigator acquisition scheme is used to determine the structure of the manifold, or equivalently the associated graph Laplacian matrix. The estimated Laplacian matrix is used to recover the dataset from undersampled measurements. The utility of the proposed scheme is demonstrated by comparisons with state of the art methods in multi-slice real-time cardiac and speech imaging applications.
Collapse
|
32
|
Toutios A, Narayanan SS. Advances in real-time magnetic resonance imaging of the vocal tract for speech science and technology research. APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING 2016; 5:e6. [PMID: 27833745 PMCID: PMC5100697 DOI: 10.1017/atsip.2016.5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Real-time magnetic resonance imaging (rtMRI) of the moving vocal tract during running speech production is an important emerging tool for speech production research providing dynamic information of a speaker's upper airway from the entire mid-sagittal plane or any other scan plane of interest. There have been several advances in the development of speech rtMRI and corresponding analysis tools, and their application to domains such as phonetics and phonological theory, articulatory modeling, and speaker characterization. An important recent development has been the open release of a database that includes speech rtMRI data from five male and five female speakers of American English each producing 460 phonetically balanced sentences. The purpose of the present paper is to give an overview and outlook of the advances in rtMRI as a tool for speech research and technology development.
Collapse
Affiliation(s)
- Asterios Toutios
- Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California (USC), 3740 McClintock Avenue, Los Angeles, CA 90089, USA
| | - Shrikanth S Narayanan
- Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California (USC), 3740 McClintock Avenue, Los Angeles, CA 90089, USA
| |
Collapse
|
33
|
|
34
|
Lingala SG, Zhu Y, Kim YC, Toutios A, Narayanan S, Nayak KS. A fast and flexible MRI system for the study of dynamic vocal tract shaping. Magn Reson Med 2016; 77:112-125. [PMID: 26778178 DOI: 10.1002/mrm.26090] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2015] [Revised: 11/06/2015] [Accepted: 11/24/2015] [Indexed: 11/07/2022]
Abstract
PURPOSE The aim of this work was to develop and evaluate an MRI-based system for study of dynamic vocal tract shaping during speech production, which provides high spatial and temporal resolution. METHODS The proposed system utilizes (a) custom eight-channel upper airway coils that have high sensitivity to upper airway regions of interest, (b) two-dimensional golden angle spiral gradient echo acquisition, (c) on-the-fly view-sharing reconstruction, and (d) off-line temporal finite difference constrained reconstruction. The system also provides simultaneous noise-cancelled and temporally aligned audio. The system is evaluated in 3 healthy volunteers, and 1 tongue cancer patient, with a broad range of speech tasks. RESULTS We report spatiotemporal resolutions of 2.4 × 2.4 mm2 every 12 ms for single-slice imaging, and 2.4 × 2.4 mm2 every 36 ms for three-slice imaging, which reflects roughly 7-fold acceleration over Nyquist sampling. This system demonstrates improved temporal fidelity in capturing rapid vocal tract shaping for tasks, such as producing consonant clusters in speech, and beat-boxing sounds. Novel acoustic-articulatory analysis was also demonstrated. CONCLUSION A synergistic combination of custom coils, spiral acquisitions, and constrained reconstruction enables visualization of rapid speech with high spatiotemporal resolution in multiple planes. Magn Reson Med 77:112-125, 2017. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Sajan Goud Lingala
- Electrical Engineering, University of Southern California, Los Angeles, CA
| | - Yinghua Zhu
- Electrical Engineering, University of Southern California, Los Angeles, CA
| | | | - Asterios Toutios
- Electrical Engineering, University of Southern California, Los Angeles, CA
| | | | - Krishna S Nayak
- Electrical Engineering, University of Southern California, Los Angeles, CA
| |
Collapse
|
35
|
Li M, Kim J, Lammert A, Ghosh PK, Ramanarayanan V, Narayanan S. Speaker verification based on the fusion of speech acoustics and inverted articulatory signals. COMPUT SPEECH LANG 2015; 36:196-211. [PMID: 28496292 DOI: 10.1016/j.csl.2015.05.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
We propose a practical, feature-level and score-level fusion approach by combining acoustic and estimated articulatory information for both text independent and text dependent speaker verification. From a practical point of view, we study how to improve speaker verification performance by combining dynamic articulatory information with the conventional acoustic features. On text independent speaker verification, we find that concatenating articulatory features obtained from measured speech production data with conventional Mel-frequency cepstral coefficients (MFCCs) improves the performance dramatically. However, since directly measuring articulatory data is not feasible in many real world applications, we also experiment with estimated articulatory features obtained through acoustic-to-articulatory inversion. We explore both feature level and score level fusion methods and find that the overall system performance is significantly enhanced even with estimated articulatory features. Such a performance boost could be due to the inter-speaker variation information embedded in the estimated articulatory features. Since the dynamics of articulation contain important information, we included inverted articulatory trajectories in text dependent speaker verification. We demonstrate that the articulatory constraints introduced by inverted articulatory features help to reject wrong password trials and improve the performance after score level fusion. We evaluate the proposed methods on the X-ray Microbeam database and the RSR 2015 database, respectively, for the aforementioned two tasks. Experimental results show that we achieve more than 15% relative equal error rate reduction for both speaker verification tasks.
Collapse
Affiliation(s)
- Ming Li
- Sun Yat-Sen University Carnegie Mellon University Joint Institute of Engineering, Sun Yat-Sen University, China.,Sun Yat-Sen University Carnegie Mellon University Shunde International Joint Research Institute, Shunde, China.,School of Mobile Information Engineering, Sun Yat-Sen University, China
| | - Jangwon Kim
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| | - Adam Lammert
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| | - Prasanta Kumar Ghosh
- Department of Electrical Engineering, Indian Institute of Science (IISc), Bangalore, India
| | - Vikram Ramanarayanan
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| | - Shrikanth Narayanan
- Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA
| |
Collapse
|
36
|
Gibert G, Olsen KN, Leung Y, Stevens CJ. Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis. COMPUTATIONAL COGNITIVE SCIENCE 2015; 1:7. [PMID: 27980889 PMCID: PMC5125409 DOI: 10.1186/s40469-015-0007-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/22/2015] [Accepted: 08/30/2015] [Indexed: 12/04/2022]
Abstract
Background Virtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal. Methods We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide ‘normalized’ animation (i.e., articulatory) parameters. In addition, semi-automatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones. Results The avatar’s facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A Text-To-Auditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database. Conclusions We describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a Text-To-Auditory Visual Speech synthesizer drives the ECA’s facial movements enhancing its speech capabilities. Electronic supplementary material The online version of this article (doi:10.1186/s40469-015-0007-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Guillaume Gibert
- The MARCS Institute, University of Western Sydney, Locked Bag 1797, Penrith, NSW 2751 Australia ; INSERM U846, 18 avenue Doyen Lépine, 69500 Bron, France ; Stem Cell and Brain Research Institute, 69500 Bron, France ; Université de Lyon, Université Lyon 1, 69003 Lyon, France
| | - Kirk N Olsen
- The MARCS Institute, University of Western Sydney, Locked Bag 1797, Penrith, NSW 2751 Australia
| | - Yvonne Leung
- The MARCS Institute, University of Western Sydney, Locked Bag 1797, Penrith, NSW 2751 Australia
| | - Catherine J Stevens
- The MARCS Institute, University of Western Sydney, Locked Bag 1797, Penrith, NSW 2751 Australia
| |
Collapse
|