1
|
Anikin A, Barreda S, Reby D. A practical guide to calculating vocal tract length and scale-invariant formant patterns. Behav Res Methods 2024; 56:5588-5604. [PMID: 38158551 DOI: 10.3758/s13428-023-02288-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/02/2023] [Indexed: 01/03/2024]
Abstract
Formants (vocal tract resonances) are increasingly analyzed not only by phoneticians in speech but also by behavioral scientists studying diverse phenomena such as acoustic size exaggeration and articulatory abilities of non-human animals. This often involves estimating vocal tract length acoustically and producing scale-invariant representations of formant patterns. We present a theoretical framework and practical tools for carrying out this work, including open-source software solutions included in R packages soundgen and phonTools. Automatic formant measurement with linear predictive coding is error-prone, but formant_app provides an integrated environment for formant annotation and correction with visual and auditory feedback. Once measured, formants can be normalized using a single recording (intrinsic methods) or multiple recordings from the same individual (extrinsic methods). Intrinsic speaker normalization can be as simple as taking formant ratios and calculating the geometric mean as a measure of overall scale. The regression method implemented in the function estimateVTL calculates the apparent vocal tract length assuming a single-tube model, while its residuals provide a scale-invariant vowel space based on how far each formant deviates from equal spacing (the schwa function). Extrinsic speaker normalization provides more accurate estimates of speaker- and vowel-specific scale factors by pooling information across recordings with simple averaging or mixed models, which we illustrate with example datasets and R code. The take-home messages are to record several calls or vowels per individual, measure at least three or four formants, check formant measurements manually, treat uncertain values as missing, and use the statistical tools best suited to each modeling context.
Collapse
Affiliation(s)
- Andrey Anikin
- Division of Cognitive Science, Department of Philosophy, Lund University, Box 192, SE-221 00, Lund, Sweden.
- ENES Bioacoustics Research Laboratory, CRNL Center for Research in Neuroscience in Lyon, University of Saint Étienne, 42023, St-Étienne, France.
| | - Santiago Barreda
- Department of Linguistics, University of California, Davis, Davis, CA, USA
| | - David Reby
- ENES Bioacoustics Research Laboratory, CRNL Center for Research in Neuroscience in Lyon, University of Saint Étienne, 42023, St-Étienne, France
- Institut Universitaire de France, 75005, Paris, France
| |
Collapse
|
2
|
Badin P, Sawallis TR, Tabain M, Lamalle L. Bilinguals from Larynx to Lips: Exploring Bilingual Articulatory Strategies with Anatomic MRI Data. LANGUAGE AND SPEECH 2024:238309231224790. [PMID: 38680040 DOI: 10.1177/00238309231224790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/01/2024]
Abstract
The goal of this article is to illustrate the use of MRI for exploring bi- and multi-lingual articulatory strategies. One male and one female speaker recorded sets of static midsagittal MRIs of the whole vocal tract, producing vowels as well as consonants in various vowel contexts in either the male's two or the female's three languages. Both speakers were native speakers of English (American and Australian English, respectively), and both were fluent L2 speakers of French. In addition, the female speaker was a heritage speaker of Croatian. Articulatory contours extracted from the MRIs were subsequently used at three progressively more compact and abstract levels of analysis. (1) Direct comparison of overlaid contours was used to assess whether phones analogous across L1 and L2 are similar or dissimilar, both overall and in specific vocal tract regions. (2) Consonant contour variability along the vocal tract due to vowel context was determined using dispersion ellipses and used to explore the variable resistance to coarticulation for non-analogous rhotics and analogous laterals in Australian, French, and Croatian. (3) Articulatory modeling was used to focus on specific articulatory gestures (tongue position and shape, lip protrusion, laryngeal height, etc.) and then to explore the articulatory strategies in the speakers' interlanguages for production of the French front rounded vowel series. This revealed that the Australian and American speakers used different strategies to produce the non-analogous French vowel series. We conclude that MRI-based articulatory data constitute a very rich and underused source of information that amply deserves applications to the study of L2 articulation and bilingual and multi-lingual speech.
Collapse
Affiliation(s)
- Pierre Badin
- Institute of Engineering, Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France
| | | | - Marija Tabain
- Department of Languages and Linguistics, La Trobe University, Australia
| | - Laurent Lamalle
- Université Grenoble Alpes and CHU de Grenoble, Inserm US 17, CNRS UMS 3552, UMS IRMaGe, France
| |
Collapse
|
3
|
Chatterjee M, Gajre S, Kulkarni AM, Barrett KC, Limb CJ. Predictors of Emotional Prosody Identification by School-Age Children With Cochlear Implants and Their Peers With Normal Hearing. Ear Hear 2024; 45:411-424. [PMID: 37811966 PMCID: PMC10922148 DOI: 10.1097/aud.0000000000001436] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
OBJECTIVES Children with cochlear implants (CIs) vary widely in their ability to identify emotions in speech. The causes of this variability are unknown, but this knowledge will be crucial if we are to design improvements in technological or rehabilitative interventions that are effective for individual patients. The objective of this study was to investigate how well factors such as age at implantation, duration of device experience (hearing age), nonverbal cognition, vocabulary, and socioeconomic status predict prosody-based emotion identification in children with CIs, and how the key predictors in this population compare to children with normal hearing who are listening to either normal emotional speech or to degraded speech. DESIGN We measured vocal emotion identification in 47 school-age CI recipients aged 7 to 19 years in a single-interval, 5-alternative forced-choice task. None of the participants had usable residual hearing based on parent/caregiver report. Stimuli consisted of a set of semantically emotion-neutral sentences that were recorded by 4 talkers in child-directed and adult-directed prosody corresponding to five emotions: neutral, angry, happy, sad, and scared. Twenty-one children with normal hearing were also tested in the same tasks; they listened to both original speech and to versions that had been noise-vocoded to simulate CI information processing. RESULTS Group comparison confirmed the expected deficit in CI participants' emotion identification relative to participants with normal hearing. Within the CI group, increasing hearing age (correlated with developmental age) and nonverbal cognition outcomes predicted emotion recognition scores. Stimulus-related factors such as talker and emotional category also influenced performance and were involved in interactions with hearing age and cognition. Age at implantation was not predictive of emotion identification. Unlike the CI participants, neither cognitive status nor vocabulary predicted outcomes in participants with normal hearing, whether listening to original speech or CI-simulated speech. Age-related improvements in outcomes were similar in the two groups. Participants with normal hearing listening to original speech showed the greatest differences in their scores for different talkers and emotions. Participants with normal hearing listening to CI-simulated speech showed significant deficits compared with their performance with original speech materials, and their scores also showed the least effect of talker- and emotion-based variability. CI participants showed more variation in their scores with different talkers and emotions than participants with normal hearing listening to CI-simulated speech, but less so than participants with normal hearing listening to original speech. CONCLUSIONS Taken together, these results confirm previous findings that pediatric CI recipients have deficits in emotion identification based on prosodic cues, but they improve with age and experience at a rate that is similar to peers with normal hearing. Unlike participants with normal hearing, nonverbal cognition played a significant role in CI listeners' emotion identification. Specifically, nonverbal cognition predicted the extent to which individual CI users could benefit from some talkers being more expressive of emotions than others, and this effect was greater in CI users who had less experience with their device (or were younger) than CI users who had more experience with their device (or were older). Thus, in young prelingually deaf children with CIs performing an emotional prosody identification task, cognitive resources may be harnessed to a greater degree than in older prelingually deaf children with CIs or than children with normal hearing.
Collapse
Affiliation(s)
- Monita Chatterjee
- Auditory Prostheses & Perception Laboratory, Center for Hearing Research, Boys Town National Research Hospital, 555 N 30 St., Omaha, NE 68131, USA
| | - Shivani Gajre
- Auditory Prostheses & Perception Laboratory, Center for Hearing Research, Boys Town National Research Hospital, 555 N 30 St., Omaha, NE 68131, USA
| | - Aditya M Kulkarni
- Auditory Prostheses & Perception Laboratory, Center for Hearing Research, Boys Town National Research Hospital, 555 N 30 St., Omaha, NE 68131, USA
| | - Karen C Barrett
- Department of Otolaryngology-Head and Neck Surgery, University of California, San Francisco, San Francisco, California, USA
| | - Charles J Limb
- Department of Otolaryngology-Head and Neck Surgery, University of California, San Francisco, San Francisco, California, USA
| |
Collapse
|
4
|
Ruthven M, Peplinski AM, Adams DM, King AP, Miquel ME. Real-time speech MRI datasets with corresponding articulator ground-truth segmentations. Sci Data 2023; 10:860. [PMID: 38042857 PMCID: PMC10693552 DOI: 10.1038/s41597-023-02766-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 11/20/2023] [Indexed: 12/04/2023] Open
Abstract
The use of real-time magnetic resonance imaging (rt-MRI) of speech is increasing in clinical practice and speech science research. Analysis of such images often requires segmentation of articulators and the vocal tract, and the community is turning to deep-learning-based methods to perform this segmentation. While there are publicly available rt-MRI datasets of speech, these do not include ground-truth (GT) segmentations, a key requirement for the development of deep-learning-based segmentation methods. To begin to address this barrier, this work presents rt-MRI speech datasets of five healthy adult volunteers with corresponding GT segmentations and velopharyngeal closure patterns. The images were acquired using standard clinical MRI scanners, coils and sequences to facilitate acquisition of similar images in other centres. The datasets include manually created GT segmentations of six anatomical features including the tongue, soft palate and vocal tract. In addition, this work makes code and instructions to implement a current state-of-the-art deep-learning-based method to segment rt-MRI speech datasets publicly available, thus providing the community and others with a starting point for developing such methods.
Collapse
Affiliation(s)
- Matthieu Ruthven
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London, EC1A 7BE, UK
- School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London, SE1 7EH, UK
| | | | - David M Adams
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London, EC1A 7BE, UK
| | - Andrew P King
- School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London, SE1 7EH, UK
| | - Marc Eric Miquel
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London, EC1A 7BE, UK.
- Digital Environment Research Institute (DERI), Empire House, 67-75 New Road, Queen Mary University of London, London, E1 1HH, UK.
- Advanced Cardiovascular Imaging, Barts NIHR BRC, Queen Mary University of London, London, EC1M 6BQ, UK.
| |
Collapse
|
5
|
Ruthven M, Miquel ME, King AP. A segmentation-informed deep learning framework to register dynamic two-dimensional magnetic resonance images of the vocal tract during speech. Biomed Signal Process Control 2023; 80:104290. [PMID: 36743699 PMCID: PMC9746295 DOI: 10.1016/j.bspc.2022.104290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Revised: 09/29/2022] [Accepted: 10/08/2022] [Indexed: 11/06/2022]
Abstract
Objective Dynamic magnetic resonance (MR) imaging enables visualisation of articulators during speech. There is growing interest in quantifying articulator motion in two-dimensional MR images of the vocal tract, to better understand speech production and potentially inform patient management decisions. Image registration is an established way to achieve this quantification. Recently, segmentation-informed deformable registration frameworks have been developed and have achieved state-of-the-art accuracy. This work aims to adapt such a framework and optimise it for estimating displacement fields between dynamic two-dimensional MR images of the vocal tract during speech. Methods A deep-learning-based registration framework was developed and compared with current state-of-the-art registration methods and frameworks (two traditional methods and three deep-learning-based frameworks, two of which are segmentation informed). The accuracy of the methods and frameworks was evaluated using the Dice coefficient (DSC), average surface distance (ASD) and a metric based on velopharyngeal closure. The metric evaluated if the fields captured a clinically relevant and quantifiable aspect of articulator motion. Results The segmentation-informed frameworks achieved higher DSCs and lower ASDs and captured more velopharyngeal closures than the traditional methods and the framework that was not segmentation informed. All segmentation-informed frameworks achieved similar DSCs and ASDs. However, the proposed framework captured the most velopharyngeal closures. Conclusions A framework was successfully developed and found to more accurately estimate articulator motion than five current state-of-the-art methods and frameworks. Significance The first deep-learning-based framework specifically for registering dynamic two-dimensional MR images of the vocal tract during speech has been developed and evaluated.
Collapse
Affiliation(s)
- Matthieu Ruthven
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom,School of Biomedical Engineering & Imaging Sciences, King’s College London, King’s Health Partners, St Thomas’ Hospital, London SE1 7EH, United Kingdom,Corresponding author at: Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom.
| | - Marc E. Miquel
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom,Digital Environment Research Institute (DERI), Empire House, 67-75 New Road, Queen Mary University of London, London E1 1HH, United Kingdom,Advanced Cardiovascular Imaging, Barts NIHR BRC, Queen Mary University of London, London EC1M 6BQ, United Kingdom
| | - Andrew P. King
- School of Biomedical Engineering & Imaging Sciences, King’s College London, King’s Health Partners, St Thomas’ Hospital, London SE1 7EH, United Kingdom
| |
Collapse
|
6
|
Anikin A, Pisanski K, Reby D. Static and dynamic formant scaling conveys body size and aggression. ROYAL SOCIETY OPEN SCIENCE 2022; 9:211496. [PMID: 35242348 PMCID: PMC8753157 DOI: 10.1098/rsos.211496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Accepted: 12/09/2021] [Indexed: 05/03/2023]
Abstract
When producing intimidating aggressive vocalizations, humans and other animals often extend their vocal tracts to lower their voice resonance frequencies (formants) and thus sound big. Is acoustic size exaggeration more effective when the vocal tract is extended before, or during, the vocalization, and how do listeners interpret within-call changes in apparent vocal tract length? We compared perceptual effects of static and dynamic formant scaling in aggressive human speech and nonverbal vocalizations. Acoustic manipulations corresponded to elongating or shortening the vocal tract either around (Experiment 1) or from (Experiment 2) its resting position. Gradual formant scaling that preserved average frequencies conveyed the impression of smaller size and greater aggression, regardless of the direction of change. Vocal tract shortening from the original length conveyed smaller size and less aggression, whereas vocal tract elongation conveyed larger size and more aggression, and these effects were stronger for static than for dynamic scaling. Listeners familiarized with the speaker's natural voice were less often 'fooled' by formant manipulations when judging speaker size, but paid more attention to formants when judging aggressive intent. Thus, within-call vocal tract scaling conveys emotion, but a better way to sound large and intimidating is to keep the vocal tract consistently extended.
Collapse
Affiliation(s)
- Andrey Anikin
- Division of Cognitive Science, Lund University, Lund, Sweden
- ENES Sensory Neuro-Ethology lab, CRNL, Jean Monnet University of Saint Étienne, UMR 5293, 42023, St-Étienne, France
| | - Katarzyna Pisanski
- ENES Sensory Neuro-Ethology lab, CRNL, Jean Monnet University of Saint Étienne, UMR 5293, 42023, St-Étienne, France
| | - David Reby
- ENES Sensory Neuro-Ethology lab, CRNL, Jean Monnet University of Saint Étienne, UMR 5293, 42023, St-Étienne, France
| |
Collapse
|
7
|
Mexican Emotional Speech Database Based on Semantic, Frequency, Familiarity, Concreteness, and Cultural Shaping of Affective Prosody. DATA 2021. [DOI: 10.3390/data6120130] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
In this paper, the Mexican Emotional Speech Database (MESD) that contains single-word emotional utterances for anger, disgust, fear, happiness, neutral and sadness with adult (male and female) and child voices is described. To validate the emotional prosody of the uttered words, a cubic Support Vector Machines classifier was trained on the basis of prosodic, spectral and voice quality features for each case study: (1) male adult, (2) female adult and (3) child. In addition, cultural, semantic, and linguistic shaping of emotional expression was assessed by statistical analysis. This study was registered at BioMed Central and is part of the implementation of a published study protocol. Mean emotional classification accuracies yielded 93.3%, 89.4% and 83.3% for male, female and child utterances respectively. Statistical analysis emphasized the shaping of emotional prosodies by semantic and linguistic features. A cultural variation in emotional expression was highlighted by comparing the MESD with the INTERFACE for Castilian Spanish database. The MESD provides reliable content for linguistic emotional prosody shaped by the Mexican cultural environment. In order to facilitate further investigations, a corpus controlled for linguistic features and emotional semantics, as well as one containing words repeated across voices and emotions are provided. The MESD is made freely available.
Collapse
|
8
|
Lim Y, Toutios A, Bliesener Y, Tian Y, Lingala SG, Vaz C, Sorensen T, Oh M, Harper S, Chen W, Lee Y, Töger J, Monteserin ML, Smith C, Godinez B, Goldstein L, Byrd D, Nayak KS, Narayanan SS. A multispeaker dataset of raw and reconstructed speech production real-time MRI video and 3D volumetric images. Sci Data 2021; 8:187. [PMID: 34285240 PMCID: PMC8292336 DOI: 10.1038/s41597-021-00976-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 06/22/2021] [Indexed: 12/11/2022] Open
Abstract
Real-time magnetic resonance imaging (RT-MRI) of human speech production is enabling significant advances in speech science, linguistics, bio-inspired speech technology development, and clinical applications. Easy access to RT-MRI is however limited, and comprehensive datasets with broad access are needed to catalyze research across numerous domains. The imaging of the rapidly moving articulators and dynamic airway shaping during speech demands high spatio-temporal resolution and robust reconstruction methods. Further, while reconstructed images have been published, to-date there is no open dataset providing raw multi-coil RT-MRI data from an optimized speech production experimental setup. Such datasets could enable new and improved methods for dynamic image reconstruction, artifact correction, feature extraction, and direct extraction of linguistically-relevant biomarkers. The present dataset offers a unique corpus of 2D sagittal-view RT-MRI videos along with synchronized audio for 75 participants performing linguistically motivated speech tasks, alongside the corresponding public domain raw RT-MRI data. The dataset also includes 3D volumetric vocal tract MRI during sustained speech sounds and high-resolution static anatomical T2-weighted upper airway MRI for each participant.
Collapse
Affiliation(s)
- Yongwan Lim
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Asterios Toutios
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Yannick Bliesener
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Ye Tian
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Sajan Goud Lingala
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Colin Vaz
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Tanner Sorensen
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Miran Oh
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Sarah Harper
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Weiyi Chen
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Yoonjeong Lee
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Johannes Töger
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Mairym Lloréns Monteserin
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Caitlin Smith
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Bianca Godinez
- Department of Linguistics, California State University Long Beach, Long Beach, California, USA
| | - Louis Goldstein
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Dani Byrd
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA
| | - Krishna S Nayak
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA
| | - Shrikanth S Narayanan
- Ming Hsieh Department of Electrical and Computer Engineering, Viterbi School of Engineering, University of Southern California, Los Angeles, California, USA.
- Department of Linguistics, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, California, USA.
| |
Collapse
|
9
|
Ruthven M, Miquel ME, King AP. Deep-learning-based segmentation of the vocal tract and articulators in real-time magnetic resonance images of speech. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 198:105814. [PMID: 33197740 PMCID: PMC7732702 DOI: 10.1016/j.cmpb.2020.105814] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Accepted: 10/19/2020] [Indexed: 06/01/2023]
Abstract
BACKGROUND AND OBJECTIVE Magnetic resonance (MR) imaging is increasingly used in studies of speech as it enables non-invasive visualisation of the vocal tract and articulators, thus providing information about their shape, size, motion and position. Extraction of this information for quantitative analysis is achieved using segmentation. Methods have been developed to segment the vocal tract, however, none of these also fully segment any articulators. The objective of this work was to develop a method to fully segment multiple groups of articulators as well as the vocal tract in two-dimensional MR images of speech, thus overcoming the limitations of existing methods. METHODS Five speech MR image sets (392 MR images in total), each of a different healthy adult volunteer, were used in this work. A fully convolutional network with an architecture similar to the original U-Net was developed to segment the following six regions in the image sets: the head, soft palate, jaw, tongue, vocal tract and tooth space. A five-fold cross-validation was performed to investigate the segmentation accuracy and generalisability of the network. The segmentation accuracy was assessed using standard overlap-based metrics (Dice coefficient and general Hausdorff distance) and a novel clinically relevant metric based on velopharyngeal closure. RESULTS The segmentations created by the method had a median Dice coefficient of 0.92 and a median general Hausdorff distance of 5mm. The method segmented the head most accurately (median Dice coefficient of 0.99), and the soft palate and tooth space least accurately (median Dice coefficients of 0.92 and 0.93 respectively). The segmentations created by the method correctly showed 90% (27 out of 30) of the velopharyngeal closures in the MR image sets. CONCLUSIONS An automatic method to fully segment multiple groups of articulators as well as the vocal tract in two-dimensional MR images of speech was successfully developed. The method is intended for use in clinical and non-clinical speech studies which involve quantitative analysis of the shape, size, motion and position of the vocal tract and articulators. In addition, a novel clinically relevant metric for assessing the accuracy of vocal tract and articulator segmentation methods was developed.
Collapse
Affiliation(s)
- Matthieu Ruthven
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom; School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London SE1 7EH, United Kingdom.
| | - Marc E Miquel
- Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom; Centre for Advanced Cardiovascular Imaging, NIHR Barts Biomedical Research Centre, William Harvey Institute, Queen Mary University of London, London EC1M 6BQ, United Kingdom
| | - Andrew P King
- School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London SE1 7EH, United Kingdom
| |
Collapse
|