1
|
Veltrup R, Angerer S, Gessner E, Matheis F, Sümmerer E, Henningson JO, Döllinger M, Semmler M. Three-Dimensional Analysis of Vocal Fold Oscillations: Correlating Superior and Medial Surface Dynamics Using Ex Vivo Human Hemilarynges. Bioengineering (Basel) 2024; 11:977. [PMID: 39451353 PMCID: PMC11505270 DOI: 10.3390/bioengineering11100977] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 09/24/2024] [Accepted: 09/26/2024] [Indexed: 10/26/2024] Open
Abstract
The primary acoustic signal of the voice is generated by the complex oscillation of the vocal folds (VFs), whereby physicians can barely examine the medial VF surface due to its anatomical inaccessibility. In this study, we investigated possibilities to infer medial surface dynamics by analyzing correlations in the oscillatory behavior of the superior and medial VF surfaces of four human hemilarynges, each in 24 different combinations of flow rate, VF adduction, and elongation. The two surfaces were recorded synchronously during sustained phonation using two high-speed camera setups and were subsequently 3D-reconstructed. The 3D surface parameters of mean and maximum velocities and displacements and general phonation parameters were calculated. The VF oscillations were also analyzed using empirical eigenfunctions (EEFs) and mucosal wave propagation, calculated from medial surface trajectories. Strong linear correlations were found between the 3D parameters of the superior and medial VF surfaces, ranging from 0.8 to 0.95. The linear regressions showed similar values for the maximum velocities at all hemilarynges (0.69-0.9), indicating the most promising parameter for predicting the medial surface. Since excessive VF velocities are suspected to cause phono-trauma and VF polyps, this parameter could provide added value to laryngeal diagnostics in the future.
Collapse
Affiliation(s)
- Reinhard Veltrup
- University Hospital Erlangen, Medical School, Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology Head and Neck Surgery, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany; (S.A.); (E.G.); (F.M.); (E.S.); (M.D.); (M.S.)
| | - Susanne Angerer
- University Hospital Erlangen, Medical School, Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology Head and Neck Surgery, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany; (S.A.); (E.G.); (F.M.); (E.S.); (M.D.); (M.S.)
| | - Elena Gessner
- University Hospital Erlangen, Medical School, Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology Head and Neck Surgery, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany; (S.A.); (E.G.); (F.M.); (E.S.); (M.D.); (M.S.)
| | - Friederike Matheis
- University Hospital Erlangen, Medical School, Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology Head and Neck Surgery, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany; (S.A.); (E.G.); (F.M.); (E.S.); (M.D.); (M.S.)
| | - Emily Sümmerer
- University Hospital Erlangen, Medical School, Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology Head and Neck Surgery, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany; (S.A.); (E.G.); (F.M.); (E.S.); (M.D.); (M.S.)
| | - Jann-Ole Henningson
- Department of Computer Science, Friedrich-Alexander-University Erlangen-Nürnberg, 91054 Erlangen, Germany;
| | - Michael Döllinger
- University Hospital Erlangen, Medical School, Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology Head and Neck Surgery, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany; (S.A.); (E.G.); (F.M.); (E.S.); (M.D.); (M.S.)
| | - Marion Semmler
- University Hospital Erlangen, Medical School, Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology Head and Neck Surgery, Friedrich-Alexander-Universität Erlangen-Nürnberg, 91054 Erlangen, Germany; (S.A.); (E.G.); (F.M.); (E.S.); (M.D.); (M.S.)
| |
Collapse
|
2
|
Zhang Y, Pu T, Zhou C, Cai H. An Improved Glottal Flow Model Based on Seq2Seq LSTM for Simulation of Vocal Fold Vibration. J Voice 2024; 38:983-992. [PMID: 35534328 DOI: 10.1016/j.jvoice.2022.03.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 03/29/2022] [Accepted: 03/30/2022] [Indexed: 10/18/2022]
Abstract
OBJECTIVES An improved data-driven glottal flow model for fluid-structure interaction (FSI) simulation of the vocal fold vibration is proposed in this paper. This model aims to improve the prediction performance of the previously developed deep neural network (DNN) based empirical flow model (EFM)1 on accuracy and efficiency. METHODS A Seq2Seq long short-term memory (LSTM) network is employed in the present model to infer the flow rate and pressure distribution from the subglottal pressure and cross-section area distribution of the glottis. The training data is collected from the generalized glottal shape library generated in Zhang et al.1 RESULTS AND CONCLUSIONS: Compared to the EFM, the present model not only discards the time-consuming optimization process, but also drastically reduces the errors, therefore the prediction performance can be greatly improved. The present model is evaluated by coupling with a solid dynamics solver for FSI simulation, and the results demonstrate a great improvement on accuracy and efficiency.
Collapse
Affiliation(s)
- Yang Zhang
- College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China.
| | - Tianmei Pu
- College of General Aviation and Flight, Nanjing University of Aeronautics and Astronautics, Nanjing 213300, China
| | - Chunhua Zhou
- Department of Aerodynamics, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
| | - Hongming Cai
- College of Astronautics, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
| |
Collapse
|
3
|
Ghasemzadeh H, Hillman RE, Mehta DD. Toward Generalizable Machine Learning Models in Speech, Language, and Hearing Sciences: Estimating Sample Size and Reducing Overfitting. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2024; 67:753-781. [PMID: 38386017 PMCID: PMC11005022 DOI: 10.1044/2023_jslhr-23-00273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Revised: 08/29/2023] [Accepted: 12/19/2023] [Indexed: 02/23/2024]
Abstract
PURPOSE Many studies using machine learning (ML) in speech, language, and hearing sciences rely upon cross-validations with single data splitting. This study's first purpose is to provide quantitative evidence that would incentivize researchers to instead use the more robust data splitting method of nested k-fold cross-validation. The second purpose is to present methods and MATLAB code to perform power analysis for ML-based analysis during the design of a study. METHOD First, the significant impact of different cross-validations on ML outcomes was demonstrated using real-world clinical data. Then, Monte Carlo simulations were used to quantify the interactions among the employed cross-validation method, the discriminative power of features, the dimensionality of the feature space, the dimensionality of the model, and the sample size. Four different cross-validation methods (single holdout, 10-fold, train-validation-test, and nested 10-fold) were compared based on the statistical power and confidence of the resulting ML models. Distributions of the null and alternative hypotheses were used to determine the minimum required sample size for obtaining a statistically significant outcome (5% significance) with 80% power. Statistical confidence of the model was defined as the probability of correct features being selected for inclusion in the final model. RESULTS ML models generated based on the single holdout method had very low statistical power and confidence, leading to overestimation of classification accuracy. Conversely, the nested 10-fold cross-validation method resulted in the highest statistical confidence and power while also providing an unbiased estimate of accuracy. The required sample size using the single holdout method could be 50% higher than what would be needed if nested k-fold cross-validation were used. Statistical confidence in the model based on nested k-fold cross-validation was as much as four times higher than the confidence obtained with the single holdout-based model. A computational model, MATLAB code, and lookup tables are provided to assist researchers with estimating the minimum sample size needed during study design. CONCLUSION The adoption of nested k-fold cross-validation is critical for unbiased and robust ML studies in the speech, language, and hearing sciences. SUPPLEMENTAL MATERIAL https://doi.org/10.23641/asha.25237045.
Collapse
Affiliation(s)
- Hamzeh Ghasemzadeh
- Center for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital, Boston
- Department of Surgery, Harvard Medical School, Boston, MA
- Department of Communicative Sciences and Disorders, Michigan State University, East Lansing
| | - Robert E. Hillman
- Center for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital, Boston
- Department of Surgery, Harvard Medical School, Boston, MA
- Speech and Hearing Bioscience and Technology, Division of Medical Sciences, Harvard Medical School, Boston, MA
- MGH Institute of Health Professions, Boston, MA
| | - Daryush D. Mehta
- Center for Laryngeal Surgery and Voice Rehabilitation, Massachusetts General Hospital, Boston
- Department of Surgery, Harvard Medical School, Boston, MA
- Speech and Hearing Bioscience and Technology, Division of Medical Sciences, Harvard Medical School, Boston, MA
- MGH Institute of Health Professions, Boston, MA
| |
Collapse
|
4
|
Donhauser J, Tur B, Döllinger M. Neural network-based estimation of biomechanical vocal fold parameters. Front Physiol 2024; 15:1282574. [PMID: 38449783 PMCID: PMC10916882 DOI: 10.3389/fphys.2024.1282574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 01/09/2024] [Indexed: 03/08/2024] Open
Abstract
Vocal fold (VF) vibrations are the primary source of human phonation. High-speed video (HSV) endoscopy enables the computation of descriptive VF parameters for assessment of physiological properties of laryngeal dynamics, i.e., the vibration of the VFs. However, underlying biomechanical factors responsible for physiological and disordered VF vibrations cannot be accessed. In contrast, physically based numerical VF models reveal insights into the organ's oscillations, which remain inaccessible through endoscopy. To estimate biomechanical properties, previous research has fitted subglottal pressure-driven mass-spring-damper systems, as inverse problem to the HSV-recorded VF trajectories, by global optimization of the numerical model. A neural network trained on the numerical model may be used as a substitute for computationally expensive optimization, yielding a fast evaluating surrogate of the biomechanical inverse problem. This paper proposes a convolutional recurrent neural network (CRNN)-based architecture trained on regression of a physiological-based biomechanical six-mass model (6 MM). To compare with previous research, the underlying biomechanical factor "subglottal pressure" prediction was tested against 288 HSV ex vivo porcine recordings. The contributions of this work are two-fold: first, the presented CRNN with the 6 MM handles multiple trajectories along the VFs, which allows for investigations on local changes in VF characteristics. Second, the network was trained to reproduce further important biomechanical model parameters like VF mass and stiffness on synthetic data. Unlike in a previous work, the network in this study is therefore an entire surrogate of the inverse problem, which allowed for explicit computation of the fitted model using our approach. The presented approach achieves a best-case mean absolute error (MAE) of 133 Pa (13.9%) in subglottal pressure prediction with 76.6% correlation on experimental data and a re-estimated fundamental frequency MAE of 15.9 Hz (9.9%). In-detail training analysis revealed subglottal pressure as the most learnable parameter. With the physiological-based model design and advances in fast parameter prediction, this work is a next step in biomechanical VF model fitting and the estimation of laryngeal kinematics.
Collapse
Affiliation(s)
- Jonas Donhauser
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany
| | | | | |
Collapse
|
5
|
Zhang Z. Voice Feature Selection to Improve Performance of Machine Learning Models for Voice Production Inversion. J Voice 2023; 37:479-485. [PMID: 33849760 PMCID: PMC8502179 DOI: 10.1016/j.jvoice.2021.03.004] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Revised: 02/24/2021] [Accepted: 03/01/2021] [Indexed: 11/19/2022]
Abstract
OBJECTIVE Estimation of physiological control parameters of the vocal system from the produced voice outcome has important applications in clinical management of voice disorders . Previously we developed a simulation-based neural network for estimation of vocal fold geometry, mechanical properties, and subglottal pressure from voice outcome features that characterize the acoustics of the produced voice. The goals of this study are to (1) explore the possibility of improving the estimation accuracy of physiological control parameters by including voice outcome features characterizing vocal fold vibration; and (2) identify voice feature sets that optimize both estimation accuracy and robustness to measurement noise. METHODS Feedforward neural networks are trained to solve the inversion problem of estimating the physiological control parameters of a three-dimensional body-cover vocal fold model from different sets of voice outcome features that characterize the simulated voice acoustics, glottal flow, and vocal fold vibration. A sensitivity analysis is then performed to evaluate the contribution of individual voice features to the overall performance of the neural networks in estimating the physiologic control parameters. RESULTS AND CONCLUSIONS While including voice outcome features characterizing vocal fold vibration increases estimation accuracy, it also reduces the network's robustness to measurement noise, due to high sensitivity of network performance to voice outcome features measuring the absolute amplitudes of the glottal flow and area waveforms, which are also difficult to measure accurately in practical applications. By excluding such glottal flow-based features and replacing glottal area-based features by their normalized counterparts, we are able to significantly improve both estimation accuracy and robustness to noise. We further show that similar estimation accuracy and robustness can be achieved with an even smaller set of voice outcome features by excluding features of small sensitivity.
Collapse
Affiliation(s)
- Zhaoyan Zhang
- Department of Head and Neck Surgery, University of California, Los Angeles, 31-24 Rehabilitation Center, Los Angeles, California.
| |
Collapse
|
6
|
Zhang Z. Estimating subglottal pressure and vocal fold adduction from the produced voice in a single-subject study (L). THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2022; 151:1337. [PMID: 35232110 PMCID: PMC9013286 DOI: 10.1121/10.0009616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 01/31/2022] [Accepted: 02/02/2022] [Indexed: 06/14/2023]
Abstract
We previously reported a simulation-based neural network for estimating vocal fold properties and subglottal pressure from the produced voice. This study aims to validate this neural network in a single-human subject study. The results showed reasonable accuracy of the neural network in estimating the subglottal pressure in this particular human subject. The neural network was also able to qualitatively differentiate soft and loud speech conditions regarding differences in the subglottal pressure and degree of vocal fold adduction. This simulation-based neural network has potential applications in identifying unhealthy vocal behavior and monitoring progress of voice therapy or vocal training.
Collapse
Affiliation(s)
- Zhaoyan Zhang
- Department of Head and Neck Surgery, University of California, Los Angeles, 31-24 Rehab Center, 1000 Veteran Avenue, Los Angeles, California 90095-1794, USA
| |
Collapse
|
7
|
B T B, Kapoor S, Chen JM. Estimating vocal tract geometry from acoustic impedance using deep neural network. JASA EXPRESS LETTERS 2022; 2:034801. [PMID: 36154632 DOI: 10.1121/10.0009599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
A data-driven approach using artificial neural networks is proposed to address the classic inverse area function problem, i.e., to determine the vocal tract geometry (modelled as a tube of nonuniform cylindrical cross-sections) from the vocal tract acoustic impedance spectrum. The predicted cylindrical radii and the actual radii were found to have high correlation in the three- and four-cylinder model (Pearson coefficient (ρ) and Lin concordance coefficient (ρc) exceeded 95%); however, for the six-cylinder model, the correlation was low (ρ around 75% and ρc around 69%). Upon standardizing the impedance value, the correlation improved significantly for all cases (ρ and ρc exceeded 90%).
Collapse
Affiliation(s)
- Balamurali B T
- Singapore University of Technology and Design, Singapore , ,
| | - Saumitra Kapoor
- Singapore University of Technology and Design, Singapore , ,
| | - Jer-Ming Chen
- Singapore University of Technology and Design, Singapore , ,
| |
Collapse
|
8
|
Ibarra EJ, Parra JA, Alzamendi GA, Cortés JP, Espinoza VM, Mehta DD, Hillman RE, Zañartu M. Estimation of Subglottal Pressure, Vocal Fold Collision Pressure, and Intrinsic Laryngeal Muscle Activation From Neck-Surface Vibration Using a Neural Network Framework and a Voice Production Model. Front Physiol 2021; 12:732244. [PMID: 34539451 PMCID: PMC8440844 DOI: 10.3389/fphys.2021.732244] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 08/09/2021] [Indexed: 11/23/2022] Open
Abstract
The ambulatory assessment of vocal function can be significantly enhanced by having access to physiologically based features that describe underlying pathophysiological mechanisms in individuals with voice disorders. This type of enhancement can improve methods for the prevention, diagnosis, and treatment of behaviorally based voice disorders. Unfortunately, the direct measurement of important vocal features such as subglottal pressure, vocal fold collision pressure, and laryngeal muscle activation is impractical in laboratory and ambulatory settings. In this study, we introduce a method to estimate these features during phonation from a neck-surface vibration signal through a framework that integrates a physiologically relevant model of voice production and machine learning tools. The signal from a neck-surface accelerometer is first processed using subglottal impedance-based inverse filtering to yield an estimate of the unsteady glottal airflow. Seven aerodynamic and acoustic features are extracted from the neck surface accelerometer and an optional microphone signal. A neural network architecture is selected to provide a mapping between the seven input features and subglottal pressure, vocal fold collision pressure, and cricothyroid and thyroarytenoid muscle activation. This non-linear mapping is trained solely with 13,000 Monte Carlo simulations of a voice production model that utilizes a symmetric triangular body-cover model of the vocal folds. The performance of the method was compared against laboratory data from synchronous recordings of oral airflow, intraoral pressure, microphone, and neck-surface vibration in 79 vocally healthy female participants uttering consecutive /pæ/ syllable strings at comfortable, loud, and soft levels. The mean absolute error and root-mean-square error for estimating the mean subglottal pressure were 191 Pa (1.95 cm H2O) and 243 Pa (2.48 cm H2O), respectively, which are comparable with previous studies but with the key advantage of not requiring subject-specific training and yielding more output measures. The validation of vocal fold collision pressure and laryngeal muscle activation was performed with synthetic values as reference. These initial results provide valuable insight for further vocal fold model refinement and constitute a proof of concept that the proposed machine learning method is a feasible option for providing physiologically relevant measures for laboratory and ambulatory assessment of vocal function.
Collapse
Affiliation(s)
- Emiro J. Ibarra
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile
- School of Electrical Engineering, University of the Andes, Mérida, Venezuela
| | - Jesús A. Parra
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile
| | - Gabriel A. Alzamendi
- Institute for Research and Development on Bioengineering and Bioinformatics, Consejo Nacional de Investigaciones Científicas y Técnicas - Universidad Nacional de Entre Ríos, Oro Verde, Argentina
| | - Juan P. Cortés
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile
- Center for Laryngeal Surgery and Voice Rehabilitation Laboratory, Massachusetts General Hospital–Harvard Medical School, Boston, MA, United States
| | - Víctor M. Espinoza
- Department of Sound, Faculty of Arts, University of Chile, Santiago, Chile
| | - Daryush D. Mehta
- Center for Laryngeal Surgery and Voice Rehabilitation Laboratory, Massachusetts General Hospital–Harvard Medical School, Boston, MA, United States
| | - Robert E. Hillman
- Center for Laryngeal Surgery and Voice Rehabilitation Laboratory, Massachusetts General Hospital–Harvard Medical School, Boston, MA, United States
| | - Matías Zañartu
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile
| |
Collapse
|
9
|
Hadwin PJ, Erath BD, Peterson SD. The influence of flow model selection on finite element model parameter estimation using Bayesian inference. JASA EXPRESS LETTERS 2021; 1:045204. [PMID: 34136884 PMCID: PMC8182970 DOI: 10.1121/10.0004260] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 03/18/2021] [Indexed: 06/12/2023]
Abstract
Recently, Bayesian estimation coupled with finite element modeling has been demonstrated as a viable tool for estimating vocal fold material properties from kinematic information obtained via high-speed video recordings. In this article, the sensitivity of the parameter estimations to the employed fluid model is explored by considering Bernoulli and one-dimensional viscous fluid flow models. Simulation results indicate that prescribing an ad hoc separation location for the Bernoulli flow model can lead to large estimate biases, whereas including the separation location as an estimated parameter leads to results comparable to that of the viscous fluid flow model.
Collapse
Affiliation(s)
- Paul J Hadwin
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| | - Byron D Erath
- Department of Mechanical and Aeronautical Engineering, Clarkson University, Potsdam, New York 13699, USA , ,
| | - Sean D Peterson
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada
| |
Collapse
|
10
|
Li Z, Chen Y, Chang S, Rousseau B, Luo H. A one-dimensional flow model enhanced by machine learning for simulation of vocal fold vibration. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2021; 149:1712. [PMID: 33765799 PMCID: PMC7954577 DOI: 10.1121/10.0003561] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 01/25/2021] [Accepted: 02/01/2021] [Indexed: 06/02/2023]
Abstract
A one-dimensional (1D) unsteady and viscous flow model that is derived from the momentum and mass conservation equations is described, and to enhance this physics-based model, a machine learning approach is used to determine the unknown modeling parameters. Specifically, an idealized larynx model is constructed and ten cases of three-dimensional (3D) fluid-structure interaction (FSI) simulations are performed. The flow data are then extracted to train the 1D flow model using a sparse identification approach for nonlinear dynamical systems. As a result of training, we obtain the analytical expressions for the entrance effect and pressure loss in the glottis, which are then incorporated in the flow model to conveniently handle different glottal shapes due to vocal fold vibration. We apply the enhanced 1D flow model in the FSI simulation of both idealized vocal fold geometries and subject-specific anatomical geometries reconstructed from the magnetic resonance imaging images of rabbits' larynges. The 1D flow model is evaluated in both of these setups and shown to have robust performance. Therefore, it provides a fast simulation tool that is superior to the previous 1D models.
Collapse
Affiliation(s)
- Zheng Li
- Department of Mechanical Engineering, Vanderbilt University, 2301 Vanderbilt Place, Nashville, Tennessee 37235-1592, USA
| | - Ye Chen
- Department of Mechanical Engineering, Vanderbilt University, 2301 Vanderbilt Place, Nashville, Tennessee 37235-1592, USA
| | - Siyuan Chang
- Department of Mechanical Engineering, Vanderbilt University, 2301 Vanderbilt Place, Nashville, Tennessee 37235-1592, USA
| | - Bernard Rousseau
- Department of Communication Science and Disorders, University of Pittsburgh, Pittsburgh, Pennsylvania 15213, USA
| | - Haoxiang Luo
- Department of Mechanical Engineering, Vanderbilt University, 2301 Vanderbilt Place, Nashville, Tennessee 37235-1592, USA
| |
Collapse
|
11
|
Gómez P, Kist AM, Schlegel P, Berry DA, Chhetri DK, Dürr S, Echternach M, Johnson AM, Kniesburges S, Kunduk M, Maryn Y, Schützenberger A, Verguts M, Döllinger M. BAGLS, a multihospital Benchmark for Automatic Glottis Segmentation. Sci Data 2020; 7:186. [PMID: 32561845 PMCID: PMC7305104 DOI: 10.1038/s41597-020-0526-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2019] [Accepted: 05/15/2020] [Indexed: 02/06/2023] Open
Abstract
Laryngeal videoendoscopy is one of the main tools in clinical examinations for voice disorders and voice research. Using high-speed videoendoscopy, it is possible to fully capture the vocal fold oscillations, however, processing the recordings typically involves a time-consuming segmentation of the glottal area by trained experts. Even though automatic methods have been proposed and the task is particularly suited for deep learning methods, there are no public datasets and benchmarks available to compare methods and to allow training of generalizing deep learning models. In an international collaboration of researchers from seven institutions from the EU and USA, we have created BAGLS, a large, multihospital dataset of 59,250 high-speed videoendoscopy frames with individually annotated segmentation masks. The frames are based on 640 recordings of healthy and disordered subjects that were recorded with varying technical equipment by numerous clinicians. The BAGLS dataset will allow an objective comparison of glottis segmentation methods and will enable interested researchers to train their own models and compare their methods.
Collapse
Affiliation(s)
- Pablo Gómez
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Waldstraße 1, 91054, Erlangen, Germany.
| | - Andreas M Kist
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Waldstraße 1, 91054, Erlangen, Germany.
| | - Patrick Schlegel
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Waldstraße 1, 91054, Erlangen, Germany
| | - David A Berry
- Department of Head and Neck Surgery, David Geffen School of Medicine at the University of California, Los Angeles, Los Angeles, California, USA
| | - Dinesh K Chhetri
- Department of Head and Neck Surgery, David Geffen School of Medicine at the University of California, Los Angeles, Los Angeles, California, USA
| | - Stephan Dürr
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Waldstraße 1, 91054, Erlangen, Germany
| | - Matthias Echternach
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Munich University Hospital (LMU), Munich, Germany
| | - Aaron M Johnson
- NYU Voice Center, Department of Otolaryngology - Head and Neck Surgery, New York University School of Medicine, New York, New York, USA
| | - Stefan Kniesburges
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Waldstraße 1, 91054, Erlangen, Germany
| | - Melda Kunduk
- Department of Communication Sciences and Disorders, Louisiana State University, Baton Rouge, Louisiana, USA
| | - Youri Maryn
- European Institute for ORL-HNS, Department of Otorhinolaryngology and Head & Neck Surgery, Sint-Augustinus GZA, Wilrijk, Belgium
- Department of Speech, Language and Hearing sciences, University of Ghent, Ghent, Belgium
- Faculty of Education, Health and Social Work, University College Ghent, Ghent, Belgium
- Faculty of Psychology and Educational Sciences, School of Logopedics, Université Catholique de Louvain, Louvain-la-Neuve, Belgium
- Faculty of Medicine and Health Sciences, University of Antwerp, Antwerp, Belgium
| | - Anne Schützenberger
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Waldstraße 1, 91054, Erlangen, Germany
| | - Monique Verguts
- European Institute for ORL-HNS, Department of Otorhinolaryngology and Head & Neck Surgery, Sint-Augustinus GZA, Wilrijk, Belgium
- Department of Otorhinolaryngology and Voice Disorders, Diest General Hospital, Diest, Belgium
| | - Michael Döllinger
- Division of Phoniatrics and Pediatric Audiology, Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Waldstraße 1, 91054, Erlangen, Germany
| |
Collapse
|
12
|
Zhang Z. Estimation of vocal fold physiology from voice acoustics using machine learning. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2020; 147:EL264. [PMID: 32237804 PMCID: PMC7075716 DOI: 10.1121/10.0000927] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/15/2019] [Revised: 03/01/2020] [Accepted: 03/03/2020] [Indexed: 05/27/2023]
Abstract
The goal of this study is to estimate vocal fold geometry, stiffness, position, and subglottal pressure from voice acoustics, toward clinical and other voice technology applications. Unlike previous voice inversion research that often uses lumped-element models of phonation, this study explores the feasibility of voice inversion using data generated from a three-dimensional voice production model. Neural networks are trained to estimate vocal fold properties and subglottal pressure from voice features extracted from the simulation data. Results show reasonably good estimation accuracy, particularly for vocal fold properties with a consistent global effect on voice production, and reasonable agreement with excised human larynx experiment.
Collapse
Affiliation(s)
- Zhaoyan Zhang
- Department of Head and Neck Surgery, University of California, Los Angeles, 31-24 Rehab Center, 1000 Veteran Avenue, Los Angeles, California 90095-1794,
| |
Collapse
|
13
|
Abstract
This review provides a comprehensive compilation, from a digital image processing point of view of the most important techniques currently developed to characterize and quantify the vibration behaviour of the vocal folds, along with a detailed description of the laryngeal image modalities currently used in the clinic. The review presents an overview of the most significant glottal-gap segmentation and facilitative playbacks techniques used in the literature for the mentioned purpose, and shows the drawbacks and challenges that still remain unsolved to develop robust vocal folds vibration function analysis tools based on digital image processing.
Collapse
|
14
|
Deng JJ, Hadwin PJ, Peterson SD. The effect of high-speed videoendoscopy configuration on reduced-order model parameter estimates by Bayesian inference. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146. [PMID: 31472542 PMCID: PMC6715443 DOI: 10.1121/1.5124256#suppl] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/09/2023]
Abstract
Bayesian inference has been previously demonstrated as a viable inverse analysis tool for estimating subject-specific reduced-order model parameters and uncertainties. However, previous studies have relied upon simulated glottal area waveforms with superimposed random noise as the measurement. In practice, high-speed videoendoscopy is used to measure glottal area, which introduces practical imaging effects not captured in simulated data, such as viewing angle, frame rate, and camera resolution. Herein, high-speed videos of the vocal folds were approximated by recording the trajectories of physical vocal fold models controlled by a symmetric body-cover model. Twenty videos were recorded, varying subglottal pressure, cricothyroid activation, and viewing angle, with frame rate and video resolution varied by digital video manipulation. Bayesian inference was used to estimate subglottal pressure and cricothyroid activation from glottal area waveforms extracted from the videos. The resulting estimates show off-axis viewing of 10° can lead to a 10% bias in the estimated subglottal pressure. A viewing model is introduced such that viewing angle can be included as an estimated parameter, which alleviates estimate bias. Frame rate and pixel resolution were found to primarily affect uncertainty of parameter estimates up to a limit where spatial and temporal resolutions were too poor to resolve the glottal area. Since many high-speed cameras have the ability to sacrifice spatial for temporal resolution, the findings herein suggest that Bayesian inference studies employing high-speed video should increase temporal resolutions at the expense of spatial resolution for reduced estimate uncertainties.
Collapse
Affiliation(s)
- Jonathan J Deng
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L 3G1, Canada
| | - Paul J Hadwin
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L 3G1, Canada
| | - Sean D Peterson
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L 3G1, Canada
| |
Collapse
|
15
|
Deng JJ, Hadwin PJ, Peterson SD. The effect of high-speed videoendoscopy configuration on reduced-order model parameter estimates by Bayesian inference. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2019; 146:1492. [PMID: 31472542 PMCID: PMC6715443 DOI: 10.1121/1.5124256] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Revised: 08/07/2019] [Accepted: 08/09/2019] [Indexed: 06/10/2023]
Abstract
Bayesian inference has been previously demonstrated as a viable inverse analysis tool for estimating subject-specific reduced-order model parameters and uncertainties. However, previous studies have relied upon simulated glottal area waveforms with superimposed random noise as the measurement. In practice, high-speed videoendoscopy is used to measure glottal area, which introduces practical imaging effects not captured in simulated data, such as viewing angle, frame rate, and camera resolution. Herein, high-speed videos of the vocal folds were approximated by recording the trajectories of physical vocal fold models controlled by a symmetric body-cover model. Twenty videos were recorded, varying subglottal pressure, cricothyroid activation, and viewing angle, with frame rate and video resolution varied by digital video manipulation. Bayesian inference was used to estimate subglottal pressure and cricothyroid activation from glottal area waveforms extracted from the videos. The resulting estimates show off-axis viewing of 10° can lead to a 10% bias in the estimated subglottal pressure. A viewing model is introduced such that viewing angle can be included as an estimated parameter, which alleviates estimate bias. Frame rate and pixel resolution were found to primarily affect uncertainty of parameter estimates up to a limit where spatial and temporal resolutions were too poor to resolve the glottal area. Since many high-speed cameras have the ability to sacrifice spatial for temporal resolution, the findings herein suggest that Bayesian inference studies employing high-speed video should increase temporal resolutions at the expense of spatial resolution for reduced estimate uncertainties.
Collapse
Affiliation(s)
- Jonathan J Deng
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L 3G1, Canada
| | - Paul J Hadwin
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L 3G1, Canada
| | - Sean D Peterson
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Ontario N2L 3G1, Canada
| |
Collapse
|
16
|
Bayesian Inference of Vocal Fold Material Properties from Glottal Area Waveforms Using a 2D Finite Element Model. APPLIED SCIENCES-BASEL 2019; 9. [PMID: 34046213 PMCID: PMC8153513 DOI: 10.3390/app9132735] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Bayesian estimation has been previously demonstrated as a viable method for developing subject-specific vocal fold models from observations of the glottal area waveform. These prior efforts, however, have been restricted to lumped-element fitting models and synthetic observation data. The indirect relationship between the lumped-element parameters and physical tissue properties renders extracting the latter from the former difficult. Herein we propose a finite element fitting model, which treats the vocal folds as a viscoelastic deformable body comprised of three layers. Using the glottal area waveforms generated by self-oscillating silicone vocal folds we directly estimate the elastic moduli, density, and other material properties of the silicone folds using a Bayesian importance sampling approach. Estimated material properties agree with the “ground truth” experimental values to within 3% for most parameters. By considering cases with varying subglottal pressure and medial compression we demonstrate that the finite element model coupled with Bayesian estimation is sufficiently sensitive to distinguish between experimental configurations. Additional information not available experimentally, namely, contact pressures, are extracted from the developed finite element models. The contact pressures are found to increase with medial compression and subglottal pressure, in agreement with expectation.
Collapse
|