1
|
Chacon AM, Nguyen DD, Holik J, Döllinger M, Arias-Vergara T, Madill CJ. Vowel onset measures and their reliability, sensitivity and specificity: A systematic literature review. PLoS One 2024; 19:e0301786. [PMID: 38696537 PMCID: PMC11065290 DOI: 10.1371/journal.pone.0301786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 03/21/2024] [Indexed: 05/04/2024] Open
Abstract
OBJECTIVE To systematically evaluate the evidence for the reliability, sensitivity and specificity of existing measures of vowel-initial voice onset. METHODS A literature search was conducted across electronic databases for published studies (MEDLINE, EMBASE, Scopus, Web of Science, CINAHL, PubMed Central, IEEE Xplore) and grey literature (ProQuest for unpublished dissertations) measuring vowel onset. Eligibility criteria included research of any study design type or context focused on measuring human voice onset on an initial vowel. Two independent reviewers were involved at each stage of title and abstract screening, data extraction and analysis. Data extracted included measures used, their reliability, sensitivity and specificity. Risk of bias and certainty of evidence was assessed using GRADE as the data of interest was extracted. RESULTS The search retrieved 6,983 records. Titles and abstracts were screened against the inclusion criteria by two independent reviewers, with a third reviewer responsible for conflict resolution. Thirty-five papers were included in the review, which identified five categories of voice onset measurement: auditory perceptual, acoustic, aerodynamic, physiological and visual imaging. Reliability was explored in 14 papers with varied reliability ratings, while sensitivity was rarely assessed, and no assessment of specificity was conducted across any of the included records. Certainty of evidence ranged from very low to moderate with high variability in methodology and voice onset measures used. CONCLUSIONS A range of vowel-initial voice onset measurements have been applied throughout the literature, however, there is a lack of evidence regarding their sensitivity, specificity and reliability in the detection and discrimination of voice onset types. Heterogeneity in study populations and methods used preclude conclusions on the most valid measures. There is a clear need for standardisation of research methodology, and for future studies to examine the practicality of these measures in research and clinical settings.
Collapse
Affiliation(s)
- Antonia Margarita Chacon
- Voice Research Laboratory/ Doctor Liang Voice Program, Discipline of Speech Pathology, Faculty of Medicine and Health, Sydney School of Health Sciences, The University of Sydney, Sydney, NSW, Australia
| | - Duy Duong Nguyen
- Voice Research Laboratory/ Doctor Liang Voice Program, Discipline of Speech Pathology, Faculty of Medicine and Health, Sydney School of Health Sciences, The University of Sydney, Sydney, NSW, Australia
| | - John Holik
- Voice Research Laboratory/ Doctor Liang Voice Program, Discipline of Speech Pathology, Faculty of Medicine and Health, Sydney School of Health Sciences, The University of Sydney, Sydney, NSW, Australia
| | - Michael Döllinger
- Division of Phoniatrics and Paediatric Audiology at the Department of Otorhinolaryngology Head & Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Bavaria, Germany
| | - Tomás Arias-Vergara
- Division of Phoniatrics and Paediatric Audiology at the Department of Otorhinolaryngology Head & Neck Surgery, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Bavaria, Germany
- Department of Computer Science, Chair of Computer Science 5, Friedrich-Alexander-University Erlangen-Nürnberg, Erlangen, Bavaria, Germany
| | - Catherine Jeanette Madill
- Voice Research Laboratory/ Doctor Liang Voice Program, Discipline of Speech Pathology, Faculty of Medicine and Health, Sydney School of Health Sciences, The University of Sydney, Sydney, NSW, Australia
| |
Collapse
|
2
|
Ren Z, Shang F, Zheng Y, Wu N, Ma L, Zhou X. The Role of EGG in Identifying Prevocalic Glottal Stop. J Voice 2024:S0892-1997(24)00020-1. [PMID: 38402112 DOI: 10.1016/j.jvoice.2024.01.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/26/2024] [Accepted: 01/26/2024] [Indexed: 02/26/2024]
Abstract
OBJECTIVE The aim of the study is to investigate the use of incidences and characteristics of Prevocalic Electroglottographic Signal (PVES) derived from electroglottography (EGG) in characterizing glottal stops (GS) in cleft palate speech. METHODS Mandarin nonaspirated monosyllabic first-tone words were used for the speech sampling procedure. A total of 1680 utterances (from 83 patients with repaired cleft palates) were divided into three categories based on the results of auditory-perceptual evaluation of recorded speech sounds by three independent reviewers: [Category A (absence of GS agreed by all three reviewers) (n = 1192 tokens), Category B (two out of three reviewers agreed on the presence of a GS) (n = 181 tokens) and Category C (all three reviewers agreed on the presence of a GS) (n = 307 tokens)]. The EGG signals of the 1680 utterances were analyzed using a MATLAB program to automatically mark the instances of PVES (amplitude and time-interval) in the GS utterances. RESULTS The result showed that the incidence of EGG PVES presented good positive correlation with auditory-perceptual evaluation (r = 0.703, P<0.000). Statistical analysis revealed a significant difference in mean PVES amplitude among different groups (P<0.05). There was a significant distinction in the time interval between groups A and B, as well as in groups A and C (P<0.05). CONCLUSIONS The study suggests PVES can be an objective means of identifying GS in cleft palate speech. It also indicates that proportion of amplitude and time interval of PVES tend to be positively correlate with subjective assessment.
Collapse
Affiliation(s)
- Zhen Ren
- Department of Oral & Maxillofacial Surgery, Peking University School and Hospital of Stomatology, Beijing, China
| | - Feifei Shang
- Department of Oral & Maxillofacial Surgery, Peking University School and Hospital of Stomatology, Beijing, China
| | - Yafeng Zheng
- Department of Oral & Maxillofacial Surgery, Peking University School and Hospital of Stomatology, Beijing, China
| | - Nankai Wu
- Department of Chinese Language and Literature, Jinan University, Guangzhou, China
| | - Lian Ma
- Department of Oral & Maxillofacial Surgery, Peking University School and Hospital of Stomatology, Beijing, China
| | - Xia Zhou
- Department of Oral & Maxillofacial Surgery, Peking University School and Hospital of Stomatology, Beijing, China.
| |
Collapse
|
3
|
Serry MA, Stepp CE, Peterson SD. Exploring the mechanics of fundamental frequency variation during phonation onset. Biomech Model Mechanobiol 2023; 22:339-356. [PMID: 36370231 PMCID: PMC10369356 DOI: 10.1007/s10237-022-01652-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Accepted: 10/20/2022] [Indexed: 11/15/2022]
Abstract
Fundamental frequency patterns during phonation onset have received renewed interest due to their promising application in objective classification of normal and pathological voices. However, the associated underlying mechanisms producing the wide array of patterns observed in different phonetic contexts are not yet fully understood. Herein, we employ theoretical and numerical analyses in an effort to elucidate the potential mechanisms driving opposing frequency patterns for initial/isolated vowels versus vowels preceded by voiceless consonants. Utilizing deterministic lumped-mass oscillator models of the vocal folds, we systematically explore the roles of collision and muscle activation in the dynamics of phonation onset. We find that an increasing trend in fundamental frequency, as observed for initial/isolated vowels, arises naturally through a progressive increase in system stiffness as collision intensifies as onset progresses, without the need for time-varying vocal fold tension or changes in aerodynamic loading. In contrast, reduction in cricothyroid muscle activation during onset is required to generate the decrease in fundamental frequency observed for vowels preceded by voiceless consonants. For such phonetic contexts, our analysis shows that the magnitude of reduction in the cricothyroid muscle activation and the activation level of the thyroarytenoid muscle are potential factors underlying observed differences in (relative) fundamental frequency between speakers with healthy and hyperfunctional voices. This work highlights the roles of sometimes competing laryngeal factors in producing the complex array of observed fundamental frequency patterns during phonation onset.
Collapse
Affiliation(s)
- Mohamed A Serry
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Cara E Stepp
- Department of Speech, Language and Hearing Sciences, Boston University, Boston, MA, 02215, USA
| | - Sean D Peterson
- Department of Mechanical and Mechatronics Engineering, University of Waterloo, Waterloo, ON, N2L 3G1, Canada.
| |
Collapse
|
4
|
Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094338] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Speech emotion recognition (SER) is an important component of emotion computation and signal processing. Recently, many works have applied abundant acoustic features and complex model architectures to enhance the model’s performance, but these works sacrifice the portability of the model. To address this problem, we propose a model utilizing only the fundamental frequency from electroglottograph (EGG) signals. EGG signals are a sort of physiological signal that can directly reflect the movement of the vocal cord. Under the assumption that different acoustic features share similar representations in the internal emotional state, we propose cross-modal emotion distillation (CMED) to train the EGG-based SER model by transferring robust speech emotion representations from the log-Mel-spectrogram-based model. Utilizing the cross-modal emotion distillation, we achieve an increase of recognition accuracy from 58.98% to 66.80% on the S70 subset of the Chinese Dual-mode Emotional Speech Database (CDESD 7-classes) and 32.29% to 42.71% on the EMO-DB (7-classes) dataset, which shows that our proposed method achieves a comparable result with the human subjective experiment and realizes a trade-off between model complexity and performance.
Collapse
|
5
|
Chen L, Ren J, Chen P, Mao X, Zhao Q. Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03075-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
AbstractThis paper proposes a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario. EGG is a sort of physiological signal which can reflect the trends of the vocal cord movement. Note that EGG’s different acquisition method contrasted with speech signals, we exploit its application in speech synthesis under the following two scenarios. (1) To synthesize speeches under high noise circumstances, where clean speech signals are unavailable. (2) To enable dumb people who retain vocal cord vibration to speak again. Our study consists of two stages, EGG to text and text to speech. The first is a text content recognition model based on Bi-LSTM, which converts each EGG signal sample into the corresponding text with a limited class of contents. This model achieves 91.12% accuracy on the validation set in a 20-class content recognition experiment. Then the second step synthesizes speeches with the corresponding text and the EGG signal. Based on modified Tacotron-2, our model gains the Mel cepstral distortion (MCD) of 5.877 and the mean opinion score (MOS) of 3.87, which is comparable with the state-of-the-art performance and achieves an improvement by 0.42 and a relatively smaller model size than the origin Tacotron-2. Considering to introduce the characteristics of speakers contained in EGG to the final synthesized speech, we put forward a fine-grained fundamental frequency modification method, which adjusts the fundamental frequency according to EGG signals and achieves a lower MCD of 5.781 and a higher MOS of 3.94 than that without modification.
Collapse
|
6
|
Lung volume affects the decay of oscillations at the end of a vocal emission. Biomed Signal Process Control 2020. [DOI: 10.1016/j.bspc.2020.102148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
7
|
DeJonckere P, Lebacq J. Intraglottal Aerodynamics at Vocal Fold Vibration Onset. J Voice 2019; 35:156.e23-156.e32. [PMID: 31481279 DOI: 10.1016/j.jvoice.2019.08.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Revised: 08/01/2019] [Accepted: 08/02/2019] [Indexed: 11/27/2022]
Abstract
The most frequently observed type of voice onset in spontaneous speech in normal subjects is the soft onset, and it may be considered as the "physiological" onset. It starts from an immobile narrow glottal slit crossed by a continuous airflow, and then a few oscillations (even a single one in some cases) precede the first glottal closure. It is a transient event, during which the acting forces, lung pressure, intraglottal pressure, myoelastic tension of the vocal fold (VF) oscillator and inertance of the supraglottal vocal tract, interact to progressively reach the steady state of a sustained oscillation. Combined measurements of flow, area, and pressure provide a detailed qualitative and quantitative analysis of the intraglottal mechanical events at the precise moment of starting oscillation in a physiological (soft or soft/breathy) onset. Our in vivo measurements of airflow and glottal area show that the very first oscillation occurs exactly at the time when turbulence appears at the level of the glottal narrowing, ie, when the Reynolds number reaches its critical value. The turbulence may be assumed to trigger an oscillator consisting in the ensemble of the VFs and the air of the vocal tract, which is known to be weakly damped. Turbulence can act here as an aspecific flick, triggering the oscillator, the frequency of oscillation being determined by its mechanical properties. Furthermore, the first noticeable glottal oscillations are sinusoidal: the VFs are neither steeply sucked together by a negative Bernoulli pressure, nor burst apart by the lung pressure. Our measurements show that, at the critical time, the rising positive lung pressure is balanced by the rising negative Bernoulli pressure generated by the transglottal flow.
Collapse
Affiliation(s)
| | - Jean Lebacq
- Neurosciences Institute, University of Louvain, Brussels, Belgium
| |
Collapse
|
8
|
DeJonckere PH, Lebacq J. In Vivo Quantification of the Intraglottal Pressure: Modal Phonation and Voice Onset. J Voice 2019; 34:645.e19-645.e39. [PMID: 30658875 DOI: 10.1016/j.jvoice.2019.01.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Revised: 12/30/2018] [Accepted: 01/02/2019] [Indexed: 11/24/2022]
Abstract
Intraglottal pressure is the driving force of vocal fold vibration. Its time course during the open phase of the vibratory cycle is essential in the mechanics of phonation, but measuring it directly is difficult and may hinder spontaneous voicing. However, it can be computed from the in vivo measured transglottal flow and glottal area (hence the air particle velocity) on the basis of the Bernoulli energy law and the interaction with the inertance of the vocal tract. As to sustained modal phonation, calculations are presented for the two possible shapes of glottal duct: convergent and divergent, including absolute calibration in order to obtain quantitative physical values. Whatever the glottal duct configuration, the calculations based on measured values of glottal area and air flow show that the integrated intraglottal pressure during the opening phase systematically exceeds that during the closing phase, which is the basic condition for sustaining vocal fold oscillation. The key point is that the airflow curve is skewed to the right relative to the glottal area curve. The skewing results from air compressibility and vocal tract inertance. The intraglottal pressure becomes negative during the closing phase. As to the soft (or physiological) voice onset, a similar approach shows that the integrated pressure differences (opening phase - closing phase) actually increase as the onset progresses, and this applies to the results based on Bernoulli's energy law as well as to those based on the interaction with the inertance of the vocal tract. Furthermore and similarly, the phase lead of the pressure wave with respect to the glottal opening progressively increases. The underlying explanation lies in the progressively increasing skewing of the airflow curve to the right with respect to the glottal area curve.
Collapse
Affiliation(s)
- Philippe H DeJonckere
- Federal Agency for Occupational Risks, Brussels and Department of Neurosciences KULeuven, University of Leuven, Leuven, Belgium.
| | - Jean Lebacq
- Neurosciences Institute, University of Louvain, Brussels, Belgium
| |
Collapse
|