1
|
Liu S, Shao J. [Current methods of acoustic analysis of voice: a review]. LIN CHUANG ER BI YAN HOU TOU JING WAI KE ZA ZHI = JOURNAL OF CLINICAL OTORHINOLARYNGOLOGY, HEAD, AND NECK SURGERY 2022; 36:966-970;976. [PMID: 36543409 PMCID: PMC10128270 DOI: 10.13201/j.issn.2096-7993.2022.12.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Indexed: 12/24/2022]
Abstract
Acoustic analysis of the voice, as an objective, quantitative, non-invasive and reproducible method for the evaluation of voice quality, can be used to detect and analyze the acoustic characteristics of normal, artistic or pathological voice. With the development of medicine, physics, statistics, and artificial intelligence technology, there are new advances in the study of voice acoustic analysis, especially in terms of acoustic parameters. In addition, artificial neural networks can be used to perform complex multi-parameter analysis, which greatly improves the efficiency of acoustic analysis. This paper provides an overview of the methods of acoustic analysis and its latest development.
Collapse
Affiliation(s)
- Siwei Liu
- Department of Otolaryngology,Eye&ENT Hospital,Fudan University,Shanghai,200031,China
| | - Jun Shao
- Department of Otolaryngology,Eye&ENT Hospital,Fudan University,Shanghai,200031,China
| |
Collapse
|
2
|
赵 登, 周 长, 朱 欣, 张 晓, 陶 智. [Pathological voice detection based on gammatone short time spectral self-similarity]. SHENG WU YI XUE GONG CHENG XUE ZA ZHI = JOURNAL OF BIOMEDICAL ENGINEERING = SHENGWU YIXUE GONGCHENGXUE ZAZHI 2022; 39:694-701. [PMID: 36008333 PMCID: PMC10957350 DOI: 10.7507/1001-5515.202107037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 05/27/2022] [Indexed: 06/15/2023]
Abstract
The acoustic detection method based on machine learning and signal processing is an important method of pathological voice detection and the extraction of voice features is one of the most important. Currently, the features widely used have disadvantage of dependence on the fundamental frequency extraction, being easily affected by noise and high computational complexity. In view of these shortcomings, a new method of pathological voice detection based on multi-band analysis and chaotic analysis is proposed. The gammatone filter bank was used to simulate the human ear auditory characteristics to analyze different frequency bands and obtain the signals in different frequency bands. According to the characteristics that turbulence noise caused by chaos in voice will worsen the spectrum convergence, we applied short time Fourier transform to each frequency band of the voice signal, then the feature gammatone short time spectral self-similarity (GSTS) was extracted, and the chaos degree of each band signal was analyzed to distinguish normal and pathological voice. The experimental results showed that combined with traditional machine learning methods, GSTS reached the accuracy of 99.50% in the pathological voice database of Massachusetts Eye and Ear Infirmary (MEEI) and had an improvement of 3.46% compared with the best existing features. Also, the time of the extraction of GSTS was far less than that of traditional nonlinear features. These results show that GSTS has higher extraction efficiency and better recognition effect than the existing features.
Collapse
Affiliation(s)
- 登煌 赵
- 苏州大学 光电科学与工程学院(江苏苏州 215000)School of Optoelectronic Science and Engineering, Soochow University, Suzhou, Jiangsu 215000, P. R. China
| | - 长伟 周
- 苏州大学 光电科学与工程学院(江苏苏州 215000)School of Optoelectronic Science and Engineering, Soochow University, Suzhou, Jiangsu 215000, P. R. China
| | - 欣程 朱
- 苏州大学 光电科学与工程学院(江苏苏州 215000)School of Optoelectronic Science and Engineering, Soochow University, Suzhou, Jiangsu 215000, P. R. China
| | - 晓俊 张
- 苏州大学 光电科学与工程学院(江苏苏州 215000)School of Optoelectronic Science and Engineering, Soochow University, Suzhou, Jiangsu 215000, P. R. China
| | - 智 陶
- 苏州大学 光电科学与工程学院(江苏苏州 215000)School of Optoelectronic Science and Engineering, Soochow University, Suzhou, Jiangsu 215000, P. R. China
| |
Collapse
|
3
|
Liu B, Zhang F, Chen L, Silverman MA, Liu H, Fu D, Huang Y, Pan J, Jiang JJ. Chaos Behavior Analysis of Alaryngeal Voices Including Esophageal and Tracheoesophageal Voices. Folia Phoniatr Logop 2022; 74:431-440. [PMID: 35051938 PMCID: PMC9296702 DOI: 10.1159/000521222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2021] [Accepted: 11/26/2021] [Indexed: 01/22/2023] Open
Abstract
HYPOTHESIS/OBJECTIVES This study's objective was to develop a method to evaluate the chaotic characteristic of alaryngeal speech. The proposed method will be capable of distinguishing between normal and alaryngeal voices, including esophageal (SE) and tracheoesophageal (TE) voices. It has been previously shown that alaryngeal voices exhibit chaotic characteristics due to the aperiodicity of their signals. The proposed method will be applied for future use to quantify both chaos behavior (CB) and the difference between SE and TE voices. STUDY DESIGN A total of 74 voice recordings including 34 normal and 40 alaryngeal (26 SE and 14 TE) were used in the study. Voice samples were analyzed to distinguish alaryngeal voices from normal voices and to investigate different chaotic characteristics of SE and TE speech. METHODS A chaotic distribution detection-based method was used to investigate the CB of alaryngeal voices. This CB was used to detect the difference between SE and TE voice types. Quantification of the CB parameter was performed. Statistical analyses were used to compare the results of the CB analysis for both the SE and TE voices. RESULTS Statistical analysis revealed that CB effectively differentiated between all normal and alaryngeal voice types (p < 0.01). Subsequent multiclass receiver operating characteristic (ROC) analysis demonstrated that CB (area under the curve) possessed the greatest classification accuracy relative to correlation dimension (D2). CONCLUSIONS The CB metric shows strong promise as an accurate, useful metric for objective differentiation between all normal and alaryngaeal, SE and TE voice types. The CB calculations showed expected results, as SE voices have significantly more CB than TE voices, constituting substantial improvement over previous methods and becoming the first SE and TE classification method. This metric can help clinicians obtain additional acoustic information when monitoring the efficacy of treatment for patients undergoing total laryngectomies.
Collapse
Affiliation(s)
- Boquan Liu
- School of Humanities, Shanghai Jiao Tong University, Shanghai, China
| | - Fan Zhang
- Otorhinolaryngology Department of the Eye, ENT Hospital affiliated with Fudan University
| | - Ling Chen
- Otorhinolaryngology Department of the Eye, ENT Hospital affiliated with Fudan University
| | - Matthew A. Silverman
- Department of Surgery, Division of Otolaryngology Head and Neck Surgery, University of Wisconsin – Madison, Madison, Wisconsin
| | | | - Dehui Fu
- ENT Department, The 2nd Hospital of Tianjin Medical University
| | - Yongwang Huang
- ENT Department, The 2nd Hospital of Tianjin Medical University
| | - Jing Pan
- ENT Department, The 2nd Hospital of Tianjin Medical University
| | - Jack J. Jiang
- Department of Surgery, Division of Otolaryngology Head and Neck Surgery, University of Wisconsin – Madison, Madison, Wisconsin
| |
Collapse
|
4
|
Miramont JM, Restrepo JF, Codino J, Jackson-Menaldi C, Schlotthauer G. Voice Signal Typing Using a Pattern Recognition Approach. J Voice 2020; 36:34-42. [PMID: 32376059 DOI: 10.1016/j.jvoice.2020.03.006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 03/24/2020] [Accepted: 03/26/2020] [Indexed: 11/29/2022]
Abstract
Voice signal classification in three types according to their degree of periodicity, a task known as signal typing, is a relevant preprocessing step before computing any perturbation measures. However, it is a time consuming and subjective activity. This has given rise to interest in automatic systems that use objective measures to distinguish among the different signal types. The purpose of this paper is twofold. First, to propose a pattern recognition approach for automatic voice signal typing based on a multi-class linear Support Vector Machine, and using rather well-known parameters like Jitter, Shimmer, Harmonic-to-Noise Ratio, and Cepstral Prominence Peak in combination with nonlinear dynamics measures. Two novel features are also proposed as objective parameters. Second, to validate this approach using a large amount of signals coming from two well-known corpora using cross-dataset experiments to assess the generalizability of the system. A total amount of 1262 signals labeled by professional voice pathologists were used with this purpose. Statistically significant differences between all types were found for all features. Accuracies over 82.71% were estimated in all intra-datasets and inter-datasets using cross-validation. Finally, the use of posterior probabilities is proposed as a measure of the reliability of the assigned type. This could help clinicians to make a more informed decision about the type assigned to a voice. These outcomes suggest that the proposed approach can successfully discriminate among the three voice types, paving the way to a fully automatic tool for voice signal typing in the future.
Collapse
Affiliation(s)
- J M Miramont
- Instituto de Investigación y Desarrollo en Bioingeniería y Bioinformática, UNER-CONICET, Oro Verde, Entre Ríos, Argentina.
| | - Juan F Restrepo
- Instituto de Investigación y Desarrollo en Bioingeniería y Bioinformática, UNER-CONICET, Oro Verde, Entre Ríos, Argentina
| | - J Codino
- Lakeshore Professional Voice Center, Lakeshore Ear, Nose and Throat Center, St. Clair Shores, Michigan
| | - C Jackson-Menaldi
- Lakeshore Professional Voice Center, Lakeshore Ear, Nose and Throat Center, St. Clair Shores, Michigan; Department of Otolaryngology, School of Medicine, Wayne State University, Detroit, Michigan
| | - G Schlotthauer
- Instituto de Investigación y Desarrollo en Bioingeniería y Bioinformática, UNER-CONICET, Oro Verde, Entre Ríos, Argentina
| |
Collapse
|
5
|
Van Hirtum A, Bouvet A, Pelorson X. Quantifying the auto-oscillation complexity following water spraying with interest for phonation. Phys Rev E 2019; 100:043111. [PMID: 31770960 DOI: 10.1103/physreve.100.043111] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Indexed: 11/07/2022]
Abstract
Human voiced sound production or phonation is the result of a fluid-structure instability in the larynx leading to vocal folds auto-oscillation. In this paper, the effect of surface hydration following water spraying (0 up to 5 ml) on an ongoing auto-oscillation is studied experimentally using different mechanical deformable vocal folds replicas. The complexity of the oscillation is quantified on the upstream pressure by a phase space recurrence and complexity analysis. It is shown that: (1) the ratio of the degree of determinism to the recurrence rate of the phase space states γ and (2) estimated correlation dimension D_{2} are suitable parameters to grasp the effect of hydration on the oscillation pattern. The oscillation regime after hydration can either remain deterministic or approach a chaotic regime depending on initial conditions prior to water spraying, such as elasticity, glottal aperture, as well as oscillation complexity.
Collapse
Affiliation(s)
- A Van Hirtum
- LEGI, UMR CNRS 5519, Grenoble Alpes University, Grenoble, France
| | - A Bouvet
- LEGI, UMR CNRS 5519, Grenoble Alpes University, Grenoble, France
| | - X Pelorson
- LEGI, UMR CNRS 5519, Grenoble Alpes University, Grenoble, France
| |
Collapse
|
6
|
Palaparthi A, Smith S, Titze IR. Mapping Thyroarytenoid and Cricothyroid Activations to Postural and Acoustic Features in a Fiber-Gel Model of the Vocal Folds. APPLIED SCIENCES (BASEL, SWITZERLAND) 2019; 9:4671. [PMID: 35265343 PMCID: PMC8903205 DOI: 10.3390/app9214671] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Any specific vowel sound that humans produce can be represented in terms of four perceptual features in addition to the vowel category. They are pitch, loudness, brightness, and roughness. Corresponding acoustic features chosen here are fundamental frequency (fo ), sound pressure level (SPL), normalized spectral centroid (NSC), and approximate entropy (ApEn). In this study, thyroarytenoid (TA) and cricothyroid (CT) activations were varied computationally to study their relationship with these four specific acoustic features. Additionally, postural and material property variables such as vocal fold length (L) and fiber stress (σ) in the three vocal fold tissue layers were also calculated. A fiber-gel finite element model developed at National Center for Voice and Speech was used for this purpose. Muscle activation plots were generated to obtain the dependency of postural and acoustic features on TA and CT muscle activations. These relationships were compared against data obtained from previous in vivo human larynx studies and from canine laryngeal studies. General trends are that fo and SPL increase with CT activation, while NSC decreases when CT activation is raised above 20%. With TA activation, acoustic features have no uniform trends, except SPL increases uniformly with TA if there is a co-variation with CT activation. Trends for postural variables and material properties are also discussed in terms of activation levels.
Collapse
Affiliation(s)
- Anil Palaparthi
- National Center for Voice and Speech, The University of Utah, 1901 S Campus Dr, Suite 2120, Salt Lake City, UT 84112, USA
- Department of Bioengineering, The University of Utah, Salt Lake City, UT 84112, USA
| | - Simeon Smith
- National Center for Voice and Speech, The University of Utah, 1901 S Campus Dr, Suite 2120, Salt Lake City, UT 84112, USA
| | - Ingo R. Titze
- National Center for Voice and Speech, The University of Utah, 1901 S Campus Dr, Suite 2120, Salt Lake City, UT 84112, USA
- Department of Bioengineering, The University of Utah, Salt Lake City, UT 84112, USA
| |
Collapse
|
7
|
Liu B, Polce E, Raj H, Jiang J. Quantification of Voice Type Components Present in Human Phonation Using a Modified Diffusive Chaos Technique. Ann Otol Rhinol Laryngol 2019; 128:921-931. [DOI: 10.1177/0003489419848451] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Purpose: Signal typing has been used to categorize healthy and disordered voices; however, human voices are likely comprised of differing proportions of periodic type 1 elements, type 2 elements that are periodic with modulations, aperiodic type 3 elements, and stochastic type 4 elements. A novel diffusive chaos method is presented to detect the distribution of voice types within a signal with the goal of providing an objective and clinically useful tool for evaluating the voice. It was predicted that continuous calculation of the diffusive chaos parameter throughout the voice sample would allow for construction of comprehensive voice type component profiles (VTCP). Methods: One hundred thirty-five voice samples of sustained /a/ vowels were randomly selected from the Disordered Voice Database Model 4337. All samples were classified according to the voice type paradigm using spectrogram analysis, yielding 34 type 1, 35 type 2, 42 type 3, and 24 type 4 voice samples. All samples were then analyzed using the diffusive chaos method, and VTCPs were generated to show the distribution of the 4 voice type components (VTC). Results: The proportions of VTC1 varied significantly between the majority of the traditional voice types ( P < .001). Three of the 4 VTCs of type 3 voices were significantly different from the VTCs of type 4 voices ( P < .001). These results were compared to calculations of spectrum convergence ratio, which did not vary significantly between voice types 1 and 2 or 2 and 3. Conclusion: The diffusive chaos method demonstrates proficiency in generating comprehensive VTCPs for disordered voices with varying severity. In contrast to acoustic parameters that provide a single measure of disorder, VTCPs can be used to detect subtler changes by observing variations in each VTC over time. This method also provides the advantage of quantifying stochastic noise components that are due to breathiness in the voice.
Collapse
Affiliation(s)
- Boquan Liu
- Department of Surgery-Division of Otolaryngology, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA
| | - Evan Polce
- Department of Surgery-Division of Otolaryngology, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA
| | - Hayley Raj
- Department of Surgery-Division of Otolaryngology, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA
| | - Jack Jiang
- Department of Surgery-Division of Otolaryngology, University of Wisconsin School of Medicine and Public Health, Madison, WI, USA
| |
Collapse
|
8
|
Croake DJ, Andreatta RD, Stemple JC. Descriptive Analysis of the Interactive Patterning of the Vocalization Subsystems in Healthy Participants: A Dynamic Systems Perspective. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2019; 62:215-228. [PMID: 30950696 DOI: 10.1044/2018_jslhr-s-17-0466] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Purpose Normative data for many objective voice measures are routinely used in clinical voice assessment; however, normative data reflect vocal output, but not vocalization process. The underlying physiologic processes of healthy phonation have been shown to be nonlinear and thus are likely different across individuals. Dynamic systems theory postulates that performance behaviors emerge from the nonlinear interplay of multiple physiologic components and that certain patterns are preferred and loosely governed by the interactions of physiology, task, and environment. The purpose of this study was to descriptively characterize the interactive nature of the vocalization subsystem triad in subjects with healthy voices and to determine if differing subgroups could be delineated to better understand how healthy voicing is physiologically generated. Method Respiratory kinematic, aerodynamic, and acoustic formant data were obtained from 29 individuals with healthy voices (21 female and eight male). Multivariate analyses were used to descriptively characterize the interactions among the subsystems that contributed to healthy voicing. Results Group data revealed representative measures of the 3 subsystems to be generally within the boundaries of established normative data. Despite this, 3 distinct clusters were delineated that represented 3 subgroups of individuals with differing subsystem patterning. Seven of the 9 measured variables in this study were found to be significantly different across at least 1 of the 3 subgroups indicating differing physiologic processes across individuals. Conclusion Vocal output in healthy individuals appears to be generated by distinct and preferred physiologic processes that were represented by 3 subgroups indicating that the process of vocalization is different among individuals, but not entirely idiosyncratic. Possibilities for these differences are explored using the framework of dynamic systems theory and the dynamics of emergent behaviors. A revised physiologic model of phonation that accounts for differences within and among the vocalization subsystems is described. Supplemental Material https://doi.org/10.23641/asha.7616462.
Collapse
Affiliation(s)
- Daniel J Croake
- Department of Communication Sciences and Disorders, University of Kentucky, Lexington
| | - Richard D Andreatta
- Department of Communication Sciences and Disorders, University of Kentucky, Lexington
| | - Joseph C Stemple
- Department of Communication Sciences and Disorders, University of Kentucky, Lexington
| |
Collapse
|