1
|
Verde L, Marulli F, De Fazio R, Campanile L, Marrone S. HEAR set: A ligHtwEight acoustic paRameters set to assess mental health from voice analysis. Comput Biol Med 2024; 182:109021. [PMID: 39236660 DOI: 10.1016/j.compbiomed.2024.109021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 06/23/2024] [Accepted: 08/09/2024] [Indexed: 09/07/2024]
Abstract
BACKGROUND Voice analysis has significant potential in aiding healthcare professionals with detecting, diagnosing, and personalising treatment. It represents an objective and non-intrusive tool for supporting the detection and monitoring of specific pathologies. By calculating various acoustic features, voice analysis extracts valuable information to assess voice quality. The choice of these parameters is crucial for an accurate assessment. METHOD In this paper, we propose a lightweight acoustic parameter set, named HEAR, able to evaluate voice quality to assess mental health. In detail, this consists of jitter, spectral centroid, Mel-frequency cepstral coefficients, and their derivates. The choice of parameters for the proposed set was influenced by the explainable significance of each acoustic parameter in the voice production process. RESULTS The reliability of the proposed acoustic set to detect the early symptoms of mental disorders was evaluated in an experimental phase. Voices of subjects suffering from different mental pathologies, selected from available databases, were analysed. The performance obtained from the HEAR features was compared with that obtained by analysing features selected from toolkits widely used in the literature, as with those obtained using learned procedures. The best performance in terms of MAE and RMSE was achieved for the detection of depression (5.32 and 6.24 respectively). For the detection of psychogenic dysphonia and anxiety, the highest accuracy rates were about 75 % and 97 %, respectively. CONCLUSIONS The comparative evaluation was carried out to assess the performance of the proposed approach, demonstrating a reliable capability to highlight affective physiological alterations of voice quality due to the considered mental disorders.
Collapse
Affiliation(s)
- Laura Verde
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", Viale Lincoln 5, Caserta, 81100, Italy.
| | - Fiammetta Marulli
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", Viale Lincoln 5, Caserta, 81100, Italy
| | - Roberta De Fazio
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", Viale Lincoln 5, Caserta, 81100, Italy
| | - Lelio Campanile
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", Viale Lincoln 5, Caserta, 81100, Italy
| | - Stefano Marrone
- Department of Mathematics and Physics, University of Campania "Luigi Vanvitelli", Viale Lincoln 5, Caserta, 81100, Italy
| |
Collapse
|
2
|
Laukkanen AM, Kadiri SR, Narayanan S, Alku P. Can a Machine Distinguish High and Low Amount of Social Creak in Speech? J Voice 2024:S0892-1997(24)00342-4. [PMID: 39455325 DOI: 10.1016/j.jvoice.2024.09.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 09/29/2024] [Accepted: 09/30/2024] [Indexed: 10/28/2024]
Abstract
OBJECTIVES Increased prevalence of social creak particularly among female speakers has been reported in several studies. The study of social creak has been previously conducted by combining perceptual evaluation of speech with conventional acoustical parameters such as the harmonic-to-noise ratio and cepstral peak prominence. In the current study, machine learning (ML) was used to automatically distinguish speech of low amount of social creak from speech of high amount of social creak. METHODS The amount of creak in continuous speech samples produced in Finnish by 90 female speakers was first perceptually assessed by two voice specialists. Based on their assessments, the speech samples were divided into two categories (low vs high amount of creak). Using the speech signals and their creak labels, seven different ML models were trained. Three spectral representations were used as feature for each model. RESULTS The results show that the best performance (accuracy of 71.1%) was obtained by the following two systems: an Adaboost classifier using the mel-spectrogram feature and a decision tree classifier using the mel-frequency cepstral coefficient feature. CONCLUSIONS The study of social creak is becoming increasingly popular in sociolinguistic and vocological research. The conventional human perceptual assessment of the amount of creak is laborious and therefore ML technology could be used to assist researchers studying social creak. The classification systems reported in this study could be considered as baselines in future ML-based studies on social creak.
Collapse
Affiliation(s)
| | - Sudarsana Reddy Kadiri
- Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles
| | - Shrikanth Narayanan
- Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles
| | - Paavo Alku
- Department of Information and Communications Engineering, Aalto University, Espoo, Finland.
| |
Collapse
|
3
|
Tomaszewska JZ, Georgakis A. Electroglottography in Medical Diagnostics of Vocal Tract Pathologies: A Systematic Review. J Voice 2023:S0892-1997(23)00388-0. [PMID: 38143204 DOI: 10.1016/j.jvoice.2023.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 12/02/2023] [Accepted: 12/04/2023] [Indexed: 12/26/2023]
Abstract
Electroglottography (EGG) is a technology developed for measuring the vocal fold contact area during human voice production. Although considered subjective and unreliable as a sole diagnostic method, with the correct application of relevant computational methods, it can constitute a most promising non-invasive voice disorder diagnostic tools in a form of a digital vocal tract pathology classifier. The aim of the following study is to gather and evaluate currently existing digital voice quality assessment systems and vocal tract abnormality classification systems that rely on the use of electroglottographic bio-impedance signals. To fully comprehend the findings of this review, first the subject of EGG is introduced. For that, we summarise most relevant existing research on EGG with a particular focus on its application in diagnostics. Then, we move on to the focal point of this work, which is describing and comparing the existing EGG-based digital voice pathology classification systems. With the application of PRISMA model, 13 articles were chosen and analysed in detail. Direct comparison between chosen studies brought us to pivotal conclusions, which have been described in Section 5 of this report. Meanwhile, certain limitations arising from the literature were identified, such as questionable understanding of the nature of EGG bio-impedance signals. The appropriate recommendations for future work were made, including the application of different methods for EGG feature extraction, as well as the need for continuous EGG datasets development containing signals gathered in various conditions and with different equipments.
Collapse
|
4
|
Zhao L. The application of graphic language in animation visual guidance system under intelligent environment. JOURNAL OF INTELLIGENT SYSTEMS 2022. [DOI: 10.1515/jisys-2022-0074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Abstract
With the continuous development of society, the role of the visual guidance system in animation design has also evolved and evolved in its long history, leading to the changes in the values of modern beauty. In the field of modern social and cultural design, the visual guidance system in animation design has unique regional nature and cultural influence. The visual language should correspond to the visual environment and easy to understand and be known by people. It combines animation conception and design technology to capture the cultural charm and beauty, values, and behavioral norms of people in different fields. This article studies and analyzes the visual orientation of graphic language in the design of animation visual guidance system, and injects the graphic language with orientation into its animation design, so that the animation design is more in line with the characteristics of the times. It can be more adapted to the emerging media and better convey the information transfer between the enterprise and the audience. To further understand the audience’s tendency toward elements of graphic expression, this article analyzes the subjective perceptions of the respondents on the importance of color selection, calligraphy fonts, graphic expression, and modeling meaning. The results of the study showed that the respondents aged 21–35 paid more attention to the choice of graphic colors, and the highest number was 69.
Collapse
Affiliation(s)
- Luning Zhao
- College of Design and creativity, Xiamen University Tan Kah Kee College , Zhangzhou 363123 , Fujian , China
| |
Collapse
|
5
|
Włodarczak M, Ludusan B, Sundberg J, Heldner M. Classification of voice quality using neck-surface acceleration: Comparison with glottal flow and radiated sound. J Voice 2022:S0892-1997(22)00198-9. [PMID: 36028369 DOI: 10.1016/j.jvoice.2022.06.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2022] [Revised: 06/30/2022] [Accepted: 06/30/2022] [Indexed: 10/15/2022]
Abstract
OBJECTIVES The aim of the present study is to investigate the usefulness of features extracted from miniature accelerometers attached to speaker's tracheal wall below the glottis for classification of phonation type. The performance of the accelerometer features is evaluated relative to features obtained from inverse filtered and radiated sound. While the former is a good proxy for the voice source, obtaining robust voice source features from the latter is considered difficult since it also contains information about the vocal tract filter. By contrast, the accelerometer signal is largely unaffected by the vocal tract and although it is shaped by subglottal resonances and the transfer properties of the neck tissue, these properties remain constant within a speaker. For this reason, we expect it to provide a better approximation of the voice source than the raw audio. We also investigate which aspects of the voice source are derivable from the accelerometer and microphone signals. METHODS Five trained singers (two females and three males) were recorded producing the syllable [pæ:] in three voice qualities (neutral, breathy and pressed) and at three pitch levels as determined by the participants' personal preference. Features extracted from the three signals were used for classification of phonation type using a random forest classifier. In addition, accelerometer and microphone features with highest correlation with the voice source features were identified. RESULTS The three signals showed comparable classification error rates, with considerable differences across speakers both with respect to the overall performance and the importance of individual features. The speaker-specific differences notwithstanding, variation of phonation type had consistent effects on the voice source, accelerometer and audio signals. With regard to the voice source, AQ, NAQ, L1L2 and CQ all showed a monotonic variation along the breathy - neutral - pressed continuum. Several features were also found to vary systematically in the accelerometer and audio signals: HRF, L1L2 and CPPS (both the accelerometer and the audio), as well as the sound level (for the audio). The random forest analysis revealed that all of these features were also among the most important for the classification of voice quality. CONCLUSION Both the accelerometer and the audio signals were found to discriminate between phonation types with an accuracy approaching that of the voice source. Thus, the accelerometer signal, which is largely uncontaminated by vocal tract resonances, offered no advantage over the signal collected with a normal microphone.
Collapse
Affiliation(s)
| | - Bogdan Ludusan
- Faculty of Linguistics and Literary Studies, Bielefeld University, Germany
| | - Johan Sundberg
- Department of Speech, Music and Hearing, KTH Royal Institute of Technology, Sweden
| | | |
Collapse
|
6
|
Electroglottograph-Based Speech Emotion Recognition via Cross-Modal Distillation. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094338] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
Speech emotion recognition (SER) is an important component of emotion computation and signal processing. Recently, many works have applied abundant acoustic features and complex model architectures to enhance the model’s performance, but these works sacrifice the portability of the model. To address this problem, we propose a model utilizing only the fundamental frequency from electroglottograph (EGG) signals. EGG signals are a sort of physiological signal that can directly reflect the movement of the vocal cord. Under the assumption that different acoustic features share similar representations in the internal emotional state, we propose cross-modal emotion distillation (CMED) to train the EGG-based SER model by transferring robust speech emotion representations from the log-Mel-spectrogram-based model. Utilizing the cross-modal emotion distillation, we achieve an increase of recognition accuracy from 58.98% to 66.80% on the S70 subset of the Chinese Dual-mode Emotional Speech Database (CDESD 7-classes) and 32.29% to 42.71% on the EMO-DB (7-classes) dataset, which shows that our proposed method achieves a comparable result with the human subjective experiment and realizes a trade-off between model complexity and performance.
Collapse
|
7
|
Chen L, Ren J, Chen P, Mao X, Zhao Q. Limited text speech synthesis with electroglottograph based on Bi-LSTM and modified Tacotron-2. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03075-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
AbstractThis paper proposes a framework of applying only the EGG signal for speech synthesis in the limited categories of contents scenario. EGG is a sort of physiological signal which can reflect the trends of the vocal cord movement. Note that EGG’s different acquisition method contrasted with speech signals, we exploit its application in speech synthesis under the following two scenarios. (1) To synthesize speeches under high noise circumstances, where clean speech signals are unavailable. (2) To enable dumb people who retain vocal cord vibration to speak again. Our study consists of two stages, EGG to text and text to speech. The first is a text content recognition model based on Bi-LSTM, which converts each EGG signal sample into the corresponding text with a limited class of contents. This model achieves 91.12% accuracy on the validation set in a 20-class content recognition experiment. Then the second step synthesizes speeches with the corresponding text and the EGG signal. Based on modified Tacotron-2, our model gains the Mel cepstral distortion (MCD) of 5.877 and the mean opinion score (MOS) of 3.87, which is comparable with the state-of-the-art performance and achieves an improvement by 0.42 and a relatively smaller model size than the origin Tacotron-2. Considering to introduce the characteristics of speakers contained in EGG to the final synthesized speech, we put forward a fine-grained fundamental frequency modification method, which adjusts the fundamental frequency according to EGG signals and achieves a lower MCD of 5.781 and a higher MOS of 3.94 than that without modification.
Collapse
|
8
|
Kadiri SR, Alku P. Glottal features for classification of phonation type from speech and neck surface accelerometer signals. COMPUT SPEECH LANG 2021. [DOI: 10.1016/j.csl.2021.101232] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
9
|
Stasak B, Huang Z, Razavi S, Joachim D, Epps J. Automatic Detection of COVID-19 Based on Short-Duration Acoustic Smartphone Speech Analysis. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2021; 5:201-217. [PMID: 33723525 PMCID: PMC7948650 DOI: 10.1007/s41666-020-00090-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Revised: 11/11/2020] [Accepted: 12/03/2020] [Indexed: 12/16/2022]
Abstract
Currently, there is an increasing global need for COVID-19 screening to help reduce the rate of infection and at-risk patient workload at hospitals. Smartphone-based screening for COVID-19 along with other respiratory illnesses offers excellent potential due to its rapid-rollout remote platform, user convenience, symptom tracking, comparatively low cost, and prompt result processing timeframe. In particular, speech-based analysis embedded in smartphone app technology can measure physiological effects relevant to COVID-19 screening that are not yet digitally available at scale in the healthcare field. Using a selection of the Sonde Health COVID-19 2020 dataset, this study examines the speech of COVID-19-negative participants exhibiting mild and moderate COVID-19-like symptoms as well as that of COVID-19-positive participants with mild to moderate symptoms. Our study investigates the classification potential of acoustic features (e.g., glottal, prosodic, spectral) from short-duration speech segments (e.g., held vowel, pataka phrase, nasal phrase) for automatic COVID-19 classification using machine learning. Experimental results indicate that certain feature-task combinations can produce COVID-19 classification accuracy of up to 80% as compared with using the all-acoustic feature baseline (68%). Further, with brute-forced n-best feature selection and speech task fusion, automatic COVID-19 classification accuracy of upwards of 82-86% was achieved, depending on whether the COVID-19-negative participant had mild or moderate COVID-19-like symptom severity.
Collapse
Affiliation(s)
- Brian Stasak
- School of Electrical Engineering & Telecommunications, University of New South Wales, Sydney, NSW Australia
| | - Zhaocheng Huang
- School of Electrical Engineering & Telecommunications, University of New South Wales, Sydney, NSW Australia
| | | | | | - Julien Epps
- School of Electrical Engineering & Telecommunications, University of New South Wales, Sydney, NSW Australia
| |
Collapse
|
10
|
Performance of Different Acoustic Measures to Discriminate Individuals With and Without Voice Disorders. J Voice 2020; 36:487-498. [PMID: 32798120 DOI: 10.1016/j.jvoice.2020.07.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 07/11/2020] [Accepted: 07/13/2020] [Indexed: 11/20/2022]
Abstract
The goal of this study is to compare and combine different acoustic features in discriminating subjects with and without voice disorders. A database of 484 adult patients participated in the research. All subjects recorded a sustained vowel /Ɛ/ and underwent a laryngoscopic examination of the larynx. From the results of the laryngeal examination performed by a physician and the auditory-perceptual judgment performed by a Speech-Language Pathologist, the subjects were allocated to the group with (n = 52) and without (n = 432) voice disorder. Four types of acoustic features were used: traditional measures, cepstral measures, nonlinear measures, and recurrence quantification measures. Recordings comprised the emission of the vowel /ε/. Quadratic discriminant analysis was used as classifier. Individual features in the context of traditional, cepstral, and recurrence quantification measures achieved an acceptable performance of ≥70%. Combination of measures improved the classifier performance. The best classification result (86.43% accuracy) was obtained by combining traditional linear and recurrence quantification measures. Results shown that Traditional, Cepstral, and recurrence quantification measures are promising features that capture meaningful information about voice production, which provides good classification performances. The findings of this study can be used to develop a computational tool for voice disorders diagnosis and monitoring.
Collapse
|
11
|
Lei Z, Kennedy E, Fasanella L, Li-Jessen NYK, Mongeau L. Discrimination between Modal, Breathy and Pressed Voice for Single Vowels Using Neck-Surface Vibration Signals. APPLIED SCIENCES (BASEL, SWITZERLAND) 2019; 9:10.3390/app9071505. [PMID: 32133204 PMCID: PMC7055909 DOI: 10.3390/app9071505] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The purpose of this study was to investigate the feasibility of using neck-surface acceleration signals to discriminate between modal, breathy and pressed voice. Voice data for five English single vowels were collected from 31 female native Canadian English speakers using a portable Neck Surface Accelerometer (NSA) and a condenser microphone. Firstly, auditory-perceptual ratings were conducted by five clinically-certificated Speech Language Pathologists (SLPs) to categorize voice type using the audio recordings. Intra- and inter-rater analyses were used to determine the SLPs' reliability for the perceptual categorization task. Mixed-type samples were screened out, and congruent samples were kept for the subsequent classification task. Secondly, features such as spectral harmonics, jitter, shimmer and spectral entropy were extracted from the NSA data. Supervised learning algorithms were used to map feature vectors to voice type categories. A feature wrapper strategy was used to evaluate the contribution of each feature or feature combinations to the classification between different voice types. The results showed that the highest classification accuracy on a full set was 82.5%. The breathy voice classification accuracy was notably greater (approximately 12%) than those of the other two voice types. Shimmer and spectral entropy were the best correlated metrics for the classification accuracy.
Collapse
Affiliation(s)
- Zhengdong Lei
- Department of Mechanical Engineering, McGill University, Montreal, QC H3A 0G4, Canada
| | - Evan Kennedy
- School of Communication Sciences and Disorders, McGill University, Montreal, QC H3A 0G4, Canada
| | - Laura Fasanella
- Department of Mechanical Engineering, McGill University, Montreal, QC H3A 0G4, Canada
| | | | - Luc Mongeau
- Department of Mechanical Engineering, McGill University, Montreal, QC H3A 0G4, Canada
| |
Collapse
|