1
|
Bandela SR, Siva Priyanka S, Sunil Kumar K, Vijay Bhaskar Reddy Y, Berhanu AA. Stressed Speech Emotion Recognition Using Teager Energy and Spectral Feature Fusion with Feature Optimization. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023; 2023:5765760. [PMID: 37868755 PMCID: PMC10586421 DOI: 10.1155/2023/5765760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/04/2022] [Indexed: 10/24/2023]
Abstract
The objective of speech emotion recognition (SER) is to enhance man-machine interface. It can also be used to cover the physiological state of a person in critical situations. In recent time, speech emotion recognition also finds its operations in medicine and forensics. A new feature extraction technique using Teager energy operator (TEO) is proposed for the detection of stressed emotions as Teager energy-autocorrelation envelope (TEO-Auto-Env). TEO is basically designed for increasing the energies of the stressed speech signals whose energies are reduced during the speech production process and hence used in this analysis. A stressed speech emotion recognition (SSER) system is developed using TEO-Auto-Env and spectral feature combination for detecting the emotions. The spectral features considered are Mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), and relative spectra-perceptual linear prediction (RASTA-PLP). EMO-DB (German), EMOVO (Italian), IITKGP (Telugu), and EMA (English) databases are used in this analysis. The classification of the emotions is carried out using the k-nearest neighborhood (k-NN) classifier for gender-dependent (GD) and speaker-independent (SI) cases. The proposed SSER system provides improved accuracy compared to the existing ones. Average recall is used for performance evaluation. The highest classification accuracy is achieved using the feature combination of TEO-Auto-Env, MFCC, and LPCC features with 91.4% (SI), 91.4% (GD-male), and 93.1%(GD-female) for EMO-DB; 68.5% (SI), 68.5% (GD-male), and 74.6% (GD-female) for EMOVO; 90.6%(SI), 91% (GD-male), and 92.3% (GD-female) for EMA; and 95.1% (GD-female) for IITKGP female database.
Collapse
Affiliation(s)
| | | | | | | | - Afework Aemro Berhanu
- Department of Environmental Engineering, College of Biological and Chemical Engineering Addis Ababa Science and Technology University, Addis Ababa, Ethiopia
| |
Collapse
|
2
|
Smart voice recognition based on deep learning for depression diagnosis. ARTIFICIAL LIFE AND ROBOTICS 2023. [DOI: 10.1007/s10015-023-00852-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
|
3
|
Little B, Alshabrawy O, Stow D, Ferrier IN, McNaney R, Jackson DG, Ladha K, Ladha C, Ploetz T, Bacardit J, Olivier P, Gallagher P, O'Brien JT. Deep learning-based automated speech detection as a marker of social functioning in late-life depression. Psychol Med 2021; 51:1441-1450. [PMID: 31944174 PMCID: PMC8311821 DOI: 10.1017/s0033291719003994] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Revised: 10/23/2019] [Accepted: 12/13/2019] [Indexed: 11/24/2022]
Abstract
BACKGROUND Late-life depression (LLD) is associated with poor social functioning. However, previous research uses bias-prone self-report scales to measure social functioning and a more objective measure is lacking. We tested a novel wearable device to measure speech that participants encounter as an indicator of social interaction. METHODS Twenty nine participants with LLD and 29 age-matched controls wore a wrist-worn device continuously for seven days, which recorded their acoustic environment. Acoustic data were automatically analysed using deep learning models that had been developed and validated on an independent speech dataset. Total speech activity and the proportion of speech produced by the device wearer were both detected whilst maintaining participants' privacy. Participants underwent a neuropsychological test battery and clinical and self-report scales to measure severity of depression, general and social functioning. RESULTS Compared to controls, participants with LLD showed poorer self-reported social and general functioning. Total speech activity was much lower for participants with LLD than controls, with no overlap between groups. The proportion of speech produced by the participants was smaller for LLD than controls. In LLD, both speech measures correlated with attention and psychomotor speed performance but not with depression severity or self-reported social functioning. CONCLUSIONS Using this device, LLD was associated with lower levels of speech than controls and speech activity was related to psychomotor retardation. We have demonstrated that speech activity measured by wearable technology differentiated LLD from controls with high precision and, in this study, provided an objective measure of an aspect of real-world social functioning in LLD.
Collapse
Affiliation(s)
- Bethany Little
- Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK
| | - Ossama Alshabrawy
- Interdisciplinary Computing and Complex BioSystems (ICOS) group, School of Computing, Newcastle University, Newcastle upon Tyne, UK
- Faculty of Science, Damietta University, New Damietta, Egypt
| | - Daniel Stow
- Institute of Health and Society, Newcastle University, Newcastle upon Tyne, UK
| | - I. Nicol Ferrier
- Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK
| | | | - Daniel G. Jackson
- Open Lab, School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | - Karim Ladha
- Open Lab, School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | | | - Thomas Ploetz
- School of Interactive Computing, Georgia Institute of Technology, Atlanta, GA, USA
| | - Jaume Bacardit
- Interdisciplinary Computing and Complex BioSystems (ICOS) group, School of Computing, Newcastle University, Newcastle upon Tyne, UK
| | - Patrick Olivier
- Faculty of Information Technology, Monash University, Melbourne, Australia
| | - Peter Gallagher
- Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK
| | - John T. O'Brien
- Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK
- Department of Psychiatry, University of Cambridge, Cambridge, UK
| |
Collapse
|
4
|
Wang J, Zhang L, Liu T, Pan W, Hu B, Zhu T. Acoustic differences between healthy and depressed people: a cross-situation study. BMC Psychiatry 2019; 19:300. [PMID: 31615470 PMCID: PMC6794822 DOI: 10.1186/s12888-019-2300-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 09/20/2019] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Abnormalities in vocal expression during a depressed episode have frequently been reported in people with depression, but less is known about if these abnormalities only exist in special situations. In addition, the impacts of irrelevant demographic variables on voice were uncontrolled in previous studies. Therefore, this study compares the vocal differences between depressed and healthy people under various situations with irrelevant variables being regarded as covariates. METHODS To examine whether the vocal abnormalities in people with depression only exist in special situations, this study compared the vocal differences between healthy people and patients with unipolar depression in 12 situations (speech scenarios). Positive, negative and neutral voice expressions between depressed and healthy people were compared in four tasks. Multiple analysis of covariance (MANCOVA) was used for evaluating the main effects of variable group (depressed vs. healthy) on acoustic features. The significances of acoustic features were evaluated by both statistical significance and magnitude of effect size. RESULTS The results of multivariate analysis of covariance showed that significant differences between the two groups were observed in all 12 speech scenarios. Although significant acoustic features were not the same in different scenarios, we found that three acoustic features (loudness, MFCC5 and MFCC7) were consistently different between people with and without depression with large effect magnitude. CONCLUSIONS Vocal differences between depressed and healthy people exist in 12 scenarios. Acoustic features including loudness, MFCC5 and MFCC7 have potentials to be indicators for identifying depression via voice analysis. These findings support that depressed people's voices include both situation-specific and cross-situational patterns of acoustic features.
Collapse
Affiliation(s)
- Jingying Wang
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
| | - Lei Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, VA USA
| | - Tianli Liu
- Institute of Population Research, Peking University, Beijing, China
| | - Wei Pan
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
| | - Bin Hu
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu Province China
| | - Tingshao Zhu
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
5
|
Detecting Depression Using an Ensemble Logistic Regression Model Based on Multiple Speech Features. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2018; 2018:6508319. [PMID: 30344616 PMCID: PMC6174772 DOI: 10.1155/2018/6508319] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 08/28/2018] [Indexed: 11/18/2022]
Abstract
Early intervention for depression is very important to ease the disease burden, but current diagnostic methods are still limited. This study investigated automatic depressed speech classification in a sample of 170 native Chinese subjects (85 healthy controls and 85 depressed patients). The classification performances of prosodic, spectral, and glottal speech features were analyzed in recognition of depression. We proposed an ensemble logistic regression model for detecting depression (ELRDD) in speech. The logistic regression, which was superior in recognition of depression, was selected as the base classifier. This ensemble model extracted many speech features from different aspects and ensured diversity of the base classifier. ELRDD provided better classification results than the other compared classifiers. A technique for identifying depression based on ELRDD, ELRDD-E, was here suggested and tested. It offered encouraging outcomes, revealing a high accuracy level of 75.00% for females and 81.82% for males, as well as an advantageous sensitivity/specificity ratio of 79.25%/70.59% for females and 78.13%/85.29% for males.
Collapse
|
6
|
Martinelli E, Mencattini A, Daprati E, Di Natale C. Strength Is in Numbers: Can Concordant Artificial Listeners Improve Prediction of Emotion from Speech? PLoS One 2016; 11:e0161752. [PMID: 27563724 PMCID: PMC5001724 DOI: 10.1371/journal.pone.0161752] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2016] [Accepted: 08/11/2016] [Indexed: 11/18/2022] Open
Abstract
Humans can communicate their emotions by modulating facial expressions or the tone of their voice. Albeit numerous applications exist that enable machines to read facial emotions and recognize the content of verbal messages, methods for speech emotion recognition are still in their infancy. Yet, fast and reliable applications for emotion recognition are the obvious advancement of present ‘intelligent personal assistants’, and may have countless applications in diagnostics, rehabilitation and research. Taking inspiration from the dynamics of human group decision-making, we devised a novel speech emotion recognition system that applies, for the first time, a semi-supervised prediction model based on consensus. Three tests were carried out to compare this algorithm with traditional approaches. Labeling performances relative to a public database of spontaneous speeches are reported. The novel system appears to be fast, robust and less computationally demanding than traditional methods, allowing for easier implementation in portable voice-analyzers (as used in rehabilitation, research, industry, etc.) and for applications in the research domain (such as real-time pairing of stimuli to participants’ emotional state, selective/differential data collection based on emotional content, etc.).
Collapse
Affiliation(s)
- Eugenio Martinelli
- Department of Electronic Engineering, University of Rome Tor Vergata, Rome, Italy
| | - Arianna Mencattini
- Department of Electronic Engineering, University of Rome Tor Vergata, Rome, Italy
| | - Elena Daprati
- Department of System Medicine and CBMS, University of Rome Tor Vergata, Rome, Italy
- * E-mail:
| | - Corrado Di Natale
- Department of Electronic Engineering, University of Rome Tor Vergata, Rome, Italy
| |
Collapse
|