1
|
Zhang Z, Zhang S, Ni D, Wei Z, Yang K, Jin S, Huang G, Liang Z, Zhang L, Li L, Ding H, Zhang Z, Wang J. Multimodal Sensing for Depression Risk Detection: Integrating Audio, Video, and Text Data. SENSORS (BASEL, SWITZERLAND) 2024; 24:3714. [PMID: 38931497 PMCID: PMC11207438 DOI: 10.3390/s24123714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 05/30/2024] [Accepted: 06/06/2024] [Indexed: 06/28/2024]
Abstract
Depression is a major psychological disorder with a growing impact worldwide. Traditional methods for detecting the risk of depression, predominantly reliant on psychiatric evaluations and self-assessment questionnaires, are often criticized for their inefficiency and lack of objectivity. Advancements in deep learning have paved the way for innovations in depression risk detection methods that fuse multimodal data. This paper introduces a novel framework, the Audio, Video, and Text Fusion-Three Branch Network (AVTF-TBN), designed to amalgamate auditory, visual, and textual cues for a comprehensive analysis of depression risk. Our approach encompasses three dedicated branches-Audio Branch, Video Branch, and Text Branch-each responsible for extracting salient features from the corresponding modality. These features are subsequently fused through a multimodal fusion (MMF) module, yielding a robust feature vector that feeds into a predictive modeling layer. To further our research, we devised an emotion elicitation paradigm based on two distinct tasks-reading and interviewing-implemented to gather a rich, sensor-based depression risk detection dataset. The sensory equipment, such as cameras, captures subtle facial expressions and vocal characteristics essential for our analysis. The research thoroughly investigates the data generated by varying emotional stimuli and evaluates the contribution of different tasks to emotion evocation. During the experiment, the AVTF-TBN model has the best performance when the data from the two tasks are simultaneously used for detection, where the F1 Score is 0.78, Precision is 0.76, and Recall is 0.81. Our experimental results confirm the validity of the paradigm and demonstrate the efficacy of the AVTF-TBN model in detecting depression risk, showcasing the crucial role of sensor-based data in mental health detection.
Collapse
Affiliation(s)
- Zhenwei Zhang
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China; (Z.Z.); (D.N.); (G.H.); (Z.L.); (L.Z.); (L.L.); (H.D.)
- Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China
| | - Shengming Zhang
- Affiliated Mental Health Center, Southern University of Science and Technology, Shenzhen 518055, China;
| | - Dong Ni
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China; (Z.Z.); (D.N.); (G.H.); (Z.L.); (L.Z.); (L.L.); (H.D.)
- Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China
| | - Zhaoguo Wei
- Shenzhen Kangning Hospital, Shenzhen 518020, China; (Z.W.); (K.Y.); (S.J.)
- Shenzhen Mental Health Center, Shenzhen 518020, China
| | - Kongjun Yang
- Shenzhen Kangning Hospital, Shenzhen 518020, China; (Z.W.); (K.Y.); (S.J.)
- Shenzhen Mental Health Center, Shenzhen 518020, China
| | - Shan Jin
- Shenzhen Kangning Hospital, Shenzhen 518020, China; (Z.W.); (K.Y.); (S.J.)
- Shenzhen Mental Health Center, Shenzhen 518020, China
| | - Gan Huang
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China; (Z.Z.); (D.N.); (G.H.); (Z.L.); (L.Z.); (L.L.); (H.D.)
- Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China
| | - Zhen Liang
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China; (Z.Z.); (D.N.); (G.H.); (Z.L.); (L.Z.); (L.L.); (H.D.)
- Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China
| | - Li Zhang
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China; (Z.Z.); (D.N.); (G.H.); (Z.L.); (L.Z.); (L.L.); (H.D.)
- Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China
| | - Linling Li
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China; (Z.Z.); (D.N.); (G.H.); (Z.L.); (L.Z.); (L.L.); (H.D.)
- Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China
| | - Huijun Ding
- School of Biomedical Engineering, Health Science Center, Shenzhen University, Shenzhen 518060, China; (Z.Z.); (D.N.); (G.H.); (Z.L.); (L.Z.); (L.L.); (H.D.)
- Guangdong Provincial Key Laboratory of Biomedical Measurements and Ultrasound Imaging, Shenzhen 518060, China
| | - Zhiguo Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen 518055, China
- Peng Cheng Laboratory, Shenzhen 518055, China
| | - Jianhong Wang
- Shenzhen Kangning Hospital, Shenzhen 518020, China; (Z.W.); (K.Y.); (S.J.)
- Shenzhen Mental Health Center, Shenzhen 518020, China
| |
Collapse
|
2
|
Zolnoori M, Zolnour A, Topaz M. ADscreen: A speech processing-based screening system for automatic identification of patients with Alzheimer's disease and related dementia. Artif Intell Med 2023; 143:102624. [PMID: 37673583 PMCID: PMC10483114 DOI: 10.1016/j.artmed.2023.102624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Revised: 06/22/2023] [Accepted: 07/08/2023] [Indexed: 09/08/2023]
Abstract
Alzheimer's disease and related dementias (ADRD) present a looming public health crisis, affecting roughly 5 million people and 11 % of older adults in the United States. Despite nationwide efforts for timely diagnosis of patients with ADRD, >50 % of them are not diagnosed and unaware of their disease. To address this challenge, we developed ADscreen, an innovative speech-processing based ADRD screening algorithm for the protective identification of patients with ADRD. ADscreen consists of five major components: (i) noise reduction for reducing background noises from the audio-recorded patient speech, (ii) modeling the patient's ability in phonetic motor planning using acoustic parameters of the patient's voice, (iii) modeling the patient's ability in semantic and syntactic levels of language organization using linguistic parameters of the patient speech, (iv) extracting vocal and semantic psycholinguistic cues from the patient speech, and (v) building and evaluating the screening algorithm. To identify important speech parameters (features) associated with ADRD, we used the Joint Mutual Information Maximization (JMIM), an effective feature selection method for high dimensional, small sample size datasets. Modeling the relationship between speech parameters and the outcome variable (presence/absence of ADRD) was conducted using three different machine learning (ML) architectures with the capability of joining informative acoustic and linguistic with contextual word embedding vectors obtained from the DistilBERT (Bidirectional Encoder Representations from Transformers). We evaluated the performance of the ADscreen on an audio-recorded patients' speech (verbal description) for the Cookie-Theft picture description task, which is publicly available in the dementia databank. The joint fusion of acoustic and linguistic parameters with contextual word embedding vectors of DistilBERT achieved F1-score = 84.64 (standard deviation [std] = ±3.58) and AUC-ROC = 92.53 (std = ±3.34) for training dataset, and F1-score = 89.55 and AUC-ROC = 93.89 for the test dataset. In summary, ADscreen has a strong potential to be integrated with clinical workflow to address the need for an ADRD screening tool so that patients with cognitive impairment can receive appropriate and timely care.
Collapse
Affiliation(s)
- Maryam Zolnoori
- Columbia University Medical Center, New York, NY, United States of America; School of Nursing, Columbia University, New York, NY, United States of America.
| | - Ali Zolnour
- School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran
| | - Maxim Topaz
- Columbia University Medical Center, New York, NY, United States of America; School of Nursing, Columbia University, New York, NY, United States of America
| |
Collapse
|
3
|
Yang W, Liu J, Cao P, Zhu R, Wang Y, Liu JK, Wang F, Zhang X. Attention guided learnable time-domain filterbanks for speech depression detection. Neural Netw 2023; 165:135-149. [PMID: 37285730 DOI: 10.1016/j.neunet.2023.05.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2022] [Revised: 05/13/2023] [Accepted: 05/20/2023] [Indexed: 06/09/2023]
Abstract
Depression, as a global mental health problem, is lacking effective screening methods that can help with early detection and treatment. This paper aims to facilitate the large-scale screening of depression by focusing on the speech depression detection (SDD) task. Currently, direct modeling on the raw signal yields a large number of parameters, and the existing deep learning-based SDD models mainly use the fixed Mel-scale spectral features as input. However, these features are not designed for depression detection, and the manual settings limit the exploration of fine-grained feature representations. In this paper, we learn the effective representations of the raw signals from an interpretable perspective. Specifically, we present a joint learning framework with attention-guided learnable time-domain filterbanks for depression classification (DALF), which collaborates with the depression filterbanks features learning (DFBL) module and multi-scale spectral attention learning (MSSA) module. DFBL is capable of producing biologically meaningful acoustic features by employing learnable time-domain filters, and MSSA is used to guide the learnable filters to better retain the useful frequency sub-bands. We collect a new dataset, the Neutral Reading-based Audio Corpus (NRAC), to facilitate the research in depression analysis, and we evaluate the performance of DALF on the NRAC and the public DAIC-woz datasets. The experimental results demonstrate that our method outperforms the state-of-the-art SDD methods with an F1 of 78.4% on the DAIC-woz dataset. In particular, DALF achieves F1 scores of 87.3% and 81.7% on two parts of the NRAC dataset. By analyzing the filter coefficients, we find that the most important frequency range identified by our method is 600-700Hz, which corresponds to the Mandarin vowels /e/ and /eˆ/ and can be considered as an effective biomarker for the SDD task. Taken together, our DALF model provides a promising approach to depression detection.
Collapse
Affiliation(s)
- Wenju Yang
- College of Computer Science and Engineering, Northeastern University, Shenyang, 110819, Liaoning, China; Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, Shenyang, 110819, Liaoning, China
| | - Jiankang Liu
- College of Computer Science and Engineering, Northeastern University, Shenyang, 110819, Liaoning, China; Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, Shenyang, 110819, Liaoning, China
| | - Peng Cao
- College of Computer Science and Engineering, Northeastern University, Shenyang, 110819, Liaoning, China; Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, Shenyang, 110819, Liaoning, China.
| | - Rongxin Zhu
- Early Intervention Unit, Department of Psychiatry, Affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing, 210096, China
| | - Yang Wang
- Early Intervention Unit, Department of Psychiatry, Affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing, 210096, China
| | - Jian K Liu
- School of Computing, University of Leeds, Leeds, LS2 9JT, United Kingdom
| | - Fei Wang
- Early Intervention Unit, Department of Psychiatry, Affiliated Nanjing Brain Hospital, Nanjing Medical University, Nanjing, 210096, China.
| | - Xizhe Zhang
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, 211166, China.
| |
Collapse
|
4
|
Applications of Speech Analysis in Psychiatry. Harv Rev Psychiatry 2023; 31:1-13. [PMID: 36608078 DOI: 10.1097/hrp.0000000000000356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
ABSTRACT The need for objective measurement in psychiatry has stimulated interest in alternative indicators of the presence and severity of illness. Speech may offer a source of information that bridges the subjective and objective in the assessment of mental disorders. We systematically reviewed the literature for articles exploring speech analysis for psychiatric applications. The utility of speech analysis depends on how accurately speech features represent clinical symptoms within and across disorders. We identified four domains of the application of speech analysis in the literature: diagnostic classification, assessment of illness severity, prediction of onset of illness, and prognosis and treatment outcomes. We discuss the findings in each of these domains, with a focus on how types of speech features characterize different aspects of psychopathology. Models that bring together multiple speech features can distinguish speakers with psychiatric disorders from healthy controls with high accuracy. Differentiating between types of mental disorders and symptom dimensions are more complex problems that expose the transdiagnostic nature of speech features. Convergent progress in speech research and computer sciences opens avenues for implementing speech analysis to enhance objectivity of assessment in clinical practice. Application of speech analysis will need to address issues of ethics and equity, including the potential to perpetuate discriminatory bias through models that learn from clinical assessment data. Methods that mitigate bias are available and should play a key role in the implementation of speech analysis.
Collapse
|
5
|
Wu P, Wang R, Lin H, Zhang F, Tu J, Sun M. Automatic depression recognition by intelligent speech signal processing: A systematic survey. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2022. [DOI: 10.1049/cit2.12113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Pingping Wu
- Jiangsu Key Laboratory of Public Project Audit, School of Engineering Audit Nanjing Audit University Nanjing China
| | - Ruihao Wang
- School of Information Engineering Nanjing Audit University Nanjing China
| | - Han Lin
- Jiangsu Key Laboratory of Public Project Audit, School of Engineering Audit Nanjing Audit University Nanjing China
| | - Fanlong Zhang
- School of Information Engineering Nanjing Audit University Nanjing China
| | - Juan Tu
- Key Laboratory of Modern Acoustics (MOE), School of Physics Nanjing University Nanjing China
| | - Miao Sun
- Faculty of Electrical Engineering, Mathematics & Computer Science Delft University of Technology Delft The Netherlands
| |
Collapse
|
6
|
Lin RF, Leung TK, Liu YP, Hu KR. Disclosing Critical Voice Features for Discriminating between Depression and Insomnia—A Preliminary Study for Developing a Quantitative Method. Healthcare (Basel) 2022; 10:healthcare10050935. [PMID: 35628071 PMCID: PMC9142030 DOI: 10.3390/healthcare10050935] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 05/09/2022] [Accepted: 05/16/2022] [Indexed: 02/06/2023] Open
Abstract
Background: Depression and insomnia are highly related—insomnia is a common symptom among depression patients, and insomnia can result in depression. Although depression patients and insomnia patients should be treated with different approaches, the lack of practical biological markers makes it difficult to discriminate between depression and insomnia effectively. Purpose: This study aimed to disclose critical vocal features for discriminating between depression and insomnia. Methods: Four groups of patients, comprising six severe-depression patients, four moderate-depression patients, ten insomnia patients, and four patients with chronic pain disorder (CPD) participated in this preliminary study, which aimed to record their speaking voices. An open-source software, openSMILE, was applied to extract 384 voice features. Analysis of variance was used to analyze the effects of the four patient statuses on these voice features. Results: statistical analyses showed significant relationships between patient status and voice features. Patients with severe depression, moderate depression, insomnia, and CPD reacted differently to certain voice features. Critical voice features were reported based on these statistical relationships. Conclusions: This preliminary study shows the potential in developing discriminating models of depression and insomnia using voice features. Future studies should recruit an adequate number of patients to confirm these voice features and increase the number of data for developing a quantitative method.
Collapse
Affiliation(s)
- Ray F. Lin
- Department of Industrial Engineering and Management, Yuan Ze University, Taoyuan 32003, Taiwan;
- Correspondence:
| | - Ting-Kai Leung
- Department of Radiology, Taoyuan General Hospital, Ministry of Health and Welfare, No. 1492, Zhongshan Rd., Taoyuan City 33004, Taiwan;
- Graduate Institute of Biomedical Materials and Tissue Engineering, College of Biomedical Engineering, Taipei Medical University, Taipei 11031, Taiwan
| | - Yung-Ping Liu
- Department of Industrial Engineering and Management, Chaoyang University of Technology, Taichung 413310, Taiwan;
| | - Kai-Rong Hu
- Department of Industrial Engineering and Management, Yuan Ze University, Taoyuan 32003, Taiwan;
| |
Collapse
|
7
|
Ye J, Yu Y, Wang Q, Li W, Liang H, Zheng Y, Fu G. Multi-modal depression detection based on emotional audio and evaluation text. J Affect Disord 2021; 295:904-913. [PMID: 34706461 DOI: 10.1016/j.jad.2021.08.090] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/11/2021] [Revised: 07/02/2021] [Accepted: 08/26/2021] [Indexed: 10/20/2022]
Abstract
BACKGROUND Early detection of depression is very important for the treatment of patients. In view of the current inefficient screening methods for depression, the research of depression identification technology is a complex problem with application value. METHODS Our research propose a new experimental method for depression detection based on audio and text. 160 Chinese subjects are investigated in this study. It is worth noting that we propose a text reading experiment to make subjects emotions change rapidly. It will be called Segmental Emotional Speech Experiment (SESE) below. We extract 384-dimensional Low-level audio features to find the differences of different emotional change in SESE. At the same time, our research propose a multi-modal fusion method based on DeepSpectrum features and word vector features to detect depression by using deep learning. RESULTS Our experiment proved that SESE can improve the recognition accuracy of depression and found differences in Low-level audio features. Case group and Control group, gender and age are grouped for verification. It is also satisfactory that the multi-modal fusion model achieves accuracy of 0.912 and F1 score of 0.906. CONCLUSIONS Our contribution is twofold. First, we propose and verify SESE, which can provide a new experimental idea for the follow-up researchers. Secondly, a new efficient multi-modal depression recognition model is proposed.
Collapse
Affiliation(s)
- Jiayu Ye
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Yanhong Yu
- College of Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine, Jinan 250355, China
| | - Qingxiang Wang
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China.
| | - Wentao Li
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | - Hu Liang
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| | | | - Gang Fu
- School of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
| |
Collapse
|
8
|
Wang J, Lv K, Liu C, Nie X, Gowda D, Luan S. Automatic Assessment for Severe Self-Reported Depressive Symptoms Using Speech Cues. IEEE Trans Cogn Dev Syst 2021. [DOI: 10.1109/tcds.2020.3002512] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
9
|
Flanagan O, Chan A, Roop P, Sundram F. Using Acoustic Speech Patterns From Smartphones to Investigate Mood Disorders: Scoping Review. JMIR Mhealth Uhealth 2021; 9:e24352. [PMID: 34533465 PMCID: PMC8486998 DOI: 10.2196/24352] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 01/04/2021] [Accepted: 07/23/2021] [Indexed: 01/19/2023] Open
Abstract
Background Mood disorders are commonly underrecognized and undertreated, as diagnosis is reliant on self-reporting and clinical assessments that are often not timely. Speech characteristics of those with mood disorders differs from healthy individuals. With the wide use of smartphones, and the emergence of machine learning approaches, smartphones can be used to monitor speech patterns to help the diagnosis and monitoring of mood disorders. Objective The aim of this review is to synthesize research on using speech patterns from smartphones to diagnose and monitor mood disorders. Methods Literature searches of major databases, Medline, PsycInfo, EMBASE, and CINAHL, initially identified 832 relevant articles using the search terms “mood disorders”, “smartphone”, “voice analysis”, and their variants. Only 13 studies met inclusion criteria: use of a smartphone for capturing voice data, focus on diagnosing or monitoring a mood disorder(s), clinical populations recruited prospectively, and in the English language only. Articles were assessed by 2 reviewers, and data extracted included data type, classifiers used, methods of capture, and study results. Studies were analyzed using a narrative synthesis approach. Results Studies showed that voice data alone had reasonable accuracy in predicting mood states and mood fluctuations based on objectively monitored speech patterns. While a fusion of different sensor modalities revealed the highest accuracy (97.4%), nearly 80% of included studies were pilot trials or feasibility studies without control groups and had small sample sizes ranging from 1 to 73 participants. Studies were also carried out over short or varying timeframes and had significant heterogeneity of methods in terms of the types of audio data captured, environmental contexts, classifiers, and measures to control for privacy and ambient noise. Conclusions Approaches that allow smartphone-based monitoring of speech patterns in mood disorders are rapidly growing. The current body of evidence supports the value of speech patterns to monitor, classify, and predict mood states in real time. However, many challenges remain around the robustness, cost-effectiveness, and acceptability of such an approach and further work is required to build on current research and reduce heterogeneity of methodologies as well as clinical evaluation of the benefits and risks of such approaches.
Collapse
Affiliation(s)
- Olivia Flanagan
- Department of Psychological Medicine, Faculty of Medical and Health Sciences, University of Auckland, Auckland, New Zealand
| | - Amy Chan
- School of Pharmacy, Faculty of Medical and Health Sciences, University of Auckland, Auckland, New Zealand
| | - Partha Roop
- Faculty of Engineering, University of Auckland, Auckland, New Zealand
| | - Frederick Sundram
- Department of Psychological Medicine, Faculty of Medical and Health Sciences, University of Auckland, Auckland, New Zealand
| |
Collapse
|
10
|
Niu M, Liu B, Tao J, Li Q. A time-frequency channel attention and vectorization network for automatic depression level prediction. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.04.056] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
11
|
Shin D, Cho WI, Park CHK, Rhee SJ, Kim MJ, Lee H, Kim NS, Ahn YM. Detection of Minor and Major Depression through Voice as a Biomarker Using Machine Learning. J Clin Med 2021; 10:jcm10143046. [PMID: 34300212 PMCID: PMC8303477 DOI: 10.3390/jcm10143046] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 06/23/2021] [Accepted: 07/07/2021] [Indexed: 11/30/2022] Open
Abstract
Both minor and major depression have high prevalence and are important causes of social burden worldwide; however, there is still no objective indicator to detect minor depression. This study aimed to examine if voice could be used as a biomarker to detect minor and major depression. Ninety-three subjects were classified into three groups: the not depressed group (n = 33), the minor depressive episode group (n = 26), and the major depressive episode group (n = 34), based on current depressive status as a dimension. Twenty-one voice features were extracted from semi-structured interview recordings. A three-group comparison was performed through analysis of variance. Seven voice indicators showed differences between the three groups, even after adjusting for age, BMI, and drugs taken for non-psychiatric disorders. Among the machine learning methods, the best performance was obtained using the multi-layer processing method, and an AUC of 65.9%, sensitivity of 65.6%, and specificity of 66.2% were shown. This study further revealed voice differences in depressive episodes and confirmed that not depressed groups and participants with minor and major depression could be accurately distinguished through machine learning. Although this study is limited by a small sample size, it is the first study on voice change in minor depression and suggests the possibility of detecting minor depression through voice.
Collapse
Affiliation(s)
- Daun Shin
- Department of Psychiatry, Seoul National University College of Medicine, Seoul 03080, Korea; (D.S.); (H.L.)
- Department of Neuropsychiatry, Seoul National University Hospital, Seoul 13620, Korea; (S.J.R.); (M.J.K.)
| | - Won Ik Cho
- Department of Electrical and Computer Engineering and INMC, Seoul National University College of Engineering, Seoul 08826, Korea; (W.I.C.); (N.S.K.)
| | | | - Sang Jin Rhee
- Department of Neuropsychiatry, Seoul National University Hospital, Seoul 13620, Korea; (S.J.R.); (M.J.K.)
| | - Min Ji Kim
- Department of Neuropsychiatry, Seoul National University Hospital, Seoul 13620, Korea; (S.J.R.); (M.J.K.)
| | - Hyunju Lee
- Department of Psychiatry, Seoul National University College of Medicine, Seoul 03080, Korea; (D.S.); (H.L.)
- Department of Neuropsychiatry, Seoul National University Hospital, Seoul 13620, Korea; (S.J.R.); (M.J.K.)
| | - Nam Soo Kim
- Department of Electrical and Computer Engineering and INMC, Seoul National University College of Engineering, Seoul 08826, Korea; (W.I.C.); (N.S.K.)
| | - Yong Min Ahn
- Department of Psychiatry, Seoul National University College of Medicine, Seoul 03080, Korea; (D.S.); (H.L.)
- Department of Neuropsychiatry, Seoul National University Hospital, Seoul 13620, Korea; (S.J.R.); (M.J.K.)
- Institute of Human Behavioral Medicine, Seoul National University Medical Research Center, Seoul 03087, Korea
- Correspondence: ; Fax: +82-2-744-7241
| |
Collapse
|
12
|
Goldberg SB, Flemotomos N, Martinez VR, Tanana MJ, Kuo PB, Pace BT, Villatte JL, Georgiou PG, Van Epps J, Imel ZE, Narayanan SS, Atkins DC. Machine learning and natural language processing in psychotherapy research: Alliance as example use case. J Couns Psychol 2020; 67:438-448. [PMID: 32614225 PMCID: PMC7393999 DOI: 10.1037/cou0000382] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
Artificial intelligence generally and machine learning specifically have become deeply woven into the lives and technologies of modern life. Machine learning is dramatically changing scientific research and industry and may also hold promise for addressing limitations encountered in mental health care and psychotherapy. The current paper introduces machine learning and natural language processing as related methodologies that may prove valuable for automating the assessment of meaningful aspects of treatment. Prediction of therapeutic alliance from session recordings is used as a case in point. Recordings from 1,235 sessions of 386 clients seen by 40 therapists at a university counseling center were processed using automatic speech recognition software. Machine learning algorithms learned associations between client ratings of therapeutic alliance exclusively from session linguistic content. Using a portion of the data to train the model, machine learning algorithms modestly predicted alliance ratings from session content in an independent test set (Spearman's ρ = .15, p < .001). These results highlight the potential to harness natural language processing and machine learning to predict a key psychotherapy process variable that is relatively distal from linguistic content. Six practical suggestions for conducting psychotherapy research using machine learning are presented along with several directions for future research. Questions of dissemination and implementation may be particularly important to explore as machine learning improves in its ability to automate assessment of psychotherapy process and outcome. (PsycInfo Database Record (c) 2020 APA, all rights reserved).
Collapse
|
13
|
Drimalla H, Scheffer T, Landwehr N, Baskow I, Roepke S, Behnia B, Dziobek I. Towards the automatic detection of social biomarkers in autism spectrum disorder: introducing the simulated interaction task (SIT). NPJ Digit Med 2020; 3:25. [PMID: 32140568 PMCID: PMC7048784 DOI: 10.1038/s41746-020-0227-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Accepted: 01/17/2020] [Indexed: 12/28/2022] Open
Abstract
Social interaction deficits are evident in many psychiatric conditions and specifically in autism spectrum disorder (ASD), but hard to assess objectively. We present a digital tool to automatically quantify biomarkers of social interaction deficits: the simulated interaction task (SIT), which entails a standardized 7-min simulated dialog via video and the automated analysis of facial expressions, gaze behavior, and voice characteristics. In a study with 37 adults with ASD without intellectual disability and 43 healthy controls, we show the potential of the tool as a diagnostic instrument and for better description of ASD-associated social phenotypes. Using machine-learning tools, we detected individuals with ASD with an accuracy of 73%, sensitivity of 67%, and specificity of 79%, based on their facial expressions and vocal characteristics alone. Especially reduced social smiling and facial mimicry as well as a higher voice fundamental frequency and harmony-to-noise-ratio were characteristic for individuals with ASD. The time-effective and cost-effective computer-based analysis outperformed a majority vote and performed equal to clinical expert ratings.
Collapse
Affiliation(s)
- Hanna Drimalla
- Department of Psychology, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
- Berlin School of Mind and Brain, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany
| | - Tobias Scheffer
- Institute of Computer Science, University of Potsdam, Am Neuen Palais 10, 14469 Potsdam, Germany
| | - Niels Landwehr
- Institute of Computer Science, University of Potsdam, Am Neuen Palais 10, 14469 Potsdam, Germany
- Leibniz Institute for Agricultural Engineering and Bioeconomy, Max-Eyth-Allee 100, 14469 Potsdam, Germany
| | - Irina Baskow
- Department of Psychology, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
- Department of Psychiatry, Charité-Universitätsmedizin Berlin, Campus Benjamin Franklin, Hindenburgdamm 30, 12203 Berlin, Germany
| | - Stefan Roepke
- Department of Psychiatry, Charité-Universitätsmedizin Berlin, Campus Benjamin Franklin, Hindenburgdamm 30, 12203 Berlin, Germany
| | - Behnoush Behnia
- Department of Psychiatry, Charité-Universitätsmedizin Berlin, Campus Benjamin Franklin, Hindenburgdamm 30, 12203 Berlin, Germany
| | - Isabel Dziobek
- Department of Psychology, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
- Berlin School of Mind and Brain, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany
| |
Collapse
|
14
|
|
15
|
Harati S, Crowell A, Mayberg H, Nemati S. Depression Severity Classification from Speech Emotion. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2019; 2018:5763-5766. [PMID: 30441645 DOI: 10.1109/embc.2018.8513610] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Major Depressive Disorder (MDD) is a common psychiatric illness. Automatically classifying depression severity using audio analysis can help clinical management decisions during Deep Brain Stimulation (DBS) treatment of MDD patients. Leveraging the link between short-term emotions and long-term depressed mood states, we build our predictive model on the top of emotion-based features. Because acquiring emotion labels of MDD patients is a challenging task, we propose to use an auxiliary emotion dataset to train a Deep Neural Network (DNN) model. The DNN is then applied to audio recordings of MDD patients to find their low dimensional representation to be used in the classification algorithm. Our preliminary results indicate that the proposed approach, in comparison to the alternatives, effectively classifies depressed and improved phases of DBS treatment with an AUC of 0.80.
Collapse
|
16
|
Wang J, Zhang L, Liu T, Pan W, Hu B, Zhu T. Acoustic differences between healthy and depressed people: a cross-situation study. BMC Psychiatry 2019; 19:300. [PMID: 31615470 PMCID: PMC6794822 DOI: 10.1186/s12888-019-2300-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/06/2018] [Accepted: 09/20/2019] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Abnormalities in vocal expression during a depressed episode have frequently been reported in people with depression, but less is known about if these abnormalities only exist in special situations. In addition, the impacts of irrelevant demographic variables on voice were uncontrolled in previous studies. Therefore, this study compares the vocal differences between depressed and healthy people under various situations with irrelevant variables being regarded as covariates. METHODS To examine whether the vocal abnormalities in people with depression only exist in special situations, this study compared the vocal differences between healthy people and patients with unipolar depression in 12 situations (speech scenarios). Positive, negative and neutral voice expressions between depressed and healthy people were compared in four tasks. Multiple analysis of covariance (MANCOVA) was used for evaluating the main effects of variable group (depressed vs. healthy) on acoustic features. The significances of acoustic features were evaluated by both statistical significance and magnitude of effect size. RESULTS The results of multivariate analysis of covariance showed that significant differences between the two groups were observed in all 12 speech scenarios. Although significant acoustic features were not the same in different scenarios, we found that three acoustic features (loudness, MFCC5 and MFCC7) were consistently different between people with and without depression with large effect magnitude. CONCLUSIONS Vocal differences between depressed and healthy people exist in 12 scenarios. Acoustic features including loudness, MFCC5 and MFCC7 have potentials to be indicators for identifying depression via voice analysis. These findings support that depressed people's voices include both situation-specific and cross-situational patterns of acoustic features.
Collapse
Affiliation(s)
- Jingying Wang
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
| | - Lei Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, VA USA
| | - Tianli Liu
- Institute of Population Research, Peking University, Beijing, China
| | - Wei Pan
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
| | - Bin Hu
- School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu Province China
| | - Tingshao Zhu
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
17
|
Marmar CR, Brown AD, Qian M, Laska E, Siegel C, Li M, Abu-Amara D, Tsiartas A, Richey C, Smith J, Knoth B, Vergyri D. Speech-based markers for posttraumatic stress disorder in US veterans. Depress Anxiety 2019; 36:607-616. [PMID: 31006959 PMCID: PMC6602854 DOI: 10.1002/da.22890] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 02/14/2019] [Accepted: 03/08/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND The diagnosis of posttraumatic stress disorder (PTSD) is usually based on clinical interviews or self-report measures. Both approaches are subject to under- and over-reporting of symptoms. An objective test is lacking. We have developed a classifier of PTSD based on objective speech-marker features that discriminate PTSD cases from controls. METHODS Speech samples were obtained from warzone-exposed veterans, 52 cases with PTSD and 77 controls, assessed with the Clinician-Administered PTSD Scale. Individuals with major depressive disorder (MDD) were excluded. Audio recordings of clinical interviews were used to obtain 40,526 speech features which were input to a random forest (RF) algorithm. RESULTS The selected RF used 18 speech features and the receiver operating characteristic curve had an area under the curve (AUC) of 0.954. At a probability of PTSD cut point of 0.423, Youden's index was 0.787, and overall correct classification rate was 89.1%. The probability of PTSD was higher for markers that indicated slower, more monotonous speech, less change in tonality, and less activation. Depression symptoms, alcohol use disorder, and TBI did not meet statistical tests to be considered confounders. CONCLUSIONS This study demonstrates that a speech-based algorithm can objectively differentiate PTSD cases from controls. The RF classifier had a high AUC. Further validation in an independent sample and appraisal of the classifier to identify those with MDD only compared with those with PTSD comorbid with MDD is required.
Collapse
Affiliation(s)
- Charles R. Marmar
- Department of Psychiatry, New York University School of Medicine, New York, New York; Steven and Alexandra Cohen Veterans Center for the Study of Post-Traumatic Stress and Traumatic Brain Injury, New York, New York
| | - Adam D. Brown
- Department of Psychiatry, New York University School of Medicine, New York, New York; Steven and Alexandra Cohen Veterans Center for the Study of Post-Traumatic Stress and Traumatic Brain Injury, New York, New York
- Department of Psychology, New School for Social Research, New York, New York
| | - Meng Qian
- Department of Psychiatry, New York University School of Medicine, New York, New York; Steven and Alexandra Cohen Veterans Center for the Study of Post-Traumatic Stress and Traumatic Brain Injury, New York, New York
| | - Eugene Laska
- Department of Psychiatry, New York University School of Medicine, New York, New York; Steven and Alexandra Cohen Veterans Center for the Study of Post-Traumatic Stress and Traumatic Brain Injury, New York, New York
| | - Carole Siegel
- Department of Psychiatry, New York University School of Medicine, New York, New York; Steven and Alexandra Cohen Veterans Center for the Study of Post-Traumatic Stress and Traumatic Brain Injury, New York, New York
| | - Meng Li
- Department of Psychiatry, New York University School of Medicine, New York, New York; Steven and Alexandra Cohen Veterans Center for the Study of Post-Traumatic Stress and Traumatic Brain Injury, New York, New York
| | - Duna Abu-Amara
- Department of Psychiatry, New York University School of Medicine, New York, New York; Steven and Alexandra Cohen Veterans Center for the Study of Post-Traumatic Stress and Traumatic Brain Injury, New York, New York
| | | | | | | | | | | |
Collapse
|
18
|
Pan W, Flint J, Shenhav L, Liu T, Liu M, Hu B, Zhu T. Re-examining the robustness of voice features in predicting depression: Compared with baseline of confounders. PLoS One 2019; 14:e0218172. [PMID: 31220113 PMCID: PMC6586278 DOI: 10.1371/journal.pone.0218172] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Accepted: 05/28/2019] [Indexed: 11/19/2022] Open
Abstract
A large proportion of Depression Disorder patients do not receive an effective diagnosis, which makes it necessary to find a more objective assessment to facilitate a more rapid and accurate diagnosis of depression. Speech data is easy to acquire clinically, its association with depression has been studied, although the actual predictive effect of voice features has not been examined. Thus, we do not have a general understanding of the extent to which voice features contribute to the identification of depression. In this study, we investigated the significance of the association between voice features and depression using binary logistic regression, and the actual classification effect of voice features on depression was re-examined through classification modeling. Nearly 1000 Chinese females participated in this study. Several different datasets was included as test set. We found that 4 voice features (PC1, PC6, PC17, PC24, P<0.05, corrected) made significant contribution to depression, and that the contribution effect of the voice features alone reached 35.65% (Nagelkerke's R2). In classification modeling, voice data based model has consistently higher predicting accuracy(F-measure) than the baseline model of demographic data when tested on different datasets, even across different emotion context. F-measure of voice features alone reached 81%, consistent with existing data. These results demonstrate that voice features are effective in predicting depression and indicate that more sophisticated models based on voice features can be built to help in clinical diagnosis.
Collapse
Affiliation(s)
- Wei Pan
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jonathan Flint
- Center for Neurobehavioral Genetics, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles, Los Angeles, United States of America
| | - Liat Shenhav
- Department of Computer Science, University of California Los Angeles, Los Angeles, United States of America
| | - Tianli Liu
- Institute of Population Research, Peking University, Beijing, China
| | - Mingming Liu
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Bin Hu
- School of Information Science & Engineering, Lanzhou University, Lanzhou, China
| | - Tingshao Zhu
- Institute of Psychology, Chinese Academy of Sciences, Beijing, China
- * E-mail:
| |
Collapse
|
19
|
Guidi A, Gentili C, Scilingo E, Vanello N. Analysis of speech features and personality traits. Biomed Signal Process Control 2019. [DOI: 10.1016/j.bspc.2019.01.027] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
20
|
de Barbaro K. Automated sensing of daily activity: A new lens into development. Dev Psychobiol 2019; 61:444-464. [PMID: 30883745 PMCID: PMC7343175 DOI: 10.1002/dev.21831] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Revised: 10/09/2018] [Accepted: 11/29/2018] [Indexed: 11/10/2022]
Abstract
Rapidly maturing technologies for sensing and activity recognition can provide unprecedented access to the complex structure daily activity and interaction, promising new insight into the mechanisms by which experience shapes developmental outcomes. Motion data, autonomic activity, and "snippets" of audio and video recordings can be conveniently logged by wearable sensors (Lazer et al., 2009). Machine learning algorithms can process these signals into meaningful markers, from child and parent behavior to outcomes such as depression or teenage drinking. Theoretically motivated aspects of daily activity can be combined and synchronized to examine reciprocal effects between children's behaviors and their environments or internal processes. Captured over longitudinal time, such data provide a new opportunity to study the processes by which individual differences emerge and stabilize. This paper introduces the reader to developments in sensing and activity recognition with implications for developmental phenomena across the lifespan, sketching a framework for leveraging mobile sensors for transactional analyses that bridge micro- and longitudinal- timescales of development. It finishes by detailing resources and best practices to facilitate the next generation of developmentalists to contribute to this emerging area.
Collapse
Affiliation(s)
- Kaya de Barbaro
- Department of Psychology, The University of Texas at Austin, Austin, Texas
| |
Collapse
|
21
|
Rana R, Latif S, Gururajan R, Gray A, Mackenzie G, Humphris G, Dunn J. Automated screening for distress: A perspective for the future. Eur J Cancer Care (Engl) 2019; 28:e13033. [DOI: 10.1111/ecc.13033] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Revised: 02/05/2019] [Accepted: 02/18/2019] [Indexed: 01/13/2023]
Affiliation(s)
- Rajib Rana
- University of Southern Queensland Springfield Queensland Australia
| | - Siddique Latif
- University of Southern Queensland Springfield Queensland Australia
| | - Raj Gururajan
- University of Southern Queensland Springfield Queensland Australia
| | - Anthony Gray
- University of Southern Queensland Springfield Queensland Australia
| | | | | | - Jeff Dunn
- University of Southern Queensland Springfield Queensland Australia
- Griffith University Brisbane Queensland Australia
- University of Technology Sydney Sydney New South Wales Australia
| |
Collapse
|
22
|
Gillespie S, Laures-Gore J, Moore E, Farina M, Russell S, Haaland B. Identification of Affective State Change in Adults With Aphasia Using Speech Acoustics. JOURNAL OF SPEECH, LANGUAGE, AND HEARING RESEARCH : JSLHR 2018; 61:2906-2916. [PMID: 30481797 PMCID: PMC6440307 DOI: 10.1044/2018_jslhr-s-17-0057] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2017] [Revised: 05/10/2017] [Accepted: 06/14/2018] [Indexed: 06/09/2023]
Abstract
Purpose The current study aimed to identify objective acoustic measures related to affective state change in the speech of adults with post-stroke aphasia. Method The speech of 20 post-stroke adults with aphasia was recorded during picture description and administration of the Western Aphasia Battery-Revised (Kertesz, 2006). In addition, participants completed the Self-Assessment Manikin (Bradley & Lang, 1994) and the Stress Scale (Tobii Dynavox, 1981-2016) before and after the language tasks. Speech from each participant was used to detect a change in affective state test scores between the beginning and ending speech. Results Machine learning revealed moderate success in classifying depression, minimal success in predicting depression and stress numeric scores, and minimal success in classifying changes in affective state class between the beginning and ending speech. Conclusions The results suggest the existence of objectively measurable aspects of speech that may be used to identify changes in acute affect from adults with aphasia. This work is exploratory and hypothesis-generating; more work will be needed to make conclusive claims. Further work in this area could lead to automated tools to assist clinicians with their diagnoses of stress, depression, and other forms of affect in adults with aphasia.
Collapse
Affiliation(s)
- Stephanie Gillespie
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta
| | | | - Elliot Moore
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta
| | - Matthew Farina
- Communication Disorders Program, Georgia State University, Atlanta
| | - Scott Russell
- Department of Speech-Language Pathology, Grady Memorial Hospital, Atlanta, GA
| | - Benjamin Haaland
- Department of Population Health Sciences, University of Utah, Salt Lake City
| |
Collapse
|
23
|
Detecting Depression Using an Ensemble Logistic Regression Model Based on Multiple Speech Features. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2018; 2018:6508319. [PMID: 30344616 PMCID: PMC6174772 DOI: 10.1155/2018/6508319] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 08/28/2018] [Indexed: 11/18/2022]
Abstract
Early intervention for depression is very important to ease the disease burden, but current diagnostic methods are still limited. This study investigated automatic depressed speech classification in a sample of 170 native Chinese subjects (85 healthy controls and 85 depressed patients). The classification performances of prosodic, spectral, and glottal speech features were analyzed in recognition of depression. We proposed an ensemble logistic regression model for detecting depression (ELRDD) in speech. The logistic regression, which was superior in recognition of depression, was selected as the base classifier. This ensemble model extracted many speech features from different aspects and ensured diversity of the base classifier. ELRDD provided better classification results than the other compared classifiers. A technique for identifying depression based on ELRDD, ELRDD-E, was here suggested and tested. It offered encouraging outcomes, revealing a high accuracy level of 75.00% for females and 81.82% for males, as well as an advantageous sensitivity/specificity ratio of 79.25%/70.59% for females and 78.13%/85.29% for males.
Collapse
|
24
|
Pan Z, Gui C, Zhang J, Zhu J, Cui D. Detecting Manic State of Bipolar Disorder Based on Support Vector Machine and Gaussian Mixture Model Using Spontaneous Speech. Psychiatry Investig 2018; 15:695-700. [PMID: 29969852 PMCID: PMC6056700 DOI: 10.30773/pi.2017.12.15] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Accepted: 12/15/2017] [Indexed: 11/27/2022] Open
Abstract
OBJECTIVE This study was aimed to compare the accuracy of Support Vector Machine (SVM) and Gaussian Mixture Model (GMM) in the detection of manic state of bipolar disorders (BD) of single patients and multiple patients. METHODS 21 hospitalized BD patients (14 females, average age 34.5±15.3) were recruited after admission. Spontaneous speech was collected through a preloaded smartphone. Firstly, speech features [pitch, formants, mel-frequency cepstrum coefficients (MFCC), linear prediction cepstral coefficient (LPCC), gamma-tone frequency cepstral coefficients (GFCC) etc.] were preprocessed and extracted. Then, speech features were selected using the features of between-class variance and within-class variance. The manic state of patients was then detected by SVM and GMM methods. RESULTS LPCC demonstrated the best discrimination efficiency. The accuracy of manic state detection for single patients was much better using SVM method than GMM method. The detection accuracy for multiple patients was higher using GMM method than SVM method. CONCLUSION SVM provided an appropriate tool for detecting manic state for single patients, whereas GMM worked better for multiple patients' manic state detection. Both of them could help doctors and patients for better diagnosis and mood state monitoring in different situations.
Collapse
Affiliation(s)
- Zhongde Pan
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China.,Shanghai Key Laboratory of Forensic Medicine, Institute of Forensic Science, Ministry of Justice, Shanghai, China
| | - Chao Gui
- School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Jing Zhang
- Jiading District Mental Health Center, Shanghai, China
| | - Jie Zhu
- School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Donghong Cui
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China.,Brain Science and Technology Research Center, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
25
|
He L, Cao C. Automated depression analysis using convolutional neural networks from speech. J Biomed Inform 2018; 83:103-111. [PMID: 29852317 DOI: 10.1016/j.jbi.2018.05.007] [Citation(s) in RCA: 77] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2017] [Revised: 04/25/2018] [Accepted: 05/12/2018] [Indexed: 11/17/2022]
Abstract
To help clinicians to efficiently diagnose the severity of a person's depression, the affective computing community and the artificial intelligence field have shown a growing interest in designing automated systems. The speech features have useful information for the diagnosis of depression. However, manually designing and domain knowledge are still important for the selection of the feature, which makes the process labor consuming and subjective. In recent years, deep-learned features based on neural networks have shown superior performance to hand-crafted features in various areas. In this paper, to overcome the difficulties mentioned above, we propose a combination of hand-crafted and deep-learned features which can effectively measure the severity of depression from speech. In the proposed method, Deep Convolutional Neural Networks (DCNN) are firstly built to learn deep-learned features from spectrograms and raw speech waveforms. Then we manually extract the state-of-the-art texture descriptors named median robust extended local binary patterns (MRELBP) from spectrograms. To capture the complementary information within the hand-crafted features and deep-learned features, we propose joint fine-tuning layers to combine the raw and spectrogram DCNN to boost the depression recognition performance. Moreover, to address the problems with small samples, a data augmentation method was proposed. Experiments conducted on AVEC2013 and AVEC2014 depression databases show that our approach is robust and effective for the diagnosis of depression when compared to state-of-the-art audio-based methods.
Collapse
Affiliation(s)
- Lang He
- NPU-VUB joint AVSP Research Lab, School of Computer Science, Northwestern Polytechnical University (NPU), Xi'an, China.
| | - Cui Cao
- Moscow Institute of Arts, Weinan Normal University, Weinan, China
| |
Collapse
|
26
|
Zhang J, Pan Z, Gui C, Xue T, Lin Y, Zhu J, Cui D. Analysis on speech signal features of manic patients. J Psychiatr Res 2018; 98:59-63. [PMID: 29291581 DOI: 10.1016/j.jpsychires.2017.12.012] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2017] [Revised: 12/15/2017] [Accepted: 12/18/2017] [Indexed: 10/18/2022]
Abstract
Given the lack of effective biological markers for early diagnosis of bipolar mania, and the tendency for voice fluctuation during transition between mood states, this study aimed to investigate the speech features of manic patients to identify a potential set of biomarkers for diagnosis of bipolar mania. 30 manic patients and 30 healthy controls were recruited and their corresponding speech features were collected during natural dialogue using the Automatic Voice Collecting System. Bech-Rafaelsdn Mania Rating Scale (BRMS) and Clinical impression rating scale (CGI) were used to assess illness. The speech features were compared between two groups: mood group (mania vs remission) and bipolar group (manic patients vs healthy individuals). We found that the characteristic speech signals differed between mood groups and bipolar groups. The fourth formant (F4) and Linear Prediction Coefficient (LPC) (P < .05) were significantly differed when patients transmitted from manic to remission state. The first formant (F1), the second formant (F2), and LPC (P < .05) also played key roles in distinguishing between patients and healthy individuals. In addition, there was a significantly correlation between LPC and BRMS, indicating that LPC may play an important role in diagnosis of bipolar mania. In this study we traced speech features of bipolar mania during natural dialogue (conversation), which is an accessible approach in clinic practice. Such specific indicators may respectively serve as promising biomarkers for benefiting the diagnosis and clinical therapeutic evaluation of bipolar mania.
Collapse
Affiliation(s)
- Jing Zhang
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Shanghai Jiading Mental Health Center, Shanghai, China
| | - Zhongde Pan
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Chao Gui
- Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Ting Xue
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Yezhe Lin
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jie Zhu
- Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China.
| | - Donghong Cui
- Shanghai Key Laboratory of Psychotic Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China; Brain Science and Technology Research Center, Shanghai Jiao Tong University, China.
| |
Collapse
|
27
|
Chien YR, Mehta DD, Guðnason J, Zañartu M, Quatieri TF. Evaluation of Glottal Inverse Filtering Algorithms Using a Physiologically Based Articulatory Speech Synthesizer. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 2017; 25:1718-1730. [PMID: 34268444 PMCID: PMC8279087 DOI: 10.1109/taslp.2017.2714839] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Glottal inverse filtering aims to estimate the glottal airflow signal from a speech signal for applications such as speaker recognition and clinical voice assessment. Nonetheless, evaluation of inverse filtering algorithms has been challenging due to the practical difficulties of directly measuring glottal airflow. Apart from this, it is acknowledged that the performance of many methods degrade in voice conditions that are of great interest, such as breathiness, high pitch, soft voice, and running speech. This paper presents a comprehensive, objective, and comparative evaluation of state-of-the-art inverse filtering algorithms that takes advantage of speech and glottal airflow signals generated by a physiological speech synthesizer. The synthesizer provides a physics-based simulation of the voice production process and thus an adequate test bed for revealing the temporal and spectral performance characteristics of each algorithm. Included in the synthetic data are continuous speech utterances and sustained vowels, which are produced with multiple voice qualities (pressed, slightly pressed, modal, slightly breathy, and breathy), fundamental frequencies, and subglottal pressures to simulate the natural variations in real speech. In evaluating the accuracy of a glottal flow estimate, multiple error measures are used, including an error in the estimated signal that measures overall waveform deviation, as well as an error in each of several clinically relevant features extracted from the glottal flow estimate. Waveform errors calculated from glottal flow estimation experiments exhibited mean values around 30% for sustained vowels, and around 40% for continuous speech, of the amplitude of true glottal flow derivative. Closed-phase approaches showed remarkable stability across different voice qualities and subglottal pressures. The algorithms of choice, as suggested by significance tests, are closed-phase covariance analysis for the analysis of sustained vowels, and sparse linear prediction for the analysis of continuous speech. Results of data subset analysis suggest that analysis of close rounded vowels is an additional challenge in glottal flow estimation.
Collapse
Affiliation(s)
- Yu-Ren Chien
- Center for Analysis and Design of Intelligent Agents, Reykjavik University, Menntavegur 1, Iceland
| | - Daryush D Mehta
- Center for Laryngeal Surgery and Voice Rehabilitation, and Institute of Health Professions, Massachusetts General Hospital, Boston MA 02114 USA, with the Department of Surgery, Harvard Medical School, Boston, MA 02115 USA, and also with MIT Lincoln Laboratory, Lexington, MA
| | - Jón Guðnason
- Center for Analysis and Design of Intelligent Agents, Reykjavik University, Menntavegur 1, Iceland
| | - Matías Zañartu
- Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile, 2390123
| | | |
Collapse
|
28
|
Guidi A, Schoentgen J, Bertschy G, Gentili C, Scilingo E, Vanello N. Features of vocal frequency contour and speech rhythm in bipolar disorder. Biomed Signal Process Control 2017. [DOI: 10.1016/j.bspc.2017.01.017] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
29
|
Liu Z, Hu B, Li X, Liu F, Wang G, Yang J. Detecting Depression in Speech Under Different Speaking Styles and Emotional Valences. Brain Inform 2017. [DOI: 10.1007/978-3-319-70772-3_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
30
|
Gros A, Bensamoun D, Manera V, Fabre R, Zacconi-Cauvin AM, Thummler S, Benoit M, Robert P, David R. Recommendations for the Use of ICT in Elderly Populations with Affective Disorders. Front Aging Neurosci 2016; 8:269. [PMID: 27877126 PMCID: PMC5099137 DOI: 10.3389/fnagi.2016.00269] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2016] [Accepted: 10/24/2016] [Indexed: 12/27/2022] Open
Abstract
Objective: Affective disorders are frequently encountered among elderly populations, and the use of information and communication technologies (ICT) could provide an added value for their recognition and assessment in addition to current clinical methods. The diversity and lack of consensus in the emerging field of ICTs is however a strong limitation for their global use in daily practice. The aim of the present article is to provide recommendations for the use of ICTs for the assessment and management of affective disorders among elderly populations with or without dementia. Methods: A Delphi panel was organized to gather recommendations from experts in the domain. A set of initial general questions for the use of ICT in affective disorders was used to guide the discussion of the expert panel and to analyze the Strengths, Weaknesses, Opportunities, and Threats (SWOT) of employing ICT in elderly populations with affective disorders. Based on the results collected from this first round, a web survey was sent to local general practitioners (GPs) and to all interns in psychiatry in France. Results: The results of the first round revealed that ICT may offer very useful tools for practitioners involved in the diagnosis and management of affective disorders. However, the results of the web survey showed the interest to explain better to current and upcoming practitioners the utility of ICT especially for people living with dementia.
Collapse
Affiliation(s)
- Auriane Gros
- Département de Neurologie, Centre Mémoire de Ressources et de Recherche, Centre Hospitalier Universitaire de DijonDijon, France; CoBTek (Cognition-Behaviour-Technology), University of Nice Sophia AntipolisNice, France; Centre Edmond et Lily Safra pour la Recherche sur la Maladie d'Alzheimer, Centre Mémoire de Ressources et de Recherche, Institut Claude Pompidou, Centre Hospitalier Universitaire de NiceNice, France
| | - David Bensamoun
- CoBTek (Cognition-Behaviour-Technology), University of Nice Sophia AntipolisNice, France; Département de Psychiatrie, Hôpital Pasteur, Centre Hospitalier Universitaire de NiceNice, France
| | - Valeria Manera
- CoBTek (Cognition-Behaviour-Technology), University of Nice Sophia Antipolis Nice, France
| | - Roxane Fabre
- Centre Edmond et Lily Safra pour la Recherche sur la Maladie d'Alzheimer, Centre Mémoire de Ressources et de Recherche, Institut Claude Pompidou, Centre Hospitalier Universitaire de NiceNice, France; Département de Santé Publique, Hôpital L'Archet, Centre Hospitalier Universitaire de NiceNice, France
| | | | - Susanne Thummler
- CoBTek (Cognition-Behaviour-Technology), University of Nice Sophia Antipolis Nice, France
| | - Michel Benoit
- CoBTek (Cognition-Behaviour-Technology), University of Nice Sophia AntipolisNice, France; Département de Psychiatrie, Hôpital Pasteur, Centre Hospitalier Universitaire de NiceNice, France
| | - Philippe Robert
- CoBTek (Cognition-Behaviour-Technology), University of Nice Sophia AntipolisNice, France; Centre Edmond et Lily Safra pour la Recherche sur la Maladie d'Alzheimer, Centre Mémoire de Ressources et de Recherche, Institut Claude Pompidou, Centre Hospitalier Universitaire de NiceNice, France
| | - Renaud David
- CoBTek (Cognition-Behaviour-Technology), University of Nice Sophia AntipolisNice, France; Centre Edmond et Lily Safra pour la Recherche sur la Maladie d'Alzheimer, Centre Mémoire de Ressources et de Recherche, Institut Claude Pompidou, Centre Hospitalier Universitaire de NiceNice, France
| |
Collapse
|
31
|
Guidi A, Schoentgen J, Bertschy G, Gentili C, Landini L, Scilingo EP, Vanello N. Voice quality in patients suffering from bipolar disease. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2016; 2015:6106-9. [PMID: 26737685 DOI: 10.1109/embc.2015.7319785] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
People suffering from bipolar disease are more and more common. Such pathology can severely affect patients' lifestyle by wide, and sometimes extreme, mood swings. Biosignals can be very useful to understand this disease. Specifically, speech-related features have been seen to vary in depressed people with respect to healthy subjects. Usually prosodic, spectral and energy-related features are studied. Some further information, instead, can be provided studying voice quality. According to Laver's model, voice quality is sensitive and depends on both anatomic/physiologic issues and long-term muscular adjustments of the larynx or the supraglottal vocal tract. A pilot study on both bipolar patients and healthy control subjects, performed by means of the Long-Term Average Spectrum (LTAS) is presented. The effects on LTAS estimation of a F0-correction procedure are discussed. Pairwise statistical comparisons between subjects in euthymic and depressed states and euthymic and hypomanic states were performed. Significant differences were found in some frequency intervals in both cases. The F0-correction procedure modified the values of the significant frequency intervals in the euthymic/depressed comparison, that also was characterized by a change of F0. Noticeably, no statistically significant differences were found in control subjects acquired in the same mood state. Though the number of subjects is small, the results are encouraging given their coherence across patients and the lack of differences in the control group. Finally, this work suggests that particular vocal settings might be involved in different mood states.
Collapse
|
32
|
Maxhuni A, Muñoz-Meléndez A, Osmani V, Perez H, Mayora O, Morales EF. Classification of bipolar disorder episodes based on analysis of voice and motor activity of patients. PERVASIVE AND MOBILE COMPUTING 2016; 31:50-66. [DOI: 10.1016/j.pmcj.2016.01.008] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
|
33
|
Faurholt-Jepsen M, Busk J, Frost M, Vinberg M, Christensen EM, Winther O, Bardram JE, Kessing LV. Voice analysis as an objective state marker in bipolar disorder. Transl Psychiatry 2016; 6:e856. [PMID: 27434490 PMCID: PMC5545710 DOI: 10.1038/tp.2016.123] [Citation(s) in RCA: 114] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/25/2016] [Revised: 04/04/2016] [Accepted: 05/05/2016] [Indexed: 12/30/2022] Open
Abstract
Changes in speech have been suggested as sensitive and valid measures of depression and mania in bipolar disorder. The present study aimed at investigating (1) voice features collected during phone calls as objective markers of affective states in bipolar disorder and (2) if combining voice features with automatically generated objective smartphone data on behavioral activities (for example, number of text messages and phone calls per day) and electronic self-monitored data (mood) on illness activity would increase the accuracy as a marker of affective states. Using smartphones, voice features, automatically generated objective smartphone data on behavioral activities and electronic self-monitored data were collected from 28 outpatients with bipolar disorder in naturalistic settings on a daily basis during a period of 12 weeks. Depressive and manic symptoms were assessed using the Hamilton Depression Rating Scale 17-item and the Young Mania Rating Scale, respectively, by a researcher blinded to smartphone data. Data were analyzed using random forest algorithms. Affective states were classified using voice features extracted during everyday life phone calls. Voice features were found to be more accurate, sensitive and specific in the classification of manic or mixed states with an area under the curve (AUC)=0.89 compared with an AUC=0.78 for the classification of depressive states. Combining voice features with automatically generated objective smartphone data on behavioral activities and electronic self-monitored data increased the accuracy, sensitivity and specificity of classification of affective states slightly. Voice features collected in naturalistic settings using smartphones may be used as objective state markers in patients with bipolar disorder.
Collapse
Affiliation(s)
- M Faurholt-Jepsen
- Psychiatric Center Copenhagen, Rigshospitalet, Copenhagen, Denmark,Psychiatric Center Copenhagen, Rigshospitalet, Blegdamsvej 9, DK- 2100 Copenhagen, Denmark. E-mail:
| | - J Busk
- DTU Compute, Technical University of Denmark (DTU), Lyngby, Denmark
| | - M Frost
- The Pervasive Interaction Laboratory, IT University of Copenhagen, Copenhagen, Denmark
| | - M Vinberg
- Psychiatric Center Copenhagen, Rigshospitalet, Copenhagen, Denmark
| | - E M Christensen
- Psychiatric Center Copenhagen, Rigshospitalet, Copenhagen, Denmark
| | - O Winther
- DTU Compute, Technical University of Denmark (DTU), Lyngby, Denmark
| | - J E Bardram
- DTU Compute, Technical University of Denmark (DTU), Lyngby, Denmark
| | - L V Kessing
- Psychiatric Center Copenhagen, Rigshospitalet, Copenhagen, Denmark
| |
Collapse
|
34
|
Chaspari T, Soldatos C, Maragos P. The development of the Athens Emotional States Inventory (AESI): collection, validation and automatic processing of emotionally loaded sentences. World J Biol Psychiatry 2016; 16:312-22. [PMID: 25797829 DOI: 10.3109/15622975.2015.1012228] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
OBJECTIVES The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. METHODS Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. RESULTS The resulting database contains 696 recorded utterances in Greek language by 20 native speakers and has a total duration of approximately 28 min. Speech classification results yield accuracy up to 75.15% for automatically recognizing the emotions in AESI. CONCLUSIONS These results indicate the usefulness of our approach for collecting emotional data with reliable content, balanced across classes and with reduced environmental variability.
Collapse
Affiliation(s)
- Theodora Chaspari
- University of Southern California, Ming Hsieh Department of Electrical Engineering , Los Angeles, CA , USA
| | | | | |
Collapse
|
35
|
Guidi A, Salvi S, Ottaviano M, Gentili C, Bertschy G, de Rossi D, Scilingo EP, Vanello N. Smartphone Application for the Analysis of Prosodic Features in Running Speech with a Focus on Bipolar Disorders: System Performance Evaluation and Case Study. SENSORS 2015; 15:28070-87. [PMID: 26561811 PMCID: PMC4701269 DOI: 10.3390/s151128070] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/31/2015] [Revised: 09/26/2015] [Accepted: 10/26/2015] [Indexed: 11/16/2022]
Abstract
Bipolar disorder is one of the most common mood disorders characterized by large and invalidating mood swings. Several projects focus on the development of decision support systems that monitor and advise patients, as well as clinicians. Voice monitoring and speech signal analysis can be exploited to reach this goal. In this study, an Android application was designed for analyzing running speech using a smartphone device. The application can record audio samples and estimate speech fundamental frequency, F0, and its changes. F0-related features are estimated locally on the smartphone, with some advantages with respect to remote processing approaches in terms of privacy protection and reduced upload costs. The raw features can be sent to a central server and further processed. The quality of the audio recordings, algorithm reliability and performance of the overall system were evaluated in terms of voiced segment detection and features estimation. The results demonstrate that mean F0 from each voiced segment can be reliably estimated, thus describing prosodic features across the speech sample. Instead, features related to F0 variability within each voiced segment performed poorly. A case study performed on a bipolar patient is presented.
Collapse
Affiliation(s)
- Andrea Guidi
- Dipartimento di Ingegneria dell'Informazione, University of Pisa, Via G. Caruso 16, Pisa 56122, Italy.
- Research Center "E. Piaggio", University of Pisa, Largo L. Lazzarino 1, Pisa 56122, Italy.
| | - Sergio Salvi
- Life Supporting Technologies, Universidad Politécnica de Madrid , Avd. Complutense 30, Madrid 28040, Spain.
| | - Manuel Ottaviano
- Life Supporting Technologies, Universidad Politécnica de Madrid , Avd. Complutense 30, Madrid 28040, Spain.
| | - Claudio Gentili
- Department of Surgical, Medical, Molecular Pathology and Critical Care, University of Pisa, Via Savi 10, Pisa 56126, Italy.
- Department of General Psychology, University of Padua, Via Venezia 8, Padua 35131, Italy.
| | - Gilles Bertschy
- Department of Psychiatry and Mental Health, Strasbourg University Hospital, INSERM U1114, Translational Medicine Federation, University of Strasbourg, Strasbourg 67000, France.
| | - Danilo de Rossi
- Dipartimento di Ingegneria dell'Informazione, University of Pisa, Via G. Caruso 16, Pisa 56122, Italy.
- Research Center "E. Piaggio", University of Pisa, Largo L. Lazzarino 1, Pisa 56122, Italy.
| | - Enzo Pasquale Scilingo
- Dipartimento di Ingegneria dell'Informazione, University of Pisa, Via G. Caruso 16, Pisa 56122, Italy.
- Research Center "E. Piaggio", University of Pisa, Largo L. Lazzarino 1, Pisa 56122, Italy.
| | - Nicola Vanello
- Dipartimento di Ingegneria dell'Informazione, University of Pisa, Via G. Caruso 16, Pisa 56122, Italy.
- Research Center "E. Piaggio", University of Pisa, Largo L. Lazzarino 1, Pisa 56122, Italy.
| |
Collapse
|
36
|
Solomon C, Valstar MF, Morriss RK, Crowe J. Objective Methods for Reliable Detection of Concealed Depression. ACTA ACUST UNITED AC 2015. [DOI: 10.3389/fict.2015.00005] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|
37
|
Muthusamy H, Polat K, Yaacob S. Particle swarm optimization based feature enhancement and feature selection for improved emotion recognition in speech and glottal signals. PLoS One 2015; 10:e0120344. [PMID: 25799141 PMCID: PMC4370637 DOI: 10.1371/journal.pone.0120344] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2014] [Accepted: 01/20/2015] [Indexed: 11/20/2022] Open
Abstract
In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature.
Collapse
Affiliation(s)
- Hariharan Muthusamy
- School of Mechatronic Engineering, Universiti Malaysia Perlis (UniMAP), Campus Pauh Putra, 02600 Arau, Perlis, Malaysia
| | - Kemal Polat
- Department of Electrical and Electronics Engineering, Faculty of Engineering and Architecture, Abant Izzet Baysal University, 14280 Bolu, Turkey
| | - Sazali Yaacob
- Universiti Kuala Lumpur Malaysian Spanish Institute, Kulim Hi-TechPark, 09000 Kulim, Kedah, Malaysia
| |
Collapse
|
38
|
Guidi A, Vanello N, Bertschy G, Gentili C, Landini L, Scilingo E. Automatic analysis of speech F0 contour for the characterization of mood changes in bipolar patients. Biomed Signal Process Control 2015. [DOI: 10.1016/j.bspc.2014.10.011] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
39
|
Prediction of major depression in adolescents using an optimized multi-channel weighted speech classification system. Biomed Signal Process Control 2014. [DOI: 10.1016/j.bspc.2014.08.006] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
40
|
Drugman T, Alku P, Alwan A, Yegnanarayana B. Glottal source processing: From analysis to applications. COMPUT SPEECH LANG 2014. [DOI: 10.1016/j.csl.2014.03.003] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
41
|
Asgari M, Shafran I, Sheeber LB. INFERRING CLINICAL DEPRESSION FROM SPEECH AND SPOKEN UTTERANCES. IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING : [PROCEEDINGS]. IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING 2014; 2014:10.1109/mlsp.2014.6958856. [PMID: 33288990 PMCID: PMC7719299 DOI: 10.1109/mlsp.2014.6958856] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
In this paper, we investigate the problem of detecting depression from recordings of subjects' speech using speech processing and machine learning. There has been considerable interest in this problem in recent years due to the potential for developing objective assessments from real-world behaviors, which may provide valuable supplementary clinical information or may be useful in screening. The cues for depression may be present in "what is said" (content) and "how it is said" (prosody). Given the limited amounts of text data, even in this relatively large study, it is difficult to employ standard method of learning models from n-gram features. Instead, we learn models using word representations in an alternative feature space of valence and arousal. This is akin to embedding words into a real vector space albeit with manual ratings instead of those learned with deep neural networks [1]. For extracting prosody, we employ standard feature extractors such as those implemented in openSMILE and compare them with features extracted from harmonic models that we have been developing in recent years. Our experiments show that our features from harmonic model improve the performance of detecting depression from spoken utterances than other alternatives. The context features provide additional improvements to achieve an accuracy of about 74%, sufficient to be useful in screening applications.
Collapse
Affiliation(s)
- Meysam Asgari
- Center for Spoken Language Understanding Oregon Health & Science University, Portland, Oregon
| | - Izhak Shafran
- Center for Spoken Language Understanding Oregon Health & Science University, Portland, Oregon
| | | |
Collapse
|
42
|
Karam ZN, Provost EM, Singh S, Montgomery J, Archer C, Harrington G, Mcinnis MG. ECOLOGICALLY VALID LONG-TERM MOOD MONITORING OF INDIVIDUALS WITH BIPOLAR DISORDER USING SPEECH. PROCEEDINGS OF THE ... IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING. ICASSP (CONFERENCE) 2014; 2014:4858-4862. [PMID: 27630535 PMCID: PMC5019119 DOI: 10.1109/icassp.2014.6854525] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Speech patterns are modulated by the emotional and neurophysiological state of the speaker. There exists a growing body of work that computationally examines this modulation in patients suffering from depression, autism, and post-traumatic stress disorder. However, the majority of the work in this area focuses on the analysis of structured speech collected in controlled environments. Here we expand on the existing literature by examining bipolar disorder (BP). BP is characterized by mood transitions, varying from a healthy euthymic state to states characterized by mania or depression. The speech patterns associated with these mood states provide a unique opportunity to study the modulations characteristic of mood variation. We describe methodology to collect unstructured speech continuously and unobtrusively via the recording of day-to-day cellular phone conversations. Our pilot investigation suggests that manic and depressive mood states can be recognized from this speech data, providing new insight into the feasibility of unobtrusive, unstructured, and continuous speech-based wellness monitoring for individuals with BP.
Collapse
Affiliation(s)
- Zahi N Karam
- Departments of: Computer Science and Engineering, University of Michigan
| | | | - Satinder Singh
- Departments of: Computer Science and Engineering, University of Michigan
| | | | | | | | | |
Collapse
|
43
|
Lech M, He L. Stress and Emotion Recognition Using Acoustic Speech Analysis. MENTAL HEALTH INFORMATICS 2014. [DOI: 10.1007/978-3-642-38550-6_9] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
|
44
|
Vanello N, Guidi A, Gentili C, Werner S, Bertschy G, Valenza G, Lanata A, Scilingo EP. Speech analysis for mood state characterization in bipolar patients. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2013; 2012:2104-7. [PMID: 23366336 DOI: 10.1109/embc.2012.6346375] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Bipolar disorders are characterized by an unpredictable behavior, resulting in depressive, hypomanic or manic episodes alternating with euthymic states. A multi-parametric approach can be followed to estimate mood states by integrating information coming from different physiological signals and from the analysis of voice. In this work we propose an algorithm to estimate speech features from running speech with the aim of characterizing the mood state in bipolar patients. This algorithm is based on an automatic segmentation of speech signals to detect voiced segments, and on a spectral matching approach to estimate pitch and pitch changes. In particular average pitch, jitter and pitch standard deviation within each voiced segment, are estimated. The performances of the algorithm are evaluated on a speech database, which includes an electroglottographic signal. A preliminary analysis on subjects affected by bipolar disorders is performed and results are discussed.
Collapse
Affiliation(s)
- Nicola Vanello
- Department of Information Engineering, University of Pisa, Pisa, Italy.
| | | | | | | | | | | | | | | |
Collapse
|
45
|
Ooi KEB, Lech M, Allen NB. Multichannel weighted speech classification system for prediction of major depression in adolescents. IEEE Trans Biomed Eng 2012. [PMID: 23192475 DOI: 10.1109/tbme.2012.2228646] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Early identification of adolescents at high imminent risk for clinical depression could significantly reduce the burden of the disease. This study demonstrated that acoustic speech analysis and classification can be used to determine early signs of major depression in adolescents, up to two years before they meet clinical diagnostic criteria for the full-blown disorder. Individual contributions of four different types of acoustic parameters [prosodic, glottal, Teager's energy operator (TEO), and spectral] to depression-related changes of speech characteristics were examined. A new computational methodology for the early prediction of depression in adolescents was developed and tested. The novel aspect of this methodology is in the introduction of multichannel classification with a weighted decision procedure. It was observed that single-channel classification was effective in predicting depression with a desirable specificity-to-sensitivity ratio and accuracy higher than chance level only when using glottal or prosodic features. The best prediction performance was achieved with the new multichannel method, which used four features (prosodic, glottal, TEO, and spectral). In the case of the person-based approach with two sets of weights, the new multichannel method provided a high accuracy level of 73% and the sensitivity-to-specificity ratio of 79%/67% for predicting future depression.
Collapse
Affiliation(s)
- Kuan Ee Brian Ooi
- School of Electrical and Computer Engineering, Royal Melbourne Institute of Technology, Melbourne, Victoria, Australia.
| | | | | |
Collapse
|
46
|
Berke EM, Choudhury T, Ali S, Rabbi M. Objective measurement of sociability and activity: mobile sensing in the community. Ann Fam Med 2011; 9:344-50. [PMID: 21747106 PMCID: PMC3133582 DOI: 10.1370/afm.1266] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
PURPOSE Automated systems able to infer detailed measures of a person's social interactions and physical activities in their natural environments could lead to better understanding of factors influencing well-being. We assessed the feasibility of a wireless mobile device in measuring sociability and physical activity in older adults, and compared results with those of traditional questionnaires. METHODS This pilot observational study was conducted among a convenience sample of 8 men and women aged 65 years or older in a continuing care retirement community. Participants wore a waist-mounted device containing sensors that continuously capture data pertaining to behavior and environment (accelerometer, microphone, barometer, and sensors for temperature, humidity, and light). The sensors measured time spent walking level, up or down an elevation, and stationary (sitting or standing), and time spent speaking with 1 or more other people. The participants also completed 4 questionnaires: the 36-Item Short Form Health Survey (SF-36), the Yale Physical Activity Survey (YPAS), the Center for Epidemiologic Studies-Depression (CES-D) scale, and the Friendship Scale. RESULTS Men spent 21.3% of their time walking and 64.4% stationary. Women spent 20.7% of their time walking and 62.0% stationary. Sensed physical activity was correlated with aggregate YPAS scores (r(2)=0.79, P=.02). Sensed time speaking was positively correlated with the mental component score of the SF-36 (r(2)=0.86, P = .03), and social interaction as assessed with the Friendship Scale (r(2)=0.97, P = .002), and showed a trend toward association with CES-D score (r(2)=-0.75, P = .08). In adjusted models, sensed time speaking was associated with SF-36 mental component score (P = .08), social interaction measured with the Friendship Scale (P = .045), and CES-D score (P=.04). CONCLUSIONS Mobile sensing of sociability and activity is well correlated with traditional measures and less prone to biases associated with questionnaires that rely on recall. Using mobile devices to collect data from and monitor older adult patients has the potential to improve detection of changes in their health.
Collapse
Affiliation(s)
- Ethan M Berke
- Center for Population Health, The Dart-mouth Institute for Health Policy and Clinical Practice, Lebanon, New Hampshire, USA.
| | | | | | | |
Collapse
|
47
|
Rabbi M, Ali S, Choudhury T, Berke E. Passive and In-situ Assessment of Mental and Physical Well-being using Mobile Sensors. PROCEEDINGS OF THE ... ACM INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING . UBICOMP (CONFERENCE) 2011; 2011:385-394. [PMID: 25285324 DOI: 10.1145/2030112.2030164] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
The idea of continuously monitoring well-being using mobile-sensing systems is gaining popularity. In-situ measurement of human behavior has the potential to overcome the short comings of gold-standard surveys that have been used for decades by the medical community. However, current sensing systems have mainly focused on tracking physical health; some have approximated aspects of mental health based on proximity measurements but have not been compared against medically accepted screening instruments. In this paper, we show the feasibility of a multi-modal mobile sensing system to simultaneously assess mental and physical health. By continuously capturing fine grained motion and privacy-sensitive audio data, we are able to derive different metrics that reflect the results of commonly used surveys for assessing well-being by the medical community. In addition, we present a case study that highlights how errors in assessment due to the subjective nature of the responses could potentially be avoided by continuous sensing and inference of social interactions and physical activities.
Collapse
Affiliation(s)
| | - Shahid Ali
- Community and Family Medicine, Dartmouth Medical School,
| | | | - Ethan Berke
- Community and Family Medicine, Dartmouth Medical School,
| |
Collapse
|
48
|
Low LSA, Maddage NC, Lech M, Sheeber LB, Allen NB. Detection of clinical depression in adolescents' speech during family interactions. IEEE Trans Biomed Eng 2010; 58:574-86. [PMID: 21075715 DOI: 10.1109/tbme.2010.2091640] [Citation(s) in RCA: 128] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
The properties of acoustic speech have previously been investigated as possible cues for depression in adults. However, these studies were restricted to small populations of patients and the speech recordings were made during patients' clinical interviews or fixed-text reading sessions. Symptoms of depression often first appear during adolescence at a time when the voice is changing, in both males and females, suggesting that specific studies of these phenomena in adolescent populations are warranted. This study investigated acoustic correlates of depression in a large sample of 139 adolescents (68 clinically depressed and 71 controls). Speech recordings were made during naturalistic interactions between adolescents and their parents. Prosodic, cepstral, spectral, and glottal features, as well as features derived from the Teager energy operator (TEO), were tested within a binary classification framework. Strong gender differences in classification accuracy were observed. The TEO-based features clearly outperformed all other features and feature combinations, providing classification accuracy ranging between 81%-87% for males and 72%-79% for females. Close, but slightly less accurate, results were obtained by combining glottal features with prosodic and spectral features (67%-69% for males and 70%-75% for females). These findings indicate the importance of nonlinear mechanisms associated with the glottal flow formation as cues for clinical depression.
Collapse
Affiliation(s)
- Lu-Shih Alex Low
- School of Electrical and Computer Engineering, Royal Melbourne Institute of Technology, Vic. 3001, Australia.
| | | | | | | | | |
Collapse
|
49
|
Spoken emotion recognition through optimum-path forest classification using glottal features. COMPUT SPEECH LANG 2010. [DOI: 10.1016/j.csl.2009.02.005] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
50
|
Torres JF, Moore E, Bryant E. A study of Glottal waveform features for deceptive speech classification. ACTA ACUST UNITED AC 2008. [DOI: 10.1109/icassp.2008.4518653] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|