1
|
Jalaeian Zaferani E, Teshnehlab M, Khodadadian A, Heitzinger C, Vali M, Noii N, Wick T. Hyper-Parameter Optimization of Stacked Asymmetric Auto-Encoders for Automatic Personality Traits Perception. SENSORS (BASEL, SWITZERLAND) 2022; 22:s22166206. [PMID: 36015967 PMCID: PMC9413006 DOI: 10.3390/s22166206] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/17/2022] [Revised: 08/15/2022] [Accepted: 08/16/2022] [Indexed: 05/27/2023]
Abstract
In this work, a method for automatic hyper-parameter tuning of the stacked asymmetric auto-encoder is proposed. In previous work, the deep learning ability to extract personality perception from speech was shown, but hyper-parameter tuning was attained by trial-and-error, which is time-consuming and requires machine learning knowledge. Therefore, obtaining hyper-parameter values is challenging and places limits on deep learning usage. To address this challenge, researchers have applied optimization methods. Although there were successes, the search space is very large due to the large number of deep learning hyper-parameters, which increases the probability of getting stuck in local optima. Researchers have also focused on improving global optimization methods. In this regard, we suggest a novel global optimization method based on the cultural algorithm, multi-island and the concept of parallelism to search this large space smartly. At first, we evaluated our method on three well-known optimization benchmarks and compared the results with recently published papers. Results indicate that the convergence of the proposed method speeds up due to the ability to escape from local optima, and the precision of the results improves dramatically. Afterward, we applied our method to optimize five hyper-parameters of an asymmetric auto-encoder for automatic personality perception. Since inappropriate hyper-parameters lead the network to over-fitting and under-fitting, we used a novel cost function to prevent over-fitting and under-fitting. As observed, the unweighted average recall (accuracy) was improved by 6.52% (9.54%) compared to our previous work and had remarkable outcomes compared to other published personality perception works.
Collapse
Affiliation(s)
- Effat Jalaeian Zaferani
- Electrical & Computer Engineering Faculty, K. N. Toosi University of Technology, Tehran 19967-15433, Iran
| | - Mohammad Teshnehlab
- Electrical & Computer Engineering Faculty, K. N. Toosi University of Technology, Tehran 19967-15433, Iran
| | - Amirreza Khodadadian
- Institute of Applied Mathematics, Leibniz University of Hannover, 30167 Hannover, Germany
| | - Clemens Heitzinger
- Institute of Analysis and Scientific Computing, TU Wien, 1040 Vienna, Austria
- Center for Artificial Intelligence and Machine Learning (CAIML), TU Wien, 1040 Vienna, Austria
| | - Mansour Vali
- Electrical & Computer Engineering Faculty, K. N. Toosi University of Technology, Tehran 19967-15433, Iran
| | - Nima Noii
- Institute of Continuum Mechanics, Leibniz University of Hannover, 30823 Garbsen, Germany
| | - Thomas Wick
- Institute of Applied Mathematics, Leibniz University of Hannover, 30167 Hannover, Germany
| |
Collapse
|
2
|
Chen F, Yang C, Khishe M. Diagnose Parkinson’s disease and cleft lip and palate using deep convolutional neural networks evolved by IP-based chimp optimization algorithm. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103688] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
3
|
Automatic Identification of Emotional Information in Spanish TV Debates and Human–Machine Interactions. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12041902] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Automatic emotion detection is a very attractive field of research that can help build more natural human–machine interaction systems. However, several issues arise when real scenarios are considered, such as the tendency toward neutrality, which makes it difficult to obtain balanced datasets, or the lack of standards for the annotation of emotional categories. Moreover, the intrinsic subjectivity of emotional information increases the difficulty of obtaining valuable data to train machine learning-based algorithms. In this work, two different real scenarios were tackled: human–human interactions in TV debates and human–machine interactions with a virtual agent. For comparison purposes, an analysis of the emotional information was conducted in both. Thus, a profiling of the speakers associated with each task was carried out. Furthermore, different classification experiments show that deep learning approaches can be useful for detecting speakers’ emotional information, mainly for arousal, valence, and dominance levels, reaching a 0.7F1-score.
Collapse
|
4
|
Huang Y, Zhai D, Song J, Rao X, Sun X, Tang J. Mental states and personality based on real-time physical activity and facial expression recognition. Front Psychiatry 2022; 13:1019043. [PMID: 36699483 PMCID: PMC9868243 DOI: 10.3389/fpsyt.2022.1019043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Accepted: 12/09/2022] [Indexed: 01/10/2023] Open
Abstract
INTRODUCTION To explore a quick and non-invasive way to measure individual psychological states, this study developed interview-based scales, and multi-modal information was collected from 172 participants. METHODS We developed the Interview Psychological Symptom Inventory (IPSI) which eventually retained 53 items with nine main factors. All of them performed well in terms of reliability and validity. We used optimized convolutional neural networks and original detection algorithms for the recognition of individual facial expressions and physical activity based on Russell's circumplex model and the five factor model. RESULTS We found that there was a significant correlation between the developed scale and the participants' scores on each factor in the Symptom Checklist-90 (SCL-90) and Big Five Inventory (BFI-2) [r = (-0.257, 0.632), p < 0.01]. Among the multi-modal data, the arousal of facial expressions was significantly correlated with the interval of validity (p < 0.01), valence was significantly correlated with IPSI and SCL-90, and physical activity was significantly correlated with gender, age, and factors of the scales. DISCUSSION Our research demonstrates that mental health can be monitored and assessed remotely by collecting and analyzing multimodal data from individuals captured by digital tools.
Collapse
Affiliation(s)
- Yating Huang
- School of Mental Health and Psychological Sciences, Anhui Medical University, Hefei, China.,Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, China
| | - Dengyue Zhai
- Department of Neurology, The Third Affiliated Hospital of Anhui Medical University, Hefei, China
| | - Jingze Song
- Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, China.,ZhongJuYuan Intelligent Technology Co., Ltd., Hefei, China
| | - Xuanheng Rao
- Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, China
| | - Xiao Sun
- School of Mental Health and Psychological Sciences, Anhui Medical University, Hefei, China.,Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, China
| | - Jin Tang
- School of Mental Health and Psychological Sciences, Anhui Medical University, Hefei, China.,Hefei Comprehensive National Science Center, Institute of Artificial Intelligence, Hefei, China
| |
Collapse
|
5
|
Pommée T, Balaguer M, Mauclair J, Pinquier J, Woisard V. Intelligibility and comprehensibility: A Delphi consensus study. INTERNATIONAL JOURNAL OF LANGUAGE & COMMUNICATION DISORDERS 2022; 57:21-41. [PMID: 34558145 DOI: 10.1111/1460-6984.12672] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/23/2021] [Revised: 07/12/2021] [Accepted: 08/30/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND Intelligibility and comprehensibility in speech disorders can be assessed both perceptually and instrumentally, but a lack of consensus exists regarding the terminology and related speech measures in both the clinical and scientific fields. AIMS To draw up a more consensual definition of intelligibility and comprehensibility and to define which assessment methods relate to both concepts, as part of their definition. METHODS & PROCEDURES A three-round modified Delphi consensus study was carried out among clinicians, researchers and lecturers engaged in activities in speech disorders. OUTCOMES & RESULTS Forty international experts from different fields (mainly clinicians, linguists and computer scientists) participated in the elaboration of a comprehensive definition of intelligibility and comprehensibility and their assessment. While both concepts are linked and contribute to functional human communication, they relate to two different reconstruction levels of the transmitted speech material. Intelligibility refers to the acoustic-phonetic decoding of the utterance, while comprehensibility relates to the reconstruction of the meaning of the message. Consequently, the perceptual assessment of intelligibility requires the use of unpredictable speech material (pseudo-words, minimal word pairs, unpredictable sentences), whereas comprehensibility assessment is meaning and context related and entails more functional speech stimuli and tasks. CONCLUSION & IMPLICATIONS This consensus study provides the scientific and clinical communities with a better understanding of intelligibility and comprehensibility. A comprehensive definition was drafted, including specifications regarding the tasks that best fit their assessment. The outcome has implications for both clinical practice and scientific research, as the disambiguation improves communication between professionals and thereby increases the efficiency of patient assessment and care and benefits the progress of research as well as research translation. WHAT THIS PAPER ADDS What is already known on the subject Intelligibility and comprehensibility in speech disorders can be assessed both perceptually and instrumentally, but a lack of consensus exists regarding the terminology and related speech measures in both the clinical and scientific fields. What this paper adds to existing knowledge This consensus study allowed for a more consensual and comprehensive definition of intelligibility and comprehensibility and their assessment, for clinicians and researchers. The terminological disambiguation helps to improve communication between experts in the field of speech disorders and thereby benefits the progress of research as well as research translation. What are the potential or actual clinical implications of this work? Unambiguous communication between professionals, for example, in a multidisciplinary team, allows for the improvement in the efficiency of patient care. Furthermore, this study allowed the assessment tasks that best fit the definition of both intelligibility and comprehensibility to be specified, thereby providing valuable information to improve speech disorder assessment and its standardization.
Collapse
Affiliation(s)
- Timothy Pommée
- IRIT, CNRS, Paul Sabatier University Toulouse III, Toulouse, France
| | - Mathieu Balaguer
- IRIT, CNRS, Paul Sabatier University Toulouse III, Toulouse, France
- ENT Department, University Hospital of Toulouse Larrey, Toulouse, France
| | - Julie Mauclair
- IRIT, CNRS, Paul Sabatier University Toulouse III, Toulouse, France
| | - Julien Pinquier
- IRIT, CNRS, Paul Sabatier University Toulouse III, Toulouse, France
| | - Virginie Woisard
- ENT Department, University Hospital of Toulouse Larrey, Toulouse, France
- Oncorehabilitation Unit, University Cancer Institute of Toulouse Oncopole, Toulouse, France
- Laboratoire Octogone Lordat, Jean Jaurès University Toulouse II, Toulouse, France
| |
Collapse
|
6
|
Fernández Carbonell M, Boman M, Laukka P. Comparing supervised and unsupervised approaches to multimodal emotion recognition. PeerJ Comput Sci 2021; 7:e804. [PMID: 35036530 PMCID: PMC8725659 DOI: 10.7717/peerj-cs.804] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2021] [Accepted: 11/11/2021] [Indexed: 06/14/2023]
Abstract
We investigated emotion classification from brief video recordings from the GEMEP database wherein actors portrayed 18 emotions. Vocal features consisted of acoustic parameters related to frequency, intensity, spectral distribution, and durations. Facial features consisted of facial action units. We first performed a series of person-independent supervised classification experiments. Best performance (AUC = 0.88) was obtained by merging the output from the best unimodal vocal (Elastic Net, AUC = 0.82) and facial (Random Forest, AUC = 0.80) classifiers using a late fusion approach and the product rule method. All 18 emotions were recognized with above-chance recall, although recognition rates varied widely across emotions (e.g., high for amusement, anger, and disgust; and low for shame). Multimodal feature patterns for each emotion are described in terms of the vocal and facial features that contributed most to classifier performance. Next, a series of exploratory unsupervised classification experiments were performed to gain more insight into how emotion expressions are organized. Solutions from traditional clustering techniques were interpreted using decision trees in order to explore which features underlie clustering. Another approach utilized various dimensionality reduction techniques paired with inspection of data visualizations. Unsupervised methods did not cluster stimuli in terms of emotion categories, but several explanatory patterns were observed. Some could be interpreted in terms of valence and arousal, but actor and gender specific aspects also contributed to clustering. Identifying explanatory patterns holds great potential as a meta-heuristic when unsupervised methods are used in complex classification tasks.
Collapse
Affiliation(s)
- Marcos Fernández Carbonell
- Department of Software and Computer Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Magnus Boman
- Department of Software and Computer Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
- Department of Learning, Informatics, Management and Ethics (LIME), Karolinska Institutet, Stockholm, Sweden
| | - Petri Laukka
- Department of Psychology, Stockholm University, Stockholm, Sweden
| |
Collapse
|
7
|
Qian K, Schmitt M, Zheng H, Koike T, Han J, Liu J, Ji W, Duan J, Song M, Yang Z, Ren Z, Liu S, Zhang Z, Yamamoto Y, Schuller BW. Computer Audition for Fighting the SARS-CoV-2 Corona Crisis-Introducing the Multitask Speech Corpus for COVID-19. IEEE INTERNET OF THINGS JOURNAL 2021; 8:16035-16046. [PMID: 35782182 PMCID: PMC8768988 DOI: 10.1109/jiot.2021.3067605] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Revised: 02/24/2021] [Accepted: 03/17/2021] [Indexed: 05/29/2023]
Abstract
Computer audition (CA) has experienced a fast development in the past decades by leveraging advanced signal processing and machine learning techniques. In particular, for its noninvasive and ubiquitous character by nature, CA-based applications in healthcare have increasingly attracted attention in recent years. During the tough time of the global crisis caused by the coronavirus disease 2019 (COVID-19), scientists and engineers in data science have collaborated to think of novel ways in prevention, diagnosis, treatment, tracking, and management of this global pandemic. On the one hand, we have witnessed the power of 5G, Internet of Things, big data, computer vision, and artificial intelligence in applications of epidemiology modeling, drug and/or vaccine finding and designing, fast CT screening, and quarantine management. On the other hand, relevant studies in exploring the capacity of CA are extremely lacking and underestimated. To this end, we propose a novel multitask speech corpus for COVID-19 research usage. We collected 51 confirmed COVID-19 patients' in-the-wild speech data in Wuhan city, China. We define three main tasks in this corpus, i.e., three-category classification tasks for evaluating the physical and/or mental status of patients, i.e., sleep quality, fatigue, and anxiety. The benchmarks are given by using both classic machine learning methods and state-of-the-art deep learning techniques. We believe this study and corpus cannot only facilitate the ongoing research on using data science to fight against COVID-19, but also the monitoring of contagious diseases for general purpose.
Collapse
Affiliation(s)
- Kun Qian
- Educational Physiology Laboratory, Graduate School of EducationThe University of TokyoTokyo113-0033Japan
| | - Maximilian Schmitt
- Chair of Embedded Intelligence for Health Care and WellbeingUniversity of Augsburg86159AugsburgGermany
| | - Huaiyuan Zheng
- Department of Hand SurgeryWuhan Union Hospital, Tongji Medical CollegeHuazhong University of Science and TechnologyWuhan430074China
| | - Tomoya Koike
- Educational Physiology Laboratory, Graduate School of EducationThe University of TokyoTokyo113-0033Japan
| | - Jing Han
- Mobile Systems GroupUniversity of CambridgeCambridgeCB2 1TNU.K.
| | - Juan Liu
- Department of Plastic SurgeryCentral Hospital of Wuhan, Tongji Medical CollegeHuazhong University of Science and TechnologyWuhan430074China
| | - Wei Ji
- Department of Plastic SurgeryWuhan Third Hospital and Tongren Hospital of Wuhan UniversityWuhan430072China
| | - Junjun Duan
- Department of Plastic SurgeryCentral Hospital of Wuhan, Tongji Medical CollegeHuazhong University of Science and TechnologyWuhan430074China
| | - Meishu Song
- Chair of Embedded Intelligence for Health Care and WellbeingUniversity of Augsburg86159AugsburgGermany
| | - Zijiang Yang
- Chair of Embedded Intelligence for Health Care and WellbeingUniversity of Augsburg86159AugsburgGermany
| | - Zhao Ren
- Chair of Embedded Intelligence for Health Care and WellbeingUniversity of Augsburg86159AugsburgGermany
| | - Shuo Liu
- Chair of Embedded Intelligence for Health Care and WellbeingUniversity of Augsburg86159AugsburgGermany
| | - Zixing Zhang
- GLAM—the Group on Language, Audio, and MusicImperial College LondonLondonSW7 2BUU.K.
| | - Yoshiharu Yamamoto
- Educational Physiology Laboratory, Graduate School of EducationThe University of TokyoTokyo113-0033Japan
| | - Björn W. Schuller
- Chair of Embedded Intelligence for Health Care and WellbeingUniversity of Augsburg86159AugsburgGermany
- GLAM—the Group on Language, Audio, and MusicImperial College LondonLondonSW7 2BUU.K.
| |
Collapse
|
8
|
Roldan-Vasco S, Orozco-Duque A, Suarez-Escudero JC, Orozco-Arroyave JR. Machine learning based analysis of speech dimensions in functional oropharyngeal dysphagia. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 208:106248. [PMID: 34260973 DOI: 10.1016/j.cmpb.2021.106248] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Accepted: 06/15/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE The normal swallowing process requires a complex coordination of anatomical structures driven by sensory and cranial nerves. Alterations in such coordination cause swallowing malfunctions, namely dysphagia. The dysphagia screening methods are quite subjective and experience dependent. Bearing in mind that the swallowing process and speech production share some anatomical structures and mechanisms of neurological control, this work aims to evaluate the suitability of automatic speech processing and machine learning techniques for screening of functional dysphagia. METHODS Speech recordings were collected from 46 patients with functional oropharyngeal dysphagia produced by neurological causes, and 46 healthy controls. The dimensions of speech including phonation, articulation, and prosody were considered through different speech tasks. Specific features per dimension were extracted and analyzed using statistical tests. Machine learning models were applied per dimension via nested cross-validation. Hyperparameters were selected using the AUC - ROC as optimization criterion. RESULTS The Random Forest in the articulation related speech tasks retrieved the highest performance measures (AUC=0.86±0.10, sensitivity=0.91±0.12) for individual analysis of dimensions. In addition, the combination of speech dimensions with a voting ensemble improved the results, which suggests a contribution of information from different feature sets extracted from speech signals in dysphagia conditions. CONCLUSIONS The proposed approach based on speech related models is suitable for the automatic discrimination between dysphagic and healthy individuals. These findings seem to have potential use in the screening of functional oropharyngeal dysphagia in a non-invasive and inexpensive way.
Collapse
Affiliation(s)
- Sebastian Roldan-Vasco
- Faculty of Engineering, Instituto Tecnológico Metropolitano, Medellín, Colombia; Faculty of Engineering, Universidad de Antioquia, Medellín, Colombia.
| | - Andres Orozco-Duque
- Faculty of Pure and Applied Sciences, Instituto Tecnológico Metropolitano, Medellín, Colombia
| | - Juan Camilo Suarez-Escudero
- School of Health Sciences, Faculty of Medicine, Universidad Pontificia Bolivariana, Medellín, Colombia; Faculty of Pure and Applied Sciences, Instituto Tecnológico Metropolitano, Medellín, Colombia
| | - Juan Rafael Orozco-Arroyave
- Faculty of Engineering, Universidad de Antioquia, Medellín, Colombia; Pattern Recognition Lab, Friedrich-Alexander-Universität, Erlangen-Nürnberg, Germany.
| |
Collapse
|
9
|
Abstract
Speech emotion recognition is a challenging and widely examined research topic in the field of speech processing. The accuracy of existing models in speech emotion recognition tasks is not high, and the generalization ability is not strong. Since the feature set and model design of effective speech directly affect the accuracy of speech emotion recognition, research on features and models is important. Because emotional expression is often correlated with the global features, local features, and model design of speech, it is often difficult to find a universal solution for effective speech emotion recognition. Based on this, the main research purpose of this paper is to generate general emotion features in speech signals from different angles, and use the ensemble learning model to perform emotion recognition tasks. It is divided into the following aspects: (1) Three expert roles of speech emotion recognition are designed. Expert 1 focuses on three-dimensional feature extraction of local signals; expert 2 focuses on extraction of comprehensive information in local data; and expert 3 emphasizes global features: acoustic feature descriptors (low-level descriptors (LLDs)), high-level statistics functionals (HSFs), and local features and their timing relationships. A single-/multiple-level deep learning model that meets expert characteristics is designed for each expert, including convolutional neural network (CNN), bi-directional long short-term memory (BLSTM), and gated recurrent unit (GRU). Convolutional recurrent neural network (CRNN), based on a combination of an attention mechanism, is used for internal training of experts. (2) By designing an ensemble learning model, each expert can play to its own advantages and evaluate speech emotions from different focuses. (3) Through experiments, the performance of various experts and ensemble learning models in emotion recognition is compared in the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and the validity of the proposed model is verified.
Collapse
|
10
|
Discriminating Emotions in the Valence Dimension from Speech Using Timbre Features. APPLIED SCIENCES-BASEL 2019. [DOI: 10.3390/app9122470] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
The most used and well-known acoustic features of a speech signal, the Mel frequency cepstral coefficients (MFCC), cannot characterize emotions in speech sufficiently when a classification is performed to classify both discrete emotions (i.e., anger, happiness, sadness, and neutral) and emotions in valence dimension (positive and negative). The main reason for this is that some of the discrete emotions, such as anger and happiness, share similar acoustic features in the arousal dimension (high and low) but are different in the valence dimension. Timbre is a sound quality that can discriminate between two sounds even with the same pitch and loudness. In this paper, we analyzed timbre acoustic features to improve the classification performance of discrete emotions as well as emotions in the valence dimension. Sequential forward selection (SFS) was used to find the most relevant acoustic features among timbre acoustic features. The experiments were carried out on the Berlin Emotional Speech Database and the Interactive Emotional Dyadic Motion Capture Database. Support vector machine (SVM) and long short-term memory recurrent neural network (LSTM-RNN) were used to classify emotions. The significant classification performance improvements were achieved using a combination of baseline and the most relevant timbre acoustic features, which were found by applying SFS on a classification of emotions for the Berlin Emotional Speech Database. From extensive experiments, it was found that timbre acoustic features could characterize emotions sufficiently in a speech in the valence dimension.
Collapse
|