1
|
Akinpelu S, Viriri S, Adegun A. An enhanced speech emotion recognition using vision transformer. Sci Rep 2024; 14:13126. [PMID: 38849422 PMCID: PMC11161461 DOI: 10.1038/s41598-024-63776-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 06/02/2024] [Indexed: 06/09/2024] Open
Abstract
In human-computer interaction systems, speech emotion recognition (SER) plays a crucial role because it enables computers to understand and react to users' emotions. In the past, SER has significantly emphasised acoustic properties extracted from speech signals. The use of visual signals for enhancing SER performance, however, has been made possible by recent developments in deep learning and computer vision. This work utilizes a lightweight Vision Transformer (ViT) model to propose a novel method for improving speech emotion recognition. We leverage the ViT model's capabilities to capture spatial dependencies and high-level features in images which are adequate indicators of emotional states from mel spectrogram input fed into the model. To determine the efficiency of our proposed approach, we conduct a comprehensive experiment on two benchmark speech emotion datasets, the Toronto English Speech Set (TESS) and the Berlin Emotional Database (EMODB). The results of our extensive experiment demonstrate a considerable improvement in speech emotion recognition accuracy attesting to its generalizability as it achieved 98%, 91%, and 93% (TESS-EMODB) accuracy respectively on the datasets. The outcomes of the comparative experiment show that the non-overlapping patch-based feature extraction method substantially improves the discipline of speech emotion recognition. Our research indicates the potential for integrating vision transformer models into SER systems, opening up fresh opportunities for real-world applications requiring accurate emotion recognition from speech compared with other state-of-the-art techniques.
Collapse
Affiliation(s)
- Samson Akinpelu
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa
| | - Serestina Viriri
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa.
| | - Adekanmi Adegun
- School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Durban, 4001, South Africa
| |
Collapse
|
2
|
Ke X, Mak MW, Meng HM. Automatic selection of spoken language biomarkers for dementia detection. Neural Netw 2024; 169:191-204. [PMID: 37898051 DOI: 10.1016/j.neunet.2023.10.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2023] [Revised: 10/10/2023] [Accepted: 10/12/2023] [Indexed: 10/30/2023]
Abstract
This paper analyzes diverse features extracted from spoken language to select the most discriminative ones for dementia detection. We present a two-step feature selection (FS) approach: Step 1 utilizes filter methods to pre-screen features, and Step 2 uses a novel feature ranking (FR) method, referred to as dual dropout ranking (DDR), to rank the screened features and select spoken language biomarkers. The proposed DDR is based on a dual-net architecture that separates FS and dementia detection into two neural networks (namely, the operator and selector). The operator is trained on features obtained from the selector to reduce classification or regression loss. The selector is optimized to predict the operator's performance based on automatic regularization. Results show that the approach significantly reduces feature dimensionality while identifying small feature subsets that achieve comparable or superior performance compared with the full, default feature set. The Python codes are available at https://github.com/kexquan/dual-dropout-ranking.
Collapse
Affiliation(s)
- Xiaoquan Ke
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region.
| | - Man Wai Mak
- Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong Special Administrative Region.
| | - Helen M Meng
- Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong Special Administrative Region.
| |
Collapse
|
3
|
Daneshfar F, Jamshidi MB. An octonion-based nonlinear echo state network for speech emotion recognition in Metaverse. Neural Netw 2023; 163:108-121. [PMID: 37030275 DOI: 10.1016/j.neunet.2023.03.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 01/18/2023] [Accepted: 03/19/2023] [Indexed: 03/29/2023]
Abstract
While the Metaverse is becoming a popular trend and drawing much attention from academia, society, and businesses, processing cores used in its infrastructures need to be improved, particularly in terms of signal processing and pattern recognition. Accordingly, the speech emotion recognition (SER) method plays a crucial role in creating the Metaverse platforms more usable and enjoyable for its users. However, existing SER methods continue to be plagued by two significant problems in the online environment. The shortage of adequate engagement and customization between avatars and users is recognized as the first issue and the second problem is related to the complexity of SER problems in the Metaverse as we face people and their digital twins or avatars. This is why developing efficient machine learning (ML) techniques specified for hypercomplex signal processing is essential to enhance the impressiveness and tangibility of the Metaverse platforms. As a solution, echo state networks (ESNs), which are an ML powerful tool for SER, can be an appropriate technique to enhance the Metaverse's foundations in this area. Nevertheless, ESNs have some technical issues restricting them from a precise and reliable analysis, especially in the aspect of high-dimensional data. The most significant limitation of these networks is the high memory consumption caused by their reservoir structure in face of high-dimensional signals. To solve all problems associated with ESNs and their application in the Metaverse, we have come up with a novel structure for ESNs empowered by octonion algebra called NO2GESNet. Octonion numbers have eight dimensions, compactly display high-dimensional data, and improve the network precision and performance in comparison to conventional ESNs. The proposed network also solves the weaknesses of the ESNs in the presentation of the higher-order statistics to the output layer by equipping it with a multidimensional bilinear filter. Three comprehensive scenarios to use the proposed network in the Metaverse have been designed and analyzed, not only do they show the accuracy and performance of the proposed approach, but also the ways how SER can be employed in the Metaverse platforms.
Collapse
Affiliation(s)
- Fatemeh Daneshfar
- Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran.
| | | |
Collapse
|
4
|
Doğdu C, Kessler T, Schneider D, Shadaydeh M, Schweinberger SR. A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech. SENSORS (BASEL, SWITZERLAND) 2022; 22:7561. [PMID: 36236658 PMCID: PMC9571288 DOI: 10.3390/s22197561] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 09/26/2022] [Accepted: 10/02/2022] [Indexed: 06/16/2023]
Abstract
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI.
Collapse
Affiliation(s)
- Cem Doğdu
- Department of Social Psychology, Institute of Psychology, Friedrich Schiller University Jena, Humboldtstraße 26, 07743 Jena, Germany
- Michael Stifel Center Jena for Data-Driven and Simulation Science, Friedrich Schiller University Jena, 07743 Jena, Germany
- Social Potential in Autism Research Unit, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Thomas Kessler
- Department of Social Psychology, Institute of Psychology, Friedrich Schiller University Jena, Humboldtstraße 26, 07743 Jena, Germany
| | - Dana Schneider
- Department of Social Psychology, Institute of Psychology, Friedrich Schiller University Jena, Humboldtstraße 26, 07743 Jena, Germany
- Michael Stifel Center Jena for Data-Driven and Simulation Science, Friedrich Schiller University Jena, 07743 Jena, Germany
- Social Potential in Autism Research Unit, Friedrich Schiller University Jena, 07743 Jena, Germany
- DFG Scientific Network “Understanding Others”, 10117 Berlin, Germany
| | - Maha Shadaydeh
- Michael Stifel Center Jena for Data-Driven and Simulation Science, Friedrich Schiller University Jena, 07743 Jena, Germany
- Computer Vision Group, Department of Mathematics and Computer Science, Friedrich Schiller University Jena, 07743 Jena, Germany
| | - Stefan R. Schweinberger
- Michael Stifel Center Jena for Data-Driven and Simulation Science, Friedrich Schiller University Jena, 07743 Jena, Germany
- Social Potential in Autism Research Unit, Friedrich Schiller University Jena, 07743 Jena, Germany
- Department of General Psychology and Cognitive Neuroscience, Friedrich Schiller University Jena, Am Steiger 3/Haus 1, 07743 Jena, Germany
- German Center for Mental Health (DZPG), Site Jena-Magdeburg-Halle, 07743 Jena, Germany
| |
Collapse
|
5
|
User Identity Protection in Automatic Emotion Recognition through Disguised Speech. AI 2021. [DOI: 10.3390/ai2040038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Ambient Assisted Living (AAL) technologies are being developed which could assist elderly people to live healthy and active lives. These technologies have been used to monitor people’s daily exercises, consumption of calories and sleep patterns, and to provide coaching interventions to foster positive behaviour. Speech and audio processing can be used to complement such AAL technologies to inform interventions for healthy ageing by analyzing speech data captured in the user’s home. However, collection of data in home settings presents challenges. One of the most pressing challenges concerns how to manage privacy and data protection. To address this issue, we proposed a low cost system for recording disguised speech signals which can protect user identity by using pitch shifting. The disguised speech so recorded can then be used for training machine learning models for affective behaviour monitoring. Affective behaviour could provide an indicator of the onset of mental health issues such as depression and cognitive impairment, and help develop clinical tools for automatically detecting and monitoring disease progression. In this article, acoustic features extracted from the non-disguised and disguised speech are evaluated in an affect recognition task using six different machine learning classification methods. The results of transfer learning from non-disguised to disguised speech are also demonstrated. We have identified sets of acoustic features which are not affected by the pitch shifting algorithm and also evaluated them in affect recognition. We found that, while the non-disguised speech signal gives the best Unweighted Average Recall (UAR) of 80.01%, the disguised speech signal only causes a slight degradation of performance, reaching 76.29%. The transfer learning from non-disguised to disguised speech results in a reduction of UAR (65.13%). However, feature selection improves the UAR (68.32%). This approach forms part of a large project which includes health and wellbeing monitoring and coaching.
Collapse
|
6
|
Amjad A, Khan L, Chang HT. Effect on speech emotion classification of a feature selection approach using a convolutional neural network. PeerJ Comput Sci 2021; 7:e766. [PMID: 34805511 PMCID: PMC8576551 DOI: 10.7717/peerj-cs.766] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Accepted: 10/11/2021] [Indexed: 05/24/2023]
Abstract
Speech emotion recognition (SER) is a challenging issue because it is not clear which features are effective for classification. Emotionally related features are always extracted from speech signals for emotional classification. Handcrafted features are mainly used for emotional identification from audio signals. However, these features are not sufficient to correctly identify the emotional state of the speaker. The advantages of a deep convolutional neural network (DCNN) are investigated in the proposed work. A pretrained framework is used to extract the features from speech emotion databases. In this work, we adopt the feature selection (FS) approach to find the discriminative and most important features for SER. Many algorithms are used for the emotion classification problem. We use the random forest (RF), decision tree (DT), support vector machine (SVM), multilayer perceptron classifier (MLP), and k-nearest neighbors (KNN) to classify seven emotions. All experiments are performed by utilizing four different publicly accessible databases. Our method obtains accuracies of 92.02%, 88.77%, 93.61%, and 77.23% for Emo-DB, SAVEE, RAVDESS, and IEMOCAP, respectively, for speaker-dependent (SD) recognition with the feature selection method. Furthermore, compared to current handcrafted feature-based SER methods, the proposed method shows the best results for speaker-independent SER. For EMO-DB, all classifiers attain an accuracy of more than 80% with or without the feature selection technique.
Collapse
Affiliation(s)
- Ammar Amjad
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan
| | - Lal Khan
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan
| | - Hsien-Tsung Chang
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan, Taiwan
- Department of Physical Medicine and Rehabilitation, Chang Gung Memorial Hospital, Taoyuan, Taiwan
- Artificial Intelligence Research Center, Chang Gung University, Taoyuan, Taiwan
- Bachelor Program in Artificial Intelligence, Chang Gung University, Taoyuan, Taiwan
| |
Collapse
|
7
|
de la Fuente Garcia S, Haider F, Luz S. COVID-19: Affect recognition through voice analysis during the winter lockdown in Scotland. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:2326-2329. [PMID: 34890322 DOI: 10.1109/embc46164.2021.9630833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The COVID-19 pandemic has led to unprecedented restrictions in people's lifestyle which have affected their psychological wellbeing. In this context, this paper investigates the use of social signal processing techniques for remote assessment of emotions. It presents a machine learning method for affect recognition applied to recordings taken during the COVID-19 winter lockdown in Scotland (UK). This method is exclusively based on acoustic features extracted from voice recordings collected through home and mobile devices (i.e. phones, tablets), thus providing insight into the feasibility of monitoring people's psychological wellbeing remotely, automatically and at scale. The proposed model is able to predict affect with a concordance correlation coefficient of 0.4230 (using Random Forest) and 0.3354 (using Decision Trees) for arousal and valence respectively.Clinical relevance- In 2018/2019, 12% and 14% of Scottish adults reported depression and anxiety symptoms. Remote emotion recognition through home devices would support the detection of these difficulties, which are often underdiagnosed and, if untreated, may lead to temporal or chronic disability.
Collapse
|
8
|
Farooq M, Hussain F, Baloch NK, Raja FR, Yu H, Zikria YB. Impact of Feature Selection Algorithm on Speech Emotion Recognition Using Deep Convolutional Neural Network. SENSORS 2020; 20:s20216008. [PMID: 33113907 PMCID: PMC7660211 DOI: 10.3390/s20216008] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2020] [Revised: 10/19/2020] [Accepted: 10/20/2020] [Indexed: 12/05/2022]
Abstract
Speech emotion recognition (SER) plays a significant role in human–machine interaction. Emotion recognition from speech and its precise classification is a challenging task because a machine is unable to understand its context. For an accurate emotion classification, emotionally relevant features must be extracted from the speech data. Traditionally, handcrafted features were used for emotional classification from speech signals; however, they are not efficient enough to accurately depict the emotional states of the speaker. In this study, the benefits of a deep convolutional neural network (DCNN) for SER are explored. For this purpose, a pretrained network is used to extract features from state-of-the-art speech emotional datasets. Subsequently, a correlation-based feature selection technique is applied to the extracted features to select the most appropriate and discriminative features for SER. For the classification of emotions, we utilize support vector machines, random forests, the k-nearest neighbors algorithm, and neural network classifiers. Experiments are performed for speaker-dependent and speaker-independent SER using four publicly available datasets: the Berlin Dataset of Emotional Speech (Emo-DB), Surrey Audio Visual Expressed Emotion (SAVEE), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and the Ryerson Audio Visual Dataset of Emotional Speech and Song (RAVDESS). Our proposed method achieves an accuracy of 95.10% for Emo-DB, 82.10% for SAVEE, 83.80% for IEMOCAP, and 81.30% for RAVDESS, for speaker-dependent SER experiments. Moreover, our method yields the best results for speaker-independent SER with existing handcrafted features-based SER approaches.
Collapse
Affiliation(s)
- Misbah Farooq
- Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan; (M.F.); (F.H.); (N.K.B.)
| | - Fawad Hussain
- Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan; (M.F.); (F.H.); (N.K.B.)
| | - Naveed Khan Baloch
- Department of Computer Engineering, University of Engineering and Technology, Taxila 47050, Pakistan; (M.F.); (F.H.); (N.K.B.)
| | - Fawad Riasat Raja
- Department of Software Engineering, University of Engineering and Technology, Taxila 47050, Pakistan;
- Machine Intelligence and Pattern Analysis Laboratory, Griffith University, Nathan QLD 4111, Australia
| | - Heejung Yu
- Department Electronics and Information Engineering, Korea University, Sejong 30019, Korea
- Correspondence: (H.Y.); (Y.B.Z.)
| | - Yousaf Bin Zikria
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38541, Korea
- Correspondence: (H.Y.); (Y.B.Z.)
| |
Collapse
|