1
|
Munsif M, Sajjad M, Ullah M, Tarekegn AN, Cheikh FA, Tsakanikas P, Muhammad K. Optimized efficient attention-based network for facial expressions analysis in neurological health care. Comput Biol Med 2024; 179:108822. [PMID: 38986286 DOI: 10.1016/j.compbiomed.2024.108822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/25/2024] [Accepted: 06/25/2024] [Indexed: 07/12/2024]
Abstract
Facial Expression Analysis (FEA) plays a vital role in diagnosing and treating early-stage neurological disorders (NDs) like Alzheimer's and Parkinson's. Manual FEA is hindered by expertise, time, and training requirements, while automatic methods confront difficulties with real patient data unavailability, high computations, and irrelevant feature extraction. To address these challenges, this paper proposes a novel approach: an efficient, lightweight convolutional block attention module (CBAM) based deep learning network (DLN) to aid doctors in diagnosing ND patients. The method comprises two stages: data collection of real ND patients, and pre-processing, involving face detection and an attention-enhanced DLN for feature extraction and refinement. Extensive experiments with validation on real patient data showcase compelling performance, achieving an accuracy of up to 73.2%. Despite its efficacy, the proposed model is lightweight, occupying only 3MB, making it suitable for deployment on resource-constrained mobile healthcare devices. Moreover, the method exhibits significant advancements over existing FEA approaches, holding tremendous promise in effectively diagnosing and treating ND patients. By accurately recognizing emotions and extracting relevant features, this approach empowers medical professionals in early ND detection and management, overcoming the challenges of manual analysis and heavy models. In conclusion, this research presents a significant leap in FEA, promising to enhance ND diagnosis and care.The code and data used in this work are available at: https://github.com/munsif200/Neurological-Health-Care.
Collapse
Affiliation(s)
| | - Muhammad Sajjad
- Digital Image Processing Lab, Department of Computer Science, Islamia College, Peshawar, 25000, Pakistan; Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Mohib Ullah
- Intelligent Systems and Analytics Research Group (ISA), Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Adane Nega Tarekegn
- Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Faouzi Alaya Cheikh
- Department of Computer Science, Norwegian University for Science and Technology, 2815, Gjøvik, Norway
| | - Panagiotis Tsakanikas
- Institute of Communication and Computer Systems, National Technical University of Athens, 15773 Athens, Greece
| | - Khan Muhammad
- Visual Analytics for Knowledge Laboratory (VIS2KNOW Lab), Department of Applied Artificial Intelligence, School of Convergence, College of Computing and Informatics, Sungkyunkwan University, Seoul 03063, Republic of Korea.
| |
Collapse
|
2
|
Zheng D. Transfer learning-based English translation text classification in a multimedia network environment. PeerJ Comput Sci 2024; 10:e1842. [PMID: 38435557 PMCID: PMC10909173 DOI: 10.7717/peerj-cs.1842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 01/09/2024] [Indexed: 03/05/2024]
Abstract
In recent years, with the rapid development of the Internet and multimedia technology, English translation text classification has played an important role in various industries. However, English translation remains a complex and difficult problem. Seeking an efficient and accurate English translation method has become an urgent problem to be solved. The study first elucidated the possibility of the development of transfer learning technology in multimedia environments, which was recognized. Then, previous research on this issue, as well as the Bidirectional Encoder Representations from Transformers (BERT) model, the attention mechanism and bidirectional long short-term memory (Att-BILSTM) model, and the transfer learning based cross domain model (TLCM) and their theoretical foundations, were comprehensively explained. Through the application of transfer learning in multimedia network technology, we deconstructed and integrated these methods. A new text classification technology fusion model, the BATCL transfer learning model, has been established. We analyzed its requirements and label classification methods, proposed a data preprocessing method, and completed experiments to analyze different influencing factors. The research results indicate that the classification system obtained from the study has a similar trend to the BERT model at the macro level, and the classification method proposed in this study can surpass the BERT model by up to 28%. The classification accuracy of the Att-BILSTM model improves over time, but it does not exceed the classification accuracy of the method proposed in this study. This study not only helps to improve the accuracy of English translation, but also enhances the efficiency of machine learning algorithms, providing a new approach for solving English translation problems.
Collapse
Affiliation(s)
- Danyang Zheng
- School of Foreign Languages, Tianjin University, Tianjin, China
| |
Collapse
|
3
|
Mamieva D, Abdusalomov AB, Kutlimuratov A, Muminov B, Whangbo TK. Multimodal Emotion Detection via Attention-Based Fusion of Extracted Facial and Speech Features. SENSORS (BASEL, SWITZERLAND) 2023; 23:5475. [PMID: 37420642 PMCID: PMC10304130 DOI: 10.3390/s23125475] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 05/25/2023] [Accepted: 06/08/2023] [Indexed: 07/09/2023]
Abstract
Methods for detecting emotions that employ many modalities at the same time have been found to be more accurate and resilient than those that rely on a single sense. This is due to the fact that sentiments may be conveyed in a wide range of modalities, each of which offers a different and complementary window into the thoughts and emotions of the speaker. In this way, a more complete picture of a person's emotional state may emerge through the fusion and analysis of data from several modalities. The research suggests a new attention-based approach to multimodal emotion recognition. This technique integrates facial and speech features that have been extracted by independent encoders in order to pick the aspects that are the most informative. It increases the system's accuracy by processing speech and facial features of various sizes and focuses on the most useful bits of input. A more comprehensive representation of facial expressions is extracted by the use of both low- and high-level facial features. These modalities are combined using a fusion network to create a multimodal feature vector which is then fed to a classification layer for emotion recognition. The developed system is evaluated on two datasets, IEMOCAP and CMU-MOSEI, and shows superior performance compared to existing models, achieving a weighted accuracy WA of 74.6% and an F1 score of 66.1% on the IEMOCAP dataset and a WA of 80.7% and F1 score of 73.7% on the CMU-MOSEI dataset.
Collapse
Affiliation(s)
- Dilnoza Mamieva
- Department of Computer Engineering, Gachon University, Seongnam-si 13120, Republic of Korea; (D.M.)
| | | | - Alpamis Kutlimuratov
- Department of AI. Software, Gachon University, Seongnam-si 13120, Republic of Korea
| | - Bahodir Muminov
- Department of Artificial Intelligence, Tashkent State University of Economics, Tashkent 100066, Uzbekistan
| | - Taeg Keun Whangbo
- Department of Computer Engineering, Gachon University, Seongnam-si 13120, Republic of Korea; (D.M.)
| |
Collapse
|
4
|
Olatinwo DD, Abu-Mahfouz A, Hancke G, Myburgh H. IoT-Enabled WBAN and Machine Learning for Speech Emotion Recognition in Patients. SENSORS (BASEL, SWITZERLAND) 2023; 23:2948. [PMID: 36991659 PMCID: PMC10056097 DOI: 10.3390/s23062948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 02/27/2023] [Accepted: 03/03/2023] [Indexed: 06/19/2023]
Abstract
Internet of things (IoT)-enabled wireless body area network (WBAN) is an emerging technology that combines medical devices, wireless devices, and non-medical devices for healthcare management applications. Speech emotion recognition (SER) is an active research field in the healthcare domain and machine learning. It is a technique that can be used to automatically identify speakers' emotions from their speech. However, the SER system, especially in the healthcare domain, is confronted with a few challenges. For example, low prediction accuracy, high computational complexity, delay in real-time prediction, and how to identify appropriate features from speech. Motivated by these research gaps, we proposed an emotion-aware IoT-enabled WBAN system within the healthcare framework where data processing and long-range data transmissions are performed by an edge AI system for real-time prediction of patients' speech emotions as well as to capture the changes in emotions before and after treatment. Additionally, we investigated the effectiveness of different machine learning and deep learning algorithms in terms of performance classification, feature extraction methods, and normalization methods. We developed a hybrid deep learning model, i.e., convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM), and a regularized CNN model. We combined the models with different optimization strategies and regularization techniques to improve the prediction accuracy, reduce generalization error, and reduce the computational complexity of the neural networks in terms of their computational time, power, and space. Different experiments were performed to check the efficiency and effectiveness of the proposed machine learning and deep learning algorithms. The proposed models are compared with a related existing model for evaluation and validation using standard performance metrics such as prediction accuracy, precision, recall, F1 score, confusion matrix, and the differences between the actual and predicted values. The experimental results proved that one of the proposed models outperformed the existing model with an accuracy of about 98%.
Collapse
Affiliation(s)
- Damilola D. Olatinwo
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
| | - Adnan Abu-Mahfouz
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
- Council for Scientific and Industrial Research (CSIR), Pretoria 0184, South Africa
| | - Gerhard Hancke
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Hermanus Myburgh
- Department of Electrical, Electronic and Computer Engineering, University of Pretoria, Pretoria 0001, South Africa
| |
Collapse
|
5
|
Kim J, Lee D. Facial Expression Recognition Robust to Occlusion and to Intra-Similarity Problem Using Relevant Subsampling. SENSORS (BASEL, SWITZERLAND) 2023; 23:2619. [PMID: 36904823 PMCID: PMC10007059 DOI: 10.3390/s23052619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 02/22/2023] [Accepted: 02/25/2023] [Indexed: 06/18/2023]
Abstract
This paper proposes facial expression recognition (FER) with the wild data set. In particular, this paper chiefly deals with two issues, occlusion and intra-similarity problems. The attention mechanism enables one to use the most relevant areas of facial images for specific expressions, and the triplet loss function solves the intra-similarity problem that sometimes fails to aggregate the same expression from different faces and vice versa. The proposed approach for the FER is robust to occlusion, and it uses a spatial transformer network (STN) with an attention mechanism to utilize specific facial region that dominantly contributes (or that is the most relevant) to particular facial expressions, e.g., anger, contempt, disgust, fear, joy, sadness, and surprise. In addition, the STN model is connected to the triplet loss function to improve the recognition rate which outperforms the existing approaches that employ cross-entropy or other approaches using only deep neural networks or classical methods. The triplet loss module alleviates limitations of the intra-similarity problem, leading to further improvement of the classification. Experimental results are provided to substantiate the proposed approach for FER, and the result outperforms the recognition rate in more practical cases, e.g., occlusion. The quantitative result provides FER results with more than 2.09% higher accuracy compared to the existing FER results in CK+ data sets and 0.48% higher than the accuracy of the results with the modified ResNet model in the FER2013 data set.
Collapse
|
6
|
Gupta MV, Vaikole S, Oza AD, Patel A, Burduhos-Nergis DP, Burduhos-Nergis DD. Audio-Visual Stress Classification Using Cascaded RNN-LSTM Networks. Bioengineering (Basel) 2022; 9:510. [PMID: 36290478 PMCID: PMC9598122 DOI: 10.3390/bioengineering9100510] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 09/22/2022] [Accepted: 09/23/2022] [Indexed: 07/30/2023] Open
Abstract
The purpose of this research is to emphasize the importance of mental health and contribute to the overall well-being of humankind by detecting stress. Stress is a state of strain, whether it be mental or physical. It can result from anything that frustrates, incenses, or unnerves you in an event or thinking. Your body's response to a demand or challenge is stress. Stress affects people on a daily basis. Stress can be regarded as a hidden pandemic. Long-term (chronic) stress results in ongoing activation of the stress response, which wears down the body over time. Symptoms manifest as behavioral, emotional, and physical effects. The most common method involves administering brief self-report questionnaires such as the Perceived Stress Scale. However, self-report questionnaires frequently lack item specificity and validity, and interview-based measures can be time- and money-consuming. In this research, a novel method used to detect human mental stress by processing audio-visual data is proposed. In this paper, the focus is on understanding the use of audio-visual stress identification. Using the cascaded RNN-LSTM strategy, we achieved 91% accuracy on the RAVDESS dataset, classifying eight emotions and eventually stressed and unstressed states.
Collapse
Affiliation(s)
- Megha V. Gupta
- Department of Computer Engineering, New Horizon Institute of Technology and Management, University of Mumbai, Mumbai 400615, Maharashtra, India
| | - Shubhangi Vaikole
- Department of Computer Engineering, Datta Meghe College of Engineering, University of Mumbai, Mumbai 400708, Maharashtra, India
| | - Ankit D. Oza
- Department of Computer Sciences and Engineering, Institute of Advanced Research, Gandhinagar 382426, Gujarat, India
| | - Amisha Patel
- Department of Mathematics, Institute of Technology, Ahmedabad 382481, Gujarat, India
| | | | - Dumitru Doru Burduhos-Nergis
- Faculty of Materials Science and Engineering, Gheorghe Asachi Technical University of Iasi, 700050 Iasi, Romania
| |
Collapse
|
7
|
Zeng X, Zhong Z. Multimodal Sentiment Analysis of Online Product Information Based on Text Mining Under the Influence of Social Media. J ORGAN END USER COM 2022. [DOI: 10.4018/joeuc.314786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Currently, with the dramatic increase in social media users and the greater variety of online product information, manual processing of this information is time-consuming and labour-intensive. Therefore, based on the text mining of online information, this paper analyzes the text representation method of online information, discusses the long short-term memory network, and constructs an interactive attention graph convolutional network (IAGCN) model based on graph convolutional neural network (GCNN) and attention mechanism to study the multimodal sentiment analysis (MSA) of online product information. The results show that the IAGCN model improves the accuracy by 4.78% and the F1 value by 29.25% compared with the pure interactive attention network. Meanwhile, it is found that the performance of the model is optimal when the GCNN is two layers and uses syntactic position attention. This research has important practical significance for MSA of online product information in social media.
Collapse
Affiliation(s)
- Xiao Zeng
- Huazhong University of Science and Technology, China
| | - Ziqi Zhong
- The London School of Economics and Political Science, UK
| |
Collapse
|
8
|
Bekmanova G, Yergesh B, Sharipbay A, Mukanova A. Emotional Speech Recognition Method Based on Word Transcription. SENSORS 2022; 22:s22051937. [PMID: 35271083 PMCID: PMC8915129 DOI: 10.3390/s22051937] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Revised: 02/25/2022] [Accepted: 02/25/2022] [Indexed: 02/01/2023]
Abstract
The emotional speech recognition method presented in this article was applied to recognize the emotions of students during online exams in distance learning due to COVID-19. The purpose of this method is to recognize emotions in spoken speech through the knowledge base of emotionally charged words, which are stored as a code book. The method analyzes human speech for the presence of emotions. To assess the quality of the method, an experiment was conducted for 420 audio recordings. The accuracy of the proposed method is 79.7% for the Kazakh language. The method can be used for different languages and consists of the following tasks: capturing a signal, detecting speech in it, recognizing speech words in a simplified transcription, determining word boundaries, comparing a simplified transcription with a code book, and constructing a hypothesis about the degree of speech emotionality. In case of the presence of emotions, there occurs complete recognition of words and definitions of emotions in speech. The advantage of this method is the possibility of its widespread use since it is not demanding on computational resources. The described method can be applied when there is a need to recognize positive and negative emotions in a crowd, in public transport, schools, universities, etc. The experiment carried out has shown the effectiveness of this method. The results obtained will make it possible in the future to develop devices that begin to record and recognize a speech signal, for example, in the case of detecting negative emotions in sounding speech and, if necessary, transmitting a message about potential threats or riots.
Collapse
Affiliation(s)
- Gulmira Bekmanova
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
| | - Banu Yergesh
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
- Correspondence:
| | - Altynbek Sharipbay
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
| | - Assel Mukanova
- Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Nur-Sultan 010008, Kazakhstan; (G.B.); (A.S.); (A.M.)
- Higher School of Information Technology and Engineering, Astana International University, Nur-Sultan 010000, Kazakhstan
| |
Collapse
|
9
|
A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset. APPLIED SCIENCES-BASEL 2021. [DOI: 10.3390/app12010327] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.
Collapse
|