1
|
Wang J. Evaluation and analysis of visual perception using attention-enhanced computation in multimedia affective computing. Front Neurosci 2024; 18:1449527. [PMID: 39170679 PMCID: PMC11335721 DOI: 10.3389/fnins.2024.1449527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 07/11/2024] [Indexed: 08/23/2024] Open
Abstract
Facial expression recognition (FER) plays a crucial role in affective computing, enhancing human-computer interaction by enabling machines to understand and respond to human emotions. Despite advancements in deep learning, current FER systems often struggle with challenges such as occlusions, head pose variations, and motion blur in natural environments. These challenges highlight the need for more robust FER solutions. To address these issues, we propose the Attention-Enhanced Multi-Layer Transformer (AEMT) model, which integrates a dual-branch Convolutional Neural Network (CNN), an Attentional Selective Fusion (ASF) module, and a Multi-Layer Transformer Encoder (MTE) with transfer learning. The dual-branch CNN captures detailed texture and color information by processing RGB and Local Binary Pattern (LBP) features separately. The ASF module selectively enhances relevant features by applying global and local attention mechanisms to the extracted features. The MTE captures long-range dependencies and models the complex relationships between features, collectively improving feature representation and classification accuracy. Our model was evaluated on the RAF-DB and AffectNet datasets. Experimental results demonstrate that the AEMT model achieved an accuracy of 81.45% on RAF-DB and 71.23% on AffectNet, significantly outperforming existing state-of-the-art methods. These results indicate that our model effectively addresses the challenges of FER in natural environments, providing a more robust and accurate solution. The AEMT model significantly advances the field of FER by improving the robustness and accuracy of emotion recognition in complex real-world scenarios. This work not only enhances the capabilities of affective computing systems but also opens new avenues for future research in improving model efficiency and expanding multimodal data integration.
Collapse
Affiliation(s)
- Jingyi Wang
- School of Mass-communication and Advertising, Tongmyong University, Busan, Republic of Korea
| |
Collapse
|
2
|
Aina J, Akinniyi O, Rahman MM, Odero-Marah V, Khalifa F. A Hybrid Learning-Architecture for Mental Disorder Detection Using Emotion Recognition. IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2024; 12:91410-91425. [PMID: 39054996 PMCID: PMC11270886 DOI: 10.1109/access.2024.3421376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/27/2024]
Abstract
Mental illness has grown to become a prevalent and global health concern that affects individuals across various demographics. Timely detection and accurate diagnosis of mental disorders are crucial for effective treatment and support as late diagnosis could result in suicidal, harmful behaviors and ultimately death. To this end, the present study introduces a novel pipeline for the analysis of facial expressions, leveraging both the AffectNet and 2013 Facial Emotion Recognition (FER) datasets. Consequently, this research goes beyond traditional diagnostic methods by contributing a system capable of generating a comprehensive mental disorder dataset and concurrently predicting mental disorders based on facial emotional cues. Particularly, we introduce a hybrid architecture for mental disorder detection leveraging the state-of-the-art object detection algorithm, YOLOv8 to detect and classify visual cues associated with specific mental disorders. To achieve accurate predictions, an integrated learning architecture based on the fusion of Convolution Neural Networks (CNNs) and Visual Transformer (ViT) models is developed to form an ensemble classifier that predicts the presence of mental illness (e.g., depression, anxiety, and other mental disorder). The overall accuracy is improved to about 81% using the proposed ensemble technique. To ensure transparency and interpretability, we integrate techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) and saliency maps to highlight the regions in the input image that significantly contribute to the model's predictions thus providing healthcare professionals with a clear understanding of the features influencing the system's decisions thereby enhancing trust and more informed diagnostic process.
Collapse
Affiliation(s)
- Joseph Aina
- Electrical and Computer Engineering Department, School of Engineering, Morgan State University, Baltimore, MD 21251, USA
| | - Oluwatunmise Akinniyi
- Electrical and Computer Engineering Department, School of Engineering, Morgan State University, Baltimore, MD 21251, USA
| | - Md Mahmudur Rahman
- Department of Computer Science, School of Computer, Mathematical and Natural Sciences, Morgan State University, Baltimore, MD 21251, USA
| | - Valerie Odero-Marah
- Center for Urban Health Disparities Research and Innovation, Department of Biology, Morgan State University, Baltimore, MD 21251, USA
| | - Fahmi Khalifa
- Electrical and Computer Engineering Department, School of Engineering, Morgan State University, Baltimore, MD 21251, USA
- Electronics and Communications Engineering Department, Mansoura University, Mansoura 35516, Egypt
| |
Collapse
|
3
|
Li N, Huang Y, Wang Z, Fan Z, Li X, Xiao Z. Enhanced Hybrid Vision Transformer with Multi-Scale Feature Integration and Patch Dropping for Facial Expression Recognition. SENSORS (BASEL, SWITZERLAND) 2024; 24:4153. [PMID: 39000930 PMCID: PMC11243949 DOI: 10.3390/s24134153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 06/22/2024] [Accepted: 06/24/2024] [Indexed: 07/16/2024]
Abstract
Convolutional neural networks (CNNs) have made significant progress in the field of facial expression recognition (FER). However, due to challenges such as occlusion, lighting variations, and changes in head pose, facial expression recognition in real-world environments remains highly challenging. At the same time, methods solely based on CNN heavily rely on local spatial features, lack global information, and struggle to balance the relationship between computational complexity and recognition accuracy. Consequently, the CNN-based models still fall short in their ability to address FER adequately. To address these issues, we propose a lightweight facial expression recognition method based on a hybrid vision transformer. This method captures multi-scale facial features through an improved attention module, achieving richer feature integration, enhancing the network's perception of key facial expression regions, and improving feature extraction capabilities. Additionally, to further enhance the model's performance, we have designed the patch dropping (PD) module. This module aims to emulate the attention allocation mechanism of the human visual system for local features, guiding the network to focus on the most discriminative features, reducing the influence of irrelevant features, and intuitively lowering computational costs. Extensive experiments demonstrate that our approach significantly outperforms other methods, achieving an accuracy of 86.51% on RAF-DB and nearly 70% on FER2013, with a model size of only 3.64 MB. These results demonstrate that our method provides a new perspective for the field of facial expression recognition.
Collapse
Affiliation(s)
- Nianfeng Li
- College of Computer Science and Technology, Changchun University, No. 6543, Satellite Road, Changchun 130022, China
| | - Yongyuan Huang
- College of Computer Science and Technology, Changchun University, No. 6543, Satellite Road, Changchun 130022, China
| | - Zhenyan Wang
- College of Computer Science and Technology, Changchun University, No. 6543, Satellite Road, Changchun 130022, China
| | - Ziyao Fan
- College of Computer Science and Technology, Changchun University, No. 6543, Satellite Road, Changchun 130022, China
| | - Xinyuan Li
- College of Computer Science and Technology, Changchun University, No. 6543, Satellite Road, Changchun 130022, China
| | - Zhiguo Xiao
- College of Computer Science and Technology, Changchun University, No. 6543, Satellite Road, Changchun 130022, China
- School of Computer Science Technology, Beijing Institute of Technology, Beijing 100811, China
| |
Collapse
|
4
|
Ramzani Shahrestani M, Motamed S, Yamaghani M. Recognition of facial emotion based on SOAR model. Front Neurosci 2024; 18:1374112. [PMID: 38826778 PMCID: PMC11140482 DOI: 10.3389/fnins.2024.1374112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 05/01/2024] [Indexed: 06/04/2024] Open
Abstract
Introduction Expressing emotions play a special role in daily communication, and one of the most essential methods in detecting emotions is to detect facial emotional states. Therefore, one of the crucial aspects of the natural human-machine interaction is the recognition of facial expressions and the creation of feedback, according to the perceived emotion. Methods To implement each part of this model, two main steps have been introduced. The first step is reading the video and converting it to images and preprocessing on them. The next step is to use the combination of 3D convolutional neural network (3DCNN) and learning automata (LA) to classify and detect the rate of facial emotional recognition. The reason for choosing 3DCNN in our model is that no dimension is removed from the images, and considering the temporal information in dynamic images leads to more efficient and better classification. In addition, the training of the 3DCNN network in calculating the backpropagation error is adjusted by LA so that both the efficiency of the proposed model is increased, and the working memory part of the SOAR model can be implemented. Results and discussion Due to the importance of the topic, this article presents an efficient method for recognizing emotional states from facial images based on a mixed deep learning and cognitive model called SOAR. Among the objectives of the proposed model, it is possible to mention providing a model for learning the time order of frames in the movie and providing a model for better display of visual features, increasing the recognition rate. The accuracy of recognition rate of facial emotional states in the proposed model is 85.3%. To compare the effectiveness of the proposed model with other models, this model has been compared with competing models. By examining the results, we found that the proposed model has a better performance than other models.
Collapse
Affiliation(s)
| | - Sara Motamed
- Department of Computer Engineering, Fouman and Shaft Branch, Islamic Azad University, Fouman, Iran
| | - Mohammadreza Yamaghani
- Department of Computer Engineering, Lahijan Branch, Islamic Azad University, Lahijan, Iran
| |
Collapse
|
5
|
Xie W, Peng Z, Shen L, Lu W, Zhang Y, Song S. Cross-Layer Contrastive Learning of Latent Semantics for Facial Expression Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:2514-2529. [PMID: 38530732 DOI: 10.1109/tip.2024.3378459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/28/2024]
Abstract
Convolutional neural networks (CNNs) have achieved significant improvement for the task of facial expression recognition. However, current training still suffers from the inconsistent learning intensities among different layers, i.e., the feature representations in the shallow layers are not sufficiently learned compared with those in deep layers. To this end, this work proposes a contrastive learning framework to align the feature semantics of shallow and deep layers, followed by an attention module for representing the multi-scale features in the weight-adaptive manner. The proposed algorithm has three main merits. First, the learning intensity, defined as the magnitude of the backpropagation gradient, of the features on the shallow layer is enhanced by cross-layer contrastive learning. Second, the latent semantics in the shallow-layer and deep-layer features are explored and aligned in the contrastive learning, and thus the fine-grained characteristics of expressions can be taken into account for the feature representation learning. Third, by integrating the multi-scale features from multiple layers with an attention module, our algorithm achieved the state-of-the-art performances, i.e. 92.21%, 89.50%, 62.82%, on three in-the-wild expression databases, i.e. RAF-DB, FERPlus, SFEW, and the second best performance, i.e. 65.29% on AffectNet dataset. Our codes will be made publicly available.
Collapse
|
6
|
Lim H, Joo Y, Ha E, Song Y, Yoon S, Shin T. Brain Age Prediction Using Multi-Hop Graph Attention Combined with Convolutional Neural Network. Bioengineering (Basel) 2024; 11:265. [PMID: 38534539 DOI: 10.3390/bioengineering11030265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 02/28/2024] [Accepted: 03/01/2024] [Indexed: 03/28/2024] Open
Abstract
Convolutional neural networks (CNNs) have been used widely to predict biological brain age based on brain magnetic resonance (MR) images. However, CNNs focus mainly on spatially local features and their aggregates and barely on the connective information between distant regions. To overcome this issue, we propose a novel multi-hop graph attention (MGA) module that exploits both the local and global connections of image features when combined with CNNs. After insertion between convolutional layers, MGA first converts the convolution-derived feature map into graph-structured data by using patch embedding and embedding-distance-based scoring. Multi-hop connections between the graph nodes are modeled by using the Markov chain process. After performing multi-hop graph attention, MGA re-converts the graph into an updated feature map and transfers it to the next convolutional layer. We combined the MGA module with sSE (spatial squeeze and excitation)-ResNet18 for our final prediction model (MGA-sSE-ResNet18) and performed various hyperparameter evaluations to identify the optimal parameter combinations. With 2788 three-dimensional T1-weighted MR images of healthy subjects, we verified the effectiveness of MGA-sSE-ResNet18 with comparisons to four established, general-purpose CNNs and two representative brain age prediction models. The proposed model yielded an optimal performance with a mean absolute error of 2.822 years and Pearson's correlation coefficient (PCC) of 0.968, demonstrating the potential of the MGA module to improve the accuracy of brain age prediction.
Collapse
Affiliation(s)
- Heejoo Lim
- Division of Mechanical and Biomedical Engineering, Ewha W. University, Seoul 03760, Republic of Korea
- Graduate Program in Smart Factory, Ewha W. University, Seoul 03760, Republic of Korea
| | - Yoonji Joo
- Ewha Brain Institute, Ewha W. University, Seoul 03760, Republic of Korea
| | - Eunji Ha
- Ewha Brain Institute, Ewha W. University, Seoul 03760, Republic of Korea
| | - Yumi Song
- Ewha Brain Institute, Ewha W. University, Seoul 03760, Republic of Korea
- Department of Brain and Cognitive Sciences, Ewha W. University, Seoul 03760, Republic of Korea
| | - Sujung Yoon
- Ewha Brain Institute, Ewha W. University, Seoul 03760, Republic of Korea
- Department of Brain and Cognitive Sciences, Ewha W. University, Seoul 03760, Republic of Korea
| | - Taehoon Shin
- Division of Mechanical and Biomedical Engineering, Ewha W. University, Seoul 03760, Republic of Korea
- Graduate Program in Smart Factory, Ewha W. University, Seoul 03760, Republic of Korea
| |
Collapse
|
7
|
Tao H, Duan Q. Hierarchical attention network with progressive feature fusion for facial expression recognition. Neural Netw 2024; 170:337-348. [PMID: 38006736 DOI: 10.1016/j.neunet.2023.11.033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Revised: 11/04/2023] [Accepted: 11/12/2023] [Indexed: 11/27/2023]
Abstract
Facial expression recognition (FER) in the wild is challenging due to the disturbing factors including pose variation, occlusions, and illumination variation. The attention mechanism can relieve these issues by enhancing expression-relevant information and suppressing expression-irrelevant information. However, most methods utilize the same attention mechanism on feature tensors with varying spatial and channel sizes across different network layers, disregarding the dynamically changing sizes of these tensors. To solve this issue, this paper proposes a hierarchical attention network with progressive feature fusion for FER. Specifically, first, to aggregate diverse complementary features, a diverse feature extraction module based on several feature aggregation blocks is designed to exploit both local context and global context features, both low-level and high-level features, as well as the gradient features that are robust to illumination variation. Second, to effectively fuse the above diverse features, a hierarchical attention module (HAM) is designed to progressively enhance discriminative features from key parts of the facial images and suppress task-irrelevant features from disturbing facial regions. Extensive experiments show that our model achieves the best performance among existing FER methods.
Collapse
Affiliation(s)
- Huanjie Tao
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, PR China; Engineering Research Center of Embedded System Integration, Ministry of Education. Xi'an 710129, PR China; National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, Xi'an 710129, PR China.
| | - Qianyue Duan
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, PR China
| |
Collapse
|
8
|
B A, Sarkar A, Behera PR, Shukla J. Multi-source transfer learning for facial emotion recognition using multivariate correlation analysis. Sci Rep 2023; 13:21004. [PMID: 38017241 PMCID: PMC10684585 DOI: 10.1038/s41598-023-48250-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Accepted: 11/23/2023] [Indexed: 11/30/2023] Open
Abstract
Deep learning techniques have proven to be effective in solving the facial emotion recognition (FER) problem. However, it demands a significant amount of supervision data which is often unavailable due to privacy and ethical concerns. In this paper, we present a novel approach for addressing the FER problem using multi-source transfer learning. The proposed method leverages the knowledge from multiple data sources of similar domains to inform the model on a related task. The approach involves the optimization of aggregate multivariate correlation among the source tasks trained on the source dataset, thus controlling the transfer of information to the target task. The hypothesis is validated on benchmark datasets for facial emotion recognition and image classification tasks, and the results demonstrate the effectiveness of the proposed method in capturing the group correlation among features, as well as being robust to negative transfer and performing well in few-shot multi-source adaptation. With respect to the state-of-the-art methods MCW and DECISION, our approach shows an improvement of 7% and [Formula: see text]15% respectively.
Collapse
Affiliation(s)
- Ashwini B
- Human-Machine Interaction Lab, Indraprastha Institute of Information Technology, New Delhi, India
| | - Arka Sarkar
- Human-Machine Interaction Lab, Indraprastha Institute of Information Technology, New Delhi, India
| | - Pruthivi Raj Behera
- Human-Machine Interaction Lab, Indraprastha Institute of Information Technology, New Delhi, India
| | - Jainendra Shukla
- Human-Machine Interaction Lab, Indraprastha Institute of Information Technology, New Delhi, India.
| |
Collapse
|
9
|
Wen H. Webcast marketing platform optimization via 6G R&D and the impact on brand content creation. PLoS One 2023; 18:e0292394. [PMID: 37856448 PMCID: PMC10586639 DOI: 10.1371/journal.pone.0292394] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 09/19/2023] [Indexed: 10/21/2023] Open
Abstract
This work aims to investigate the development and management of cosmetics webcast marketing platforms, offering novel approaches for building and sustaining commercial brands. Firstly, an analysis of the current utilization of cosmetics webcast marketing platforms is conducted, identifying operational challenges associated with these platforms. Secondly, optimization strategies are proposed to address the identified issues by leveraging advancements in 6th Generation (6G) communication technology. Subsequently, a conceptual framework is established, employing big data interaction to examine the influence of webcast marketing platform experiences on brand fit. Multiple hypotheses are formulated to explore the relationship between platform experiences and brand fit. Finally, empirical analysis is performed within the context of the 5th Generation (5G) Mobile Communication Technology and extended to incorporate the 6G Mobile Communication Technology landscape. The results of the validation indicate the following: (1) the content generated by the webcast marketing platform has a positive impact on brand fit (β = 0.46, p<0.01; β = 0.31, p<0.05); (2) in the 6G network environment, a webcast marketing platform with high traffic transmission rates may enhance brand fit (β = 0.51, p<0.001); (3) the content generated by the webcast marketing platform exhibits significant positive regulatory effects on information-based and co-generated content (β = 0.42, p<0.01; β = 0.02, p<0.001). The findings of this work offer valuable insights for other scholars and researchers seeking to optimize webcast marketing platforms.
Collapse
Affiliation(s)
- Hui Wen
- School of Management, Henan Institute of Economics and Trade, Zhengzhou, Henan, China
| |
Collapse
|
10
|
Li Y, Huang J, Lu S, Zhang Z, Lu G. Cross-Domain Facial Expression Recognition via Contrastive Warm up and Complexity-Aware Self-Training. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:5438-5450. [PMID: 37773906 DOI: 10.1109/tip.2023.3318955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2023]
Abstract
Unsupervised cross-domain Facial Expression Recognition (FER) aims to transfer the knowledge from a labeled source domain to an unlabeled target domain. Existing methods strive to reduce the discrepancy between source and target domain, but cannot effectively explore the abundant semantic information of the target domain due to the absence of target labels. To this end, we propose a novel framework via Contrastive Warm up and Complexity-aware Self-Training (namely CWCST), which facilitates source knowledge transfer and target semantic learning jointly. Specifically, we formulate a contrastive warm up strategy via features, momentum features, and learnable category centers to concurrently learn discriminative representations and narrow the domain gap, which benefits domain adaptation by generating more accurate target pseudo labels. Moreover, to deal with the inevitable noise in pseudo labels, we develop complexity-aware self-training with a label selection module based on prediction entropy, which iteratively generates pseudo labels and adaptively chooses the reliable ones for training, ultimately yielding effective target semantics exploration. Furthermore, by jointly using the two mentioned components, our framework enables to effectively utilize the source knowledge and target semantic information by source-target co- training. In addition, our framework can be easily incorporated into other baselines with consistent performance improvements. Extensive experimental results on seven databases show the superior performance of the proposed method against various baselines.
Collapse
|
11
|
Zhao Y, Zhu H, Chen X, Luo F, Li M, Zhou J, Chen S, Pan Y. Pose-invariant and occlusion-robust neonatal facial pain assessment. Comput Biol Med 2023; 165:107462. [PMID: 37716244 DOI: 10.1016/j.compbiomed.2023.107462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 08/12/2023] [Accepted: 09/04/2023] [Indexed: 09/18/2023]
Abstract
Neonatal Facial Pain Assessment (NFPA) is essential to improve neonatal pain management. Pose variation and occlusion, which can significantly alter the facial appearance, are two major and still unstudied barriers to NFPA. We bridge this gap in terms of method and dataset. Techniques to tackle both challenges in other tasks either expect pose/occlusion-invariant deep learning methods or first generate a normal version of the input image before feature extraction, combining these we argue that it is more effective to jointly perform adversarial learning and end-to-end classification for their mutual benefit. To this end, we propose a Pose-invariant Occlusion-robust Pain Assessment (POPA) framework, with two novelties. We incorporate adversarial learning-based disturbance mitigation for end-to-end pain-level classification and propose a novel composite loss function for facial representation learning; compared to the vanilla discriminator that implicitly determines occlusion and pose conditions, we propose a multi-scale discriminator that determines explicitly, while incorporating local discriminators to enhance the discrimination of key regions. For a comprehensive evaluation, we built the first neonatal pain dataset with disturbance annotation involving 1091 neonates and also applied the proposed POPA to the facial expression recognition task. Extensive qualitative and quantitative experiments prove the superiority of the POPA.
Collapse
Affiliation(s)
- Yisheng Zhao
- College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China.
| | - Huaiyu Zhu
- College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China.
| | - Xiaofei Chen
- Nursing Department, The Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
| | - Feixiang Luo
- Nursing Department, The Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
| | - Mengting Li
- Nursing Department, The Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
| | - Jinyan Zhou
- Nursing Department, The Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
| | - Shuohui Chen
- Hospital Infection-Control Department, The Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou 310052, China.
| | - Yun Pan
- College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China.
| |
Collapse
|
12
|
Chen Y, Liu S, Zhao D, Ji W. Occlusion facial expression recognition based on feature fusion residual attention network. Front Neurorobot 2023; 17:1250706. [PMID: 37663762 PMCID: PMC10472272 DOI: 10.3389/fnbot.2023.1250706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 07/31/2023] [Indexed: 09/05/2023] Open
Abstract
Recognizing occluded facial expressions in the wild poses a significant challenge. However, most previous approaches rely solely on either global or local feature-based methods, leading to the loss of relevant expression features. To address these issues, a feature fusion residual attention network (FFRA-Net) is proposed. FFRA-Net consists of a multi-scale module, a local attention module, and a feature fusion module. The multi-scale module divides the intermediate feature map into several sub-feature maps in an equal manner along the channel dimension. Then, a convolution operation is applied to each of these feature maps to obtain diverse global features. The local attention module divides the intermediate feature map into several sub-feature maps along the spatial dimension. Subsequently, a convolution operation is applied to each of these feature maps, resulting in the extraction of local key features through the attention mechanism. The feature fusion module plays a crucial role in integrating global and local expression features while also establishing residual links between inputs and outputs to compensate for the loss of fine-grained features. Last, two occlusion expression datasets (FM_RAF-DB and SG_RAF-DB) were constructed based on the RAF-DB dataset. Extensive experiments demonstrate that the proposed FFRA-Net achieves excellent results on four datasets: FM_RAF-DB, SG_RAF-DB, RAF-DB, and FERPLUS, with accuracies of 77.87%, 79.50%, 88.66%, and 88.97%, respectively. Thus, the approach presented in this paper demonstrates strong applicability in the context of occluded facial expression recognition (FER).
Collapse
Affiliation(s)
| | - Shuaishi Liu
- School of Electrical and Electronic Engineering, Changchun University of Technology, Changchun, China
| | | | | |
Collapse
|
13
|
Fang B, Zhao Y, Han G, He J. Expression-Guided Deep Joint Learning for Facial Expression Recognition. SENSORS (BASEL, SWITZERLAND) 2023; 23:7148. [PMID: 37631685 PMCID: PMC10457757 DOI: 10.3390/s23167148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 08/10/2023] [Accepted: 08/11/2023] [Indexed: 08/27/2023]
Abstract
In recent years, convolutional neural networks (CNNs) have played a dominant role in facial expression recognition. While CNN-based methods have achieved remarkable success, they are notorious for having an excessive number of parameters, and they rely on a large amount of manually annotated data. To address this challenge, we expand the number of training samples by learning expressions from a face recognition dataset to reduce the impact of a small number of samples on the network training. In the proposed deep joint learning framework, the deep features of the face recognition dataset are clustered, and simultaneously, the parameters of an efficient CNN are learned, thereby marking the data for network training automatically and efficiently. Specifically, first, we develop a new efficient CNN based on the proposed affinity convolution module with much lower computational overhead for deep feature learning and expression classification. Then, we develop an expression-guided deep facial clustering approach to cluster the deep features and generate abundant expression labels from the face recognition dataset. Finally, the AC-based CNN is fine-tuned using an updated training set and a combined loss function. Our framework is evaluated on several challenging facial expression recognition datasets as well as a self-collected dataset. In the context of facial expression recognition applied to the field of education, our proposed method achieved an impressive accuracy of 95.87% on the self-collected dataset, surpassing other existing methods.
Collapse
Affiliation(s)
- Bei Fang
- Key Laboratory of Modern Teaching Technology, Ministry of Education, Shaanxi Normal University, Xi’an 710062, China; (B.F.); (G.H.)
| | - Yujie Zhao
- Department of Information Construction and Management, Shaanxi Normal University, Xi’an 710061, China;
| | - Guangxin Han
- Key Laboratory of Modern Teaching Technology, Ministry of Education, Shaanxi Normal University, Xi’an 710062, China; (B.F.); (G.H.)
| | - Juhou He
- Key Laboratory of Modern Teaching Technology, Ministry of Education, Shaanxi Normal University, Xi’an 710062, China; (B.F.); (G.H.)
| |
Collapse
|
14
|
Bellamkonda S, Gopalan NP, Mala C, Settipalli L. Facial expression recognition on partially occluded faces using component based ensemble stacked CNN. Cogn Neurodyn 2023; 17:985-1008. [PMID: 37522034 PMCID: PMC10374495 DOI: 10.1007/s11571-022-09879-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 07/22/2022] [Accepted: 08/13/2022] [Indexed: 11/28/2022] Open
Abstract
Facial Expression Recognition (FER) is the basis for many applications including human-computer interaction and surveillance. While developing such applications, it is imperative to understand human emotions for better interaction with machines. Among many FER models developed so far, Ensemble Stacked Convolution Neural Networks (ES-CNN) showed an empirical impact in improving the performance of FER on static images. However, the existing ES-CNN based FER models trained with features extracted from the entire face, are unable to address the issues of ambient parameters such as pose, illumination, occlusions. To mitigate the problem of reduced performance of ES-CNN on partially occluded faces, a Component based ES-CNN (CES-CNN) is proposed. CES-CNN applies ES-CNN on action units of individual face components such as eyes, eyebrows, nose, cheek, mouth, and glabella as one subnet of the network. Max-Voting based ensemble classifier is used to ensemble the decisions of the subnets in order to obtain the optimized recognition accuracy. The proposed CES-CNN is validated by conducting experiments on benchmark datasets and the performance is compared with the state-of-the-art models. It is observed from the experimental results that the proposed model has a significant enhancement in the recognition accuracy compared to the existing models.
Collapse
Affiliation(s)
- Sivaiah Bellamkonda
- Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015 India
| | - N. P. Gopalan
- Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015 India
| | - C. Mala
- Department of Computer Science and Engineering, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015 India
| | - Lavanya Settipalli
- Department of Computer Applications, National Institute of Technology, Tiruchirappalli, Tamilnadu 620015 India
| |
Collapse
|
15
|
Yao H, Yang X, Chen D, Wang Z, Tian Y. Facial Expression Recognition Based on Fine-Tuned Channel-Spatial Attention Transformer. SENSORS (BASEL, SWITZERLAND) 2023; 23:6799. [PMID: 37571582 PMCID: PMC10422316 DOI: 10.3390/s23156799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 07/27/2023] [Accepted: 07/28/2023] [Indexed: 08/13/2023]
Abstract
Facial expressions help individuals convey their emotions. In recent years, thanks to the development of computer vision technology, facial expression recognition (FER) has become a research hotspot and made remarkable progress. However, human faces in real-world environments are affected by various unfavorable factors, such as facial occlusion and head pose changes, which are seldom encountered in controlled laboratory settings. These factors often lead to a reduction in expression recognition accuracy. Inspired by the recent success of transformers in many computer vision tasks, we propose a model called the fine-tuned channel-spatial attention transformer (FT-CSAT) to improve the accuracy of recognition of FER in the wild. FT-CSAT consists of two crucial components: channel-spatial attention module and fine-tuning module. In the channel-spatial attention module, the feature map is input into the channel attention module and the spatial attention module sequentially. The final output feature map will effectively incorporate both channel information and spatial information. Consequently, the network becomes adept at focusing on relevant and meaningful features associated with facial expressions. To further improve the model's performance while controlling the number of excessive parameters, we employ a fine-tuning method. Extensive experimental results demonstrate that our FT-CSAT outperforms the state-of-the-art methods on two benchmark datasets: RAF-DB and FERPlus. The achieved recognition accuracy is 88.61% and 89.26%, respectively. Furthermore, to evaluate the robustness of FT-CSAT in the case of facial occlusion and head pose changes, we take tests on Occlusion-RAF-DB and Pose-RAF-DB data sets, and the results also show that the superior recognition performance of the proposed method under such conditions.
Collapse
Affiliation(s)
| | | | | | | | - Yuan Tian
- Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China; (H.Y.); (X.Y.); (D.C.); (Z.W.)
| |
Collapse
|
16
|
Chen X, Zheng X, Sun K, Liu W, Zhang Y. Self-supervised vision transformer-based few-shot learning for facial expression recognition. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.03.105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
17
|
Raimundo A, Pavia JP, Sebastião P, Postolache O. YOLOX-Ray: An Efficient Attention-Based Single-Staged Object Detector Tailored for Industrial Inspections. SENSORS (BASEL, SWITZERLAND) 2023; 23:4681. [PMID: 37430595 DOI: 10.3390/s23104681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 04/29/2023] [Accepted: 05/05/2023] [Indexed: 07/12/2023]
Abstract
Industrial inspection is crucial for maintaining quality and safety in industrial processes. Deep learning models have recently demonstrated promising results in such tasks. This paper proposes YOLOX-Ray, an efficient new deep learning architecture tailored for industrial inspection. YOLOX-Ray is based on the You Only Look Once (YOLO) object detection algorithms and integrates the SimAM attention mechanism for improved feature extraction in the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN). Moreover, it also employs the Alpha-IoU cost function for enhanced small-scale object detection. YOLOX-Ray's performance was assessed in three case studies: hotspot detection, infrastructure crack detection and corrosion detection. The architecture outperforms all other configurations, achieving mAP50 values of 89%, 99.6% and 87.7%, respectively. For the most challenging metric, mAP50:95, the achieved values were 44.7%, 66.1% and 51.8%, respectively. A comparative analysis demonstrated the importance of combining the SimAM attention mechanism with Alpha-IoU loss function for optimal performance. In conclusion, YOLOX-Ray's ability to detect and to locate multi-scale objects in industrial environments presents new opportunities for effective, efficient and sustainable inspection processes across various industries, revolutionizing the field of industrial inspections.
Collapse
Affiliation(s)
- António Raimundo
- Instituto de Telecomunicações (IT), Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal
- Department of Information Science and Technology, Iscte-Instituto Universitário de Lisboa, Av. das Forças Armadas, 1649-026 Lisboa, Portugal
| | - João Pedro Pavia
- Instituto de Telecomunicações (IT), Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal
- COPELABS, Universidade Lusófona, Campo Grande 376, 1749-024 Lisboa, Portugal
| | - Pedro Sebastião
- Instituto de Telecomunicações (IT), Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal
- Department of Information Science and Technology, Iscte-Instituto Universitário de Lisboa, Av. das Forças Armadas, 1649-026 Lisboa, Portugal
| | - Octavian Postolache
- Instituto de Telecomunicações (IT), Av. Rovisco Pais, 1, 1049-001 Lisboa, Portugal
- Department of Information Science and Technology, Iscte-Instituto Universitário de Lisboa, Av. das Forças Armadas, 1649-026 Lisboa, Portugal
| |
Collapse
|
18
|
Qu Z, Niu D. Leveraging ResNet and label distribution in advanced intelligent systems for facial expression recognition. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:11101-11115. [PMID: 37322973 DOI: 10.3934/mbe.2023491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
With the development of AI (Artificial Intelligence), facial expression recognition (FER) is a hot topic in computer vision tasks. Many existing works employ a single label for FER. Therefore, the label distribution problem has not been considered for FER. In addition, some discriminative features can not be captured well. To overcome these problems, we propose a novel framework, ResFace, for FER. It has the following modules: 1) a local feature extraction module in which ResNet-18 and ResNet-50 are used to extract the local features for the following feature aggregation; 2) a channel feature aggregation module, in which a channel-spatial feature aggregation method is adopted to learn the high-level features for FER; 3) a compact feature aggregation module, in which several convolutional operations are used to learn the label distributions to interact with the softmax layer. Extensive experiments conducted on the FER+ and Real-world Affective Faces databases demonstrate that the proposed approach obtains comparable performances: 89.87% and 88.38%, respectively.
Collapse
Affiliation(s)
- Zhenggeng Qu
- College of Mathematics and Computer Application, Shangluo University, Shaanxi 726000, China
- Engineering Research Center of Qinling Health Welfare Big Data, Shaanxi 726000, China
| | - Danying Niu
- Shangluo Central Hospital, Shaanxi 726000, China
| |
Collapse
|
19
|
Liao J, Lin Y, Ma T, He S, Liu X, He G. Facial Expression Recognition Methods in the Wild Based on Fusion Feature of Attention Mechanism and LBP. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23094204. [PMID: 37177408 PMCID: PMC10180539 DOI: 10.3390/s23094204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 03/16/2023] [Accepted: 04/12/2023] [Indexed: 05/15/2023]
Abstract
Facial expression methods play a vital role in human-computer interaction and other fields, but there are factors such as occlusion, illumination, and pose changes in wild facial recognition, as well as category imbalances between different datasets, that result in large variations in recognition rates and low accuracy rates for different categories of facial expression datasets. This study introduces RCL-Net, a method of recognizing wild facial expressions that is based on an attention mechanism and LBP feature fusion. The structure consists of two main branches, namely the ResNet-CBAM residual attention branch and the local binary feature (LBP) extraction branch (RCL-Net). First, by merging the residual network and hybrid attention mechanism, the residual attention network is presented to emphasize the local detail feature information of facial expressions; the significant characteristics of facial expressions are retrieved from both channel and spatial dimensions to build the residual attention classification model. Second, we present a locally improved residual network attention model. LBP features are introduced into the facial expression feature extraction stage in order to extract texture information on expression photographs in order to emphasize facial feature information and enhance the recognition accuracy of the model. Lastly, experimental validation is performed using the FER2013, FERPLUS, CK+, and RAF-DB datasets, and the experimental results demonstrate that the proposed method has superior generalization capability and robustness in the laboratory-controlled environment and field environment compared to the most recent experimental methods.
Collapse
Affiliation(s)
- Jun Liao
- Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
- College of Mechanical Engineering, Chongqing University of Technology, Chongqing 400054, China
- Chongqing Key Laboratory of Artificial Intelligence and Service Robot Control Technology, Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
| | - Yuanchang Lin
- Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
- Chongqing Key Laboratory of Artificial Intelligence and Service Robot Control Technology, Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
| | - Tengyun Ma
- Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
- Chongqing Key Laboratory of Artificial Intelligence and Service Robot Control Technology, Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
| | - Songxiying He
- Chongqing Key Laboratory of Artificial Intelligence and Service Robot Control Technology, Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
| | - Xiaofang Liu
- Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
- Chongqing Key Laboratory of Artificial Intelligence and Service Robot Control Technology, Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
| | - Guotian He
- Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
- Chongqing Key Laboratory of Artificial Intelligence and Service Robot Control Technology, Chongqing Institute of Green Intelligent Technology, Chinese Academy of Sciences, Chongqing 400714, China
| |
Collapse
|
20
|
Rasmussen SHR, Ludeke SG, Klemmensen R. Using deep learning to predict ideology from facial photographs: expressions, beauty, and extra-facial information. Sci Rep 2023; 13:5257. [PMID: 37002240 PMCID: PMC10066183 DOI: 10.1038/s41598-023-31796-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/17/2023] [Indexed: 04/03/2023] Open
Abstract
Deep learning techniques can use public data such as facial photographs to predict sensitive personal information, but little is known about what information contributes to the predictive success of these techniques. This lack of knowledge limits both the public's ability to protect against revealing unintended information as well as the scientific utility of deep learning results. We combine convolutional neural networks, heat maps, facial expression coding, and classification of identifiable features such as masculinity and attractiveness in our study of political ideology in 3323 Danes. Predictive accuracy from the neural network was 61% in each gender. Model-predicted ideology correlated with aspects of both facial expressions (happiness vs neutrality) and morphology (specifically, attractiveness in females). Heat maps highlighted the informativeness of areas both on and off the face, pointing to methodological refinements and the need for future research to better understand the significance of certain facial areas.
Collapse
Affiliation(s)
| | - Steven G. Ludeke
- grid.10825.3e0000 0001 0728 0170Department of Psychology, University of Southern Denmark, Odense, Denmark
| | - Robert Klemmensen
- grid.4514.40000 0001 0930 2361Department of Political Science, Lund University, Lund, Sweden
| |
Collapse
|
21
|
Qiu S, Zhao G, Li X, Wang X. Facial Expression Recognition Using Local Sliding Window Attention. SENSORS (BASEL, SWITZERLAND) 2023; 23:s23073424. [PMID: 37050483 PMCID: PMC10098964 DOI: 10.3390/s23073424] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Revised: 03/07/2023] [Accepted: 03/19/2023] [Indexed: 06/12/2023]
Abstract
There are problems associated with facial expression recognition (FER), such as facial occlusion and head pose variations. These two problems lead to incomplete facial information in images, making feature extraction extremely difficult. Most current methods use prior knowledge or fixed-size patches to perform local cropping, thereby enhancing the ability to acquire fine-grained features. However, the former requires extra data processing work and is prone to errors; the latter destroys the integrity of local features. In this paper, we propose a local Sliding Window Attention Network (SWA-Net) for FER. Specifically, we propose a sliding window strategy for feature-level cropping, which preserves the integrity of local features and does not require complex preprocessing. Moreover, the local feature enhancement module mines fine-grained features with intraclass semantics through a multiscale depth network. The adaptive local feature selection module is introduced to prompt the model to find more essential local features. Extensive experiments demonstrate that our SWA-Net model achieves a comparable performance to that of state-of-the-art methods with scores of 90.03% on RAF-DB, 89.22% on FERPlus, 63.97% on AffectNet.
Collapse
Affiliation(s)
- Shuang Qiu
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
- Beijing Key Laboratory of Robot Bionics and Function Research, Beijing 100044, China
| | - Guangzhe Zhao
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
- Beijing Key Laboratory of Robot Bionics and Function Research, Beijing 100044, China
| | - Xiao Li
- School of Electronics and Information Engineering, Zhongyuan University of Technology, Zhengzhou 450007, China
| | - Xueping Wang
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China
- Beijing Key Laboratory of Robot Bionics and Function Research, Beijing 100044, China
| |
Collapse
|
22
|
Shahid AR, Yan H. SqueezExpNet: Dual-stage convolutional neural network for accurate facial expression recognition with attention mechanism. Knowl Based Syst 2023. [DOI: 10.1016/j.knosys.2023.110451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/28/2023]
|
23
|
Kim J, Lee D. Facial Expression Recognition Robust to Occlusion and to Intra-Similarity Problem Using Relevant Subsampling. SENSORS (BASEL, SWITZERLAND) 2023; 23:2619. [PMID: 36904823 PMCID: PMC10007059 DOI: 10.3390/s23052619] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Revised: 02/22/2023] [Accepted: 02/25/2023] [Indexed: 06/18/2023]
Abstract
This paper proposes facial expression recognition (FER) with the wild data set. In particular, this paper chiefly deals with two issues, occlusion and intra-similarity problems. The attention mechanism enables one to use the most relevant areas of facial images for specific expressions, and the triplet loss function solves the intra-similarity problem that sometimes fails to aggregate the same expression from different faces and vice versa. The proposed approach for the FER is robust to occlusion, and it uses a spatial transformer network (STN) with an attention mechanism to utilize specific facial region that dominantly contributes (or that is the most relevant) to particular facial expressions, e.g., anger, contempt, disgust, fear, joy, sadness, and surprise. In addition, the STN model is connected to the triplet loss function to improve the recognition rate which outperforms the existing approaches that employ cross-entropy or other approaches using only deep neural networks or classical methods. The triplet loss module alleviates limitations of the intra-similarity problem, leading to further improvement of the classification. Experimental results are provided to substantiate the proposed approach for FER, and the result outperforms the recognition rate in more practical cases, e.g., occlusion. The quantitative result provides FER results with more than 2.09% higher accuracy compared to the existing FER results in CK+ data sets and 0.48% higher than the accuracy of the results with the modified ResNet model in the FER2013 data set.
Collapse
|
24
|
SoftClusterMix: learning soft boundaries for empirical risk minimization. Neural Comput Appl 2023. [DOI: 10.1007/s00521-023-08338-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
|
25
|
Gupta S, Kumar P, Tekchandani R. A multimodal facial cues based engagement detection system in e-learning context using deep learning approach. MULTIMEDIA TOOLS AND APPLICATIONS 2023; 82:1-27. [PMID: 36789011 PMCID: PMC9911959 DOI: 10.1007/s11042-023-14392-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Revised: 11/20/2022] [Indexed: 06/18/2023]
Abstract
Due to the COVID-19 crisis, the education sector has been shifted to a virtual environment. Monitoring the engagement level and providing regular feedback during e-classes is one of the major concerns, as this facility lacks in the e-learning environment due to no physical observation of the teacher. According to present study, an engagement detection system to ensure that the students get immediate feedback during e-Learning. Our proposed engagement system analyses the student's behaviour throughout the e-Learning session. The proposed novel approach evaluates three modalities based on the student's behaviour, such as facial expression, eye blink count, and head movement, from the live video streams to predict student engagement in e-learning. The proposed system is implemented based on deep-learning approaches such as VGG-19 and ResNet-50 for facial emotion recognition and the facial landmark approach for eye-blinking and head movement detection. The results from different modalities (for which the algorithms are proposed) are combined to determine the EI (engagement index). Based on EI value, an engaged or disengaged state is predicted. The present study suggests that the proposed facial cues-based multimodal system accurately determines student engagement in real time. The experimental research achieved an accuracy of 92.58% and showed that the proposed engagement detection approach significantly outperforms the existing approaches.
Collapse
Affiliation(s)
- Swadha Gupta
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, 147001 Punjab India
| | - Parteek Kumar
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, 147001 Punjab India
| | - Rajkumar Tekchandani
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, 147001 Punjab India
| |
Collapse
|
26
|
Zhang Z, Tian X, Zhang Y, Guo K, Xu X. Enhanced Discriminative Global-Local Feature Learning with Priority for Facial Expression Recognition. Inf Sci (N Y) 2023. [DOI: 10.1016/j.ins.2023.02.056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
|
27
|
Zhang X, Yan X. Predicting collision cases at unsignalized intersections using EEG metrics and driving simulator platform. ACCIDENT; ANALYSIS AND PREVENTION 2023; 180:106910. [PMID: 36525717 DOI: 10.1016/j.aap.2022.106910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 10/16/2022] [Accepted: 11/25/2022] [Indexed: 06/17/2023]
Abstract
Unsignalized intersection collision has been one of the most dangerous accidents in the world. How to identify road hazards and predict the potential intersection collision ahead are challenging problems in traffic safety. This paper studies the feasibility of EEG metrics to forecast road hazards and presents an improved neural network model to predict intersection collision based on EEG metrics and driving behavior. It is demonstrated that EEG metrics show significant differences between collision and non-collision cases. It indicates that EEG metrics can serve as effective indicators to predict the collision probability. The drivers with higher relative power in fast frequency band (alpha and beta), lower relative power in slow frequency band (delta and theta) are more likely to have conflicts. The prediction using three machine learning models (Multi-layer perceptron (MLP), Logistic regression (LR) and Random forest (RF)) based on three input datasets (only EEG metrics, only driving behavior and combined EEG metrics with driving behavior) are compared. The results show that for single time point prediction, MLP model has the highest accuracy among three machine learning models. The model solely based on EEG metrics datasets has higher accuracy than driving behavior as well as combined datasets. However, for multi-time point prediction, the accuracy of MLP is only 73.9%, worse than LR and RF. We improved the MLP model by adding attention mechanism layer and using random forest model to select important features. As a consequence, the accuracy is greatly improved and reaches 88%. This study demonstrates the importance and feasibility of EEG signals to identify unsafe drivers ahead. The improved neural network model can be helpful to reduce intersection accidents and improve traffic safety.
Collapse
Affiliation(s)
- Xinran Zhang
- China North Artificial Intelligence & Innovation Research Institute, Beijing 100072, China.
| | - Xuedong Yan
- School of Traffic and Transportation, Beijing Jiaotong University, Beijing 100044, China.
| |
Collapse
|
28
|
Mixing Global and Local Features for Long-Tailed Expression Recognition. INFORMATION 2023. [DOI: 10.3390/info14020083] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Large-scale facial expression datasets are primarily composed of real-world facial expressions. Expression occlusion and large-angle faces are two important problems affecting the accuracy of expression recognition. Moreover, because facial expression data in natural scenes commonly follow a long-tailed distribution, trained models tend to recognize the majority classes while recognizing the minority classes with low accuracies. To improve the robustness and accuracy of expression recognition networks in an uncontrolled environment, this paper proposes an efficient network structure based on an attention mechanism that fuses global and local features (AM-FGL). We use a channel spatial model and local feature convolutional neural networks to perceive the global and local features of the human face, respectively. Because the distribution of real-world scene field expression datasets commonly follows a long-tail distribution, where neutral and happy expressions account for the tail expressions, a trained model exhibits low recognition accuracy for tail expressions such as fear and disgust. CutMix is a novel data enhancement method proposed in other fields; thus, based on the CutMix concept, a simple and effective data-balancing method is proposed (BC-EDB). The key idea is to paste key pixels (around eyes, mouths, and noses), which reduces the influence of overfitting. Our proposed method is more focused on the recognition of tail expression, occluded expression, and large-angle faces, and we achieved the most advanced results in occlusion-RAF-DB, 30∘ pose-RAF-DB, and 45∘ pose-RAF-DB with accuracies of 86.96%, 89.74%, and 88.53%.
Collapse
|
29
|
Eyiokur FI, Kantarcı A, Erakın ME, Damer N, Ofli F, Imran M, Križaj J, Salah AA, Waibel A, Štruc V, Ekenel HK. A survey on computer vision based human analysis in the COVID-19 era. IMAGE AND VISION COMPUTING 2023; 130:104610. [PMID: 36540857 PMCID: PMC9755265 DOI: 10.1016/j.imavis.2022.104610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 12/11/2022] [Indexed: 06/17/2023]
Abstract
The emergence of COVID-19 has had a global and profound impact, not only on society as a whole, but also on the lives of individuals. Various prevention measures were introduced around the world to limit the transmission of the disease, including face masks, mandates for social distancing and regular disinfection in public spaces, and the use of screening applications. These developments also triggered the need for novel and improved computer vision techniques capable of ( i ) providing support to the prevention measures through an automated analysis of visual data, on the one hand, and ( ii ) facilitating normal operation of existing vision-based services, such as biometric authentication schemes, on the other. Especially important here, are computer vision techniques that focus on the analysis of people and faces in visual data and have been affected the most by the partial occlusions introduced by the mandates for facial masks. Such computer vision based human analysis techniques include face and face-mask detection approaches, face recognition techniques, crowd counting solutions, age and expression estimation procedures, models for detecting face-hand interactions and many others, and have seen considerable attention over recent years. The goal of this survey is to provide an introduction to the problems induced by COVID-19 into such research and to present a comprehensive review of the work done in the computer vision based human analysis field. Particular attention is paid to the impact of facial masks on the performance of various methods and recent solutions to mitigate this problem. Additionally, a detailed review of existing datasets useful for the development and evaluation of methods for COVID-19 related applications is also provided. Finally, to help advance the field further, a discussion on the main open challenges and future research direction is given at the end of the survey. This work is intended to have a broad appeal and be useful not only for computer vision researchers but also the general public.
Collapse
Affiliation(s)
- Fevziye Irem Eyiokur
- Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Alperen Kantarcı
- Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey
| | - Mustafa Ekrem Erakın
- Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey
| | - Naser Damer
- Fraunhofer Institute for Computer Graphics Research IGD, Darmstadt, Germany
- Department of Computer Science, TU Darmstadt, Darmstadt, Germany
| | - Ferda Ofli
- Qatar Computing Research Institute, HBKU, Doha, Qatar
| | | | - Janez Križaj
- Faculty of Electrical Engineering, University of Ljubljana, Tržaška cesta 25, 1000 Ljubljana, Slovenia
| | - Albert Ali Salah
- Department of Information and Computing Sciences, Utrecht University, Utrecht, The Netherlands
- Department of Computer Engineering, Bogˇaziçi University, Istanbul, Turkey
| | - Alexander Waibel
- Institute for Anthropomatics and Robotics, Karlsruhe Institute of Technology, Karlsruhe, Germany
- Carnegie Mellon University, Pittsburgh, United States
| | - Vitomir Štruc
- Faculty of Electrical Engineering, University of Ljubljana, Tržaška cesta 25, 1000 Ljubljana, Slovenia
| | - Hazım Kemal Ekenel
- Department of Computer Engineering, Istanbul Technical University, Istanbul, Turkey
| |
Collapse
|
30
|
Fang J, Lin X, Liu W, An Y, Sun H. Triple attention feature enhanced pyramid network for facial expression recognition. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-222252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
The purpose of facial expression recognition is to capture facial expression features from static pictures or videos and to provide the most intuitive information about human emotion changes for artificial intelligence devices to use effectively for human-computer interaction. Among the factors, the excessive loss of locally valid information and the irreversible degradation trend of the information at different expression semantic scales with increasing network depth are the main challenges faced currently. To address such problems, an enhanced pyramidal network model combining with triple attention mechanisms is designed in this paper. Firstly, three attention mechanism modules, i.e. CBAM, SK, and SE, are embedded into the backbone network model in stages, and the key features are sensed by using spatial or channel information mining, which effectively reduces the effective information loss caused by the network depth. Then, the pyramid network is used as an extension of the backbone network to obtain the semantic information of expression features across scales. The recognition accuracy reaches 96.25% and 73.61% in the CK+ and Fer2013 expression change datasets, respectively. Furthermore, by comparing with other current advanced methods, it is shown that the proposed network architecture combining with the triple attention mechanism and multi-scale cross-information fusion can simultaneously maintain and improve the information mining ability and recognition accuracy of the facial expression recognition model.
Collapse
Affiliation(s)
- Jian Fang
- School of Mechanical and Electrical Engineering, Changchun University of Technology, Changchun, China
- Jilin Communications Polytechnic, Changchun, China
| | - Xiaomei Lin
- School of Electronics and Electrical Engineering, Changchun University of Technology, Changchun, China
| | - Weida Liu
- School of Electrical and Information Engineering, Jilin Engineering Normal University, Changchun, China
| | - Yi An
- School of Electrical and Information Engineering, Jilin Engineering Normal University, Changchun, China
| | - Haoran Sun
- Jilin Communications Polytechnic, Changchun, China
| |
Collapse
|
31
|
Mukhiddinov M, Djuraev O, Akhmedov F, Mukhamadiyev A, Cho J. Masked Face Emotion Recognition Based on Facial Landmarks and Deep Learning Approaches for Visually Impaired People. SENSORS (BASEL, SWITZERLAND) 2023; 23:1080. [PMID: 36772117 PMCID: PMC9921901 DOI: 10.3390/s23031080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Revised: 01/10/2023] [Accepted: 01/15/2023] [Indexed: 06/18/2023]
Abstract
Current artificial intelligence systems for determining a person's emotions rely heavily on lip and mouth movement and other facial features such as eyebrows, eyes, and the forehead. Furthermore, low-light images are typically classified incorrectly because of the dark region around the eyes and eyebrows. In this work, we propose a facial emotion recognition method for masked facial images using low-light image enhancement and feature analysis of the upper features of the face with a convolutional neural network. The proposed approach employs the AffectNet image dataset, which includes eight types of facial expressions and 420,299 images. Initially, the facial input image's lower parts are covered behind a synthetic mask. Boundary and regional representation methods are used to indicate the head and upper features of the face. Secondly, we effectively adopt a facial landmark detection method-based feature extraction strategy using the partially covered masked face's features. Finally, the features, the coordinates of the landmarks that have been identified, and the histograms of the oriented gradients are then incorporated into the classification procedure using a convolutional neural network. An experimental evaluation shows that the proposed method surpasses others by achieving an accuracy of 69.3% on the AffectNet dataset.
Collapse
Affiliation(s)
- Mukhriddin Mukhiddinov
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - Oybek Djuraev
- Department of Hardware and Software of Control Systems in Telecommunication, Tashkent University of Information Technologies Named after Muhammad al-Khwarizmi, Tashkent 100084, Uzbekistan
| | - Farkhod Akhmedov
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - Abdinabi Mukhamadiyev
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - Jinsoo Cho
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| |
Collapse
|
32
|
Zhou S, Wu X, Jiang F, Huang Q, Huang C. Emotion Recognition from Large-Scale Video Clips with Cross-Attention and Hybrid Feature Weighting Neural Networks. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2023; 20:1400. [PMID: 36674161 PMCID: PMC9859118 DOI: 10.3390/ijerph20021400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 01/06/2023] [Accepted: 01/07/2023] [Indexed: 06/17/2023]
Abstract
The emotion of humans is an important indicator or reflection of their mental states, e.g., satisfaction or stress, and recognizing or detecting emotion from different media is essential to perform sequence analysis or for certain applications, e.g., mental health assessments, job stress level estimation, and tourist satisfaction assessments. Emotion recognition based on computer vision techniques, as an important method of detecting emotion from visual media (e.g., images or videos) of human behaviors with the use of plentiful emotional cues, has been extensively investigated because of its significant applications. However, most existing models neglect inter-feature interaction and use simple concatenation for feature fusion, failing to capture the crucial complementary gains between face and context information in video clips, which is significant in addressing the problems of emotion confusion and emotion misunderstanding. Accordingly, in this paper, to fully exploit the complementary information between face and context features, we present a novel cross-attention and hybrid feature weighting network to achieve accurate emotion recognition from large-scale video clips, and the proposed model consists of a dual-branch encoding (DBE) network, a hierarchical-attention encoding (HAE) network, and a deep fusion (DF) block. Specifically, the face and context encoding blocks in the DBE network generate the respective shallow features. After this, the HAE network uses the cross-attention (CA) block to investigate and capture the complementarity between facial expression features and their contexts via a cross-channel attention operation. The element recalibration (ER) block is introduced to revise the feature map of each channel by embedding global information. Moreover, the adaptive-attention (AA) block in the HAE network is developed to infer the optimal feature fusion weights and obtain the adaptive emotion features via a hybrid feature weighting operation. Finally, the DF block integrates these adaptive emotion features to predict an individual emotional state. Extensive experimental results of the CAER-S dataset demonstrate the effectiveness of our method, exhibiting its potential in the analysis of tourist reviews with video clips, estimation of job stress levels with visual emotional evidence, or assessments of mental healthiness with visual media.
Collapse
Affiliation(s)
| | | | | | - Qionghao Huang
- Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, Jinhua 321004, China
| | | |
Collapse
|
33
|
CLC-Net: Contextual and Local Collaborative Network for Lesion Segmentation in Diabetic Retinopathy Images. Neurocomputing 2023. [DOI: 10.1016/j.neucom.2023.01.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
34
|
Zhong J, Chen T, Yi L. Face expression recognition based on NGO-BILSTM model. Front Neurorobot 2023; 17:1155038. [PMID: 37025255 PMCID: PMC10072256 DOI: 10.3389/fnbot.2023.1155038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 03/03/2023] [Indexed: 04/08/2023] Open
Abstract
Introduction Facial expression recognition has always been a hot topic in computer vision and artificial intelligence. In recent years, deep learning models have achieved good results in accurately recognizing facial expressions. BILSTM network is such a model. However, the BILSTM network's performance depends largely on its hyperparameters, which is a challenge for optimization. Methods In this paper, a Northern Goshawk optimization (NGO) algorithm is proposed to optimize the hyperparameters of BILSTM network for facial expression recognition. The proposed methods were evaluated and compared with other methods on the FER2013, FERplus and RAF-DB datasets, taking into account factors such as cultural background, race and gender. Results The results show that the recognition accuracy of the model on FER2013 and FERPlus data sets is much higher than that of the traditional VGG16 network. The recognition accuracy is 89.72% on the RAF-DB dataset, which is 5.45, 9.63, 7.36, and 3.18% higher than that of the proposed facial expression recognition algorithms DLP-CNN, gACNN, pACNN, and LDL-ALSG in recent 2 years, respectively. Discussion In conclusion, NGO algorithm effectively optimized the hyperparameters of BILSTM network, improved the performance of facial expression recognition, and provided a new method for the hyperparameter optimization of BILSTM network for facial expression recognition.
Collapse
|
35
|
Gao H, Wu M, Chen Z, Li Y, Wang X, An S, Li J, Liu C. SSA-ICL: Multi-domain adaptive attention with intra-dataset continual learning for Facial expression recognition. Neural Netw 2023; 158:228-238. [PMID: 36473290 DOI: 10.1016/j.neunet.2022.11.025] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Revised: 09/26/2022] [Accepted: 11/15/2022] [Indexed: 11/27/2022]
Abstract
Facial expression recognition (FER) is a kind of affective computing that identifies the emotional state represented in facial photographs. Various methods have been developed for completing this critical task. In spite of this progress, three significant obstacles, the interaction between spatial action units, the inadequacy of semantic information about spectral expressions and the unbalanced data distribution, are not well addressed. In this work, we propose SSA-ICL, a novel approach for FER, and solve these three difficulties inside a coherent framework. To address the first two challenges, we develop a Spectral and Spatial Attention (SSA) module that integrates spectral semantics with spatial locations to improve the performance of the model. We provide an Intra-dataset Continual Learning (ICL) module to combat the issue of long-tail distribution in FER datasets. By subdividing a single long-tail dataset into multiple sub-datasets, ICL repeatedly trains well-balanced representations from each subset and finally develop a independent classifier. We performed extensive experiments on two publicly available datasets, AffectNet and RAFDB. In comparison to existing attention modules, our SSA achieves an accuracy improvement of 3.8%∼6.7%, as evidenced by testing results. In the meanwhile, our proposed SSA-ICL can achieve superior or comparable performance to state-of-the-art FER methods (65.78% on AffectNet and 89.44% on RAFDB).
Collapse
Affiliation(s)
- Hongxiang Gao
- State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China; Institute for Infocomm Research, A*STAR, Singapore, 138632, Singapore
| | - Min Wu
- Institute for Infocomm Research, A*STAR, Singapore, 138632, Singapore.
| | - Zhenghua Chen
- Institute for Infocomm Research, A*STAR, Singapore, 138632, Singapore
| | - Yuwen Li
- State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
| | - Xingyao Wang
- State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China
| | - Shan An
- State Key Lab of Software Development Environment, Beihang University, JD Health International Inc., Beijing, 100191, China
| | - Jianqing Li
- State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China; School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, 211166, China
| | - Chengyu Liu
- State Key Laboratory of Bioelectronics, School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, China.
| |
Collapse
|
36
|
Facial Expression Recognition Based on Dual-Channel Fusion with Edge Features. Symmetry (Basel) 2022. [DOI: 10.3390/sym14122651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
In the era of artificial intelligence, accomplishing emotion recognition in human–computer interaction is a key work. Expressions contain plentiful information about human emotion. We found that the canny edge detector can significantly help improve facial expression recognition performance. A canny edge detector based dual-channel network using the OI-network and EI-Net is proposed, which does not add an additional redundant network layer and training. We discussed the fusion parameters of α and β using ablation experiments. The method was verified in CK+, Fer2013, and RafDb datasets and achieved a good result.
Collapse
|
37
|
Zhu Y, Wei L, Lang C, Li S, Feng S, Li Y. Fine-grained facial expression recognition via relational reasoning and hierarchical relation optimization. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.10.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
|
38
|
Yang B, Wu J, Ikeda K, Hattori G, Sugano M, Iwasawa Y, Matsuo Y. Face-mask-aware Facial Expression Recognition based on Face Parsing and Vision Transformer. Pattern Recognit Lett 2022; 164:173-182. [PMID: 36407855 PMCID: PMC9645067 DOI: 10.1016/j.patrec.2022.11.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2022] [Revised: 10/05/2022] [Accepted: 11/04/2022] [Indexed: 11/11/2022]
Abstract
As wearing face masks is becoming an embedded practice due to the COVID-19 pandemic, facial expression recognition (FER) that takes face masks into account is now a problem that needs to be solved. In this paper, we propose a face parsing and vision Transformer-based method to improve the accuracy of face-mask-aware FER. First, in order to improve the precision of distinguishing the unobstructed facial region as well as those parts of the face covered by a mask, we re-train a face-mask-aware face parsing model, based on the existing face parsing dataset automatically relabeled with a face mask and pixel label. Second, we propose a vision Transformer with a cross attention mechanism-based FER classifier, capable of taking both occluded and non-occluded facial regions into account and reweigh these two parts automatically to get the best facial expression recognition performance. The proposed method outperforms existing state-of-the-art face-mask-aware FER methods, as well as other occlusion-aware FER methods, on two datasets that contain three kinds of emotions (M-LFW-FER and M-KDDI-FER datasets) and two datasets that contain seven kinds of emotions (M-FER-2013 and M-CK+ datasets).
Collapse
Affiliation(s)
- Bo Yang
- KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
- The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8654 Japan
| | - Jianming Wu
- KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
| | - Kazushi Ikeda
- KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
| | - Gen Hattori
- KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
| | - Masaru Sugano
- KDDI Research, Inc., 2-1-15 Ohara, Fujimino-shi, Saitama, 356-8502, Japan
| | - Yusuke Iwasawa
- The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8654 Japan
| | - Yutaka Matsuo
- The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8654 Japan
| |
Collapse
|
39
|
Zhang Z, Sun X, Li J, Wang M. MAN: Mining Ambiguity and Noise for Facial Expression Recognition in the Wild. Pattern Recognit Lett 2022. [DOI: 10.1016/j.patrec.2022.10.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
40
|
Liu P, Lin Y, Meng Z, Lu L, Deng W, Zhou JT, Yang Y. Point Adversarial Self-Mining: A Simple Method for Facial Expression Recognition. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:12649-12660. [PMID: 34197333 DOI: 10.1109/tcyb.2021.3085744] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In this article, we propose a simple yet effective approach, called point adversarial self mining (PASM), to improve the recognition accuracy in facial expression recognition (FER). Unlike previous works focusing on designing specific architectures or loss functions to solve this problem, PASM boosts the network capability by simulating human learning processes: providing updated learning materials and guidance from more capable teachers. Specifically, to generate new learning materials, PASM leverages a point adversarial attack method and a trained teacher network to locate the most informative position related to the target task, generating harder learning samples to refine the network. The searched position is highly adaptive since it considers both the statistical information of each sample and the teacher network capability. Other than being provided new learning materials, the student network also receives guidance from the teacher network. After the student network finishes training, the student network changes its role and acts as a teacher, generating new learning materials and providing stronger guidance to train a better student network. The adaptive learning materials generation and teacher/student update can be conducted more than one time, improving the network capability iteratively. Extensive experimental results validate the efficacy of our method over the existing state of the arts for FER.
Collapse
|
41
|
Su C, Wei J, Lin D, Kong L. Using attention LSGB network for facial expression recognition. Pattern Anal Appl 2022. [DOI: 10.1007/s10044-022-01124-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
42
|
Huo H, Yu Y, Liu Z. Facial expression recognition based on improved depthwise separable convolutional network. MULTIMEDIA TOOLS AND APPLICATIONS 2022; 82:18635-18652. [PMID: 36467439 PMCID: PMC9686458 DOI: 10.1007/s11042-022-14066-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2022] [Revised: 08/29/2022] [Accepted: 10/10/2022] [Indexed: 06/17/2023]
Abstract
A single network model can't extract more complex and rich effective features. Meanwhile, the network structure is usually huge, and there are many parameters and consume more space resources, etc. Therefore, the combination of multiple network models to extract complementary features has attracted extensive attention. In order to solve the problems existing in the prior art that the network model can't extract high spatial depth features, redundant network structure parameters, and weak generalization ability, this paper adopts two models of Xception module and inverted residual structure to build the neural network. Based on this, a face expression recognition method based on improved depthwise separable convolutional network is proposed in the paper. Firstly, Gaussian filtering is performed by Canny operator to remove noise, and combined with two original pixel feature maps to form a three-channel image. Secondly, the inverted residual structure of MobileNetV2 model is introduced into the network structure. Finally, the extracted features are classified by Softmax classifier, and the entire network model uses ReLU6 as the nonlinear activation function. The experimental results show that the recognition rate is 70.76% in Fer2013 dataset (facial expression recognition 2013) and 97.92% in CK+ dataset (extended Cohn Kanade). It can be seen that this method not only effectively mines the deeper and more abstract features of the image, but also prevents network over-fitting and improves the generalization ability.
Collapse
Affiliation(s)
- Hua Huo
- Engineering Technology Research Center of Big Data and Computational Intelligence, Henan University of Science and Technology, Kaiyuan Avenue, Luoyang, 471003 Henan China
| | - YaLi Yu
- Engineering Technology Research Center of Big Data and Computational Intelligence, Henan University of Science and Technology, Kaiyuan Avenue, Luoyang, 471003 Henan China
| | - ZhongHua Liu
- Information Engineering College, Henan University of Science and Technology, Kaiyuan Avenue, Luoyang, 471003 Henan China
| |
Collapse
|
43
|
Gong W, Qian Y, Fan Y. MPCSAN: multi-head parallel channel-spatial attention network for facial expression recognition in the wild. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-08040-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
44
|
Patch Attention Convolutional Vision Transformer for Facial Expression Recognition with Occlusion. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.11.068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
45
|
Zhou J, Wang Y, Zhang C, Wu W, Ji Y, Zou Y. Eyebirds: Enabling the Public to Recognize Water Birds at Hand. Animals (Basel) 2022; 12:3000. [PMID: 36359124 PMCID: PMC9658372 DOI: 10.3390/ani12213000] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Revised: 10/25/2022] [Accepted: 10/26/2022] [Indexed: 09/29/2023] Open
Abstract
Enabling the public to easily recognize water birds has a positive effect on wetland bird conservation. However, classifying water birds requires advanced ornithological knowledge, which makes it very difficult for the public to recognize water bird species in daily life. To break the knowledge barrier of water bird recognition for the public, we construct a water bird recognition system (Eyebirds) by using deep learning, which is implemented as a smartphone app. Eyebirds consists of three main modules: (1) a water bird image dataset; (2) an attention mechanism-based deep convolution neural network for water bird recognition (AM-CNN); (3) an app for smartphone users. The waterbird image dataset currently covers 48 families, 203 genera and 548 species of water birds worldwide, which is used to train our water bird recognition model. The AM-CNN model employs attention mechanism to enhance the shallow features of bird images for boosting image classification performance. Experimental results on the North American bird dataset (CUB200-2011) show that the AM-CNN model achieves an average classification accuracy of 85%. On our self-built water bird image dataset, the AM-CNN model also works well with classification accuracies of 94.0%, 93.6% and 86.4% at three levels: family, genus and species, respectively. The user-side app is a WeChat applet deployed in smartphones. With the app, users can easily recognize water birds in expeditions, camping, sightseeing, or even daily life. In summary, our system can bring not only fun, but also water bird knowledge to the public, thus inspiring their interests and further promoting their participation in bird ecological conservation.
Collapse
Affiliation(s)
- Jiaogen Zhou
- Jiangsu Provincial Engineering Research Center for Intelligent Monitoring and Ecological Management of Pond and Reservoir Water Environment, Huaiyin Normal University, Huaian 223300, China
| | - Yang Wang
- Department of Computer Science and Technology, Tongji University, Shanghai 201804, China
| | - Caiyun Zhang
- Jiangsu Provincial Engineering Research Center for Intelligent Monitoring and Ecological Management of Pond and Reservoir Water Environment, Huaiyin Normal University, Huaian 223300, China
| | - Wenbo Wu
- Research Center of Information Technology, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China
| | - Yanzhu Ji
- Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China
| | - Yeai Zou
- Dongting Lake Station for Wetland Ecosystem Research, Institute of Subtropical Agriculture, Chinese Academy of Sciences, Changsha 410125, China
| |
Collapse
|
46
|
Xu X, Zong Y, Lu C, Jiang X. Enhanced Sample Self-Revised Network for Cross-Dataset Facial Expression Recognition. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1475. [PMID: 37420495 DOI: 10.3390/e24101475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2022] [Revised: 10/04/2022] [Accepted: 10/10/2022] [Indexed: 07/09/2023]
Abstract
Recently, cross-dataset facial expression recognition (FER) has obtained wide attention from researchers. Thanks to the emergence of large-scale facial expression datasets, cross-dataset FER has made great progress. Nevertheless, facial images in large-scale datasets with low quality, subjective annotation, severe occlusion, and rare subject identity can lead to the existence of outlier samples in facial expression datasets. These outlier samples are usually far from the clustering center of the dataset in the feature space, thus resulting in considerable differences in feature distribution, which severely restricts the performance of most cross-dataset facial expression recognition methods. To eliminate the influence of outlier samples on cross-dataset FER, we propose the enhanced sample self-revised network (ESSRN) with a novel outlier-handling mechanism, whose aim is first to seek these outlier samples and then suppress them in dealing with cross-dataset FER. To evaluate the proposed ESSRN, we conduct extensive cross-dataset experiments across RAF-DB, JAFFE, CK+, and FER2013 datasets. Experimental results demonstrate that the proposed outlier-handling mechanism can reduce the negative impact of outlier samples on cross-dataset FER effectively and our ESSRN outperforms classic deep unsupervised domain adaptation (UDA) methods and the recent state-of-the-art cross-dataset FER results.
Collapse
Affiliation(s)
- Xiaolin Xu
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Yuan Zong
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| | - Cheng Lu
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
| | - Xingxun Jiang
- Key Laboratory of Child Development and Learning Science of Ministry of Education, Southeast University, Nanjing 210096, China
- School of Biological Science and Medical Engineering, Southeast University, Nanjing 210096, China
| |
Collapse
|
47
|
CNN-LSTM Facial Expression Recognition Method Fused with Two-Layer Attention Mechanism. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:7450637. [DOI: 10.1155/2022/7450637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 09/29/2022] [Indexed: 11/17/2022]
Abstract
When exploring facial expression recognition methods, it is found that existing algorithms make insufficient use of information about the key parts that express emotion. For this problem, on the basis of a convolutional neural network and long short-term memory (CNN-LSTM), we propose a facial expression recognition method that incorporates an attention mechanism (CNN-ALSTM). Compared with the general CNN-LSTM algorithm, it can mine the information of important regions more effectively. Furthermore, a CNN-LSTM facial expression recognition method incorporating a two-layer attention mechanism (ACNN-ALSTM) is proposed. We conducted comparative experiments on Fer2013 and processed CK + datasets with CNN-ALSTM, ACNN-ALSTM, patch based ACNN (pACNN), Facial expression recognition with attention net (FERAtt), and other networks. The results show that the proposed ACNN-ALSTM hybrid neural network model is superior to related work in expression recognition.
Collapse
|
48
|
Kuruvayil S, Palaniswamy S. Emotion recognition from facial images with simultaneous occlusion, pose and illumination variations using meta-learning. JOURNAL OF KING SAUD UNIVERSITY - COMPUTER AND INFORMATION SCIENCES 2022. [DOI: 10.1016/j.jksuci.2021.06.012] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
49
|
Gupta S, Kumar P, Tekchandani RK. Facial emotion recognition based real-time learner engagement detection system in online learning context using deep learning models. MULTIMEDIA TOOLS AND APPLICATIONS 2022; 82:11365-11394. [PMID: 36105662 PMCID: PMC9461440 DOI: 10.1007/s11042-022-13558-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Revised: 05/14/2022] [Accepted: 07/14/2022] [Indexed: 06/15/2023]
Abstract
The dramatic impact of the COVID-19 pandemic has resulted in the closure of physical classrooms and teaching methods being shifted to the online medium.To make the online learning environment more interactive, just like traditional offline classrooms, it is essential to ensure the proper engagement of students during online learning sessions.This paper proposes a deep learning-based approach using facial emotions to detect the real-time engagement of online learners. This is done by analysing the students' facial expressions to classify their emotions throughout the online learning session. The facial emotion recognition information is used to calculate the engagement index (EI) to predict two engagement states "Engaged" and "Disengaged". Different deep learning models such as Inception-V3, VGG19 and ResNet-50 are evaluated and compared to get the best predictive classification model for real-time engagement detection. Varied benchmarked datasets such as FER-2013, CK+ and RAF-DB are used to gauge the overall performance and accuracy of the proposed system. Experimental results showed that the proposed system achieves an accuracy of 89.11%, 90.14% and 92.32% for Inception-V3, VGG19 and ResNet-50, respectively, on benchmarked datasets and our own created dataset. ResNet-50 outperforms all others with an accuracy of 92.3% for facial emotions classification in real-time learning scenarios.
Collapse
Affiliation(s)
- Swadha Gupta
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India
| | - Parteek Kumar
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India
| | - Raj Kumar Tekchandani
- Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India
| |
Collapse
|
50
|
Cai Q, An JP, Li HY, Guo JY, Gao ZK. Cross-subject emotion recognition using visibility graph and genetic algorithm-based convolution neural network. CHAOS (WOODBURY, N.Y.) 2022; 32:093110. [PMID: 36182360 DOI: 10.1063/5.0098454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 08/01/2022] [Indexed: 06/16/2023]
Abstract
An efficient emotion recognition model is an important research branch in electroencephalogram (EEG)-based brain-computer interfaces. However, the input of the emotion recognition model is often a whole set of EEG channels obtained by electrodes placed on subjects. The unnecessary information produced by redundant channels affects the recognition rate and depletes computing resources, thereby hindering the practical applications of emotion recognition. In this work, we aim to optimize the input of EEG channels using a visibility graph (VG) and genetic algorithm-based convolutional neural network (GA-CNN). First, we design an experiment to evoke three types of emotion states using movies and collect the multi-channel EEG signals of each subject under different emotion states. Then, we construct VGs for each EEG channel and derive nonlinear features representing each EEG channel. We employ the genetic algorithm (GA) to find the optimal subset of EEG channels for emotion recognition and use the recognition results of the CNN as fitness values. The experimental results show that the recognition performance of the proposed method using a subset of EEG channels is superior to that of the CNN using all channels for each subject. Last, based on the subset of EEG channels searched by the GA-CNN, we perform cross-subject emotion recognition tasks employing leave-one-subject-out cross-validation. These results demonstrate the effectiveness of the proposed method in recognizing emotion states using fewer EEG channels and further enrich the methods of EEG classification using nonlinear features.
Collapse
Affiliation(s)
- Qing Cai
- School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
| | - Jian-Peng An
- School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
| | - Hao-Yu Li
- School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
| | - Jia-Yi Guo
- School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
| | - Zhong-Ke Gao
- School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
| |
Collapse
|