1
|
Zhou S, Xu H, Bai Z, Du Z, Zeng J, Wang Y, Wang Y, Li S, Wang M, Li Y, Li J, Xu J. A multidimensional feature fusion network based on MGSE and TAAC for video-based human action recognition. Neural Netw 2023; 168:496-507. [PMID: 37827068 DOI: 10.1016/j.neunet.2023.09.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 09/14/2023] [Accepted: 09/18/2023] [Indexed: 10/14/2023]
Abstract
With the maturity of intelligent technology such as human-computer interaction, human action recognition (HAR) technology has been widely used in virtual reality, video surveillance, and other fields. However, the current video-based HAR methods still cannot fully extract abstract action features, and there is still a lack of action collection and recognition for special personnel such as prisoners and elderly people living alone. To solve the above problems, this paper proposes a multidimensional feature fusion network, called P-MTSC3D, a parallel network based on context modeling and temporal adaptive attention module. It consists of three branches. The first branch serves as the basic network branch, which extracts basic feature information. The second branch consists of a feature pre-extraction layer and two multiscale-convolution-based global context modeling combined squeeze and excitation (MGSE) modules, which can extract spatial and channel features. The third branch consists of two temporal adaptive attention units based on convolution (TAAC) to extract temporal dimension features. In order to verify the validity of the proposed network, this paper conducts experiments on the University of Central Florida (UCF) 101 dataset and the human motion database (HMDB) 51 dataset. The recognition accuracy of the proposed P-MTSC3D network is 97.92% on the UCF101 dataset and 75.59% on the HMDB51 dataset, respectively. The FLOPs of the P-MTSC3D network is 30.85G, and the test time is 2.83 s/16 samples on the UCF101 dataset. The experimental results demonstrate that the P-MTSC3D network has better overall performance than the state-of-the-art networks. In addition, a prison action (PA) dataset is constructed in this paper to verify the application effect of the proposed network in actual scenarios.
Collapse
Affiliation(s)
- Shuang Zhou
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Hongji Xu
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China.
| | - Zhiquan Bai
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China.
| | - Zhengfeng Du
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Jiaqi Zeng
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Yang Wang
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Yuhao Wang
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Shijie Li
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Mengmeng Wang
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Yiran Li
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Jianjun Li
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| | - Jie Xu
- School of Information Science and Engineering, Shandong University, 72 Binhai Road, Qingdao, 266237, Shandong, China
| |
Collapse
|
3
|
Wang C, Wang Z. Unsupervised Facial Action Representation Learning by Temporal Prediction. Front Neurorobot 2022; 16:851847. [PMID: 35370591 PMCID: PMC8965886 DOI: 10.3389/fnbot.2022.851847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 01/31/2022] [Indexed: 11/13/2022] Open
Abstract
Due to the cumbersome and expensive data collection process, facial action unit (AU) datasets are generally much smaller in scale than those in other computer vision fields, resulting in overfitting AU detection models trained on insufficient AU images. Despite the recent progress in AU detection, deployment of these models has been impeded due to their limited generalization to unseen subjects and facial poses. In this paper, we propose to learn the discriminative facial AU representation in a self-supervised manner. Considering that facial AUs show temporal consistency and evolution in consecutive facial frames, we develop a self-supervised pseudo signal based on temporally predictive coding (TPC) to capture the temporal characteristics. To further learn the per-frame discriminativeness between the sibling facial frames, we incorporate the frame-wisely temporal contrastive learning into the self-supervised paradigm naturally. The proposed TPC can be trained without AU annotations, which facilitates us using a large number of unlabeled facial videos to learn the AU representations that are robust to undesired nuisances such as facial identities, poses. Contrary to previous AU detection works, our method does not require manually selecting key facial regions or explicitly modeling the AU relations manually. Experimental results show that TPC improves the AU detection precision on several popular AU benchmark datasets compared with other self-supervised AU detection methods.
Collapse
Affiliation(s)
- Chongwen Wang
- School of Computer Science, Beijing Institute of Technology, Beijing, China
| | | |
Collapse
|
4
|
Wang C, Wang Z. Progressive Multi-Scale Vision Transformer for Facial Action Unit Detection. Front Neurorobot 2022; 15:824592. [PMID: 35095460 PMCID: PMC8790567 DOI: 10.3389/fnbot.2021.824592] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 12/10/2021] [Indexed: 11/29/2022] Open
Abstract
Facial action unit (AU) detection is an important task in affective computing and has attracted extensive attention in the field of computer vision and artificial intelligence. Previous studies for AU detection usually encode complex regional feature representations with manually defined facial landmarks and learn to model the relationships among AUs via graph neural network. Albeit some progress has been achieved, it is still tedious for existing methods to capture the exclusive and concurrent relationships among different combinations of the facial AUs. To circumvent this issue, we proposed a new progressive multi-scale vision transformer (PMVT) to capture the complex relationships among different AUs for the wide range of expressions in a data-driven fashion. PMVT is based on the multi-scale self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for AUs. Compared with previous AU detection methods, the benefits of PMVT are 2-fold: (i) PMVT does not rely on manually defined facial landmarks to extract the regional representations, and (ii) PMVT is capable of encoding facial regions with adaptive receptive fields, thus facilitating representation of different AU flexibly. Experimental results show that PMVT improves the AU detection accuracy on the popular BP4D and DISFA datasets. Compared with other state-of-the-art AU detection methods, PMVT obtains consistent improvements. Visualization results show PMVT automatically perceives the discriminative facial regions for robust AU detection.
Collapse
Affiliation(s)
- Chongwen Wang
- School of Computer Science, Beijing Institute of Technology, Beijing, China
| | | |
Collapse
|
5
|
Gao J, Zhao Y. TFE: A Transformer Architecture for Occlusion Aware Facial Expression Recognition. Front Neurorobot 2021; 15:763100. [PMID: 34759808 PMCID: PMC8573424 DOI: 10.3389/fnbot.2021.763100] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Accepted: 09/13/2021] [Indexed: 11/13/2022] Open
Abstract
Facial expression recognition (FER) in uncontrolled environment is challenging due to various un-constrained conditions. Although existing deep learning-based FER approaches have been quite promising in recognizing frontal faces, they still struggle to accurately identify the facial expressions on the faces that are partly occluded in unconstrained scenarios. To mitigate this issue, we propose a transformer-based FER method (TFE) that is capable of adaptatively focusing on the most important and unoccluded facial regions. TFE is based on the multi-head self-attention mechanism that can flexibly attend to a sequence of image patches to encode the critical cues for FER. Compared with traditional transformer, the novelty of TFE is two-fold: (i) To effectively select the discriminative facial regions, we integrate all the attention weights in various transformer layers into an attention map to guide the network to perceive the important facial regions. (ii) Given an input occluded facial image, we use a decoder to reconstruct the corresponding non-occluded face. Thus, TFE is capable of inferring the occluded regions to better recognize the facial expressions. We evaluate the proposed TFE on the two prevalent in-the-wild facial expression datasets (AffectNet and RAF-DB) and the their modifications with artificial occlusions. Experimental results show that TFE improves the recognition accuracy on both the non-occluded faces and occluded faces. Compared with other state-of-the-art FE methods, TFE obtains consistent improvements. Visualization results show TFE is capable of automatically focusing on the discriminative and non-occluded facial regions for robust FER.
Collapse
Affiliation(s)
- Jixun Gao
- Department of Computer Science, Henan University of Engineering, Zhengzhou, China
| | - Yuanyuan Zhao
- Department of Computer Science, Zhengzhou University of Technology, Zhengzhou, China
| |
Collapse
|