1
|
Su C, Wei J, Lin D, Kong L. Using attention LSGB network for facial expression recognition. Pattern Anal Appl 2022. [DOI: 10.1007/s10044-022-01124-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
2
|
Emotion Recognition of Down Syndrome People Based on the Evaluation of Artificial Intelligence and Statistical Analysis Methods. Symmetry (Basel) 2022. [DOI: 10.3390/sym14122492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
This article presents a study based on evaluating different techniques to automatically recognize the basic emotions of people with Down syndrome, such as anger, happiness, sadness, surprise, and neutrality, as well as the statistical analysis of the Facial Action Coding System, determine the symmetry of the Action Units present in each emotion, identify the facial features that represent this group of people. First, a dataset of images of faces of people with Down syndrome classified according to their emotions is built. Then, the characteristics of facial micro-expressions (Action Units) present in the feelings of the target group through statistical analysis are evaluated. This analysis uses the intensity values of the most representative exclusive action units to classify people’s emotions. Subsequently, the collected dataset was evaluated using machine learning and deep learning techniques to recognize emotions. In the beginning, different supervised learning techniques were used, with the Support Vector Machine technique obtaining the best precision with a value of 66.20%. In the case of deep learning methods, the mini-Xception convolutional neural network was used to recognize people’s emotions with typical development, obtaining an accuracy of 74.8%.
Collapse
|
3
|
Liu W, Wang H, Shen X, Tsang IW. The Emerging Trends of Multi-Label Learning. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:7955-7974. [PMID: 34637378 DOI: 10.1109/tpami.2021.3119334] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Exabytes of data are generated daily by humans, leading to the growing needs for new efforts in dealing with the grand challenges for multi-label learning brought by big data. For example, extreme multi-label classification is an active and rapidly growing research area that deals with classification tasks with extremely large number of classes or labels; utilizing massive data with limited supervision to build a multi-label classification model becomes valuable for practical applications, etc. Besides these, there are tremendous efforts on how to harvest the strong learning capability of deep learning to better capture the label dependencies in multi-label learning, which is the key for deep learning to address real-world classification tasks. However, it is noted that there have been a lack of systemic studies that focus explicitly on analyzing the emerging trends and new challenges of multi-label learning in the era of big data. It is imperative to call for a comprehensive survey to fulfil this mission and delineate future research directions and new applications.
Collapse
|
4
|
Zhao J, Wu M, Zhou L, Wang X, Jia J. Cognitive psychology-based artificial intelligence review. Front Neurosci 2022; 16:1024316. [PMID: 36278021 PMCID: PMC9582153 DOI: 10.3389/fnins.2022.1024316] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Accepted: 09/13/2022] [Indexed: 12/02/2022] Open
Abstract
Most of the current development of artificial intelligence is based on brain cognition, however, this replication of biology cannot simulate the subjective emotional and mental state changes of human beings. Due to the imperfections of existing artificial intelligence, this manuscript summarizes and clarifies that artificial intelligence system combined with cognitive psychology is the research direction of artificial intelligence. It aims to promote the development of artificial intelligence and give computers human advanced cognitive abilities, so that computers can recognize emotions, understand human feelings, and eventually achieve dialog and empathy with humans and other artificial intelligence. This paper emphasizes the development potential and importance of artificial intelligence to understand, possess and discriminate human mental states, and argues its application value with three typical application examples of human–computer interaction: face attraction, affective computing, and music emotion, which is conducive to the further and higher level of artificial intelligence research.
Collapse
Affiliation(s)
- Jian Zhao
- School of Information Science and Technology, Northwest University, Xi’an, China
| | - Mengqing Wu
- School of Information Science and Technology, Northwest University, Xi’an, China
| | - Liyun Zhou
- School of Information Science and Technology, Northwest University, Xi’an, China
| | - Xuezhu Wang
- School of Information Science and Technology, Northwest University, Xi’an, China
| | - Jian Jia
- Medical Big Data Research Center, Northwest University, Xi’an, China
- School of Mathematics, Northwest University, Xi’an, China
- *Correspondence: Jian Jia,
| |
Collapse
|
5
|
Li Y, Zeng J, Shan S. Learning Representations for Facial Actions From Unlabeled Videos. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:302-317. [PMID: 32750828 DOI: 10.1109/tpami.2020.3011063] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Facial actions are usually encoded as anatomy-based action units (AUs), the labelling of which demands expertise and thus is time-consuming and expensive. To alleviate the labelling demand, we propose to leverage the large number of unlabelled videos by proposing a twin-cycle autoencoder (TAE) to learn discriminative representations for facial actions. TAE is inspired by the fact that facial actions are embedded in the pixel-wise displacements between two sequential face images (hereinafter, source and target) in the video. Therefore, learning the representations of facial actions can be achieved by learning the representations of the displacements. However, the displacements induced by facial actions are entangled with those induced by head motions. TAE is thus trained to disentangle the two kinds of movements by evaluating the quality of the synthesized images when either the facial actions or head pose is changed, aiming to reconstruct the target image. Experiments on AU detection show that TAE can achieve accuracy comparable to other existing AU detection methods including some supervised methods, thus validating the discriminant capacity of the representations learned by TAE. TAE's ability in decoupling the action-induced and pose-induced movements is also validated by visualizing the generated images and analyzing the facial image retrieval results qualitatively and quantitatively.
Collapse
|
6
|
Chen Y, Wu H, Wang T, Wang Y, Liang Y. Cross-Modal Representation Learning for Lightweight and Accurate Facial Action Unit Detection. IEEE Robot Autom Lett 2021. [DOI: 10.1109/lra.2021.3098944] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
7
|
Li Y, Huang X, Zhao G. Micro-expression action unit detection with spatial and channel attention. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.01.032] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Masson A, Cazenave G, Trombini J, Batt M. The current challenges of automatic recognition of facial expressions: A systematic review. AI COMMUN 2020. [DOI: 10.3233/aic-200631] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In recent years, due to its great economic and social potential, the recognition of facial expressions linked to emotions has become one of the most flourishing applications in the field of artificial intelligence, and has been the subject of many developments. However, despite significant progress, this field is still subject to many theoretical debates and technical challenges. It therefore seems important to make a general inventory of the different lines of research and to present a synthesis of recent results in this field. To this end, we have carried out a systematic review of the literature according to the guidelines of the PRISMA method. A search of 13 documentary databases identified a total of 220 references over the period 2014–2019. After a global presentation of the current systems and their performance, we grouped and analyzed the selected articles in the light of the main problems encountered in the field of automated facial expression recognition. The conclusion of this review highlights the strengths, limitations and main directions for future research in this field.
Collapse
Affiliation(s)
- Audrey Masson
- Interpsy – GRC, University of Lorraine, France. E-mails: ,
- Two-I, France. E-mails: ,
| | | | | | - Martine Batt
- Interpsy – GRC, University of Lorraine, France. E-mails: ,
| |
Collapse
|
9
|
|
10
|
Zhang F, Zhang T, Mao Q, Xu C. A Unified Deep Model for Joint Facial Expression Recognition, Face Synthesis, and Face Alignment. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:6574-6589. [PMID: 32396088 DOI: 10.1109/tip.2020.2991549] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Facial expression recognition, face synthesis, and face alignment are three coherently related tasks and can be solved in a joint framework. To achieve this goal, in this paper, we propose a novel end-to-end deep learning model by exploiting the expression code, geometry code and generated data jointly for simultaneous pose-invariant facial expression recognition, face image synthesis, and face alignment. The proposed deep model enjoys several merits. First, to the best of our knowledge, this is the first work to address these three tasks jointly in a unified deep model to complement and enhance each other. Second, the proposed model can effectively disentangle the global and local identity representation from different expression and geometry codes. As a result, it can automatically generate facial images with different expressions under arbitrary geometry codes. Third, these three tasks can further boost their performance for each other via our model. Extensive experimental results on three standard benchmarks demonstrate that the proposed deep model performs favorably against state-of-the-art methods on the three tasks.
Collapse
|
11
|
The neural representation of facial-emotion categories reflects conceptual structure. Proc Natl Acad Sci U S A 2019; 116:15861-15870. [PMID: 31332015 DOI: 10.1073/pnas.1816408116] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Humans reliably categorize configurations of facial actions into specific emotion categories, leading some to argue that this process is invariant between individuals and cultures. However, growing behavioral evidence suggests that factors such as emotion-concept knowledge may shape the way emotions are visually perceived, leading to variability-rather than universality-in facial-emotion perception. Understanding variability in emotion perception is only emerging, and the neural basis of any impact from the structure of emotion-concept knowledge remains unknown. In a neuroimaging study, we used a representational similarity analysis (RSA) approach to measure the correspondence between the conceptual, perceptual, and neural representational structures of the six emotion categories Anger, Disgust, Fear, Happiness, Sadness, and Surprise. We found that subjects exhibited individual differences in their conceptual structure of emotions, which predicted their own unique perceptual structure. When viewing faces, the representational structure of multivoxel patterns in the right fusiform gyrus was significantly predicted by a subject's unique conceptual structure, even when controlling for potential physical similarity in the faces themselves. Finally, cross-cultural differences in emotion perception were also observed, which could be explained by individual differences in conceptual structure. Our results suggest that the representational structure of emotion expressions in visual face-processing regions may be shaped by idiosyncratic conceptual understanding of emotion categories.
Collapse
|
12
|
Ma Z, Lai Y, Kleijn WB, Song YZ, Wang L, Guo J. Variational Bayesian Learning for Dirichlet Process Mixture of Inverted Dirichlet Distributions in Non-Gaussian Image Feature Modeling. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2019; 30:449-463. [PMID: 29994731 DOI: 10.1109/tnnls.2018.2844399] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In this paper, we develop a novel variational Bayesian learning method for the Dirichlet process (DP) mixture of the inverted Dirichlet distributions, which has been shown to be very flexible for modeling vectors with positive elements. The recently proposed extended variational inference (EVI) framework is adopted to derive an analytically tractable solution. The convergency of the proposed algorithm is theoretically guaranteed by introducing single lower bound approximation to the original objective function in the EVI framework. In principle, the proposed model can be viewed as an infinite inverted Dirichlet mixture model that allows the automatic determination of the number of mixture components from data. Therefore, the problem of predetermining the optimal number of mixing components has been overcome. Moreover, the problems of overfitting and underfitting are avoided by the Bayesian estimation approach. Compared with several recently proposed DP-related methods and conventional applied methods, the good performance and effectiveness of the proposed method have been demonstrated with both synthesized data and real data evaluations.
Collapse
|
13
|
Chu WS, De la Torre F, Cohn JF. Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling. IMAGE AND VISION COMPUTING 2019; 81:1-14. [PMID: 30524157 PMCID: PMC6277040 DOI: 10.1016/j.imavis.2018.10.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Facial action units (AUs) may be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.
Collapse
Affiliation(s)
- Wen-Sheng Chu
- Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
| | | | - Jeffrey F Cohn
- Department of Psychology, University of Pittsburgh, Pittsburgh, USA
| |
Collapse
|
14
|
Wang S, Peng G, Chen S, Ji Q. Weakly Supervised Facial Action Unit Recognition With Domain Knowledge. IEEE TRANSACTIONS ON CYBERNETICS 2018; 48:3265-3276. [PMID: 30273163 DOI: 10.1109/tcyb.2018.2868194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Current facial action unit (AU) recognition typically includes supervised training, where the fully AU annotated training images are required. Due to the nuances of facial appearance and individual differences, AU annotation is a time-consuming, expensive, and error-prone process. Facial expression is relatively simple to label, since facial expressions describe facial behavior globally and the number of expressions appearing on a face is much less than that of AUs. Furthermore, there exist strong dependencies between AUs and expressions, referred to as domain knowledge. Such domain knowledge is inherent in facial anatomy and facial behavior. Therefore, in this paper, we propose a novel weakly supervised AU recognition method to jointly learn multiple AU classifiers with expression annotations but without any AU annotations by leveraging domain knowledge. Specifically, we first summarize the expression-dependent AU ranking from the domain knowledge of conditional probabilities of AUs given expressions. Then, we formulate the weakly supervised AU recognition as a multilabel ranking problem and propose an efficient learning algorithm to solve it. Furthermore, we extend the proposed weakly supervised AU recognition method to a semi-supervised learning scenario when partial AU labeled samples are available. Experimental results on three benchmark databases demonstrate that the proposed method can successfully exploit domain knowledge for multiple AU recognition and, thus, outperforms both state-of-the-art weakly supervised AU recognition method and the semi-supervised AU recognition method.
Collapse
|
15
|
Wang S, Hao L, Ji Q. Facial Action Unit Recognition and Intensity Estimation Enhanced through Label Dependencies. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 28:1428-1442. [PMID: 30371371 DOI: 10.1109/tip.2018.2878339] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The inherent dependencies among facial action units (AU) caused by the underlying anatomic mechanism are essential for the proper recognition of AUs and estimation of intensity levels, but they have not been exploited to their full potential. We are proposing novel methods to recognize AUs and estimate intensity via hybrid Bayesian networks. The upper two layers are latent regression Bayesian networks (LRBNs), and the lower layers are Bayesian networks (BNs). The visible nodes of the LRBN layers are representations of ground-truth AU occurrences or AU intensities. Through the directed connections from latent layer and visible layer, an LRBN can successfully represent relationships between multiple AUs or AU intensities. The lower layers include Bayesian networks with two nodes for AU recognition, and Bayesian networks with three nodes for AU intensity estimation. The bottom layers incorporate measurements from facial images with AU dependencies for intensity estimation and AU recognition. Efficient learning algorithms of the hybrid Bayesian networks are proposed for AU recognition as well as intensity estimation. Furthermore, the proposed hybrid Bayesian network models are extended for facial expression-assisted AU recognition and intensity estimation, as AU relationships are closely related to facial expressions. We test our methods on three benchmark databases for AU recognition and two benchmark databases for intensity estimation. The results demonstrate that the proposed approaches faithfully model the complex and global inherent AU dependencies, and the expression labels available only during training can boost the estimation of AU dependencies for both AU recognition and intensity estimation.
Collapse
|
16
|
Zhao K, Chu WS, Martinez AM. Learning Facial Action Units from Web Images with Scalable Weakly Supervised Clustering. PROCEEDINGS. IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2018; 2018:2090-2099. [PMID: 31244515 PMCID: PMC6594709 DOI: 10.1109/cvpr.2018.00223] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We present a scalable weakly supervised clustering approach to learn facial action units (AUs) from large, freely available web images. Unlike most existing methods (e.g., CNNs) that rely on fully annotated data, our method exploits web images with inaccurate annotations. Specifically, we derive a weakly-supervised spectral algorithm that learns an embedding space to couple image appearance and semantics. The algorithm has efficient gradient update, and scales up to large quantities of images with a stochastic extension. With the learned embedding space, we adopt rank-order clustering to identify groups of visually and semantically similar images, and re-annotate these groups for training AU classifiers. Evaluation on the 1 millon EmotioNet dataset demonstrates the effectiveness of our approach: (1) our learned annotations reach on average 91.3% agreement with human annotations on 7 common AUs, (2) classifiers trained with re-annotated images perform comparably to, sometimes even better than, its supervised CNN-based counterpart, and (3) our method offers intuitive outlier/noise pruning instead of forcing one annotation to every image. Code is available.
Collapse
Affiliation(s)
- Kaili Zhao
- School of Comm. and Info. Engineering, Beijing University of Posts and Telecom
| | | | - Aleix M Martinez
- Dept. of Electrical and Computer Engineering, The Ohio State University
| |
Collapse
|
17
|
Ertugrul IO, Jeni LA, Cohn JF. FACSCaps: Pose-Independent Facial Action Coding with Capsules. CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS. IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION. WORKSHOPS 2018; 2018:2211-2220. [PMID: 30944768 PMCID: PMC6443417 DOI: 10.1109/cvprw.2018.00287] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Most automated facial expression analysis methods treat the face as a 2D object, flat like a sheet of paper. That works well provided images are frontal or nearly so. In real- world conditions, moderate to large head rotation is common and system performance to recognize expression degrades. Multi-view Convolutional Neural Networks (CNNs) have been proposed to increase robustness to pose, but they require greater model sizes and may generalize poorly across views that are not included in the training set. We propose FACSCaps architecture to handle multi-view and multi-label facial action unit (AU) detection within a single model that can generalize to novel views. Additionally, FACSCaps's ability to synthesize faces enables insights into what is leaned by the model. FACSCaps models video frames using matrix capsules, where hierarchical pose relationships between face parts are built into internal representations. The model is trained by jointly optimizing a multi-label loss and the reconstruction accuracy. FACSCaps was evaluated using the FERA 2017 facial expression dataset that includes spontaneous facial expressions in a wide range of head orientations. FACSCaps outperformed both state-of-the-art CNNs and their temporal extensions.
Collapse
Affiliation(s)
| | - Lászlό A Jeni
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Jeffrey F Cohn
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Psychology, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
18
|
Oyedotun OK, Demisse G, Shabayek AER, Aouada D, Ottersten B. Facial Expression Recognition via Joint Deep Learning of RGB-Depth Map Latent Representations. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW) 2017. [DOI: 10.1109/iccvw.2017.374] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
19
|
Confidence-Weighted Local Expression Predictions for Occlusion Handling in Expression Recognition and Action Unit Detection. Int J Comput Vis 2017. [DOI: 10.1007/s11263-017-1010-1] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
20
|
Wen-Sheng Chu, De la Torre F, Cohn JF. Selective Transfer Machine for Personalized Facial Expression Analysis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2017; 39:529-545. [PMID: 28113267 PMCID: PMC5400741 DOI: 10.1109/tpami.2016.2547397] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Automatic facial action unit (AU) and expression detection from videos is a long-standing problem. The problem is challenging in part because classifiers must generalize to previously unknown subjects that differ markedly in behavior and facial morphology (e.g., heavy versus delicate brows, smooth versus deeply etched wrinkles) from those on which the classifiers are trained. While some progress has been achieved through improvements in choices of features and classifiers, the challenge occasioned by individual differences among people remains. Person-specific classifiers would be a possible solution but for a paucity of training data. Sufficient training data for person-specific classifiers typically is unavailable. This paper addresses the problem of how to personalize a generic classifier without additional labels from the test subject. We propose a transductive learning method, which we refer to as a Selective Transfer Machine (STM), to personalize a generic classifier by attenuating person-specific mismatches. STM achieves this effect by simultaneously learning a classifier and re-weighting the training samples that are most relevant to the test subject. We compared STM to both generic classifiers and cross-domain learning methods on four benchmarks: CK+ [44], GEMEP-FERA [67], RUFACS [4] and GFT [57]. STM outperformed generic classifiers in all.
Collapse
|
21
|
Eleftheriadis S, Rudovic O, Pantic M. Joint Facial Action Unit Detection and Feature Fusion: A Multi-conditional Learning Approach. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:5727-5742. [PMID: 28113501 DOI: 10.1109/tip.2016.2615288] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Automated analysis of facial expressions can benefit many domains, from marketing to clinical diagnosis of neurodevelopmental disorders. Facial expressions are typically encoded as a combination of facial muscle activations, i.e., action units. Depending on context, these action units co-occur in specific patterns, and rarely in isolation. Yet, most existing methods for automatic action unit detection fail to exploit dependencies among them, and the corresponding facial features. To address this, we propose a novel multi-conditional latent variable model for simultaneous fusion of facial features and joint action unit detection. Specifically, the proposed model performs feature fusion in a generative fashion via a low-dimensional shared subspace, while simultaneously performing action unit detection using a discriminative classification approach. We show that by combining the merits of both approaches, the proposed methodology outperforms existing purely discriminative/generative methods for the target task. To reduce the number of parameters, and avoid overfitting, a novel Bayesian learning approach based on Monte Carlo sampling is proposed, to integrate out the shared subspace. We validate the proposed method on posed and spontaneous data from three publicly available datasets (CK+, DISFA and Shoulder-pain), and show that both feature fusion and joint learning of action units leads to improved performance compared to the state-of-the-art methods for the task.
Collapse
|
22
|
De la Torre F, Cohn JF. Confidence Preserving Machine for Facial Action Unit Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:4753-4767. [PMID: 27479964 PMCID: PMC5272912 DOI: 10.1109/tip.2016.2594486] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Facial action unit (AU) detection from video has been a long-standing problem in the automated facial expression analysis. While progress has been made, accurate detection of facial AUs remains challenging due to ubiquitous sources of errors, such as inter-personal variability, pose, and low-intensity AUs. In this paper, we refer to samples causing such errors as hard samples, and the remaining as easy samples. To address learning with the hard samples, we propose the confidence preserving machine (CPM), a novel two-stage learning framework that combines multiple classifiers following an "easy-to-hard" strategy. During the training stage, CPM learns two confident classifiers. Each classifier focuses on separating easy samples of one class from all else, and thus preserves confidence on predicting each class. During the test stage, the confident classifiers provide "virtual labels" for easy test samples. Given the virtual labels, we propose a quasi-semi-supervised (QSS) learning strategy to learn a person-specific classifier. The QSS strategy employs a spatio-temporal smoothness that encourages similar predictions for samples within a spatio-temporal neighborhood. In addition, to further improve detection performance, we introduce two CPM extensions: iterative CPM that iteratively augments training samples to train the confident classifiers, and kernel CPM that kernelizes the original CPM model to promote nonlinearity. Experiments on four spontaneous data sets GFT, BP4D, DISFA, and RU-FACS illustrate the benefits of the proposed CPM models over baseline methods and the state-of-the-art semi-supervised learning and transfer learning methods.
Collapse
|
23
|
Kaltwang S, Todorovic S, Pantic M. Doubly Sparse Relevance Vector Machine for Continuous Facial Behavior Estimation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2016; 38:1748-1761. [PMID: 26595911 DOI: 10.1109/tpami.2015.2501824] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Certain inner feelings and physiological states like pain are subjective states that cannot be directly measured, but can be estimated from spontaneous facial expressions. Since they are typically characterized by subtle movements of facial parts, analysis of the facial details is required. To this end, we formulate a new regression method for continuous estimation of the intensity of facial behavior interpretation, called Doubly Sparse Relevance Vector Machine (DSRVM). DSRVM enforces double sparsity by jointly selecting the most relevant training examples (a.k.a. relevance vectors) and the most important kernels associated with facial parts relevant for interpretation of observed facial expressions. This advances prior work on multi-kernel learning, where sparsity of relevant kernels is typically ignored. Empirical evaluation on challenging Shoulder Pain videos, and the benchmark DISFA and SEMAINE datasets demonstrate that DSRVM outperforms competing approaches with a multi-fold reduction of running times in training and testing.
Collapse
|
24
|
De la Torre F, Cohn JF. Joint Patch and Multi-label Learning for Facial Action Unit and Holistic Expression Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:3931-3946. [PMID: 28113424 DOI: 10.1109/tip.2016.2570550] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Most action unit (AU) detection methods use one-versus-all classifiers without considering dependences between features or AUs. In this paper, we introduce a joint patch and multi-label learning (JPML) framework that models the structured joint dependence behind features, AUs, and their interplay. In particular, JPML leverages group sparsity to identify important facial patches, and learns a multi-label classifier constrained by the likelihood of co-occurring AUs. To describe such likelihood, we derive two AU relations, positive correlation and negative competition, by statistically analyzing more than 350,000 video frames annotated with multiple AUs. To the best of our knowledge, this is the first work that jointly addresses patch learning and multi-label learning for AU detection. In addition, we show that JPML can be extended to recognize holistic expressions by learning common and specific patches, which afford a more compact representation than the standard expression recognition methods. We evaluate JPML on three benchmark datasets CK+, BP4D, and GFT, using within-and cross-dataset scenarios. In four of five experiments, JPML achieved the highest averaged F1 scores in comparison with baseline and alternative methods that use either patch learning or multi-label learning alone.
Collapse
|