1
|
Chu WS, De la Torre F, Cohn JF. Learning Facial Action Units with Spatiotemporal Cues and Multi-label Sampling. IMAGE AND VISION COMPUTING 2019; 81:1-14. [PMID: 30524157 PMCID: PMC6277040 DOI: 10.1016/j.imavis.2018.10.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Facial action units (AUs) may be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.
Collapse
Affiliation(s)
- Wen-Sheng Chu
- Robotics Institute, Carnegie Mellon University, Pittsburgh, USA
| | | | - Jeffrey F Cohn
- Department of Psychology, University of Pittsburgh, Pittsburgh, USA
| |
Collapse
|
2
|
Wu Y, Hassner T, Kim K, Medioni G, Natarajan P. Facial Landmark Detection with Tweaked Convolutional Neural Networks. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2018; 40:3067-3074. [PMID: 29990138 DOI: 10.1109/tpami.2017.2787130] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
This paper concerns the problem of facial landmark detection. We provide a unique new analysis of the features produced at intermediate layers of a convolutional neural network (CNN) trained to regress facial landmark coordinates. This analysis shows that while being processed by the CNN, face images can be partitioned in an unsupervised manner into subsets containing faces in similar poses (i.e., 3D views) and facial properties (e.g., presence or absence of eye-wear). Based on this finding, we describe a novel CNN architecture, specialized to regress the facial landmark coordinates of faces in specific poses and appearances. To address the shortage of training data, particularly in extreme profile poses, we additionally present data augmentation techniques designed to provide sufficient training examples for each of these specialized sub-networks. The proposed Tweaked CNN (TCNN) architecture is shown to outperform existing landmark detection methods in an extensive battery of tests on the AFW, ALFW, and 300W benchmarks. Finally, to promote reproducibility of our results, we make code and trained models publicly available through our project webpage.
Collapse
|
3
|
Blended Emotion in-the-Wild: Multi-label Facial Expression Recognition Using Crowdsourced Annotations and Deep Locality Feature Learning. Int J Comput Vis 2018. [DOI: 10.1007/s11263-018-1131-1] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
4
|
Zhao K, Chu WS, Martinez AM. Learning Facial Action Units from Web Images with Scalable Weakly Supervised Clustering. PROCEEDINGS. IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2018; 2018:2090-2099. [PMID: 31244515 PMCID: PMC6594709 DOI: 10.1109/cvpr.2018.00223] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
We present a scalable weakly supervised clustering approach to learn facial action units (AUs) from large, freely available web images. Unlike most existing methods (e.g., CNNs) that rely on fully annotated data, our method exploits web images with inaccurate annotations. Specifically, we derive a weakly-supervised spectral algorithm that learns an embedding space to couple image appearance and semantics. The algorithm has efficient gradient update, and scales up to large quantities of images with a stochastic extension. With the learned embedding space, we adopt rank-order clustering to identify groups of visually and semantically similar images, and re-annotate these groups for training AU classifiers. Evaluation on the 1 millon EmotioNet dataset demonstrates the effectiveness of our approach: (1) our learned annotations reach on average 91.3% agreement with human annotations on 7 common AUs, (2) classifiers trained with re-annotated images perform comparably to, sometimes even better than, its supervised CNN-based counterpart, and (3) our method offers intuitive outlier/noise pruning instead of forcing one annotation to every image. Code is available.
Collapse
Affiliation(s)
- Kaili Zhao
- School of Comm. and Info. Engineering, Beijing University of Posts and Telecom
| | | | - Aleix M Martinez
- Dept. of Electrical and Computer Engineering, The Ohio State University
| |
Collapse
|
5
|
Pei W, Dibeklioglu H, Tax DMJ, van der Maaten L. Multivariate Time-Series Classification Using the Hidden-Unit Logistic Model. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018; 29:920-931. [PMID: 28141534 DOI: 10.1109/tnnls.2017.2651018] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We present a new model for multivariate time-series classification, called the hidden-unit logistic model (HULM), that uses binary stochastic hidden units to model latent structure in the data. The hidden units are connected in a chain structure that models temporal dependencies in the data. Compared with the prior models for time-series classification such as the hidden conditional random field, our model can model very complex decision boundaries, because the number of latent states grows exponentially with the number of hidden units. We demonstrate the strong performance of our model in experiments on a variety of (computer vision) tasks, including handwritten character recognition, speech recognition, facial expression, and action recognition. We also present a state-of-the-art system for facial action unit detection based on the HULM.
Collapse
|
6
|
Ko BC. A Brief Review of Facial Emotion Recognition Based on Visual Information. SENSORS (BASEL, SWITZERLAND) 2018; 18:E401. [PMID: 29385749 PMCID: PMC5856145 DOI: 10.3390/s18020401] [Citation(s) in RCA: 122] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/06/2017] [Revised: 01/24/2018] [Accepted: 01/25/2018] [Indexed: 11/24/2022]
Abstract
Facial emotion recognition (FER) is an important topic in the fields of computer vision and artificial intelligence owing to its significant academic and commercial potential. Although FER can be conducted using multiple sensors, this review focuses on studies that exclusively use facial images, because visual expressions are one of the main information channels in interpersonal communication. This paper provides a brief review of researches in the field of FER conducted over the past decades. First, conventional FER approaches are described along with a summary of the representative categories of FER systems and their main algorithms. Deep-learning-based FER approaches using deep networks enabling "end-to-end" learning are then presented. This review also focuses on an up-to-date hybrid deep-learning approach combining a convolutional neural network (CNN) for the spatial features of an individual frame and long short-term memory (LSTM) for temporal features of consecutive frames. In the later part of this paper, a brief review of publicly available evaluation metrics is given, and a comparison with benchmark results, which are a standard for a quantitative comparison of FER researches, is described. This review can serve as a brief guidebook to newcomers in the field of FER, providing basic knowledge and a general understanding of the latest state-of-the-art studies, as well as to experienced researchers looking for productive directions for future work.
Collapse
Affiliation(s)
- Byoung Chul Ko
- Department of Computer Engineering, Keimyung University, Daegu 42601, Korea.
| |
Collapse
|
7
|
Wen-Sheng Chu, De la Torre F, Cohn JF. Selective Transfer Machine for Personalized Facial Expression Analysis. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2017; 39:529-545. [PMID: 28113267 PMCID: PMC5400741 DOI: 10.1109/tpami.2016.2547397] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Automatic facial action unit (AU) and expression detection from videos is a long-standing problem. The problem is challenging in part because classifiers must generalize to previously unknown subjects that differ markedly in behavior and facial morphology (e.g., heavy versus delicate brows, smooth versus deeply etched wrinkles) from those on which the classifiers are trained. While some progress has been achieved through improvements in choices of features and classifiers, the challenge occasioned by individual differences among people remains. Person-specific classifiers would be a possible solution but for a paucity of training data. Sufficient training data for person-specific classifiers typically is unavailable. This paper addresses the problem of how to personalize a generic classifier without additional labels from the test subject. We propose a transductive learning method, which we refer to as a Selective Transfer Machine (STM), to personalize a generic classifier by attenuating person-specific mismatches. STM achieves this effect by simultaneously learning a classifier and re-weighting the training samples that are most relevant to the test subject. We compared STM to both generic classifiers and cross-domain learning methods on four benchmarks: CK+ [44], GEMEP-FERA [67], RUFACS [4] and GFT [57]. STM outperformed generic classifiers in all.
Collapse
|
8
|
Chu WS, De la Torre F, Cohn JF, Messinger DS. A Branch-and-Bound Framework for Unsupervised Common Event Discovery. Int J Comput Vis 2017; 123:372-391. [PMID: 28943718 DOI: 10.1007/s11263-017-0989-7] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Event discovery aims to discover a temporal segment of interest, such as human behavior, actions or activities. Most approaches to event discovery within or between time series use supervised learning. This becomes problematic when some relevant event labels are unknown, are difficult to detect, or not all possible combinations of events have been anticipated. To overcome these problems, this paper explores Common Event Discovery (CED), a new problem that aims to discover common events of variable-length segments in an unsupervised manner. A potential solution to CED is searching over all possible pairs of segments, which would incur a prohibitive quartic cost. In this paper, we propose an efficient branch-and-bound (B&B) framework that avoids exhaustive search while guaranteeing a globally optimal solution. To this end, we derive novel bounding functions for various commonality measures and provide extensions to multiple commonality discovery and accelerated search. The B&B framework takes as input any multidimensional signal that can be quantified into histograms. A generalization of the framework can be readily applied to discover events at the same or different times (synchrony and event commonality, respectively). We consider extensions to video search and supervised event detection. The effectiveness of the B&B framework is evaluated in motion capture of deliberate behavior and in video of spontaneous facial behavior in diverse interpersonal contexts: interviews, small groups of young adults, and parent-infant face-to-face interaction.
Collapse
Affiliation(s)
| | | | - Jeffrey F Cohn
- Robotics Institute, Carnegie Mellon University, USA
- Department of Psychology, University of Pittsburgh, USA
| | | |
Collapse
|
9
|
De la Torre F, Cohn JF. Confidence Preserving Machine for Facial Action Unit Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:4753-4767. [PMID: 27479964 PMCID: PMC5272912 DOI: 10.1109/tip.2016.2594486] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Facial action unit (AU) detection from video has been a long-standing problem in the automated facial expression analysis. While progress has been made, accurate detection of facial AUs remains challenging due to ubiquitous sources of errors, such as inter-personal variability, pose, and low-intensity AUs. In this paper, we refer to samples causing such errors as hard samples, and the remaining as easy samples. To address learning with the hard samples, we propose the confidence preserving machine (CPM), a novel two-stage learning framework that combines multiple classifiers following an "easy-to-hard" strategy. During the training stage, CPM learns two confident classifiers. Each classifier focuses on separating easy samples of one class from all else, and thus preserves confidence on predicting each class. During the test stage, the confident classifiers provide "virtual labels" for easy test samples. Given the virtual labels, we propose a quasi-semi-supervised (QSS) learning strategy to learn a person-specific classifier. The QSS strategy employs a spatio-temporal smoothness that encourages similar predictions for samples within a spatio-temporal neighborhood. In addition, to further improve detection performance, we introduce two CPM extensions: iterative CPM that iteratively augments training samples to train the confident classifiers, and kernel CPM that kernelizes the original CPM model to promote nonlinearity. Experiments on four spontaneous data sets GFT, BP4D, DISFA, and RU-FACS illustrate the benefits of the proposed CPM models over baseline methods and the state-of-the-art semi-supervised learning and transfer learning methods.
Collapse
|
10
|
De la Torre F, Cohn JF. Joint Patch and Multi-label Learning for Facial Action Unit and Holistic Expression Recognition. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2016; 25:3931-3946. [PMID: 28113424 DOI: 10.1109/tip.2016.2570550] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Most action unit (AU) detection methods use one-versus-all classifiers without considering dependences between features or AUs. In this paper, we introduce a joint patch and multi-label learning (JPML) framework that models the structured joint dependence behind features, AUs, and their interplay. In particular, JPML leverages group sparsity to identify important facial patches, and learns a multi-label classifier constrained by the likelihood of co-occurring AUs. To describe such likelihood, we derive two AU relations, positive correlation and negative competition, by statistically analyzing more than 350,000 video frames annotated with multiple AUs. To the best of our knowledge, this is the first work that jointly addresses patch learning and multi-label learning for AU detection. In addition, we show that JPML can be extended to recognize holistic expressions by learning common and specific patches, which afford a more compact representation than the standard expression recognition methods. We evaluate JPML on three benchmark datasets CK+, BP4D, and GFT, using within-and cross-dataset scenarios. In four of five experiments, JPML achieved the highest averaged F1 scores in comparison with baseline and alternative methods that use either patch learning or multi-label learning alone.
Collapse
|
11
|
Sariyanidi E, Gunes H, Cavallaro A. Automatic Analysis of Facial Affect: A Survey of Registration, Representation, and Recognition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2015; 37:1113-1133. [PMID: 26357337 DOI: 10.1109/tpami.2014.2366127] [Citation(s) in RCA: 113] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Automatic affect analysis has attracted great interest in various contexts including the recognition of action units and basic or non-basic emotions. In spite of major efforts, there are several open questions on what the important cues to interpret facial expressions are and how to encode them. In this paper, we review the progress across a range of affect recognition applications to shed light on these fundamental questions. We analyse the state-of-the-art solutions by decomposing their pipelines into fundamental components, namely face registration, representation, dimensionality reduction and recognition. We discuss the role of these components and highlight the models and new trends that are followed in their design. Moreover, we provide a comprehensive analysis of facial representations by uncovering their advantages and limitations; we elaborate on the type of information they encode and discuss how they deal with the key challenges of illumination variations, registration errors, head-pose variations, occlusions, and identity bias. This survey allows us to identify open issues and to define future directions for designing real-world affect recognition systems.
Collapse
|
12
|
Zhao K, Chu WS, De la Torre F, Cohn JF, Zhang H. Joint Patch and Multi-label Learning for Facial Action Unit Detection. PROCEEDINGS. IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2015; 2015:2207-2216. [PMID: 27382243 PMCID: PMC4930865 DOI: 10.1109/cvpr.2015.7298833] [Citation(s) in RCA: 108] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The face is one of the most powerful channel of nonverbal communication. The most commonly used taxonomy to describe facial behaviour is the Facial Action Coding System (FACS). FACS segments the visible effects of facial muscle activation into 30+ action units (AUs). AUs, which may occur alone and in thousands of combinations, can describe nearly all-possible facial expressions. Most existing methods for automatic AU detection treat the problem using one-vs-all classifiers and fail to exploit dependencies among AU and facial features. We introduce joint-patch and multi-label learning (JPML) to address these issues. JPML leverages group sparsity by selecting a sparse subset of facial patches while learning a multi-label classifier. In four of five comparisons on three diverse datasets, CK+, GFT, and BP4D, JPML produced the highest average F1 scores in comparison with state-of-the art.
Collapse
Affiliation(s)
- Kaili Zhao
- School of Comm. and Info. Engineering, Beijing University of Posts and Telecom., Beijing China
| | - Wen-Sheng Chu
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213
| | | | - Jeffrey F Cohn
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213; Department of Psychology, University of Pittsburgh, Pittsburgh, PA 15260
| | - Honggang Zhang
- School of Comm. and Info. Engineering, Beijing University of Posts and Telecom., Beijing China
| |
Collapse
|
13
|
De la Torre F, Chu WS, Xiong X, Vicente F, Ding X, Cohn J. IntraFace. IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE & GESTURE RECOGNITION AND WORKSHOPS 2015; 1:10.1109/FG.2015.7163082. [PMID: 27346987 PMCID: PMC4918819 DOI: 10.1109/fg.2015.7163082] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Within the last 20 years, there has been an increasing interest in the computer vision community in automated facial image analysis algorithms. This has been driven by applications in animation, market research, autonomous-driving, surveillance, and facial editing among others. To date, there exist several commercial packages for specific facial image analysis tasks such as facial expression recognition, facial attribute analysis or face tracking. However, free and easy-to-use software that incorporates all these functionalities is unavailable. This paper presents IntraFace (IF), a publicly-available software package for automated facial feature tracking, head pose estimation, facial attribute recognition, and facial expression analysis from video. In addition, IFincludes a newly develop technique for unsupervised synchrony detection to discover correlated facial behavior between two or more persons, a relatively unexplored problem in facial image analysis. In tests, IF achieved state-of-the-art results for emotion expression and action unit detection in three databases, FERA, CK+ and RU-FACS; measured audience reaction to a talk given by one of the authors; and discovered synchrony for smiling in videos of parent-infant interaction. IF is free of charge for academic use at http://www.humansensing.cs.cmu.edu/intraface/.
Collapse
Affiliation(s)
| | - Wen-Sheng Chu
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213
| | - Xuehan Xiong
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213
| | - Francisco Vicente
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213
| | - Xiaoyu Ding
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213
| | - Jeffrey Cohn
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213; Department of Psychology, University of Pittsburgh, Pittsburgh, PA 15260
| |
Collapse
|