1
|
Fu J, Gao J, Xu C. Semantic and Temporal Contextual Correlation Learning for Weakly-Supervised Temporal Action Localization. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:12427-12443. [PMID: 37335790 DOI: 10.1109/tpami.2023.3287208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2023]
Abstract
Weakly-supervised temporal action localization (WSTAL) aims to automatically identify and localize action instances in untrimmed videos with only video-level labels as supervision. In this task, there exist two challenges: (1) how to accurately discover the action categories in an untrimmed video (what to discover); (2) how to elaborately focus on the integral temporal interval of each action instance (where to focus). Empirically, to discover the action categories, discriminative semantic information should be extracted, while robust temporal contextual information is beneficial for complete action localization. However, most existing WSTAL methods ignore to explicitly and jointly model the semantic and temporal contextual correlation information for the above two challenges. In this article, a Semantic and Temporal Contextual Correlation Learning Network (STCL-Net) with the semantic (SCL) and temporal contextual correlation learning (TCL) modules is proposed, which achieves both accurate action discovery and complete action localization by modeling the semantic and temporal contextual correlation information for each snippet in the inter- and intra-video manners respectively. It is noteworthy that the two proposed modules are both designed in a unified dynamic correlation-embedding paradigm. Extensive experiments are performed on different benchmarks. On all the benchmarks, our proposed method exhibits superior or comparable performance in comparison to the existing state-of-the-art models, especially achieving gains as high as 7.2% in terms of the average mAP on THUMOS-14. In addition, comprehensive ablation studies also verify the effectiveness and robustness of each component in our model.
Collapse
|
2
|
Liu Z, Wu S, Jin S, Ji S, Liu Q, Lu S, Cheng L. Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:681-697. [PMID: 34982672 DOI: 10.1109/tpami.2021.3139918] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Predicting human motion from historical pose sequence is crucial for a machine to succeed in intelligent interactions with humans. One aspect that has been obviated so far, is the fact that how we represent the skeletal pose has a critical impact on the prediction results. Yet there is no effort that investigates across different pose representation schemes. We conduct an indepth study on various pose representations with a focus on their effects on the motion prediction task. Moreover, recent approaches build upon off-the-shelf RNN units for motion prediction. These approaches process input pose sequence sequentially and inherently have difficulties in capturing long-term dependencies. In this paper, we propose a novel RNN architecture termed AHMR (Attentive Hierarchical Motion Recurrent network) for motion prediction which simultaneously models local motion contexts and a global context. We further explore a geodesic loss and a forward kinematics loss for the motion prediction task, which have more geometric significance than the widely employed L2 loss. Interestingly, we applied our method to a range of articulate objects including human, fish, and mouse. Empirical results show that our approach outperforms the state-of-the-art methods in short-term prediction and achieves much enhanced long-term prediction proficiency, such as retaining natural human-like motions over 50 seconds predictions. Our codes are released.
Collapse
|
3
|
Szczapa B, Daoudi M, Berretti S, Pala P, Del Bimbo A, Hammal Z. Automatic Estimation of Self-Reported Pain by Trajectory Analysis in the Manifold of Fixed Rank Positive Semi-Definite Matrices. IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 2022; 13:1813-1826. [PMID: 36452255 PMCID: PMC9708064 DOI: 10.1109/taffc.2022.3207001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
We propose an automatic method to estimate self-reported pain based on facial landmarks extracted from videos. For each video sequence, we decompose the face into four different regions and the pain intensity is measured by modeling the dynamics of facial movement using the landmarks of these regions. A formulation based on Gram matrices is used for representing the trajectory of landmarks on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. A curve fitting algorithm is used to smooth the trajectories and temporal alignment is performed to compute the similarity between the trajectories on the manifold. A Support Vector Regression classifier is then trained to encode extracted trajectories into pain intensity levels consistent with self-reported pain intensity measurement. Finally, a late fusion of the estimation for each region is performed to obtain the final predicted pain level. The proposed approach is evaluated on two publicly available datasets, the UNBCMcMaster Shoulder Pain Archive and the Biovid Heat Pain dataset. We compared our method to the state-of-the-art on both datasets using different testing protocols, showing the competitiveness of the proposed approach.
Collapse
Affiliation(s)
- Benjamin Szczapa
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Mohamed Daoudi
- IMT Nord Europe, Institut Mines-Télécom, Univ. Lille, Centre for Digital Systems, F-59000 Lille, France, and Univ. Lille, CNRS, Centrale Lille, Institut Mines-Télécom, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Stefano Berretti
- Department of Information Engineering, University of Florence, Italy
| | - Pietro Pala
- Department of Information Engineering, University of Florence, Italy
| | - Alberto Del Bimbo
- Department of Information Engineering, University of Florence, Italy
| | - Zakia Hammal
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
4
|
Abstract
In industrial production, accidents caused by the unsafe behavior of operators often bring serious economic losses. Therefore, how to use artificial intelligence technology to monitor the unsafe behavior of operators in a production area in real time has become a research topic of great concern. Based on the YOLOv5 framework, this paper proposes an improved YOLO network to detect unsafe behaviors such as not wearing safety helmets and smoking in industrial places. First, the proposed network uses a novel adaptive self-attention embedding (ASAE) model to improve the backbone network and reduce the loss of context information in the high-level feature map by reducing the number of feature channels. Second, a new weighted feature pyramid network (WFPN) module is used to replace the original enhanced feature-extraction network PANet to alleviate the loss of feature information caused by too many network layers. Finally, the experimental results on the self-constructed behavior dataset show that the proposed framework has higher detection accuracy than traditional methods. The average detection accuracy of smoking increased by 3.3%, and the average detection accuracy of not wearing a helmet increased by 3.1%.
Collapse
|
5
|
A Framework for Short Video Recognition Based on Motion Estimation and Feature Curves on SPD Manifolds. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12094669] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Given the prosperity of video media such as TikTok and YouTube, the requirement of short video recognition is becoming more and more urgent. A significant feature of short video is that there are few switches of scenes in short video, and the target (e.g., the face of the key person in the short video) often runs through the short video. This paper presents a new short video recognition algorithm framework that transforms a short video into a family of feature curves on symmetric positive definite (SPD) manifold as the basis of recognition. Thus far, no similar algorithm has been reported. The results of experiments suggest that our method performs better on three changeling databases than seven other related algorithms published in the top issues.
Collapse
|
6
|
Otberdout N, Daoudi M, Kacem A, Ballihi L, Berretti S. Dynamic Facial Expression Generation on Hilbert Hypersphere With Conditional Wasserstein Generative Adversarial Nets. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:848-863. [PMID: 32750786 DOI: 10.1109/tpami.2020.3002500] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this work, we propose a novel approach for generating videos of the six basic facial expressions given a neutral face image. We propose to exploit the face geometry by modeling the facial landmarks motion as curves encoded as points on a hypersphere. By proposing a conditional version of manifold-valued Wasserstein generative adversarial network (GAN) for motion generation on the hypersphere, we learn the distribution of facial expression dynamics of different classes, from which we synthesize new facial expression motions. The resulting motions can be transformed to sequences of landmarks and then to images sequences by editing the texture information using another conditional Generative Adversarial Network. To the best of our knowledge, this is the first work that explores manifold-valued representations with GAN to address the problem of dynamic facial expression generation. We evaluate our proposed approach both quantitatively and qualitatively on two public datasets; Oulu-CASIA and MUG Facial Expression. Our experimental results demonstrate the effectiveness of our approach in generating realistic videos with continuous motion, realistic appearance and identity preservation. We also show the efficiency of our framework for dynamic facial expressions generation, dynamic facial expression transfer and data augmentation for training improved emotion recognition models.
Collapse
|
7
|
Szczapa B, Daoudi M, Berretti S, Pala P, Del Bimbo A, Hammal Z. Automatic Estimation of Self-Reported Pain by Interpretable Representations of Motion Dynamics. PROCEEDINGS OF THE ... IAPR INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION. INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION 2021; 2020. [PMID: 34651145 DOI: 10.1109/icpr48806.2021.9412292] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
We propose an automatic method for pain intensity measurement from video. For each video, pain intensity was measured using the dynamics of facial movement using 66 facial points. Gram matrices formulation was used for facial points trajectory representations on the Riemannian manifold of symmetric positive semi-definite matrices of fixed rank. Curve fitting and temporal alignment were then used to smooth the extracted trajectories. A Support Vector Regression model was then trained to encode the extracted trajectories into ten pain intensity levels consistent with the Visual Analogue Scale for pain intensity measurement. The proposed approach was evaluated using the UNBC McMaster Shoulder Pain Archive and was compared to the state-of-the-art on the same data. Using both 5-fold cross-validation and leave-one-subject-out cross-validation, our results are competitive with respect to state-of-the-art methods.
Collapse
Affiliation(s)
- Benjamin Szczapa
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France.,Department of Information Engineering, University of Florence, Italy
| | - Mohamed Daoudi
- IMT Lille Douai, Univ. Lille, CNRS, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Stefano Berretti
- Department of Information Engineering, University of Florence, Italy
| | - Pietro Pala
- Department of Information Engineering, University of Florence, Italy
| | - Alberto Del Bimbo
- Department of Information Engineering, University of Florence, Italy
| | - Zakia Hammal
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
8
|
Youssfi Alaoui A, Tabii Y, Oulad Haj Thami R, Daoudi M, Berretti S, Pala P. Fall Detection of Elderly People Using the Manifold of Positive Semidefinite Matrices. J Imaging 2021; 7:109. [PMID: 39080897 PMCID: PMC8321381 DOI: 10.3390/jimaging7070109] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/14/2021] [Accepted: 06/23/2021] [Indexed: 11/26/2022] Open
Abstract
Falls are one of the most critical health care risks for elderly people, being, in some adverse circumstances, an indirect cause of death. Furthermore, demographic forecasts for the future show a growing elderly population worldwide. In this context, models for automatic fall detection and prediction are of paramount relevance, especially AI applications that use ambient, sensors or computer vision. In this paper, we present an approach for fall detection using computer vision techniques. Video sequences of a person in a closed environment are used as inputs to our algorithm. In our approach, we first apply the V2V-PoseNet model to detect 2D body skeleton in every frame. Specifically, our approach involves four steps: (1) the body skeleton is detected by V2V-PoseNet in each frame; (2) joints of skeleton are first mapped into the Riemannian manifold of positive semidefinite matrices of fixed-rank 2 to build time-parameterized trajectories; (3) a temporal warping is performed on the trajectories, providing a (dis-)similarity measure between them; (4) finally, a pairwise proximity function SVM is used to classify them into fall or non-fall, incorporating the (dis-)similarity measure into the kernel function. We evaluated our approach on two publicly available datasets URFD and Charfi. The results of the proposed approach are competitive with respect to state-of-the-art methods, while only involving 2D body skeletons.
Collapse
Affiliation(s)
- Abdessamad Youssfi Alaoui
- ADMIR Laboratory, Rabat IT Center, IRDATeam, ENSIAS, Mohammed V University in Rabat, Rabat 10000, Morocco; (Y.T.); or (R.O.H.T.)
| | - Youness Tabii
- ADMIR Laboratory, Rabat IT Center, IRDATeam, ENSIAS, Mohammed V University in Rabat, Rabat 10000, Morocco; (Y.T.); or (R.O.H.T.)
| | - Rachid Oulad Haj Thami
- ADMIR Laboratory, Rabat IT Center, IRDATeam, ENSIAS, Mohammed V University in Rabat, Rabat 10000, Morocco; (Y.T.); or (R.O.H.T.)
| | - Mohamed Daoudi
- MT Lille Douai, Institut Mines-Télécom, Centre for Digital Systems, F-59000 Lille, France;
- CNRS, Centrale Lille, Institut Mines-Télécom, UMR 9189 CRIStAL, University Lille, F-59000 Lille, France
| | - Stefano Berretti
- Department of Information Engineering, University of Florence, 50121 Florence, Italy; (S.B.); (P.P.)
| | - Pietro Pala
- Department of Information Engineering, University of Florence, 50121 Florence, Italy; (S.B.); (P.P.)
| |
Collapse
|
9
|
Ullah FUM, Obaidat MS, Muhammad K, Ullah A, Baik SW, Cuzzolin F, Rodrigues JJPC, Albuquerque VHC. An intelligent system for complex violence pattern analysis and detection. INT J INTELL SYST 2021. [DOI: 10.1002/int.22537] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
| | - Mohammad S. Obaidat
- Founding Dean and Professor College of Computing and Informatics University of Sharjah Sharjah UAE
- King Abdullah II School of Information Technology, University of Jordan Amman Jordan
- University of Science and Technology Beijing China
| | - Khan Muhammad
- School of Convergence, College of Computing and Informatics Sungkyunkwan University Seoul South Korea
| | | | | | - Fabio Cuzzolin
- School of Engineering, Computing and Mathematics Oxford Brookes University Oxford UK
| | - Joel J. P. C. Rodrigues
- Federal University of Piauí (UFPI) Teresina Brazil
- Instituto de Telecomunicações Covilhã Portugal
| | - Victor Hugo C Albuquerque
- Graduate Program on Teleinformatics Engineering Federal University of Ceará Fortaleza, Fortaleza/CE Brazil
- Graduate Program on Electrical Engineering Federal University of Ceará Fortaleza/CE Brazil
| |
Collapse
|
10
|
Tanfous AB, Drira H, Amor BB. Sparse Coding of Shape Trajectories for Facial Expression and Action Recognition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2020; 42:2594-2607. [PMID: 31395537 DOI: 10.1109/tpami.2019.2932979] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
The detection and tracking of human landmarks in video streams has gained in reliability partly due to the availability of affordable RGB-D sensors. The analysis of such time-varying geometric data is playing an important role in the automatic human behavior understanding. However, suitable shape representations as well as their temporal evolution, termed trajectories, often lie to nonlinear manifolds. This puts an additional constraint (i.e., nonlinearity) in using conventional Machine Learning techniques. As a solution, this paper accommodates the well-known Sparse Coding and Dictionary Learning approach to study time-varying shapes on the Kendall shape spaces of 2D and 3D landmarks. We illustrate effective coding of 3D skeletal sequences for action recognition and 2D facial landmark sequences for macro- and micro-expression recognition. To overcome the inherent nonlinearity of the shape spaces, intrinsic and extrinsic solutions were explored. As main results, shape trajectories give rise to more discriminative time-series with suitable computational properties, including sparsity and vector space structure. Extensive experiments conducted on commonly-used datasets demonstrate the competitiveness of the proposed approaches with respect to state-of-the-art.
Collapse
|
11
|
Otberdout N, Kacem A, Daoudi M, Ballihi L, Berretti S. Automatic Analysis of Facial Expressions Based on Deep Covariance Trajectories. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2020; 31:3892-3905. [PMID: 31725395 DOI: 10.1109/tnnls.2019.2947244] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this article, we propose a new approach for facial expression recognition (FER) using deep covariance descriptors. The solution is based on the idea of encoding local and global deep convolutional neural network (DCNN) features extracted from still images, in compact local and global covariance descriptors. The space geometry of the covariance matrices is that of symmetric positive definite (SPD) matrices. By conducting the classification of static facial expressions using a support vector machine (SVM) with a valid Gaussian kernel on the SPD manifold, we show that deep covariance descriptors are more effective than the standard classification with fully connected layers and softmax. Besides, we propose a completely new and original solution to model the temporal dynamic of facial expressions as deep trajectories on the SPD manifold. As an extension of the classification pipeline of covariance descriptors, we apply SVM with valid positive definite kernels derived from global alignment for deep covariance trajectories classification. By performing extensive experiments on the Oulu-CASIA, CK+, static facial expression in the wild (SFEW), and acted facial expressions in the wild (AFEW) data sets, we show that both the proposed static and dynamic approaches achieve the state-of-the-art performance for FER outperforming many recent approaches.
Collapse
|