1
|
Zhang Y, Li X, Xie H, Zhuang W, Guo S, Li Z. Multi-Label Action Anticipation for Real-World Videos With Scene Understanding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3242-3255. [PMID: 38662558 DOI: 10.1109/tip.2024.3391692] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2024]
Abstract
With human action anticipation becoming an essential tool for many practical applications, there has been an increasing trend in developing more accurate anticipation models in recent years. Most of the existing methods target standard action anticipation datasets, in which they could produce promising results by learning action-level contextual patterns. However, the over-simplified scenarios of standard datasets often do not hold in reality, which hinders them from being applied to real-world applications. To address this, we propose a scene-graph-based novel model SEAD that learns the action anticipation at the high semantic level rather than focusing on the action level. The proposed model is composed of two main modules, 1) the scene prediction module, which predicts future scene graphs using a grammar dictionary, and 2) the action anticipation module, which is responsible for predicting future actions with an LSTM network by taking as input the observed and predicted scene graphs. We evaluate our model on two real-world video datasets (Charades and Home Action Genome) as well as a standard action anticipation dataset (CAD-120) to verify its efficacy. The experimental results show that SEAD is able to outperform existing methods by large margins on the two real-world datasets and can also yield stable predictions on the standard dataset at the same time. In particular, our proposed model surpasses the state-of-the-art methods with mean average precision improvements consistently higher than 65% on the Charades dataset and an average improvement of 40.6% on the Home Action Genome dataset.
Collapse
|
2
|
Yu X, Yi H, Tang Q, Huang K, Hu W, Zhang S, Wang X. Graph-based social relation inference with multi-level conditional attention. Neural Netw 2024; 173:106216. [PMID: 38442650 DOI: 10.1016/j.neunet.2024.106216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 01/15/2024] [Accepted: 02/26/2024] [Indexed: 03/07/2024]
Abstract
Social relation inference intrinsically requires high-level semantic understanding. In order to accurately infer relations of persons in images, one needs not only to understand scenes and objects in images, but also to adaptively attend to important clues. Unlike prior works of classifying social relations using attention on detected objects, we propose a MUlti-level Conditional Attention (MUCA) mechanism for social relation inference, which attends to scenes, objects and human interactions based on each person pair. Then, we develop a transformer-style network to achieve the MUCA mechanism. The novel network named as Graph-based Relation Inference Transformer (i.e., GRIT) consists of two modules, i.e., a Conditional Query Module (CQM) and a Relation Attention Module (RAM). Specifically, we design a graph-based CQM to generate informative relation queries for all person pairs, which fuses local features and global context for each person pair. Moreover, we fully take advantage of transformer-style networks in RAM for multi-level attentions in classifying social relations. To our best knowledge, GRIT is the first for inferring social relations with multi-level conditional attention. GRIT is end-to-end trainable and significantly outperforms existing methods on two benchmark datasets, e.g., with performance improvement of 7.8% on PIPA and 9.6% on PISC.
Collapse
Affiliation(s)
- Xiaotian Yu
- Department of AI Technology Center, Shenzhen Intellifusion Ltd., China.
| | - Hanling Yi
- Department of AI Technology Center, Shenzhen Intellifusion Ltd., China
| | - Qie Tang
- Department of AI Technology Center, Shenzhen Intellifusion Ltd., China
| | - Kun Huang
- Department of AI Technology Center, Shenzhen Intellifusion Ltd., China
| | - Wenze Hu
- Department of AI Technology Center, Shenzhen Intellifusion Ltd., China
| | - Shiliang Zhang
- Department of Computer Science, Peking University, China
| | - Xiaoyu Wang
- Department of AI Technology Center, Shenzhen Intellifusion Ltd., China; The Chinese University of Hong Kong (Shenzhen), China
| |
Collapse
|
3
|
Bai J, Qin H, Lai S, Guo J, Guo Y. GLPanoDepth: Global-to-Local Panoramic Depth Estimation. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:2936-2949. [PMID: 38619939 DOI: 10.1109/tip.2024.3386403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
Depth estimation is a fundamental task in many vision applications. With the popularity of omnidirectional cameras, it becomes a new trend to tackle this problem in the spherical space. In this paper, we propose a learning-based method for predicting dense depth values of a scene from a monocular omnidirectional image. An omnidirectional image has a full field-of-view, providing much more complete descriptions of the scene than perspective images. However, fully-convolutional networks that most current solutions rely on fail to capture rich global contexts from the panorama. To address this issue and also the distortion of equirectangular projection in the panorama, we propose Cubemap Vision Transformers (CViT), a new transformer-based architecture that can model long-range dependencies and extract distortion-free global features from the panorama. We show that cubemap vision transformers have a global receptive field at every stage and can provide globally coherent predictions for spherical signals. As a general architecture, it removes any restriction that has been imposed on the panorama in many other monocular panoramic depth estimation methods. To preserve important local features, we further design a convolution-based branch in our pipeline (dubbed GLPanoDepth) and fuse global features from cubemap vision transformers at multiple scales. This global-to-local strategy allows us to fully exploit useful global and local features in the panorama, achieving state-of-the-art performance in panoramic depth estimation.
Collapse
|
4
|
Lin WK, Zhang HB, Fan Z, Liu JH, Yang LJ, Lei Q, Du J. Point-Based Learnable Query Generator for Human-Object Interaction Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2023; 32:6469-6484. [PMID: 37995177 DOI: 10.1109/tip.2023.3334100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2023]
Abstract
Transformer-based and interaction point-based methods have demonstrated promising performance and potential in human-object interaction detection. However, due to differences in structure and properties, direct integration of these two types of models is not feasible. Recent Transformer-based methods divide the decoder into two branches: an instance decoder for human-object pair detection and a classification decoder for interaction recognition. While the attention mechanism within the Transformer enhances the connection between localization and classification, this paper focuses on further improving HOI detection performance by increasing the intrinsic correlation between instance and action features. To address these challenges, this paper proposes a novel Transformer-based HOI Detection framework. In the proposed method, the decoder contains three parts: learnable query generator, instance decoder, and interaction classifier. The learnable query generator aims to build an effective query to guide the instance decoder and interaction classifier to learn more accurate instance and interaction features. These features are then applied to update the query generator for the next layer. Especially, inspired by the interaction point-based HOI and object detection methods, this paper introduces the prior bounding boxes, keypoints detection and spatial relation feature to build the novel learnable query generator. Finally, the proposed method is verified on HICO-DET and V-COCO datasets. The experimental results show that the proposed method has the better performance compared with the state-of-the-art methods.
Collapse
|
5
|
Amaral P, Silva F, Santos V. Recognition of Grasping Patterns Using Deep Learning for Human-Robot Collaboration. SENSORS (BASEL, SWITZERLAND) 2023; 23:8989. [PMID: 37960688 PMCID: PMC10650364 DOI: 10.3390/s23218989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 10/30/2023] [Accepted: 11/01/2023] [Indexed: 11/15/2023]
Abstract
Recent advances in the field of collaborative robotics aim to endow industrial robots with prediction and anticipation abilities. In many shared tasks, the robot's ability to accurately perceive and recognize the objects being manipulated by the human operator is crucial to make predictions about the operator's intentions. In this context, this paper proposes a novel learning-based framework to enable an assistive robot to recognize the object grasped by the human operator based on the pattern of the hand and finger joints. The framework combines the strengths of the commonly available software MediaPipe in detecting hand landmarks in an RGB image with a deep multi-class classifier that predicts the manipulated object from the extracted keypoints. This study focuses on the comparison between two deep architectures, a convolutional neural network and a transformer, in terms of prediction accuracy, precision, recall and F1-score. We test the performance of the recognition system on a new dataset collected with different users and in different sessions. The results demonstrate the effectiveness of the proposed methods, while providing valuable insights into the factors that limit the generalization ability of the models.
Collapse
Affiliation(s)
- Pedro Amaral
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, 3810-193 Aveiro, Portugal;
| | - Filipe Silva
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, 3810-193 Aveiro, Portugal;
| | - Vítor Santos
- Department of Mechanical Engineering (DEM), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, 3810-193 Aveiro, Portugal;
| |
Collapse
|
6
|
Selva J, Johansen AS, Escalera S, Nasrollahi K, Moeslund TB, Clapes A. Video Transformers: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:12922-12943. [PMID: 37022830 DOI: 10.1109/tpami.2023.3243465] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
Collapse
|
7
|
Auletta F, Kallen RW, di Bernardo M, Richardson MJ. Predicting and understanding human action decisions during skillful joint-action using supervised machine learning and explainable-AI. Sci Rep 2023; 13:4992. [PMID: 36973473 PMCID: PMC10042997 DOI: 10.1038/s41598-023-31807-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 03/17/2023] [Indexed: 03/29/2023] Open
Abstract
This study investigated the utility of supervised machine learning (SML) and explainable artificial intelligence (AI) techniques for modeling and understanding human decision-making during multiagent task performance. Long short-term memory (LSTM) networks were trained to predict the target selection decisions of expert and novice players completing a multiagent herding task. The results revealed that the trained LSTM models could not only accurately predict the target selection decisions of expert and novice players but that these predictions could be made at timescales that preceded a player's conscious intent. Importantly, the models were also expertise specific, in that models trained to predict the target selection decisions of experts could not accurately predict the target selection decisions of novices (and vice versa). To understand what differentiated expert and novice target selection decisions, we employed the explainable-AI technique, SHapley Additive explanation (SHAP), to identify what informational features (variables) most influenced modelpredictions. The SHAP analysis revealed that experts were more reliant on information about target direction of heading and the location of coherders (i.e., other players) compared to novices. The implications and assumptions underlying the use of SML and explainable-AI techniques for investigating and understanding human decision-making are discussed.
Collapse
Affiliation(s)
- Fabrizia Auletta
- School of Psychological Sciences, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney, NSW, Australia
- Department of Engineering Mathematics, University of Bristol, Bristol, UK
| | - Rachel W Kallen
- School of Psychological Sciences, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney, NSW, Australia
- Center for Elite Performance, Expertise and Training, Macquarie University, Sydney, NSW, Australia
| | - Mario di Bernardo
- Department of Electrical Engineering and Information Technology, University of Naples, Federico II, Naples, Italy.
- Scuola Superiore Meridionale, Naples, Italy.
| | - Michael J Richardson
- School of Psychological Sciences, Faculty of Medicine, Health and Human Sciences, Macquarie University, Sydney, NSW, Australia.
- Center for Elite Performance, Expertise and Training, Macquarie University, Sydney, NSW, Australia.
| |
Collapse
|