1
|
Jiang Y, Ding W, Li H, Chi Z. Multi-Person Pose Tracking With Sparse Key-Point Flow Estimation and Hierarchical Graph Distance Minimization. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024; 33:3590-3605. [PMID: 38819968 DOI: 10.1109/tip.2024.3405339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2024]
Abstract
In this paper, we propose a novel framework for multi-person pose estimation and tracking on challenging scenarios. In view of occlusions and motion blurs which hinder the performance of pose tracking, we proposed to model humans as graphs and perform pose estimation and tracking by concentrating on the visible parts of human bodies which are informative about complete skeletons under incomplete observations. Specifically, the proposed framework involves three parts: (i) A Sparse Key-point Flow Estimating Module (SKFEM) and a Hierarchical Graph Distance Minimizing Module (HGMM) for estimating pixel-level and human-level motion, respectively; (ii) Pixel-level appearance consistency and human-level structural consistency are combined in measuring the visibility scores of body joints. The scores guide the pose estimator to predict complete skeletons by observing high-visibility parts, under the assumption that visible and invisible parts are inherently correlated in human part graphs. The pose estimator is iteratively fine-tuned to achieve this capability; (iii) Multiple historical frames are combined to benefit tracking which is implemented using HGMM. The proposed approach not only achieves state-of-the-art performance on PoseTrack datasets but also contributes to significant improvements in other tasks such as human-related anomaly detection.
Collapse
|
2
|
Yang B, Yu C, Yu JG, Gao C, Sang N. Pose-Guided Hierarchical Semantic Decomposition and Composition for Human Parsing. IEEE TRANSACTIONS ON CYBERNETICS 2023; 53:1641-1652. [PMID: 34506295 DOI: 10.1109/tcyb.2021.3107544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Human parsing is a fine-grained semantic segmentation task, which needs to understand human semantic parts. Most existing methods model human parsing as a general semantic segmentation, which ignores the inherent relationship among hierarchical human parts. In this work, we propose a pose-guided hierarchical semantic decomposition and composition framework for human parsing. Specifically, our method includes a semantic maintained decomposition and composition (SMDC) module and a pose distillation (PC) module. SMDC progressively disassembles the human body to focus on the more concise regions of interest in the decomposition stage and then gradually assembles human parts under the guidance of pose information in the composition stage. Notably, SMDC maintains the atomic semantic labels during both stages to avoid the error propagation issue of the hierarchical structure. To further take advantage of the relationship of human parts, we introduce pose information as explicit guidance for the composition. However, the discrete structure prediction in pose estimation is against the requirement of the continuous region in human parsing. To this end, we design a PC module to broadcast the maximum responses of pose estimation to form the continuous structure in the way of knowledge distillation. The experimental results on the look-into-person (LIP) and PASCAL-Person-Part datasets demonstrate the superiority of our method compared with the state-of-the-art methods, that is, 55.21% mean Intersection of Union (mIoU) on LIP and 69.88% mIoU on PASCAL-Person-Part.
Collapse
|
3
|
Xu Y, Wang W, Liu T, Liu X, Xie J, Zhu SC. Monocular 3D Pose Estimation via Pose Grammar and Data Augmentation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:6327-6344. [PMID: 34106844 DOI: 10.1109/tpami.2021.3087695] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
In this paper, we propose a pose grammar to tackle the problem of 3D human pose estimation from a monocular RGB image. Our model takes estimated 2D pose as the input and learns a generalized 2D-3D mapping function to leverage into 3D pose. The proposed model consists of a base network which efficiently captures pose-aligned features and a hierarchy of Bi-directional RNNs (BRNNs) on the top to explicitly incorporate a set of knowledge regarding human body configuration (i.e., kinematics, symmetry, motor coordination). The proposed model thus enforces high-level constraints over human poses. In learning, we develop a data augmentation algorithm to further improve model robustness against appearance variations and cross-view generalization ability. We validate our method on public 3D human pose benchmarks and propose a new evaluation protocol working on cross-view setting to verify the generalization capability of different methods. We empirically observe that most state-of-the-art methods encounter difficulty under such setting while our method can well handle such challenges.
Collapse
|
4
|
Tang Z, Huang J. DRFormer: Learning dual relations using Transformer for pedestrian attribute recognition. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2022.05.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
5
|
Zhang X, Chen Y, Tang M, Lei Z, Wang J. Grammar-Induced Wavelet Network for Human Parsing. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4502-4514. [PMID: 35700249 DOI: 10.1109/tip.2022.3181486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Most existing methods of human parsing still face a challenge: how to extract the accurate foreground from similar or cluttered scenes effectively. In this paper, we propose a Grammar-induced Wavelet Network (GWNet), to deal with the challenge. GWNet mainly consists of two modules, including a blended grammar-induced module and a wavelet prediction module. We design the blended grammar-induced module to exploit the relationship of different human parts and the inherent hierarchical structure of a human body by means of grammar rules in both cascaded and paralleled manner. In this way, conspicuous parts, which are easily distinguished from the background, can amend the segmentation of inconspicuous ones, improving the foreground extraction. We also design a Part-aware Convolutional Recurrent Neural Network (PCRNN) to pass messages which are generated by grammar rules. To further improve the performance, we propose a wavelet prediction module to capture the basic structure and the edge details of a person by decomposing the low-frequency and high-frequency components of features. The low-frequency component can represent the smooth structures and the high-frequency components can describe the fine details. We conduct extensive experiments to evaluate GWNet on PASCAL-Person-Part, LIP, and PPSS datasets. GWNet obtains state-of-the-art performance on these human parsing datasets.
Collapse
|
6
|
MCFL: multi-label contrastive focal loss for deep imbalanced pedestrian attribute recognition. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07300-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
7
|
Zhou T, Qi S, Wang W, Shen J, Zhu SC. Cascaded Parsing of Human-Object Interaction Recognition. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:2827-2840. [PMID: 33400648 DOI: 10.1109/tpami.2021.3049156] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images. Considering the intrinsic complexity and structural nature of the task, we introduce a cascaded parsing network (CP-HOI) for a multi-stage, structured HOI understanding. At each cascade stage, an instance detection module progressively refines HOI proposals and feeds them into a structured interaction reasoning module. Each of the two modules is also connected to its predecessor in the previous stage, enabling efficient cross-stage information propagation. The structured interaction reasoning module is built upon a graph parsing neural network (GPNN), which efficiently models potential HOI structures as graphs and mines rich context for comprehensive relation understanding. In particular, GPNN infers a parse graph that i) interprets meaningful HOI structures by a learnable adjacency matrix, and ii) predicts action (edge) labels. Within an end-to-end, message-passing framework, GPNN blends learning and inference, iteratively parsing HOI structures and reasoning HOI representations (i.e., instance and relation features). Further beyond relation detection at a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. A preliminary version of our CP-HOI model reached 1st place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation. In addition, our CP-HOI shows promising results on two popular HOI recognition benchmarks, i.e., V-COCO and HICO-DET.
Collapse
|
8
|
Akula AR, Wang K, Liu C, Saba-Sadiya S, Lu H, Todorovic S, Chai J, Zhu SC. CX-ToM: Counterfactual explanations with theory-of-mind for enhancing human trust in image recognition models. iScience 2022; 25:103581. [PMID: 35036861 PMCID: PMC8753121 DOI: 10.1016/j.isci.2021.103581] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 02/15/2021] [Accepted: 12/05/2021] [Indexed: 11/29/2022] Open
Abstract
We propose CX-ToM, short for counterfactual explanations with theory-of-mind, a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN). In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process, i.e., dialogue between the machine and human user. More concretely, our CX-ToM framework generates a sequence of explanations in a dialogue by mediating the differences between the minds of the machine and human user. To do this, we use Theory of Mind (ToM) which helps us in explicitly modeling the human's intention, the machine's mind as inferred by the human, as well as human's mind as inferred by the machine. Moreover, most state-of-the-art XAI frameworks provide attention (or heat map) based explanations. In our work, we show that these attention-based explanations are not sufficient for increasing human trust in the underlying CNN model. In CX-ToM, we instead use counterfactual explanations called fault-lines which we define as follows: given an input image I for which a CNN classification model M predicts class c pred , a fault-line identifies the minimal semantic-level features (e.g., stripes on zebra), referred to as explainable concepts, that need to be added to or deleted from I to alter the classification category of I by M to another specified class c alt . Extensive experiments verify our hypotheses, demonstrating that our CX-ToM significantly outperforms the state-of-the-art XAI models.
Collapse
Affiliation(s)
- Arjun R. Akula
- Department of Statistics, UCLA, Los Angeles, CA 90024, USA
| | - Keze Wang
- Department of Statistics, UCLA, Los Angeles, CA 90024, USA
| | - Changsong Liu
- Department of Statistics, UCLA, Los Angeles, CA 90024, USA
| | - Sari Saba-Sadiya
- Department of Computer Science, University of Michigan, Ann Arbor, MI 48109, USA
| | - Hongjing Lu
- Department of Statistics, UCLA, Los Angeles, CA 90024, USA
| | - Sinisa Todorovic
- Department of Computer Science, Oregon State University, Corvallis, OR 97331, USA
| | - Joyce Chai
- Department of Computer Science, University of Michigan, Ann Arbor, MI 48109, USA
| | - Song-Chun Zhu
- Beijing Institute for General AI (BIGAI), Tsinghua University, Peking University, Beijing 100871, China
| |
Collapse
|
9
|
Bratch A, Chen Y, Engel SA, Kersten DJ. Visual adaptation selective for individual limbs reveals hierarchical human body representation. J Vis 2021; 21:18. [PMID: 34007989 PMCID: PMC8142707 DOI: 10.1167/jov.21.5.18] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 02/24/2021] [Indexed: 11/24/2022] Open
Abstract
The spatial relationships between body parts are a rich source of information for person perception, with even simple pairs of parts providing highly valuable information. Computation of these relationships would benefit from a hierarchical representation, where body parts are represented individually. We hypothesized that the human visual system makes use of such representations. To test this hypothesis, we used adaptation to determine whether observers were sensitive to changes in the length of one body part relative to another. Observers viewed forearm/upper arm pairs where the forearm had been either lengthened or shortened, judging the perceived length of the forearm. Observers then adapted to a variety of different stimuli (e.g., arms, objects, etc.) in different orientations and visual field locations. We found that following adaptation to distorted limbs, but not non-limb objects, observers experienced a shift in perceived forearm length. Furthermore, this effect partially transferred across different orientations and visual field locations. Taken together, these results suggest the effect arises in high level mechanisms specialized for specific body parts, providing evidence for a representation of bodies based on parts and their relationships.
Collapse
Affiliation(s)
- Alexander Bratch
- Department of Psychology, University of Minnesota, Minneapolis, MN, USA
- Department of Biomedical Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Yixiong Chen
- Department of Psychology, University of Minnesota, Minneapolis, MN, USA
| | - Stephen A Engel
- Department of Psychology, University of Minnesota, Minneapolis, MN, USA
| | - Daniel J Kersten
- Department of Psychology, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
10
|
Gu X, Gao F, Tan M, Peng P. Fashion analysis and understanding with artificial intelligence. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2020.102276] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
11
|
Abstract
Human Attribute Recognition (HAR) is a highly active research field in computer vision and pattern recognition domains with various applications such as surveillance or fashion. Several approaches have been proposed to tackle the particular challenges in HAR. However, these approaches have dramatically changed over the last decade, mainly due to the improvements brought by deep learning solutions. To provide insights for future algorithm design and dataset collections, in this survey, (1) we provide an in-depth analysis of existing HAR techniques, concerning the advances proposed to address the HAR’s main challenges; (2) we provide a comprehensive discussion over the publicly available datasets for the development and evaluation of novel HAR approaches; (3) we outline the applications and typical evaluation metrics used in the HAR context.
Collapse
|
12
|
Liang X, Gong K, Shen X, Lin L. Look into Person: Joint Body Parsing & Pose Estimation Network and a New Benchmark. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2018; 41:871-885. [PMID: 29994083 DOI: 10.1109/tpami.2018.2820063] [Citation(s) in RCA: 49] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Human parsing and pose estimation have recently received considerable interest due to their substantial application potentials. However, the existing datasets have limited numbers of images and annotations and lack a variety of human appearances and coverage of challenging cases in unconstrained environments. In this paper, we introduce a new benchmark named "Look into Person (LIP)" that provides a significant advancement in terms of scalability, diversity, and difficulty, which are crucial for future developments in human-centric analysis. This comprehensive dataset contains over 50,000 elaborately annotated images with 19 semantic part labels and 16 body joints, which are captured from a broad range of viewpoints, occlusions, and background complexities. Using these rich annotations, we perform detailed analyses of the leading human parsing and pose estimation approaches, thereby obtaining insights into the successes and failures of these methods. To further explore and take advantage of the semantic correlation of these two tasks, we propose a novel joint human parsing and pose estimation network to explore efficient context modeling, which can simultaneously predict parsing and pose with extremely high quality. Furthermore, we simplify the network to solve human parsing by exploring a novel self-supervised structure-sensitive learning approach, which imposes human pose structures into the parsing results without resorting to extra supervision. The datasets, code and models are available at http://www.sysu-hcp.net/lip/.
Collapse
|
13
|
|