1
|
An L, Ren J, Yu T, Hai T, Jia Y, Liu Y. Three-dimensional surface motion capture of multiple freely moving pigs using MAMMAL. Nat Commun 2023; 14:7727. [PMID: 38001106 PMCID: PMC10673844 DOI: 10.1038/s41467-023-43483-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Accepted: 11/09/2023] [Indexed: 11/26/2023] Open
Abstract
Understandings of the three-dimensional social behaviors of freely moving large-size mammals are valuable for both agriculture and life science, yet challenging due to occlusions in close interactions. Although existing animal pose estimation methods captured keypoint trajectories, they ignored deformable surfaces which contained geometric information essential for social interaction prediction and for dealing with the occlusions. In this study, we develop a Multi-Animal Mesh Model Alignment (MAMMAL) system based on an articulated surface mesh model. Our self-designed MAMMAL algorithms automatically enable us to align multi-view images into our mesh model and to capture 3D surface motions of multiple animals, which display better performance upon severe occlusions compared to traditional triangulation and allow complex social analysis. By utilizing MAMMAL, we are able to quantitatively analyze the locomotion, postures, animal-scene interactions, social interactions, as well as detailed tail motions of pigs. Furthermore, experiments on mouse and Beagle dogs demonstrate the generalizability of MAMMAL across different environments and mammal species.
Collapse
Affiliation(s)
- Liang An
- Department of Automation, Tsinghua University, Beijing, China
| | - Jilong Ren
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Beijing Farm Animal Research Center, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Tao Yu
- Department of Automation, Tsinghua University, Beijing, China
- Tsinghua University Beijing National Research Center for Information Science and Technology (BNRist), Beijing, China
| | - Tang Hai
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
- Beijing Farm Animal Research Center, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
| | - Yichang Jia
- School of Medicine, Tsinghua University, Beijing, China.
- IDG/McGovern Institute for Brain Research at Tsinghua, Beijing, China.
- Tsinghua Laboratory of Brain and Intelligence, Beijing, China.
| | - Yebin Liu
- Department of Automation, Tsinghua University, Beijing, China.
- Institute for Brain and Cognitive Sciences, Tsinghua University, Beijing, China.
| |
Collapse
|
2
|
Shin S, Li Z, Halilaj E. Markerless Motion Tracking With Noisy Video and IMU Data. IEEE Trans Biomed Eng 2023; 70:3082-3092. [PMID: 37171931 DOI: 10.1109/tbme.2023.3275775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/14/2023]
Abstract
OBJECTIVE Marker-based motion capture, considered the gold standard in human motion analysis, is expensive and requires trained personnel. Advances in inertial sensing and computer vision offer new opportunities to obtain research-grade assessments in clinics and natural environments. A challenge that discourages clinical adoption, however, is the need for careful sensor-to-body alignment, which slows the data collection process in clinics and is prone to errors when patients take the sensors home. METHODS We propose deep learning models to estimate human movement with noisy data from videos (VideoNet), inertial sensors (IMUNet), and a combination of the two (FusionNet), obviating the need for careful calibration. The video and inertial sensing data used to train the models were generated synthetically from a marker-based motion capture dataset of a broad range of activities and augmented to account for sensor-misplacement and camera-occlusion errors. The models were tested using real data that included walking, jogging, squatting, sit-to-stand, and other activities. RESULTS On calibrated data, IMUNet was as accurate as state-of-the-art models, while VideoNet and FusionNet reduced mean ± std root-mean-squared errors by 7.6 ± 5.4 ° and 5.9 ± 3.3 °, respectively. Importantly, all the newly proposed models were less sensitive to noise than existing approaches, reducing errors by up to 14.0 ± 5.3 ° for sensor-misplacement errors of up to 30.0 ± 13.7 ° and by up to 7.4 ± 5.5 ° for joint-center-estimation errors of up to 101.1 ± 11.2 mm, across joints. CONCLUSION These tools offer clinicians and patients the opportunity to estimate movement with research-grade accuracy, without the need for time-consuming calibration steps or the high costs associated with commercial products such as Theia3D or Xsens, helping democratize the diagnosis, prognosis, and treatment of neuromusculoskeletal conditions.
Collapse
|
3
|
Pearl O, Shin S, Godura A, Bergbreiter S, Halilaj E. Fusion of video and inertial sensing data via dynamic optimization of a biomechanical model. J Biomech 2023; 155:111617. [PMID: 37220709 DOI: 10.1016/j.jbiomech.2023.111617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Revised: 04/26/2023] [Accepted: 05/02/2023] [Indexed: 05/25/2023]
Abstract
Inertial sensing and computer vision are promising alternatives to traditional optical motion tracking, but until now these data sources have been explored either in isolation or fused via unconstrained optimization, which may not take full advantage of their complementary strengths. By adding physiological plausibility and dynamical robustness to a proposed solution, biomechanical modeling may enable better fusion than unconstrained optimization. To test this hypothesis, we fused video and inertial sensing data via dynamic optimization with a nine degree-of-freedom model and investigated when this approach outperforms video-only, inertial-sensing-only, and unconstrained-fusion methods. We used both experimental and synthetic data that mimicked different ranges of video and inertial measurement unit (IMU) data noise. Fusion with a dynamically constrained model significantly improved estimation of lower-extremity kinematics over the video-only approach and estimation of joint centers over the IMU-only approach. It consistently outperformed single-modality approaches across different noise profiles. When the quality of video data was high and that of inertial data was low, dynamically constrained fusion improved estimation of joint kinematics and joint centers over unconstrained fusion, while unconstrained fusion was advantageous in the opposite scenario. These findings indicate that complementary modalities and techniques can improve motion tracking by clinically meaningful margins and that data quality and computational complexity must be considered when selecting the most appropriate method for a particular application.
Collapse
Affiliation(s)
- Owen Pearl
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Soyong Shin
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ashwin Godura
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Sarah Bergbreiter
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Eni Halilaj
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
4
|
Huang B, Zhang T, Wang Y. Object-Occluded Human Shape and Pose Estimation With Probabilistic Latent Consistency. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:5010-5026. [PMID: 35976842 DOI: 10.1109/tpami.2022.3199449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Occlusions between human and objects, especially for the activities of human-object interactions, are very common in practical applications. However, most of the existing approaches for 3D human shape and pose estimation require that human bodies are well captured without occlusions or with minor self-occlusions. In this paper, we focus on the problem of directly estimating the object-occluded human shape and pose from single color images. Our key idea is to utilize a partial UV map to represent an object-occluded human body, and the full 3D human shape estimation is ultimately converted as an image inpainting problem. We propose a novel two-branch network architecture to train an end-to-end regressor via a latent distribution consistency, which also includes a novel visible feature sub-net to extract the human information from object-occluded color images. To supervise the network training, we further build a novel dataset named as 3DOH50K. Several experiments are conducted to reveal the effectiveness of the proposed method. Experimental results demonstrate that the proposed method achieves state-of-the-art compared with previous methods. The dataset and codes are publicly available at https://www.yangangwang.com/papers/ZHANG-OOH-2020-03.html.
Collapse
|
5
|
Zhang M, Zhou Z, Tao X, Zhang N, Deng M. Hand pose estimation based on fish skeleton CNN: application in gesture recognition. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2023. [DOI: 10.3233/jifs-224271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/12/2023]
Abstract
The modern world contains a significant number of applications based on computer vision, in which human-computer interaction plays a crucial role, pose estimation of the hand is a crucial approach in the field of human-computer interaction. However, previous approaches suffer from the inability to accurately measure position in real-world scenes, difficulty in obtaining targets of different sizes, the structure of complex network, and the lack of applications. In recent years, deep learning techniques have produced state-of-the-art outcomes but there are still challenges that need to be overcome to fully exploit this technology. In this research, a fish skeleton CNN (FS-HandNet) is proposed for hand posture estimation from a monocular RGB image. To obtain hand pose information, a fish skeleton network structure is used for the first time. Particularly, bidirectional pyramid structures (BiPS) can effectively reduce the loss of feature information during downsampling and can be used to extract features from targets of different sizes. It is more effective at solving problems of different sizes. Then a distribution-aware coordinate representation is employed to adjust the position information of the hand, and finally, a convex hull algorithm and hand pose information are applied to recognize multiple gestures. Extensive studies on three publicly available hand position benchmarks demonstrate that our method performs nearly as well as the state-of-the-art in hand pose estimation. Additionally, we have implemented hand pose estimation for the application of gesture recognition.
Collapse
Affiliation(s)
- Mingyue Zhang
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
- Key Laboratory of Big Data and Intelligent Robot, South China University of Technology, Guangzhou, China
| | - Zhiheng Zhou
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
- Key Laboratory of Big Data and Intelligent Robot, South China University of Technology, Guangzhou, China
| | - Xiyuan Tao
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
- Key Laboratory of Big Data and Intelligent Robot, South China University of Technology, Guangzhou, China
| | - Na Zhang
- Guangdong Science Center, Guangzhou, China
| | - Ming Deng
- School of Electronic and Information Engineering, South China University of Technology, Guangzhou, China
- Key Laboratory of Big Data and Intelligent Robot, South China University of Technology, Guangzhou, China
| |
Collapse
|
6
|
Zhang Y, Wang C, Wang X, Liu W, Zeng W. VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the Wild. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:2613-2626. [PMID: 35427220 DOI: 10.1109/tpami.2022.3163709] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
We present VoxelTrack for multi-person 3D pose estimation and tracking from a few cameras which are separated by wide baselines. It employs a multi-branch network to jointly estimate 3D poses and re-identification (Re-ID) features for all people in the environment. In contrast to previous efforts which require to establish cross-view correspondence based on noisy 2D pose estimates, it directly estimates and tracks 3D poses from a 3D voxel-based representation constructed from multi-view images. We first discretize the 3D space by regular voxels and compute a feature vector for each voxel by averaging the body joint heatmaps that are inversely projected from all views. We estimate 3D poses from the voxel representation by predicting whether each voxel contains a particular body joint. Similarly, a Re-ID feature is computed for each voxel which is used to track the estimated 3D poses over time. The main advantage of the approach is that it avoids making any hard decisions based on individual images. The approach can robustly estimate and track 3D poses even when people are severely occluded in some cameras. It outperforms the state-of-the-art methods by a large margin on four public datasets including Shelf, Campus, Human3.6 M and CMU Panoptic.
Collapse
|
7
|
Yoon JS, Yu Z, Park J, Park HS. HUMBI: A Large Multiview Dataset of Human Body Expressions and Benchmark Challenge. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:623-640. [PMID: 34962862 DOI: 10.1109/tpami.2021.3138762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
This paper presents a new large multiview dataset called HUMBI for human body expressions with natural clothing. The goal of HUMBI is to facilitate modeling view-specific appearance and geometry of five primary body signals including gaze, face, hand, body, and garment from assorted people. 107 synchronized HD cameras are used to capture 772 distinctive subjects across gender, ethnicity, age, and style. With the multiview image streams, we reconstruct the geometry of body expressions using 3D mesh models, which allows representing view-specific appearance. We demonstrate that HUMBI is highly effective in learning and reconstructing a complete human model and is complementary to the existing datasets of human body expressions with limited views and subjects such as MPII-Gaze, Multi-PIE, Human3.6M, and Panoptic Studio datasets. Based on HUMBI, we formulate a new benchmark challenge of a pose-guided appearance rendering task that aims to substantially extend photorealism in modeling diverse human expressions in 3D, which is the key enabling factor of authentic social tele-presence. HUMBI is publicly available at http://humbi-data.net.
Collapse
|
8
|
Dong J, Fang Q, Jiang W, Yang Y, Huang Q, Bao H, Zhou X. Fast and Robust Multi-Person 3D Pose Estimation and Tracking From Multiple Views. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2022; 44:6981-6992. [PMID: 34283712 DOI: 10.1109/tpami.2021.3098052] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
This paper addresses the problem of reconstructing 3D poses of multiple people from a few calibrated camera views. The main challenge of this problem is to find the cross-view correspondences among noisy and incomplete 2D pose predictions. Most previous methods address this challenge by directly reasoning in 3D using a pictorial structure model, which is inefficient due to the huge state space. We propose a fast and robust approach to solve this problem. Our key idea is to use a multi-way matching algorithm to cluster the detected 2D poses in all views. Each resulting cluster encodes 2D poses of the same person across different views and consistent correspondences across the keypoints, from which the 3D pose of each person can be effectively inferred. The proposed convex optimization based multi-way matching algorithm is efficient and robust against missing and false detections, without knowing the number of people in the scene. Moreover, we propose to combine geometric and appearance cues for cross-view matching. Finally, an efficient tracking method is proposed to track the detected 3D poses across the multi-view video. The proposed approach achieves the state-of-the-art performance on the Campus and Shelf datasets, while being efficient for real-time applications.
Collapse
|
9
|
Liu H, Wu J, He R. Center point to pose: Multiple views 3D human pose estimation for multi-person. PLoS One 2022; 17:e0274450. [PMID: 36099276 PMCID: PMC9469997 DOI: 10.1371/journal.pone.0274450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2022] [Accepted: 08/27/2022] [Indexed: 12/30/2022] Open
Abstract
3D human pose estimation has always been an important task in computer vision, especially in crowded scenes where multiple people interact with each other. There are many state-of-the-arts for object detection based on single view. However, recovering the location of people is complicated in crowded and occluded scenes due to the lack of depth information for single view, which is the lack of robustness. Multi-view Human Pose Estimation for Multi-Person became an effective approach. The previous multi-view 3D human pose estimation method can be attributed to a strategy to associate the joints of the same person from 2D pose estimation. However, the incompleteness and noise of the 2D pose are inevitable. In addition, how to associate the joints itself is challenging. To solve this issue, we propose a CTP (Center Point to Pose) network based on multi-view which directly operates in the 3D space. The 2D joint features in all cameras are projected into 3D voxel space. Our CTP network regresses the center of one person as the location, and the 3D bounding box as the activity area of one person. Then our CTP network estimates detailed 3D pose for each bounding box. Besides, our CTP network is Non-Maximum Suppression free at the stage of regressing the center of one person, which makes it more efficient and simpler. Our method outperforms competitively on several public datasets which shows the efficacy of our center point to pose network representation.
Collapse
Affiliation(s)
- Huan Liu
- The State Key Laboratory of Automotive Simulation and Control, Jilin University, Changchun, China
| | - Jian Wu
- The State Key Laboratory of Automotive Simulation and Control, Jilin University, Changchun, China
| | - Rui He
- The State Key Laboratory of Automotive Simulation and Control, Jilin University, Changchun, China
- * E-mail:
| |
Collapse
|
10
|
Huang B, Zhang T, Wang Y. Pose2UV: Single-Shot Multiperson Mesh Recovery With Deep UV Prior. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2022; 31:4679-4692. [PMID: 35793292 DOI: 10.1109/tip.2022.3187294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In this work, we focus on the task of multi-person mesh recovery from a single color image, where the key issue is to tackle the pixel-level ambiguities caused by inter-person occlusions. Overall, there are two main technical challenges when addressing the ambiguities: how to extract valid target features under occlusions and how to reconstruct reasonable human meshes with only a handful of body cues? To deal with these problems, our key idea is to utilize the predicted 2D poses to locate and separate the target person, and reconstruct them with a novel learning-based UV prior. Specifically, we propose a visible pose-mask module to help extract valid target features, then train a dense body mesh prior to promote reconstructing natural mesh represented by the UV position map. To evaluate the performance of our proposed method under occlusions, we further build an in-the-wild 3D multi-person benchmark named as 3DMPB. Experimental results demonstrate that our method achieves state-of-the-art compared with previous methods. The dataset, codes are publicly available on our website.
Collapse
|
11
|
Pagnon D, Domalain M, Reveret L. Pose2Sim: An End-to-End Workflow for 3D Markerless Sports Kinematics—Part 2: Accuracy. SENSORS 2022; 22:s22072712. [PMID: 35408326 PMCID: PMC9002957 DOI: 10.3390/s22072712] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/21/2022] [Accepted: 03/27/2022] [Indexed: 02/04/2023]
Abstract
Two-dimensional deep-learning pose estimation algorithms can suffer from biases in joint pose localizations, which are reflected in triangulated coordinates, and then in 3D joint angle estimation. Pose2Sim, our robust markerless kinematics workflow, comes with a physically consistent OpenSim skeletal model, meant to mitigate these errors. Its accuracy was concurrently validated against a reference marker-based method. Lower-limb joint angles were estimated over three tasks (walking, running, and cycling) performed multiple times by one participant. When averaged over all joint angles, the coefficient of multiple correlation (CMC) remained above 0.9 in the sagittal plane, except for the hip in running, which suffered from a systematic 15° offset (CMC = 0.65), and for the ankle in cycling, which was partially occluded (CMC = 0.75). When averaged over all joint angles and all degrees of freedom, mean errors were 3.0°, 4.1°, and 4.0°, in walking, running, and cycling, respectively; and range of motion errors were 2.7°, 2.3°, and 4.3°, respectively. Given the magnitude of error traditionally reported in joint angles computed from a marker-based optoelectronic system, Pose2Sim is deemed accurate enough for the analysis of lower-body kinematics in walking, cycling, and running.
Collapse
Affiliation(s)
- David Pagnon
- Laboratoire Jean Kuntzmann, CNRS UMR 5224, Université Grenoble Alpes, 38400 Saint Martin d’Hères, France;
- Institut Pprime, CNRS UPR 3346, Université de Poitiers, 86360 Chasseneuil-du-Poitou, France;
- Correspondence:
| | - Mathieu Domalain
- Institut Pprime, CNRS UPR 3346, Université de Poitiers, 86360 Chasseneuil-du-Poitou, France;
| | - Lionel Reveret
- Laboratoire Jean Kuntzmann, CNRS UMR 5224, Université Grenoble Alpes, 38400 Saint Martin d’Hères, France;
- INRIA Grenoble Rhône-Alpes, 38330 Montbonnot-Saint-Martin, France
| |
Collapse
|
12
|
Krajnik W, Markiewicz Ł, Sitnik R. sSfS: Segmented Shape from Silhouette Reconstruction of the Human Body. SENSORS 2022; 22:s22030925. [PMID: 35161670 PMCID: PMC8840191 DOI: 10.3390/s22030925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Revised: 01/17/2022] [Accepted: 01/20/2022] [Indexed: 02/04/2023]
Abstract
Three-dimensional (3D) shape estimation of the human body has a growing number of applications in medicine, anthropometry, special effects, and many other fields. Therefore, the demand for the high-quality acquisition of a complete and accurate body model is increasing. In this paper, a short survey of current state-of-the-art solutions is provided. One of the most commonly used approaches is the Shape-from-Silhouette (SfS) method. It is capable of the reconstruction of dynamic and challenging-to-capture objects. This paper proposes a novel approach that extends the conventional voxel-based SfS method with silhouette segmentation—segmented Shape from Silhouette (sSfS). It allows the 3D reconstruction of body segments separately, which provides significantly better human body shape estimation results, especially in concave areas. For validation, a dataset representing the human body in 20 complex poses was created and assessed based on the quality metrics in reference to the ground-truth photogrammetric reconstruction. It appeared that the number of invalid reconstruction voxels for the sSfS method was 1.7 times lower than for the state-of-the-art SfS approach. The root-mean-square (RMS) error of the distance to the reference surface was also 1.22 times lower.
Collapse
Affiliation(s)
- Wiktor Krajnik
- Mnemosis S. A., 8 Józefa Str., 31-056 Krakow, Poland; (W.K.); (Ł.M.)
- Institute of Micromechanics and Photonics, Warsaw University of Technology, 8 Sw. Andrzeja Boboli Str., 02-525 Warsaw, Poland
| | - Łukasz Markiewicz
- Mnemosis S. A., 8 Józefa Str., 31-056 Krakow, Poland; (W.K.); (Ł.M.)
- Institute of Micromechanics and Photonics, Warsaw University of Technology, 8 Sw. Andrzeja Boboli Str., 02-525 Warsaw, Poland
| | - Robert Sitnik
- Mnemosis S. A., 8 Józefa Str., 31-056 Krakow, Poland; (W.K.); (Ł.M.)
- Institute of Micromechanics and Photonics, Warsaw University of Technology, 8 Sw. Andrzeja Boboli Str., 02-525 Warsaw, Poland
- Correspondence: ; Tel.: +48-222348283
| |
Collapse
|
13
|
Yan J, Zhou M, Pan J, Yin M, Fang B. Recent Advances in 3D Human Pose Estimation: From Optimization to Implementation and Beyond. INT J PATTERN RECOGN 2022. [DOI: 10.1142/s0218001422550035] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
3D human pose estimation describes estimating 3D articulation structure of a person from an image or a video. The technology has massive potential because it can enable tracking people and analyzing motion in real time. Recently, much research has been conducted to optimize human pose estimation, but few works have focused on reviewing 3D human pose estimation. In this paper, we offer a comprehensive survey of the state-of-the-art methods for 3D human pose estimation, referred to as pose estimation solutions, implementations on images or videos that contain different numbers of people and advanced 3D human pose estimation techniques. Furthermore, different kinds of algorithms are further subdivided into sub-categories and compared in light of different methodologies. To the best of our knowledge, this is the first such comprehensive survey of the recent progress of 3D human pose estimation and will hopefully facilitate the completion, refinement and applications of 3D human pose estimation.
Collapse
Affiliation(s)
- Jielu Yan
- State Key Lab of Internet of Things for Smart City, University of Macau, Taipa, Macau 999078, P. R. China
| | - MingLiang Zhou
- School of Computer Science, Chongqing University, Chongqing 400044, P. R. China
| | - Jinli Pan
- TMS Measurement and Control Technology Co., Ltd., P. R. China
| | - Meng Yin
- Chongqing Pharmaceutical Data Information Technology Co., Ltd., Building 3, Block B, Administration Centre, Nanan District, Chongqing, P. R. China
| | - Bin Fang
- School of Computer Science, Chongqing University, Chongqing 400044, P. R. China
| |
Collapse
|
14
|
Lee SE, Shibata K, Nonaka S, Nobuhara S, Nishino K. Extrinsic Camera Calibration From a Moving Person. IEEE Robot Autom Lett 2022. [DOI: 10.1109/lra.2022.3192629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- Sang-Eun Lee
- Graduate School of Informatics, Kyoto University, Kyoto University
| | - Keisuke Shibata
- Graduate School of Informatics, Kyoto University, Kyoto University
| | - Soma Nonaka
- Graduate School of Informatics, Kyoto University, Kyoto University
| | - Shohei Nobuhara
- Graduate School of Informatics, Kyoto University, Kyoto University
| | - Ko Nishino
- Graduate School of Informatics, Kyoto University, Kyoto University
| |
Collapse
|
15
|
Halilaj E, Shin S, Rapp E, Xiang D. American society of biomechanics early career achievement award 2020: Toward portable and modular biomechanics labs: How video and IMU fusion will change gait analysis. J Biomech 2021; 129:110650. [PMID: 34644610 DOI: 10.1016/j.jbiomech.2021.110650] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 07/25/2021] [Indexed: 10/20/2022]
Abstract
The field of biomechanics is at a turning point, with marker-based motion capture set to be replaced by portable and inexpensive hardware, rapidly improving markerless tracking algorithms, and open datasets that will turn these new technologies into field-wide team projects. Despite progress, several challenges inhibit both inertial and vision-based motion tracking from reaching the high accuracies that many biomechanics applications require. Their complementary strengths, however, could be harnessed toward better solutions than those offered by either modality alone. The drift from inertial measurement units (IMUs) could be corrected by video data, while occlusions in videos could be corrected by inertial data. To expedite progress in this direction, we have collected the CMU Panoptic Dataset 2.0, which contains 86 subjects captured with 140 VGA cameras, 31 HD cameras, and 15 IMUs, performing on average 6.5 min of activities, including range of motion activities and tasks of daily living. To estimate ground-truth kinematics, we imposed simultaneous consistency with the video and IMU data. Three-dimensional joint centers were first computed by geometrically triangulating proposals from a convolutional neural network applied to each video independently. A statistical meshed model parametrized in terms of body shape and pose was then fit through a top-down optimization approach that enforced consistency with both the video-based joint centers and IMU data. As proof of concept, we used this dataset to benchmark pose estimation from a sparse set of sensors, showing that incorporation of complementary modalities is a promising frontier that can be further strengthened through physics-informed frameworks.
Collapse
Affiliation(s)
- Eni Halilaj
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
| | - Soyong Shin
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Eric Rapp
- Department of Mechanical Engineering, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Donglai Xiang
- Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
16
|
Dong X, Yang Y, Wei SE, Weng X, Sheikh Y, Yu SI. Supervision by Registration and Triangulation for Landmark Detection. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:3681-3694. [PMID: 32248096 DOI: 10.1109/tpami.2020.2983935] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We present supervision by registration and triangulation (SRT), an unsupervised approach that utilizes unlabeled multi-view video to improve the accuracy and precision of landmark detectors. Being able to utilize unlabeled data enables our detectors to learn from massive amounts of unlabeled data freely available and not be limited by the quality and quantity of manual human annotations. To utilize unlabeled data, there are two key observations: (I) The detections of the same landmark in adjacent frames should be coherent with registration, i.e., optical flow. (II) The detections of the same landmark in multiple synchronized and geometrically calibrated views should correspond to a single 3D point, i.e., multi-view consistency. Registration and multi-view consistency are sources of supervision that do not require manual labeling, thus it can be leveraged to augment existing training data during detector training. End-to-end training is made possible by differentiable registration and 3D triangulation modules. Experiments with 11 datasets and a newly proposed metric to measure precision demonstrate accuracy and precision improvements in landmark detection on both images and video.
Collapse
|
17
|
Li Z, Oskarsson M, Heyden A. Detailed 3D human body reconstruction from multi-view images combining voxel super-resolution and learned implicit representation. APPL INTELL 2021. [DOI: 10.1007/s10489-021-02783-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
AbstractThe task of reconstructing detailed 3D human body models from images is interesting but challenging in computer vision due to the high freedom of human bodies. This work proposes a coarse-to-fine method to reconstruct detailed 3D human body from multi-view images combining Voxel Super-Resolution (VSR) based on learning the implicit representation. Firstly, the coarse 3D models are estimated by learning an Pixel-aligned Implicit Function based on Multi-scale Features (MF-PIFu) which are extracted by multi-stage hourglass networks from the multi-view images. Then, taking the low resolution voxel grids which are generated by the coarse 3D models as input, the VSR is implemented by learning an implicit function through a multi-stage 3D convolutional neural network. Finally, the refined detailed 3D human body models can be produced by VSR which can preserve the details and reduce the false reconstruction of the coarse 3D models. Benefiting from the implicit representation, the training process in our method is memory efficient and the detailed 3D human body produced by our method from multi-view images is the continuous decision boundary with high-resolution geometry. In addition, the coarse-to-fine method based on MF-PIFu and VSR can remove false reconstructions and preserve the appearance details in the final reconstruction, simultaneously. In the experiments, our method quantitatively and qualitatively achieves the competitive 3D human body models from images with various poses and shapes on both the real and synthetic datasets.
Collapse
|
18
|
Vo M, Yumer E, Sunkavalli K, Hadap S, Sheikh Y, Narasimhan SG. Self-Supervised Multi-View Person Association and its Applications. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:2794-2808. [PMID: 32086193 DOI: 10.1109/tpami.2020.2974726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Reliable markerless motion tracking of people participating in a complex group activity from multiple moving cameras is challenging due to frequent occlusions, strong viewpoint and appearance variations, and asynchronous video streams. To solve this problem, reliable association of the same person across distant viewpoints and temporal instances is essential. We present a self-supervised framework to adapt a generic person appearance descriptor to the unlabeled videos by exploiting motion tracking, mutual exclusion constraints, and multi-view geometry. The adapted discriminative descriptor is used in a tracking-by-clustering formulation. We validate the effectiveness of our descriptor learning on WILDTRACK T. Chavdarova et al., "WILDTRACK: A multi-camera HD dataset for dense unscripted pedestrian detection," in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 5030-5039. and three new complex social scenes captured by multiple cameras with up to 60 people "in the wild". We report significant improvement in association accuracy (up to 18 percent) and stable and coherent 3D human skeleton tracking (5 to 10 times) over the baseline. Using the reconstructed 3D skeletons, we cut the input videos into a multi-angle video where the image of a specified person is shown from the best visible front-facing camera. Our algorithm detects inter-human occlusion to determine the camera switching moment while still maintaining the flow of the action well. Website: http://www.cs.cmu.edu/~ILIM/projects/IM/Association4Tracking.
Collapse
|
19
|
Rapczyński M, Werner P, Handrich S, Al-Hamadi A. A Baseline for Cross-Database 3D Human Pose Estimation. SENSORS 2021; 21:s21113769. [PMID: 34071704 PMCID: PMC8198914 DOI: 10.3390/s21113769] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 05/19/2021] [Accepted: 05/24/2021] [Indexed: 11/16/2022]
Abstract
Vision-based 3D human pose estimation approaches are typically evaluated on datasets that are limited in diversity regarding many factors, e.g., subjects, poses, cameras, and lighting. However, for real-life applications, it would be desirable to create systems that work under arbitrary conditions (“in-the-wild”). To advance towards this goal, we investigated the commonly used datasets HumanEva-I, Human3.6M, and Panoptic Studio, discussed their biases (that is, their limitations in diversity), and illustrated them in cross-database experiments (for which we used a surrogate for roughly estimating in-the-wild performance). For this purpose, we first harmonized the differing skeleton joint definitions of the datasets, reducing the biases and systematic test errors in cross-database experiments. We further proposed a scale normalization method that significantly improved generalization across camera viewpoints, subjects, and datasets. In additional experiments, we investigated the effect of using more or less cameras, training with multiple datasets, applying a proposed anatomy-based pose validation step, and using OpenPose as the basis for the 3D pose estimation. The experimental results showed the usefulness of the joint harmonization, of the scale normalization, and of augmenting virtual cameras to significantly improve cross-database and in-database generalization. At the same time, the experiments showed that there were dataset biases that could not be compensated and call for new datasets covering more diversity. We discussed our results and promising directions for future work.
Collapse
|
20
|
Zhang Z, Wang C, Qin W. Semantically Synchronizing Multiple-Camera Systems with Human Pose Estimation. SENSORS 2021; 21:s21072464. [PMID: 33918255 PMCID: PMC8038137 DOI: 10.3390/s21072464] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 03/24/2021] [Accepted: 03/25/2021] [Indexed: 11/16/2022]
Abstract
Multiple-camera systems can expand coverage and mitigate occlusion problems. However, temporal synchronization remains a problem for budget cameras and capture devices. We propose an out-of-the-box framework to temporally synchronize multiple cameras using semantic human pose estimation from the videos. Human pose predictions are obtained with an out-of-the-shelf pose estimator for each camera. Our method firstly calibrates each pair of cameras by minimizing an energy function related to epipolar distances. We also propose a simple yet effective multiple-person association algorithm across cameras and a score-regularized energy function for improved performance. Secondly, we integrate the synchronized camera pairs into a graph and derive the optimal temporal displacement configuration for the multiple-camera system. We evaluate our method on four public benchmark datasets and demonstrate robust sub-frame synchronization accuracy on all of them.
Collapse
Affiliation(s)
- Zhe Zhang
- School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China;
| | - Chunyu Wang
- Microsoft Research Asia, Beijing 100089, China;
| | - Wenhu Qin
- School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China;
- Correspondence:
| |
Collapse
|
21
|
Yang F, Gao Y, Ma R, Zojaji S, Castellano G, Peters C. A dataset of human and robot approach behaviors into small free-standing conversational groups. PLoS One 2021; 16:e0247364. [PMID: 33630908 PMCID: PMC7906375 DOI: 10.1371/journal.pone.0247364] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Accepted: 01/14/2021] [Indexed: 11/19/2022] Open
Abstract
The analysis and simulation of the interactions that occur in group situations is important when humans and artificial agents, physical or virtual, must coordinate when inhabiting similar spaces or even collaborate, as in the case of human-robot teams. Artificial systems should adapt to the natural interfaces of humans rather than the other way around. Such systems should be sensitive to human behaviors, which are often social in nature, and account for human capabilities when planning their own behaviors. A limiting factor relates to our understanding of how humans behave with respect to each other and with artificial embodiments, such as robots. To this end, we present CongreG8 (pronounced 'con-gre-gate'), a novel dataset containing the full-body motions of free-standing conversational groups of three humans and a newcomer that approaches the groups with the intent of joining them. The aim has been to collect an accurate and detailed set of positioning, orienting and full-body behaviors when a newcomer approaches and joins a small group. The dataset contains trials from human and robot newcomers. Additionally, it includes questionnaires about the personality of participants (BFI-10), their perception of robots (Godspeed), and custom human/robot interaction questions. An overview and analysis of the dataset is also provided, which suggests that human groups are more likely to alter their configuration to accommodate a human newcomer than a robot newcomer. We conclude by providing three use cases that the dataset has already been applied to in the domains of behavior detection and generation in real and virtual environments. A sample of the CongreG8 dataset is available at https://zenodo.org/record/4537811.
Collapse
Affiliation(s)
- Fangkai Yang
- Department of Computational Science and Technology, KTH Royal Institute of Technology, Stockholm, Sweden
- * E-mail:
| | - Yuan Gao
- Department of Information Technology, Uppsala University, Uppsala, Sweden
| | - Ruiyang Ma
- Department of Computational Science and Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Sahba Zojaji
- Department of Computational Science and Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Ginevra Castellano
- Department of Information Technology, Uppsala University, Uppsala, Sweden
| | - Christopher Peters
- Department of Computational Science and Technology, KTH Royal Institute of Technology, Stockholm, Sweden
| |
Collapse
|
22
|
Stamm O, Heimann-Steinert A. Accuracy of Monocular Two-Dimensional Pose Estimation Compared With a Reference Standard for Kinematic Multiview Analysis: Validation Study. JMIR Mhealth Uhealth 2020; 8:e19608. [PMID: 33346739 PMCID: PMC7781802 DOI: 10.2196/19608] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 09/14/2020] [Accepted: 11/17/2020] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Expensive optoelectronic systems, considered the gold standard, require a laboratory environment and the attachment of markers, and they are therefore rarely used in everyday clinical practice. Two-dimensional (2D) human pose estimations for clinical purposes allow kinematic analyses to be carried out via a camera-based smartphone app. Since clinical specialists highly depend on the validity of information, there is a need to evaluate the accuracy of 2D pose estimation apps. OBJECTIVE The aim of the study was to investigate the accuracy of the 2D pose estimation of a mobility analysis app (Lindera-v2), using the PanopticStudio Toolbox data set as a reference standard. The study aimed to assess the differences in joint angles obtained by 2D video information generated with the Lindera-v2 algorithm and the reference standard. The results can provide an important assessment of the adequacy of the app for clinical use. METHODS To evaluate the accuracy of the Lindera-v2 algorithm, 10 video sequences were analyzed. Accuracy was evaluated by assessing a total of 30,000 data pairs for each joint (10 joints in total), comparing the angle data obtained from the Lindera-v2 algorithm with those of the reference standard. The mean differences of the angles were calculated for each joint, and a comparison was made between the estimated values and the reference standard values. Furthermore, the mean absolute error (MAE), root mean square error, and symmetric mean absolute percentage error of the 2D angles were calculated. Agreement between the 2 measurement methods was calculated using the intraclass correlation coefficient (ICC[A,2]). A cross-correlation was calculated for the time series to verify whether there was a temporal shift in the data. RESULTS The mean difference of the Lindera-v2 data in the right hip was the closest to the reference standard, with a mean value difference of -0.05° (SD 6.06°). The greatest difference in comparison with the baseline was found in the neck, with a measurement of -3.07° (SD 6.43°). The MAE of the angle measurement closest to the baseline was observed in the pelvis (1.40°, SD 1.48°). In contrast, the largest MAE was observed in the right shoulder (6.48°, SD 8.43°). The medians of all acquired joints ranged in difference from 0.19° to 3.17° compared with the reference standard. The ICC values ranged from 0.951 (95% CI 0.914-0.969) in the neck to 0.997 (95% CI 0.997-0.997) in the left elbow joint. The cross-correlation showed that the Lindera-v2 algorithm had no temporal lag. CONCLUSIONS The results of the study indicate that a 2D pose estimation by means of a smartphone app can have excellent agreement compared with a validated reference standard. An assessment of kinematic variables can be performed with the analyzed algorithm, showing only minimal deviations compared with data from a massive multiview system.
Collapse
Affiliation(s)
- Oskar Stamm
- Geriatrics Research Group, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Anika Heimann-Steinert
- Geriatrics Research Group, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany
| |
Collapse
|
23
|
Zhang Z, Wang C, Qiu W, Qin W, Zeng W. AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild. Int J Comput Vis 2020. [DOI: 10.1007/s11263-020-01398-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
24
|
A Comprehensive Study on Deep Learning-Based 3D Hand Pose Estimation Methods. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10196850] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The field of 3D hand pose estimation has been gaining a lot of attention recently, due to its significance in several applications that require human-computer interaction (HCI). The utilization of technological advances, such as cost-efficient depth cameras coupled with the explosive progress of Deep Neural Networks (DNNs), has led to a significant boost in the development of robust markerless 3D hand pose estimation methods. Nonetheless, finger occlusions and rapid motions still pose significant challenges to the accuracy of such methods. In this survey, we provide a comprehensive study of the most representative deep learning-based methods in literature and propose a new taxonomy heavily based on the input data modality, being RGB, depth, or multimodal information. Finally, we demonstrate results on the most popular RGB and depth-based datasets and discuss potential research directions in this rapidly growing field.
Collapse
|
25
|
Bu F, Le T, Du X, Vasudevan R, Johnson-Roberson M. Pedestrian Planar LiDAR Pose (PPLP) Network for Oriented Pedestrian Detection Based on Planar LiDAR and Monocular Images. IEEE Robot Autom Lett 2020. [DOI: 10.1109/lra.2019.2962358] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
26
|
Liu Y, Chen J, Hu C, Ma Y, Ge D, Miao S, Xue Y, Li L. Vision-Based Method for Automatic Quantification of Parkinsonian Bradykinesia. IEEE Trans Neural Syst Rehabil Eng 2019; 27:1952-1961. [PMID: 31502982 DOI: 10.1109/tnsre.2019.2939596] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Non-volitional discontinuation of motion, namely bradykinesia, is a common motor symptom among patients with Parkinson's disease (PD). Evaluating bradykinesia severity is an important part of clinical examinations on PD patients in both diagnosis and monitoring phases. However, subjective evaluations from different clinicians often show low consistency. The research works that explore objective quantification of bradykinesia are mostly based on highly-integrated sensors. Although these sensor-based methods demonstrate applaudable performance, it is unrealistic to promote them for wide use because the special devices they require are far from popularized in daily lives. In this paper, we take advantage of computer vision and machine learning technologies, proposing a vision-based method to automatically and objectively quantify bradykinesia severity. Three bradykinesia-related items are investigated in our study: finger tapping, hand clasping and hand pro/supination. In our method, human pose estimation technology is utilized to extract kinematic characteristics and supervised-learning-based classifiers are employed to generate score ratings. Clinical experiment on 60 patients shows that the scoring accuracy of our method over 360 examination videos is 89.7%, which is competitive with other related works. The devices our method requires are only a camera for instrumentation and a laptop for data processing. Therefore, our method can produce reliable assessment results on Parkinsonian bradykinesia with minimal device requirement, showing great potential of realizing long-term remote monitoring on patients' condition.
Collapse
|