1
|
Lim J, Luo C, Lee S, Song YE, Jung H. Action Recognition of Taekwondo Unit Actions Using Action Images Constructed with Time-Warped Motion Profiles. Sensors (Basel) 2024; 24:2595. [PMID: 38676211 PMCID: PMC11055144 DOI: 10.3390/s24082595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 04/05/2024] [Accepted: 04/15/2024] [Indexed: 04/28/2024]
Abstract
Taekwondo has evolved from a traditional martial art into an official Olympic sport. This study introduces a novel action recognition model tailored for Taekwondo unit actions, utilizing joint-motion data acquired via wearable inertial measurement unit (IMU) sensors. The utilization of IMU sensor-measured motion data facilitates the capture of the intricate and rapid movements characteristic of Taekwondo techniques. The model, underpinned by a conventional convolutional neural network (CNN)-based image classification framework, synthesizes action images to represent individual Taekwondo unit actions. These action images are generated by mapping joint-motion profiles onto the RGB color space, thus encapsulating the motion dynamics of a single unit action within a solitary image. To further refine the representation of rapid movements within these images, a time-warping technique was applied, adjusting motion profiles in relation to the velocity of the action. The effectiveness of the proposed model was assessed using a dataset compiled from 40 Taekwondo experts, yielding remarkable outcomes: an accuracy of 0.998, a precision of 0.983, a recall of 0.982, and an F1 score of 0.982. These results underscore this time-warping technique's contribution to enhancing feature representation, as well as the proposed method's scalability and effectiveness in recognizing Taekwondo unit actions.
Collapse
Affiliation(s)
- Junghwan Lim
- Department of Motion, Torooc Co., Ltd., Seoul 04585, Republic of Korea;
| | - Chenglong Luo
- Department of Mechanical Engineering, Konkuk University, Seoul 05029, Republic of Korea;
| | - Seunghun Lee
- School of Mechanical and Aerospace Engineering, Seoul National University, Seoul 08826, Republic of Korea;
| | - Young Eun Song
- Department of Autonomous Mobility, Korea University, Sejong 30019, Republic of Korea
| | - Hoeryong Jung
- Department of Mechanical Engineering, Konkuk University, Seoul 05029, Republic of Korea;
| |
Collapse
|
2
|
Chen Z, Huang W, Liu H, Wang Z, Wen Y, Wang S. ST-TGR: Spatio-Temporal Representation Learning for Skeleton-Based Teaching Gesture Recognition. Sensors (Basel) 2024; 24:2589. [PMID: 38676207 PMCID: PMC11054209 DOI: 10.3390/s24082589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 03/30/2024] [Accepted: 04/15/2024] [Indexed: 04/28/2024]
Abstract
Teaching gesture recognition is a technique used to recognize the hand movements of teachers in classroom teaching scenarios. This technology is widely used in education, including for classroom teaching evaluation, enhancing online teaching, and assisting special education. However, current research on gesture recognition in teaching mainly focuses on detecting the static gestures of individual students and analyzing their classroom behavior. To analyze the teacher's gestures and mitigate the difficulty of single-target dynamic gesture recognition in multi-person teaching scenarios, this paper proposes skeleton-based teaching gesture recognition (ST-TGR), which learns through spatio-temporal representation. This method mainly uses the human pose estimation technique RTMPose to extract the coordinates of the keypoints of the teacher's skeleton and then inputs the recognized sequence of the teacher's skeleton into the MoGRU action recognition network for classifying gesture actions. The MoGRU action recognition module mainly learns the spatio-temporal representation of target actions by stacking a multi-scale bidirectional gated recurrent unit (BiGRU) and using improved attention mechanism modules. To validate the generalization of the action recognition network model, we conducted comparative experiments on datasets including NTU RGB+D 60, UT-Kinect Action3D, SBU Kinect Interaction, and Florence 3D. The results indicate that, compared with most existing baseline models, the model proposed in this article exhibits better performance in recognition accuracy and speed.
Collapse
Affiliation(s)
- Zengzhao Chen
- Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China; (Z.C.); (W.H.); (H.L.)
- National Engineering Research Center for E-Learning, Central China Normal University, Wuhan 430079, China
| | - Wenkai Huang
- Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China; (Z.C.); (W.H.); (H.L.)
| | - Hai Liu
- Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China; (Z.C.); (W.H.); (H.L.)
- National Engineering Research Center for E-Learning, Central China Normal University, Wuhan 430079, China
| | - Zhuo Wang
- Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan 430079, China; (Z.C.); (W.H.); (H.L.)
| | - Yuqun Wen
- Faculty of Literature and Journalism, Xiangtan University, Xiangtan 411105, China
| | - Shengming Wang
- National Engineering Research Center of Big Data, Center China Normal University, Wuhan 430079, China
| |
Collapse
|
3
|
Enkhbat A, Shih TK, Cheewaprakobkit P. Human Action Recognition and Note Recognition: A Deep Learning Approach Using STA-GCN. Sensors (Basel) 2024; 24:2519. [PMID: 38676137 PMCID: PMC11054163 DOI: 10.3390/s24082519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 04/05/2024] [Accepted: 04/12/2024] [Indexed: 04/28/2024]
Abstract
Human action recognition (HAR) is growing in machine learning with a wide range of applications. One challenging aspect of HAR is recognizing human actions while playing music, further complicated by the need to recognize the musical notes being played. This paper proposes a deep learning-based method for simultaneous HAR and musical note recognition in music performances. We conducted experiments on Morin khuur performances, a traditional Mongolian instrument. The proposed method consists of two stages. First, we created a new dataset of Morin khuur performances. We used motion capture systems and depth sensors to collect data that includes hand keypoints, instrument segmentation information, and detailed movement information. We then analyzed RGB images, depth images, and motion data to determine which type of data provides the most valuable features for recognizing actions and notes in music performances. The second stage utilizes a Spatial Temporal Attention Graph Convolutional Network (STA-GCN) to recognize musical notes as continuous gestures. The STA-GCN model is designed to learn the relationships between hand keypoints and instrument segmentation information, which are crucial for accurate recognition. Evaluation on our dataset demonstrates that our model outperforms the traditional ST-GCN model, achieving an accuracy of 81.4%.
Collapse
Affiliation(s)
- Avirmed Enkhbat
- Department of Computer Science and Information Engineering, National Central University, Taoyuan City 32001, Taiwan
| | - Timothy K Shih
- Department of Computer Science and Information Engineering, National Central University, Taoyuan City 32001, Taiwan
| | - Pimpa Cheewaprakobkit
- Department of Computer Science and Information Engineering, National Central University, Taoyuan City 32001, Taiwan
- Department of Information Technology, Asia-Pacific International University, Saraburi 18180, Thailand
| |
Collapse
|
4
|
Gu J, Yi Y, Li Q. Motion sensitive network for action recognition in control and decision-making of autonomous systems. Front Neurosci 2024; 18:1370024. [PMID: 38591065 PMCID: PMC11000707 DOI: 10.3389/fnins.2024.1370024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2024] [Accepted: 03/04/2024] [Indexed: 04/10/2024] Open
Abstract
Spatial-temporal modeling is crucial for action recognition in videos within the field of artificial intelligence. However, robustly extracting motion information remains a primary challenge due to temporal deformations of appearances and variations in motion frequencies between different actions. In order to address these issues, we propose an innovative and effective method called the Motion Sensitive Network (MSN), incorporating the theories of artificial neural networks and key concepts of autonomous system control and decision-making. Specifically, we employ an approach known as Spatial-Temporal Pyramid Motion Extraction (STP-ME) module, adjusting convolution kernel sizes and time intervals synchronously to gather motion information at different temporal scales, aligning with the learning and prediction characteristics of artificial neural networks. Additionally, we introduce a new module called Variable Scale Motion Excitation (DS-ME), utilizing a differential model to capture motion information in resonance with the flexibility of autonomous system control. Particularly, we employ a multi-scale deformable convolutional network to alter the motion scale of the target object before computing temporal differences across consecutive frames, providing theoretical support for the flexibility of autonomous systems. Temporal modeling is a crucial step in understanding environmental changes and actions within autonomous systems, and MSN, by integrating the advantages of Artificial Neural Networks (ANN) in this task, provides an effective framework for the future utilization of artificial neural networks in autonomous systems. We evaluate our proposed method on three challenging action recognition datasets (Kinetics-400, Something-Something V1, and Something-Something V2). The results indicate an improvement in accuracy ranging from 1.1% to 2.2% on the test set. When compared with state-of-the-art (SOTA) methods, the proposed approach achieves a maximum performance of 89.90%. In ablation experiments, the performance gain of this module also shows an increase ranging from 2% to 5.3%. The introduced Motion Sensitive Network (MSN) demonstrates significant potential in various challenging scenarios, providing an initial exploration into integrating artificial neural networks into the domain of autonomous systems.
Collapse
Affiliation(s)
- Jialiang Gu
- Computer Science and Engineering, Sun Yat-sen University, Guangdong, China
| | | | | |
Collapse
|
5
|
Oh Y. Data Augmentation Techniques for Accurate Action Classification in Stroke Patients with Hemiparesis. Sensors (Basel) 2024; 24:1618. [PMID: 38475154 DOI: 10.3390/s24051618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 02/29/2024] [Accepted: 02/29/2024] [Indexed: 03/14/2024]
Abstract
Stroke survivors with hemiparesis require extensive home-based rehabilitation. Deep learning-based classifiers can detect actions and provide feedback based on patient data; however, this is difficult owing to data sparsity and heterogeneity. In this study, we investigate data augmentation and model training strategies to address this problem. Three transformations are tested with varying data volumes to analyze the changes in the classification performance of individual data. Moreover, the impact of transfer learning relative to a pre-trained one-dimensional convolutional neural network (Conv1D) and training with an advanced InceptionTime model are estimated with data augmentation. In Conv1D, the joint training data of non-disabled (ND) participants and double rotationally augmented data of stroke patients is observed to outperform the baseline in terms of F1-score (60.9% vs. 47.3%). Transfer learning pre-trained with ND data exhibits 60.3% accuracy, whereas joint training with InceptionTime exhibits 67.2% accuracy under the same conditions. Our results indicate that rotational augmentation is more effective for individual data with initially lower performance and subset data with smaller numbers of participants than other techniques, suggesting that joint training on rotationally augmented ND and stroke data enhances classification performance, particularly in cases with sparse data and lower initial performance.
Collapse
Affiliation(s)
- Youngmin Oh
- School of Computing, Gachon University, Seongnam 13120, Republic of Korea
| |
Collapse
|
6
|
Enoki M, Watanabe K, Noguchi H. Single Person Identification and Activity Estimation in a Room from Waist-Level Contours Captured by 2D Light Detection and Ranging. Sensors (Basel) 2024; 24:1272. [PMID: 38400430 PMCID: PMC10892201 DOI: 10.3390/s24041272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 02/08/2024] [Accepted: 02/15/2024] [Indexed: 02/25/2024]
Abstract
To develop socially assistive robots for monitoring older adults at home, a sensor is required to identify residents and capture activities within the room without violating privacy. We focused on 2D Light Detection and Ranging (2D-LIDAR) capable of robustly measuring human contours in a room. While horizontal 2D contour data can provide human location, identifying humans and activities from these contours is challenging. To address this issue, we developed novel methods using deep learning techniques. This paper proposes methods for person identification and activity estimation in a room using contour point clouds captured by a single 2D-LIDAR at hip height. In this approach, human contours were extracted from 2D-LIDAR data using density-based spatial clustering of applications with noise. Subsequently, the person and activity within a 10-s interval were estimated employing deep learning techniques. Two deep learning models, namely Long Short-Term Memory (LSTM) and image classification (VGG16), were compared. In the experiment, a total of 120 min of walking data and 100 min of additional activities (door opening, sitting, and standing) were collected from four participants. The LSTM-based and VGG16-based methods achieved accuracies of 65.3% and 89.7%, respectively, for person identification among the four individuals. Furthermore, these methods demonstrated accuracies of 94.2% and 97.9%, respectively, for the estimation of the four activities. Despite the 2D-LIDAR point clouds at hip height containing small features related to gait, the results indicate that the VGG16-based method has the capability to identify individuals and accurately estimate their activities.
Collapse
Affiliation(s)
- Mizuki Enoki
- Graduate School of Engineering, Osaka City University, Osaka 558-8585, Japan
| | - Kai Watanabe
- Graduate School of Engineering, Osaka City University, Osaka 558-8585, Japan
| | - Hiroshi Noguchi
- Graduate School of Engineering, Osaka Metropolitan University, Osaka 558-8585, Japan
| |
Collapse
|
7
|
Yuan L, He Z, Wang Q, Xu L. Advancing Human Motion Recognition with SkeletonCLIP++: Weighted Video Feature Integration and Enhanced Contrastive Sample Discrimination. Sensors (Basel) 2024; 24:1189. [PMID: 38400347 PMCID: PMC10892604 DOI: 10.3390/s24041189] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/03/2024] [Accepted: 02/08/2024] [Indexed: 02/25/2024]
Abstract
This paper introduces 'SkeletonCLIP++', an extension of our prior work in human action recognition, emphasizing the use of semantic information beyond traditional label-based methods. The innovation, 'Weighted Frame Integration' (WFI), shifts video feature computation from averaging to a weighted frame approach, enabling a more nuanced representation of human movements in line with semantic relevance. Another key development, 'Contrastive Sample Identification' (CSI), introduces a novel discriminative task within the model. This task involves identifying the most similar negative sample among positive ones, enhancing the model's ability to distinguish between closely related actions. Incorporating the 'BERT Text Encoder Integration' (BTEI) leverages the pre-trained BERT model as our text encoder to refine the model's performance. Empirical evaluations on HMDB-51, UCF-101, and NTU RGB+D 60 datasets illustrate positive improvements, especially in smaller datasets. 'SkeletonCLIP++' thus offers a refined approach to human action recognition, ensuring semantic integrity and detailed differentiation in video data analysis.
Collapse
Affiliation(s)
| | | | - Qiang Wang
- Department of Control Science and Engineering, Harbin Institute of Technology, Harbin 150001, China; (L.Y.); (Z.H.); (L.X.)
| | | |
Collapse
|
8
|
Shi L, Wang R, Zhao J, Zhang J, Kuang Z. Detection of Rehabilitation Training Effect of Upper Limb Movement Disorder Based on MPL-CNN. Sensors (Basel) 2024; 24:1105. [PMID: 38400263 PMCID: PMC10892837 DOI: 10.3390/s24041105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 01/31/2024] [Accepted: 02/02/2024] [Indexed: 02/25/2024]
Abstract
Stroke represents a medical emergency and can lead to the development of movement disorders such as abnormal muscle tone, limited range of motion, or abnormalities in coordination and balance. In order to help stroke patients recover as soon as possible, rehabilitation training methods employ various movement modes such as ordinary movements and joint reactions to induce active reactions in the limbs and gradually restore normal functions. Rehabilitation effect evaluation can help physicians understand the rehabilitation needs of different patients, determine effective treatment methods and strategies, and improve treatment efficiency. In order to achieve real-time and accuracy of action detection, this article uses Mediapipe's action detection algorithm and proposes a model based on MPL-CNN. Mediapipe can be used to identify key point features of the patient's upper limbs and simultaneously identify key point features of the hand. In order to detect the effect of rehabilitation training for upper limb movement disorders, LSTM and CNN are combined to form a new LSTM-CNN model, which is used to identify the action features of upper limb rehabilitation training extracted by Medipipe. The MPL-CNN model can effectively identify the accuracy of rehabilitation movements during upper limb rehabilitation training for stroke patients. In order to ensure the scientific validity and unified standards of rehabilitation training movements, this article employs the postures in the Fugl-Meyer Upper Limb Rehabilitation Training Functional Assessment Form (FMA) and establishes an FMA upper limb rehabilitation data set for experimental verification. Experimental results show that in each stage of the Fugl-Meyer upper limb rehabilitation training evaluation effect detection, the MPL-CNN-based method's recognition accuracy of upper limb rehabilitation training actions reached 95%. At the same time, the average accuracy rate of various upper limb rehabilitation training actions reaches 97.54%. This shows that the model is highly robust across different action categories and proves that the MPL-CNN model is an effective and feasible solution. This method based on MPL-CNN can provide a high-precision detection method for the evaluation of rehabilitation effects of upper limb movement disorders after stroke, helping clinicians in evaluating the patient's rehabilitation progress and adjusting the rehabilitation plan based on the evaluation results. This will help improve the personalization and precision of rehabilitation treatment and promote patient recovery.
Collapse
Affiliation(s)
- Lijuan Shi
- College of Electronic Information Engineering, Changchun University, Changchun 130012, China; (L.S.); (R.W.); (J.Z.)
- Jilin Provincial Key Laboratory of Human Health Status Identification Function & Enhancement, Changchun 130022, China;
- Key Laboratory of Intelligent Rehabilitation and Barrier-Free for the Disabled, Changchun University, Ministry of Education, Changchun 130012, China
| | - Runmin Wang
- College of Electronic Information Engineering, Changchun University, Changchun 130012, China; (L.S.); (R.W.); (J.Z.)
| | - Jian Zhao
- Jilin Provincial Key Laboratory of Human Health Status Identification Function & Enhancement, Changchun 130022, China;
- Key Laboratory of Intelligent Rehabilitation and Barrier-Free for the Disabled, Changchun University, Ministry of Education, Changchun 130012, China
- College of Computer Science and Technology, Changchun University, Changchun 130022, China
| | - Jing Zhang
- College of Electronic Information Engineering, Changchun University, Changchun 130012, China; (L.S.); (R.W.); (J.Z.)
| | - Zhejun Kuang
- Jilin Provincial Key Laboratory of Human Health Status Identification Function & Enhancement, Changchun 130022, China;
- Key Laboratory of Intelligent Rehabilitation and Barrier-Free for the Disabled, Changchun University, Ministry of Education, Changchun 130012, China
- College of Computer Science and Technology, Changchun University, Changchun 130022, China
| |
Collapse
|
9
|
Karakose-Akbiyik S, Sussman O, Wurm MF, Caramazza A. The Role of Agentive and Physical Forces in the Neural Representation of Motion Events. J Neurosci 2024; 44:e1363232023. [PMID: 38050107 PMCID: PMC10860628 DOI: 10.1523/jneurosci.1363-23.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 11/14/2023] [Accepted: 11/19/2023] [Indexed: 12/06/2023] Open
Abstract
How does the brain represent information about motion events in relation to agentive and physical forces? In this study, we investigated the neural activity patterns associated with observing animated actions of agents (e.g., an agent hitting a chair) in comparison to similar movements of inanimate objects that were either shaped solely by the physics of the scene (e.g., gravity causing an object to fall down a hill and hit a chair) or initiated by agents (e.g., a visible agent causing an object to hit a chair). Using an fMRI-based multivariate pattern analysis (MVPA), this design allowed testing where in the brain the neural activity patterns associated with motion events change as a function of, or are invariant to, agentive versus physical forces behind them. A total of 29 human participants (nine male) participated in the study. Cross-decoding revealed a shared neural representation of animate and inanimate motion events that is invariant to agentive or physical forces in regions spanning frontoparietal and posterior temporal cortices. In contrast, the right lateral occipitotemporal cortex showed a higher sensitivity to agentive events, while the left dorsal premotor cortex was more sensitive to information about inanimate object events that were solely shaped by the physics of the scene.
Collapse
Affiliation(s)
| | - Oliver Sussman
- Department of Psychology, Harvard University, Cambridge, Massachusetts 02138
| | - Moritz F Wurm
- Center for Mind/Brain Sciences - CIMeC, University of Trento, 38068 Rovereto, Italy
| | - Alfonso Caramazza
- Department of Psychology, Harvard University, Cambridge, Massachusetts 02138
- Center for Mind/Brain Sciences - CIMeC, University of Trento, 38068 Rovereto, Italy
| |
Collapse
|
10
|
Croom S, Zhou H, Firestone C. Seeing and understanding epistemic actions. Proc Natl Acad Sci U S A 2023; 120:e2303162120. [PMID: 37983484 DOI: 10.1073/pnas.2303162120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 07/27/2023] [Indexed: 11/22/2023] Open
Abstract
Many actions have instrumental aims, in which we move our bodies to achieve a physical outcome in the environment. However, we also perform actions with epistemic aims, in which we move our bodies to acquire information and learn about the world. A large literature on action recognition investigates how observers represent and understand the former class of actions; but what about the latter class? Can one person tell, just by observing another person's movements, what they are trying to learn? Here, five experiments explore epistemic action understanding. We filmed volunteers playing a "physics game" consisting of two rounds: Players shook an opaque box and attempted to determine i) the number of objects hidden inside, or ii) the shape of the objects inside. Then, independent subjects watched these videos and were asked to determine which videos came from which round: Who was shaking for number and who was shaking for shape? Across several variations, observers successfully determined what an actor was trying to learn, based only on their actions (i.e., how they shook the box)-even when the box's contents were identical across rounds. These results demonstrate that humans can infer epistemic intent from physical behaviors, adding a new dimension to research on action understanding.
Collapse
Affiliation(s)
- Sholei Croom
- Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218
| | - Hanbei Zhou
- Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218
| | - Chaz Firestone
- Department of Psychological and Brain Sciences, Johns Hopkins University, Baltimore, MD 21218
| |
Collapse
|
11
|
Zhuang T, Kabulska Z, Lingnau A. The Representation of Observed Actions at the Subordinate, Basic, and Superordinate Level. J Neurosci 2023; 43:8219-8230. [PMID: 37798129 PMCID: PMC10697398 DOI: 10.1523/jneurosci.0700-22.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Revised: 08/08/2023] [Accepted: 09/06/2023] [Indexed: 10/07/2023] Open
Abstract
Actions can be planned and recognized at different hierarchical levels, ranging from very specific (e.g., to swim backstroke) to very broad (e.g., locomotion). Understanding the corresponding neural representation is an important prerequisite to reveal how our brain flexibly assigns meaning to the world around us. To address this question, we conducted an event-related fMRI study in male and female human participants in which we examined distinct representations of observed actions at the subordinate, basic and superordinate level. Using multiple regression representational similarity analysis (RSA) in predefined regions of interest, we found that the three different taxonomic levels were best captured by patterns of activations in bilateral lateral occipitotemporal cortex (LOTC), showing the highest similarity with the basic level model. A whole-brain multiple regression RSA revealed that information unique to the basic level was captured by patterns of activation in dorsal and ventral portions of the LOTC and in parietal regions. By contrast, the unique information for the subordinate level was limited to bilateral occipitotemporal cortex, while no single cluster was obtained that captured unique information for the superordinate level. The behaviorally established action space was best captured by patterns of activation in the LOTC and superior parietal cortex, and the corresponding neural patterns of activation showed the highest similarity with patterns of activation corresponding to the basic level model. Together, our results suggest that occipitotemporal cortex shows a preference for the basic level model, with flexible access across the subordinate and the basic level.SIGNIFICANCE STATEMENT The human brain captures information at varying levels of abstraction. It is debated which brain regions host representations across different hierarchical levels, with some studies emphasizing parietal and premotor regions, while other studies highlight the role of the lateral occipitotemporal cortex (LOTC). To shed light on this debate, here we examined the representation of observed actions at the three taxonomic levels suggested by Rosch et al. (1976) Our results highlight the role of the LOTC, which hosts a shared representation across the subordinate and the basic level, with the highest similarity with the basic level model. These results shed new light on the hierarchical organization of observed actions and provide insights into the neural basis underlying the basic level advantage.
Collapse
Affiliation(s)
- Tonghe Zhuang
- Faculty of Human Sciences, Institute of Psychology, Chair of Cognitive Neuroscience, University of Regensburg, 93053 Regensburg, Germany
| | - Zuzanna Kabulska
- Faculty of Human Sciences, Institute of Psychology, Chair of Cognitive Neuroscience, University of Regensburg, 93053 Regensburg, Germany
| | - Angelika Lingnau
- Faculty of Human Sciences, Institute of Psychology, Chair of Cognitive Neuroscience, University of Regensburg, 93053 Regensburg, Germany
| |
Collapse
|
12
|
Vannuscorps G, Caramazza A. Effector-specific motor simulation supplements core action recognition processes in adverse conditions. Soc Cogn Affect Neurosci 2023; 18:nsad046. [PMID: 37688518 PMCID: PMC10576201 DOI: 10.1093/scan/nsad046] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 08/10/2023] [Accepted: 09/05/2023] [Indexed: 09/11/2023] Open
Abstract
Observing other people acting activates imitative motor plans in the observer. Whether, and if so when and how, such 'effector-specific motor simulation' contributes to action recognition remains unclear. We report that individuals born without upper limbs (IDs)-who cannot covertly imitate upper-limb movements-are significantly less accurate at recognizing degraded (but not intact) upper-limb than lower-limb actions (i.e. point-light animations). This finding emphasizes the need to reframe the current controversy regarding the role of effector-specific motor simulation in action recognition: instead of focusing on the dichotomy between motor and non-motor theories, the field would benefit from new hypotheses specifying when and how effector-specific motor simulation may supplement core action recognition processes to accommodate the full variety of action stimuli that humans can recognize.
Collapse
Affiliation(s)
- Gilles Vannuscorps
- Psychological Sciences Research Institute, Université catholique de Louvain, Place Cardinal Mercier 10, 1348, Louvain-la-Neuve, Belgium
- Institute of Neuroscience, Université catholique de Louvain, Avenue E. Mounier 53, Brussels 1200, Belgium
- Department of Psychology, Harvard University, Kirkland Street 33, Cambridge, MA 02138, USA
| | - Alfonso Caramazza
- Department of Psychology, Harvard University, Kirkland Street 33, Cambridge, MA 02138, USA
- CIMEC (Center for Mind-Brain Sciences), University of Trento, Via delle Regole 101, Mattarello TN 38123, Italy
| |
Collapse
|
13
|
Craighero L, Granziol U, Sartori L. Digital Intentions in the Fingers: I Know What You Are Doing with Your Smartphone. Brain Sci 2023; 13:1418. [PMID: 37891787 PMCID: PMC10605869 DOI: 10.3390/brainsci13101418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Revised: 09/28/2023] [Accepted: 09/29/2023] [Indexed: 10/29/2023] Open
Abstract
Every day, we make thousands of finger movements on the touchscreen of our smartphones. The same movements might be directed at various distal goals. We can type "What is the weather in Rome?" in Google to acquire information from a weather site, or we may type it on WhatsApp to decide whether to visit Rome with a friend. In this study, we show that by watching an agent's typing hands, an observer can infer whether the agent is typing on the smartphone to obtain information or to share it with others. The probability of answering correctly varies with age and typing style. According to embodied cognition, we propose that the recognition process relies on detecting subtle differences in the agent's movement, a skill that grows with sensorimotor competence. We expect that this preliminary work will serve as a starting point for further research on sensorimotor representations of digital actions.
Collapse
Affiliation(s)
- Laila Craighero
- Department of Neuroscience and Rehabilitation, University of Ferrara, via Fossato di Mortara 19, 44121 Ferrara, Italy
| | - Umberto Granziol
- Department of General Psychology, University of Padova, 35131 Padova, Italy
| | - Luisa Sartori
- Department of General Psychology, University of Padova, 35131 Padova, Italy
| |
Collapse
|
14
|
Romero A, Carvalho P, Côrte-Real L, Pereira A. Synthesizing Human Activity for Data Generation. J Imaging 2023; 9:204. [PMID: 37888311 PMCID: PMC10607066 DOI: 10.3390/jimaging9100204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 09/22/2023] [Accepted: 09/27/2023] [Indexed: 10/28/2023] Open
Abstract
The problem of gathering sufficiently representative data, such as those about human actions, shapes, and facial expressions, is costly and time-consuming and also requires training robust models. This has led to the creation of techniques such as transfer learning or data augmentation. However, these are often insufficient. To address this, we propose a semi-automated mechanism that allows the generation and editing of visual scenes with synthetic humans performing various actions, with features such as background modification and manual adjustments of the 3D avatars to allow users to create data with greater variability. We also propose an evaluation methodology for assessing the results obtained using our method, which is two-fold: (i) the usage of an action classifier on the output data resulting from the mechanism and (ii) the generation of masks of the avatars and the actors to compare them through segmentation. The avatars were robust to occlusion, and their actions were recognizable and accurate to their respective input actors. The results also showed that even though the action classifier concentrates on the pose and movement of the synthetic humans, it strongly depends on contextual information to precisely recognize the actions. Generating the avatars for complex activities also proved problematic for action recognition and the clean and precise formation of the masks.
Collapse
Affiliation(s)
- Ana Romero
- Faculdade de Engenharia, Universidade do Porto, 4200-465 Porto, Portugal (L.C.-R.); (A.P.)
| | - Pedro Carvalho
- Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência, 4200-465 Porto, Portugal
- School of Engineering, Polytechnic of Porto, 4200-072 Porto, Portugal
| | - Luís Côrte-Real
- Faculdade de Engenharia, Universidade do Porto, 4200-465 Porto, Portugal (L.C.-R.); (A.P.)
- Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência, 4200-465 Porto, Portugal
| | - Américo Pereira
- Faculdade de Engenharia, Universidade do Porto, 4200-465 Porto, Portugal (L.C.-R.); (A.P.)
- Instituto de Engenharia de Sistemas e Computadores, Tecnologia e Ciência, 4200-465 Porto, Portugal
| |
Collapse
|
15
|
Hussain T, Memon ZA, Qureshi R, Alam T. EMO-MoviNet: Enhancing Action Recognition in Videos with EvoNorm, Mish Activation, and Optimal Frame Selection for Efficient Mobile Deployment. Sensors (Basel) 2023; 23:8106. [PMID: 37836936 PMCID: PMC10574851 DOI: 10.3390/s23198106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 09/20/2023] [Accepted: 09/25/2023] [Indexed: 10/15/2023]
Abstract
The primary goal of this study is to develop a deep neural network for action recognition that enhances accuracy and minimizes computational costs. In this regard, we propose a modified EMO-MoviNet-A2* architecture that integrates Evolving Normalization (EvoNorm), Mish activation, and optimal frame selection to improve the accuracy and efficiency of action recognition tasks in videos. The asterisk notation indicates that this model also incorporates the stream buffer concept. The Mobile Video Network (MoviNet) is a member of the memory-efficient architectures discovered through Neural Architecture Search (NAS), which balances accuracy and efficiency by integrating spatial, temporal, and spatio-temporal operations. Our research implements the MoviNet model on the UCF101 and HMDB51 datasets, pre-trained on the kinetics dataset. Upon implementation on the UCF101 dataset, a generalization gap was observed, with the model performing better on the training set than on the testing set. To address this issue, we replaced batch normalization with EvoNorm, which unifies normalization and activation functions. Another area that required improvement was key-frame selection. We also developed a novel technique called Optimal Frame Selection (OFS) to identify key-frames within videos more effectively than random or densely frame selection methods. Combining OFS with Mish nonlinearity resulted in a 0.8-1% improvement in accuracy in our UCF101 20-classes experiment. The EMO-MoviNet-A2* model consumes 86% fewer FLOPs and approximately 90% fewer parameters on the UCF101 dataset, with a trade-off of 1-2% accuracy. Additionally, it achieves 5-7% higher accuracy on the HMDB51 dataset while requiring seven times fewer FLOPs and ten times fewer parameters compared to the reference model, Motion-Augmented RGB Stream (MARS).
Collapse
Affiliation(s)
- Tarique Hussain
- Fast School of Computing, National University of Computer and Emerging Sciences, Karachi Campus, Karachi 75030, Pakistan
| | - Zulfiqar Ali Memon
- Fast School of Computing, National University of Computer and Emerging Sciences, Karachi Campus, Karachi 75030, Pakistan
| | - Rizwan Qureshi
- Fast School of Computing, National University of Computer and Emerging Sciences, Karachi Campus, Karachi 75030, Pakistan
| | - Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar
| |
Collapse
|
16
|
Luo C, Kim SW, Park HY, Lim K, Jung H. Viewpoint-Agnostic Taekwondo Action Recognition Using Synthesized Two-Dimensional Skeletal Datasets. Sensors (Basel) 2023; 23:8049. [PMID: 37836879 PMCID: PMC10575175 DOI: 10.3390/s23198049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 09/18/2023] [Accepted: 09/21/2023] [Indexed: 10/15/2023]
Abstract
Issues of fairness and consistency in Taekwondo poomsae evaluation have often occurred due to the lack of an objective evaluation method. This study proposes a three-dimensional (3D) convolutional neural network-based action recognition model for an objective evaluation of Taekwondo poomsae. The model exhibits robust recognition performance regardless of variations in the viewpoints by reducing the discrepancy between the training and test images. It uses 3D skeletons of poomsae unit actions collected using a full-body motion-capture suit to generate synthesized two-dimensional (2D) skeletons from desired viewpoints. The 2D skeletons obtained from diverse viewpoints form the training dataset, on which the model is trained to ensure consistent recognition performance regardless of the viewpoint. The performance of the model was evaluated against various test datasets, including projected 2D skeletons and RGB images captured from diverse viewpoints. Comparison of the performance of the proposed model with those of previously reported action recognition models demonstrated the superiority of the proposed model, underscoring its effectiveness in recognizing and classifying Taekwondo poomsae actions.
Collapse
Affiliation(s)
- Chenglong Luo
- Division of Mechanical and Aerospace Engineering, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea;
| | - Sung-Woo Kim
- Physical Activity and Performance Institute, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea; (S.-W.K.); (H.-Y.P.); (K.L.)
| | - Hun-Young Park
- Physical Activity and Performance Institute, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea; (S.-W.K.); (H.-Y.P.); (K.L.)
- Department of Sports Medicine and Science, Graduate School, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea
| | - Kiwon Lim
- Physical Activity and Performance Institute, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea; (S.-W.K.); (H.-Y.P.); (K.L.)
- Department of Sports Medicine and Science, Graduate School, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea
- Department of Physical Education, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea
| | - Hoeryong Jung
- Division of Mechanical and Aerospace Engineering, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea;
- Department of Sports Medicine and Science, Graduate School, Konkuk University, 120 Neungdong-ro, Gwangjin-gu, Seoul 05029, Republic of Korea
| |
Collapse
|
17
|
Zhang D, Deng H, Zhi Y. Enhanced Adjacency Matrix-Based Lightweight Graph Convolution Network for Action Recognition. Sensors (Basel) 2023; 23:6397. [PMID: 37514691 PMCID: PMC10386035 DOI: 10.3390/s23146397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 07/01/2023] [Accepted: 07/12/2023] [Indexed: 07/30/2023]
Abstract
Graph convolutional networks (GCNs), which extend convolutional neural networks (CNNs) to non-Euclidean structures, have been utilized to promote skeleton-based human action recognition research and have made substantial progress in doing so. However, there are still some challenges in the construction of recognition models based on GCNs. In this paper, we propose an enhanced adjacency matrix-based graph convolutional network with a combinatorial attention mechanism (CA-EAMGCN) for skeleton-based action recognition. Firstly, an enhanced adjacency matrix is constructed to expand the model's perceptive field of global node features. Secondly, a feature selection fusion module (FSFM) is designed to provide an optimal fusion ratio for multiple input features of the model. Finally, a combinatorial attention mechanism is devised. Specifically, our spatial-temporal (ST) attention module and limb attention module (LAM) are integrated into a multi-input branch and a mainstream network of the proposed model, respectively. Extensive experiments on three large-scale datasets, namely the NTU RGB+D 60, NTU RGB+D 120 and UAV-Human datasets, show that the proposed model takes into account both requirements of light weight and recognition accuracy. This demonstrates the effectiveness of our method.
Collapse
Affiliation(s)
- Daqing Zhang
- School of Electronics and Information Engineering, Sichuan University, Chengdu 610064, China
| | - Hongmin Deng
- School of Electronics and Information Engineering, Sichuan University, Chengdu 610064, China
| | - Yong Zhi
- School of Electronics and Information Engineering, Sichuan University, Chengdu 610064, China
| |
Collapse
|
18
|
Van Dam EA, Noldus LPJJ, Van Gerven MAJ. Disentangling rodent behaviors to improve automated behavior recognition. Front Neurosci 2023; 17:1198209. [PMID: 37496740 PMCID: PMC10366600 DOI: 10.3389/fnins.2023.1198209] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 06/12/2023] [Indexed: 07/28/2023] Open
Abstract
Automated observation and analysis of behavior is important to facilitate progress in many fields of science. Recent developments in deep learning have enabled progress in object detection and tracking, but rodent behavior recognition struggles to exceed 75-80% accuracy for ethologically relevant behaviors. We investigate the main reasons why and distinguish three aspects of behavior dynamics that are difficult to automate. We isolate these aspects in an artificial dataset and reproduce effects with the state-of-the-art behavior recognition models. Having an endless amount of labeled training data with minimal input noise and representative dynamics will enable research to optimize behavior recognition architectures and get closer to human-like recognition performance for behaviors with challenging dynamics.
Collapse
Affiliation(s)
- Elsbeth A. Van Dam
- Department of Artificial Intelligence, Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
- Noldus Information Technology BV, Wageningen, Netherlands
| | - Lucas P. J. J. Noldus
- Noldus Information Technology BV, Wageningen, Netherlands
- Department of Biophysics, Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
| | - Marcel A. J. Van Gerven
- Department of Artificial Intelligence, Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
| |
Collapse
|
19
|
Wang K, Deng H. TFC-GCN: Lightweight Temporal Feature Cross-Extraction Graph Convolutional Network for Skeleton-Based Action Recognition. Sensors (Basel) 2023; 23:5593. [PMID: 37420759 DOI: 10.3390/s23125593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 06/11/2023] [Accepted: 06/13/2023] [Indexed: 07/09/2023]
Abstract
For skeleton-based action recognition, graph convolutional networks (GCN) have absolute advantages. Existing state-of-the-art (SOTA) methods tended to focus on extracting and identifying features from all bones and joints. However, they ignored many new input features which could be discovered. Moreover, many GCN-based action recognition models did not pay sufficient attention to the extraction of temporal features. In addition, most models had swollen structures due to too many parameters. In order to solve the problems mentioned above, a temporal feature cross-extraction graph convolutional network (TFC-GCN) is proposed, which has a small number of parameters. Firstly, we propose the feature extraction strategy of the relative displacements of joints, which is fitted for the relative displacement between its previous and subsequent frames. Then, TFC-GCN uses a temporal feature cross-extraction block with gated information filtering to excavate high-level representations for human actions. Finally, we propose a stitching spatial-temporal attention (SST-Att) block for different joints to be given different weights so as to obtain favorable results for classification. FLOPs and the number of parameters of TFC-GCN reach 1.90 G and 0.18 M, respectively. The superiority has been verified on three large-scale public datasets, namely NTU RGB + D60, NTU RGB + D120 and UAV-Human.
Collapse
Affiliation(s)
- Kaixuan Wang
- College of Electronics and Information Engineering, Sichuan University, No. 24, Section 1, First Ring Road, Wuhou District, Chengdu 610041, China
| | - Hongmin Deng
- College of Electronics and Information Engineering, Sichuan University, No. 24, Section 1, First Ring Road, Wuhou District, Chengdu 610041, China
| |
Collapse
|
20
|
Rodrigues NRP, da Costa NMC, Melo C, Abbasi A, Fonseca JC, Cardoso P, Borges J. Fusion Object Detection and Action Recognition to Predict Violent Action. Sensors (Basel) 2023; 23:5610. [PMID: 37420776 DOI: 10.3390/s23125610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 05/30/2023] [Accepted: 06/04/2023] [Indexed: 07/09/2023]
Abstract
In the context of Shared Autonomous Vehicles, the need to monitor the environment inside the car will be crucial. This article focuses on the application of deep learning algorithms to present a fusion monitoring solution which was three different algorithms: a violent action detection system, which recognizes violent behaviors between passengers, a violent object detection system, and a lost items detection system. Public datasets were used for object detection algorithms (COCO and TAO) to train state-of-the-art algorithms such as YOLOv5. For violent action detection, the MoLa InCar dataset was used to train on state-of-the-art algorithms such as I3D, R(2+1)D, SlowFast, TSN, and TSM. Finally, an embedded automotive solution was used to demonstrate that both methods are running in real-time.
Collapse
Affiliation(s)
- Nelson R P Rodrigues
- Engineering School, University of Minho, 4800-058 Guimarães, Portugal
- Algoritmi Center, University of Minho, 4800-058 Guimarães, Portugal
- Polytechnic Institute of Cávado and Ave, 4750-810 Barcelos, Portugal
| | - Nuno M C da Costa
- Algoritmi Center, University of Minho, 4800-058 Guimarães, Portugal
- Polytechnic Institute of Cávado and Ave, 4750-810 Barcelos, Portugal
- 2Ai-School of Technology, Polytechnic Institute of Cávado and Ave, 4750-810 Barcelos, Portugal
| | - César Melo
- Algoritmi Center, University of Minho, 4800-058 Guimarães, Portugal
- Polytechnic Institute of Cávado and Ave, 4750-810 Barcelos, Portugal
| | - Ali Abbasi
- Algoritmi Center, University of Minho, 4800-058 Guimarães, Portugal
| | - Jaime C Fonseca
- Algoritmi Center, University of Minho, 4800-058 Guimarães, Portugal
| | - Paulo Cardoso
- Algoritmi Center, University of Minho, 4800-058 Guimarães, Portugal
| | - João Borges
- Algoritmi Center, University of Minho, 4800-058 Guimarães, Portugal
- Polytechnic Institute of Cávado and Ave, 4750-810 Barcelos, Portugal
- 2Ai-School of Technology, Polytechnic Institute of Cávado and Ave, 4750-810 Barcelos, Portugal
| |
Collapse
|
21
|
Zhang H, Zhang X, Yu D, Guan L, Wang D, Zhou F, Zhang W. Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition. Sensors (Basel) 2023; 23:5414. [PMID: 37420580 DOI: 10.3390/s23125414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 05/23/2023] [Accepted: 06/06/2023] [Indexed: 07/09/2023]
Abstract
Graph convolutional networks are widely used in skeleton-based action recognition because of their good fitting ability to non-Euclidean data. While conventional multi-scale temporal convolution uses several fixed-size convolution kernels or dilation rates at each layer of the network, we argue that different layers and datasets require different receptive fields. We use multi-scale adaptive convolution kernels and dilation rates to optimize traditional multi-scale temporal convolution with a simple and effective self attention mechanism, allowing different network layers to adaptively select convolution kernels of different sizes and dilation rates instead of being fixed and unchanged. Besides, the effective receptive field of the simple residual connection is not large, and there is a great deal of redundancy in the deep residual network, which will lead to the loss of context when aggregating spatio-temporal information. This article introduces a feature fusion mechanism that replaces the residual connection between initial features and temporal module outputs, effectively solving the problems of context aggregation and initial feature fusion. We propose a multi-modality adaptive feature fusion framework (MMAFF) to simultaneously increase the receptive field in both spatial and temporal dimensions. Concretely, we input the features extracted by the spatial module into the adaptive temporal fusion module to simultaneously extract multi-scale skeleton features in both spatial and temporal parts. In addition, based on the current multi-stream approach, we use the limb stream to uniformly process correlated data from multiple modalities. Extensive experiments show that our model obtains competitive results with state-of-the-art methods on the NTU-RGB+D 60 and NTU-RGB+D 120 datasets.
Collapse
Affiliation(s)
- Haiping Zhang
- School of Computer Science, Hangzhou Dianzi University, Hangzhou 310005, China
- School of Information Engineering, Hangzhou Dianzi University, Hangzhou 310005, China
| | - Xinhao Zhang
- School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310005, China
| | - Dongjin Yu
- School of Computer Science, Hangzhou Dianzi University, Hangzhou 310005, China
| | - Liming Guan
- School of Information Engineering, Hangzhou Dianzi University, Hangzhou 310005, China
| | - Dongjing Wang
- School of Computer Science, Hangzhou Dianzi University, Hangzhou 310005, China
| | - Fuxing Zhou
- School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310005, China
| | - Wanjun Zhang
- School of Information Engineering, Hangzhou Dianzi University, Hangzhou 310005, China
| |
Collapse
|
22
|
Sarker NH, Hakim ZA, Dabouei A, Uddin MR, Freyberg Z, MacWilliams A, Kangas J, Xu M. Detecting anomalies from liquid transfer videos in automated laboratory setting. Front Mol Biosci 2023; 10:1147514. [PMID: 37214339 PMCID: PMC10192699 DOI: 10.3389/fmolb.2023.1147514] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 04/17/2023] [Indexed: 05/24/2023] Open
Abstract
In this work, we address the problem of detecting anomalies in a certain laboratory automation setting. At first, we collect video images of liquid transfer in automated laboratory experiments. We mimic the real-world challenges of developing an anomaly detection model by considering two points. First, the size of the collected dataset is set to be relatively small compared to large-scale video datasets. Second, the dataset has a class imbalance problem where the majority of the collected videos are from abnormal events. Consequently, the existing learning-based video anomaly detection methods do not perform well. To this end, we develop a practical human-engineered feature extraction method to detect anomalies from the liquid transfer video images. Our simple yet effective method outperforms state-of-the-art anomaly detection methods with a notable margin. In particular, the proposed method provides 19% and 76% average improvement in AUC and Equal Error Rate, respectively. Our method also quantifies the anomalies and provides significant benefits for deployment in the real-world experimental setting.
Collapse
Affiliation(s)
- Najibul Haque Sarker
- Computer Science and Engineering Department, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Zaber Abdul Hakim
- Computer Science and Engineering Department, Bangladesh University of Engineering and Technology, Dhaka, Bangladesh
| | - Ali Dabouei
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Mostofa Rafid Uddin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Zachary Freyberg
- Department of Psychiatry, University of Pittsburgh, Pittsburgh, PA, United States
| | - Andy MacWilliams
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Joshua Kangas
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, United States
| | - Min Xu
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, United States
| |
Collapse
|
23
|
Host K, Pobar M, Ivasic-Kos M. Analysis of Movement and Activities of Handball Players Using Deep Neural Networks. J Imaging 2023; 9:jimaging9040080. [PMID: 37103231 PMCID: PMC10144022 DOI: 10.3390/jimaging9040080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 04/01/2023] [Accepted: 04/11/2023] [Indexed: 04/28/2023] Open
Abstract
This paper focuses on image and video content analysis of handball scenes and applying deep learning methods for detecting and tracking the players and recognizing their activities. Handball is a team sport of two teams played indoors with the ball with well-defined goals and rules. The game is dynamic, with fourteen players moving quickly throughout the field in different directions, changing positions and roles from defensive to offensive, and performing different techniques and actions. Such dynamic team sports present challenging and demanding scenarios for both the object detector and the tracking algorithms and other computer vision tasks, such as action recognition and localization, with much room for improvement of existing algorithms. The aim of the paper is to explore the computer vision-based solutions for recognizing player actions that can be applied in unconstrained handball scenes with no additional sensors and with modest requirements, allowing a broader adoption of computer vision applications in both professional and amateur settings. This paper presents semi-manual creation of custom handball action dataset based on automatic player detection and tracking, and models for handball action recognition and localization using Inflated 3D Networks (I3D). For the task of player and ball detection, different configurations of You Only Look Once (YOLO) and Mask Region-Based Convolutional Neural Network (Mask R-CNN) models fine-tuned on custom handball datasets are compared to original YOLOv7 model to select the best detector that will be used for tracking-by-detection algorithms. For the player tracking, DeepSORT and Bag of tricks for SORT (BoT SORT) algorithms with Mask R-CNN and YOLO detectors were tested and compared. For the task of action recognition, I3D multi-class model and ensemble of binary I3D models are trained with different input frame lengths and frame selection strategies, and the best solution is proposed for handball action recognition. The obtained action recognition models perform well on the test set with nine handball action classes, with average F1 measures of 0.69 and 0.75 for ensemble and multi-class classifiers, respectively. They can be used to index handball videos to facilitate retrieval automatically. Finally, some open issues, challenges in applying deep learning methods in such a dynamic sports environment, and direction for future development will be discussed.
Collapse
Affiliation(s)
- Kristina Host
- Faculty of Informatics and Digital Technologies, University of Rijeka, 51000 Rijeka, Croatia
- Centre for Artificial Intelligence and Cybersecurity, University of Rijeka, 51000 Rijeka, Croatia
| | - Miran Pobar
- Faculty of Informatics and Digital Technologies, University of Rijeka, 51000 Rijeka, Croatia
- Centre for Artificial Intelligence and Cybersecurity, University of Rijeka, 51000 Rijeka, Croatia
| | - Marina Ivasic-Kos
- Faculty of Informatics and Digital Technologies, University of Rijeka, 51000 Rijeka, Croatia
- Centre for Artificial Intelligence and Cybersecurity, University of Rijeka, 51000 Rijeka, Croatia
| |
Collapse
|
24
|
Chen B, Meng F, Tang H, Tong G. Two-Level Attention Module Based on Spurious-3D Residual Networks for Human Action Recognition. Sensors (Basel) 2023; 23:1707. [PMID: 36772770 PMCID: PMC9919151 DOI: 10.3390/s23031707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 01/19/2023] [Accepted: 02/02/2023] [Indexed: 06/18/2023]
Abstract
In recent years, deep learning techniques have excelled in video action recognition. However, currently commonly used video action recognition models minimize the importance of different video frames and spatial regions within some specific frames when performing action recognition, which makes it difficult for the models to adequately extract spatiotemporal features from the video data. In this paper, an action recognition method based on improved residual convolutional neural networks (CNNs) for video frames and spatial attention modules is proposed to address this problem. The network can guide what and where to emphasize or suppress with essentially little computational cost using the video frame attention module and the spatial attention module. It also employs a two-level attention module to emphasize feature information along the temporal and spatial dimensions, respectively, highlighting the more important frames in the overall video sequence and the more important spatial regions in some specific frames. Specifically, we create the video frame and spatial attention map by successively adding the video frame attention module and the spatial attention module to aggregate the spatial and temporal dimensions of the intermediate feature maps of the CNNs to obtain different feature descriptors, thus directing the network to focus more on important video frames and more contributing spatial regions. The experimental results further show that the network performs well on the UCF-101 and HMDB-51 datasets.
Collapse
Affiliation(s)
- Bo Chen
- Science and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, China
- School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Fangzhou Meng
- Science and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, China
- School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Hongying Tang
- Science and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, China
| | - Guanjun Tong
- Science and Technology on Microsystem Laboratory, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai 201800, China
| |
Collapse
|
25
|
Sun R, Zhang T, Wan Y, Zhang F, Wei J. WLiT: Windows and Linear Transformer for Video Action Recognition. Sensors (Basel) 2023; 23:1616. [PMID: 36772658 PMCID: PMC9919352 DOI: 10.3390/s23031616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Revised: 01/28/2023] [Accepted: 01/29/2023] [Indexed: 06/18/2023]
Abstract
The emergence of Transformer has led to the rapid development of video understanding, but it also brings the problem of high computational complexity. Previously, there were methods to divide the feature maps into windows along the spatiotemporal dimensions and then calculate the attention. There are also methods to perform down-sampling during attention computation to reduce the spatiotemporal resolution of features. Although the complexity is effectively reduced, there is still room for further optimization. Thus, we present the Windows and Linear Transformer (WLiT) for efficient video action recognition, by combining Spatial-Windows attention with Linear attention. We first divide the feature maps into multiple windows along the spatial dimensions and calculate the attention separately inside the windows. Therefore, our model further reduces the computational complexity compared with previous methods. However, the perceptual field of Spatial-Windows attention is small, and global spatiotemporal information cannot be obtained. To address this problem, we then calculate Linear attention along the channel dimension so that the model can capture complete spatiotemporal information. Our method achieves better recognition accuracy with less computational complexity through this mechanism. We conduct extensive experiments on four public datasets, namely Something-Something V2 (SSV2), Kinetics400 (K400), UCF101, and HMDB51. On the SSV2 dataset, our method reduces the computational complexity by 28% and improves the recognition accuracy by 1.6% compared to the State-Of-The-Art (SOTA) method. On the K400 and two other datasets, our method achieves SOTA-level accuracy while reducing the complexity by about 49%.
Collapse
Affiliation(s)
- Ruoxi Sun
- Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China
- School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Tianzhao Zhang
- Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China
- School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Yong Wan
- State Key Laboratory of Geomechanics and Geotechnical Engineering, Institute of Rock and Soil Mechanics, Chinese Academy of Sciences, Wuhan 430071, China
| | - Fuping Zhang
- Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China
| | - Jianming Wei
- Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China
| |
Collapse
|
26
|
Hollaus B, Reiter B, Volmer JC. Catch Recognition in Automated American Football Training Using Machine Learning. Sensors (Basel) 2023; 23:840. [PMID: 36679637 PMCID: PMC9864489 DOI: 10.3390/s23020840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 01/04/2023] [Accepted: 01/05/2023] [Indexed: 06/17/2023]
Abstract
In order to train receivers in American football in a targeted and individual manner, the strengths and weaknesses of the athletes must be evaluated precisely. As human resources are limited, it is beneficial to do it in an automated way. Automated passing machines are already given, therefore the motivation is to design a computer-based system that records and automatically evaluates the athlete's catch attempts. The most fundamental evaluation would be whether the athlete has caught the pass successfully or not. An experiment was carried out to gain data about catch attempts that potentially contain information about the outcome of such. The experiment used a fully automated passing machine which can release passes on command. After a pass was released, an audio and a video sequence of the specific catch attempt was recorded. For this purpose, an audio-visual recording system was developed which was integrated into the passing machine. This system is used to create an audio and video dataset in the amount of 2276 recorded catch attempts. A Convolutional Neural Network (CNN) is used for feature extraction with downstream Long Short-Term Memory (LSTM) to classify the video data. Classification of the audio data is performed using a one-dimensional CNN. With the chosen neural network architecture, an accuracy of 92.19% was achieved in detecting whether a pass had been caught or not. The feasibility for automatic classification of catch attempts during automated catch training is confirmed with this result.
Collapse
Affiliation(s)
- Bernhard Hollaus
- Department of Medical, Health & Sports Engineering, MCI, 6020 Innsbruck, Austria
| | | | | |
Collapse
|
27
|
Moutik O, Sekkat H, Tigani S, Chehri A, Saadane R, Tchakoucht TA, Paul A. Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data? Sensors (Basel) 2023; 23:734. [PMID: 36679530 PMCID: PMC9862752 DOI: 10.3390/s23020734] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Revised: 01/01/2023] [Accepted: 01/04/2023] [Indexed: 06/17/2023]
Abstract
Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis's outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.
Collapse
Affiliation(s)
- Oumaima Moutik
- Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco
| | - Hiba Sekkat
- Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco
| | - Smail Tigani
- Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco
| | - Abdellah Chehri
- Department of Mathematics and Computer Science, Royal Military College of Canada, Kingston, ON 11 K7K 7B4, Canada
| | - Rachid Saadane
- SIRC-LaGeS, Hassania School of Public Works, Casablanca 8108, Morocco
| | - Taha Ait Tchakoucht
- Engineering Unit, Euromed Research Center, Euro-Mediterranean University, Fes 30030, Morocco
| | - Anand Paul
- School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Republic of Korea
| |
Collapse
|
28
|
Hu Z, Mao J, Yao J, Bi S. 3D network with channel excitation and knowledge distillation for action recognition. Front Neurorobot 2023; 17:1050167. [PMID: 37033413 PMCID: PMC10076829 DOI: 10.3389/fnbot.2023.1050167] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 02/28/2023] [Indexed: 04/11/2023] Open
Abstract
Modern action recognition techniques frequently employ two networks: the spatial stream, which accepts input from RGB frames, and the temporal stream, which accepts input from optical flow. Recent researches use 3D convolutional neural networks that employ spatiotemporal filters on both streams. Although mixing flow with RGB enhances performance, correct optical flow computation is expensive and adds delay to action recognition. In this study, we present a method for training a 3D CNN using RGB frames that replicates the motion stream and, as a result, does not require flow calculation during testing. To begin, in contrast to the SE block, we suggest a channel excitation module (CE module). Experiments have shown that the CE module can improve the feature extraction capabilities of a 3D network and that the effect is superior to the SE block. Second, for action recognition training, we adopt a linear mix of loss based on knowledge distillation and standard cross-entropy loss to effectively leverage appearance and motion information. The Intensified Motion RGB Stream is the stream trained with this combined loss (IMRS). IMRS surpasses RGB or Flow as a single stream; for example, HMDB51 achieves 73.5% accuracy, while RGB and Flow streams score 65.6% and 69.1% accuracy, respectively. Extensive experiments confirm the effectiveness of our proposed method. The comparison with other models proves that our model has good competitiveness in behavior recognition.
Collapse
Affiliation(s)
- Zhengping Hu
- School of Information Science and Engineering, Yanshan University, Qinhuangdao, China
- Hebei Key Laboratory of Information Transmission and Signal Processing, Qinhuangdao, China
- *Correspondence: Zhengping Hu
| | - Jianzeng Mao
- School of Information Science and Engineering, Yanshan University, Qinhuangdao, China
| | - Jianxin Yao
- School of Information Science and Engineering, Yanshan University, Qinhuangdao, China
| | - Shuai Bi
- School of Information Science and Engineering, Yanshan University, Qinhuangdao, China
| |
Collapse
|
29
|
Suzuki T, Aoki Y. Efficient Transformer-Based Compressed Video Modeling via Informative Patch Selection. Sensors (Basel) 2022; 23:244. [PMID: 36616842 PMCID: PMC9823838 DOI: 10.3390/s23010244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 12/16/2022] [Accepted: 12/19/2022] [Indexed: 06/17/2023]
Abstract
Recently, Transformer-based video recognition models have achieved state-of-the-art results on major video recognition benchmarks. However, their high inference cost significantly limits research speed and practical use. In video compression, methods considering small motions and residuals that are less informative and assigning short code lengths to them (e.g., MPEG4) have successfully reduced the redundancy of videos. Inspired by this idea, we propose Informative Patch Selection (IPS), which efficiently reduces the inference cost by excluding redundant patches from the input of the Transformer-based video model. The redundancy of each patch is calculated from motions and residuals obtained while decoding a compressed video. The proposed method is simple and effective in that it can dynamically reduce the inference cost depending on the input without any policy model or additional loss term. Extensive experiments on action recognition demonstrated that our method could significantly improve the trade-off between the accuracy and inference cost of the Transformer-based video model. Although the method does not require any policy model or additional loss term, its performance approaches that of existing methods that do require them.
Collapse
|
30
|
Kim SB, Jung C, Kim BI, Ko BC. Lightweight Semantic-Guided Neural Networks Based on Single Head Attention for Action Recognition. Sensors (Basel) 2022; 22:9249. [PMID: 36501952 PMCID: PMC9737289 DOI: 10.3390/s22239249] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Revised: 11/22/2022] [Accepted: 11/24/2022] [Indexed: 06/17/2023]
Abstract
Skeleton-based action recognition can achieve a relatively high performance by transforming the human skeleton structure in an image into a graph and applying action recognition based on structural changes in the body. Among the many graph convolutional network (GCN) approaches used in skeleton-based action recognition, semantic-guided neural networks (SGNs) are fast action recognition algorithms that hierarchically learn spatial and temporal features by applying a GCN. However, because an SGN focuses on global feature learning rather than local feature learning owing to the structural characteristics, there is a limit to an action recognition in which the dependency between neighbouring nodes is important. To solve these problems and simultaneously achieve a real-time action recognition in low-end devices, in this study, a single head attention (SHA) that can overcome the limitations of an SGN is proposed, and a new SGN-SHA model that combines SHA with an SGN is presented. In experiments on various action recognition benchmark datasets, the proposed SGN-SHA model significantly reduced the computational complexity while exhibiting a performance similar to that of an existing SGN and other state-of-the-art methods.
Collapse
|
31
|
Joefrie YY, Aono M. Video Action Recognition Using Motion and Multi-View Excitation with Temporal Aggregation. Entropy (Basel) 2022; 24:1663. [PMID: 36421524 PMCID: PMC9689149 DOI: 10.3390/e24111663] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 11/11/2022] [Accepted: 11/11/2022] [Indexed: 06/16/2023]
Abstract
Spatiotemporal and motion feature representations are the key to video action recognition. Typical previous approaches are to utilize 3D CNNs to cope with both spatial and temporal features, but they suffer from huge computations. Other approaches are to utilize (1+2)D CNNs to learn spatial and temporal features in an efficient way, but they neglect the importance of motion representations. To overcome problems with previous approaches, we propose a novel block which makes it possible to alleviate the aforementioned problems, since our block can capture spatial and temporal features more faithfully and efficiently learn motion features. This proposed block includes Motion Excitation (ME), Multi-view Excitation (MvE), and Densely Connected Temporal Aggregation (DCTA). The purpose of ME is to encode feature-level frame differences; MvE is designed to enrich spatiotemporal features with multiple view representations adaptively; and DCTA is to model long-range temporal dependencies. We inject the proposed building block, which we refer to as the META block (or simply "META"), into 2D ResNet-50. Through extensive experiments, we demonstrate that our proposed method architecture outperforms previous CNN-based methods in terms of "Val Top-1 %" measure with Something-Something v1 and Jester datasets, while the META yielded competitive results with the Moment-in-Time Mini dataset.
Collapse
Affiliation(s)
- Yuri Yudhaswana Joefrie
- Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Tenpaku-cho, Toyohashi 441-8580, Japan
- Department of Information Technology, Universitas Tadulako (UNTAD), Palu 94118, Indonesia
| | - Masaki Aono
- Department of Computer Science and Engineering, Toyohashi University of Technology, 1-1 Tenpaku-cho, Toyohashi 441-8580, Japan
| |
Collapse
|
32
|
Ciampi L, Foszner P, Messina N, Staniszewski M, Gennaro C, Falchi F, Serao G, Cogiel M, Golba D, Szczęsna A, Amato G. Bus Violence: An Open Benchmark for Video Violence Detection on Public Transport. Sensors (Basel) 2022; 22:8345. [PMID: 36366043 PMCID: PMC9658862 DOI: 10.3390/s22218345] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 10/25/2022] [Accepted: 10/28/2022] [Indexed: 06/16/2023]
Abstract
The automatic detection of violent actions in public places through video analysis is difficult because the employed Artificial Intelligence-based techniques often suffer from generalization problems. Indeed, these algorithms hinge on large quantities of annotated data and usually experience a drastic drop in performance when used in scenarios never seen during the supervised learning phase. In this paper, we introduce and publicly release the Bus Violence benchmark, the first large-scale collection of video clips for violence detection on public transport, where some actors simulated violent actions inside a moving bus in changing conditions, such as the background or light. Moreover, we conduct a performance analysis of several state-of-the-art video violence detectors pre-trained with general violence detection databases on this newly established use case. The achieved moderate performances reveal the difficulties in generalizing from these popular methods, indicating the need to have this new collection of labeled data, beneficial for specializing them in this new scenario.
Collapse
Affiliation(s)
- Luca Ciampi
- Institute of Information Science and Technologies, National Research Council, Via G. Moruzzi 1, 56124 Pisa, Italy
| | - Paweł Foszner
- Department of Computer Graphics, Vision and Digital Systems, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 2A, 44-100 Gliwice, Poland
| | - Nicola Messina
- Institute of Information Science and Technologies, National Research Council, Via G. Moruzzi 1, 56124 Pisa, Italy
| | - Michał Staniszewski
- Department of Computer Graphics, Vision and Digital Systems, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 2A, 44-100 Gliwice, Poland
| | - Claudio Gennaro
- Institute of Information Science and Technologies, National Research Council, Via G. Moruzzi 1, 56124 Pisa, Italy
| | - Fabrizio Falchi
- Institute of Information Science and Technologies, National Research Council, Via G. Moruzzi 1, 56124 Pisa, Italy
| | - Gianluca Serao
- Department of Information Engineering, University of Pisa, Via Girolamo Caruso, 16, 56122 Pisa, Italy
| | - Michał Cogiel
- Blees sp. z o.o., Zygmunta Starego 24a/10, 44-100 Gliwice, Poland
| | - Dominik Golba
- Blees sp. z o.o., Zygmunta Starego 24a/10, 44-100 Gliwice, Poland
| | - Agnieszka Szczęsna
- Department of Computer Graphics, Vision and Digital Systems, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Akademicka 2A, 44-100 Gliwice, Poland
| | - Giuseppe Amato
- Institute of Information Science and Technologies, National Research Council, Via G. Moruzzi 1, 56124 Pisa, Italy
| |
Collapse
|
33
|
Htet Y, Zin TT, Tin P, Tamura H, Kondo K, Chosa E. HMM-Based Action Recognition System for Elderly Healthcare by Colorizing Depth Map. Int J Environ Res Public Health 2022; 19:ijerph191912055. [PMID: 36231351 PMCID: PMC9566476 DOI: 10.3390/ijerph191912055] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2022] [Revised: 09/16/2022] [Accepted: 09/19/2022] [Indexed: 05/13/2023]
Abstract
Addressing the problems facing the elderly, whether living independently or in managed care facilities, is considered one of the most important applications for action recognition research. However, existing systems are not ready for automation, or for effective use in continuous operation. Therefore, we have developed theoretical and practical foundations for a new real-time action recognition system. This system is based on Hidden Markov Model (HMM) along with colorizing depth maps. The use of depth cameras provides privacy protection. Colorizing depth images in the hue color space enables compressing and visualizing depth data, and detecting persons. The specific detector used for person detection is You Look Only Once (YOLOv5). Appearance and motion features are extracted from depth map sequences and are represented with a Histogram of Oriented Gradients (HOG). These HOG feature vectors are transformed as the observation sequences and then fed into the HMM. Finally, the Viterbi Algorithm is applied to recognize the sequential actions. This system has been tested on real-world data featuring three participants in a care center. We tried out three combinations of HMM with classification algorithms and found that a fusion with Support Vector Machine (SVM) had the best average results, achieving an accuracy rate (84.04%).
Collapse
Affiliation(s)
- Ye Htet
- Interdisciplinary Graduate School of Agriculture and Engineering, University of Miyazaki, Miyazaki 889-2192, Japan
| | - Thi Thi Zin
- Graduate School of Engineering, University of Miyazaki, Miyazaki 889-2192, Japan
- Correspondence:
| | - Pyke Tin
- International Relation Center, University of Miyazaki, Miyazaki 889-2192, Japan
| | - Hiroki Tamura
- Graduate School of Engineering, University of Miyazaki, Miyazaki 889-2192, Japan
| | - Kazuhiro Kondo
- Faculty of Medicine, University of Miyazaki, Miyazaki 889-1692, Japan
| | - Etsuo Chosa
- Faculty of Medicine, University of Miyazaki, Miyazaki 889-1692, Japan
| |
Collapse
|
34
|
Nan M, Florea AM. Fast Temporal Graph Convolutional Model for Skeleton-Based Action Recognition. Sensors (Basel) 2022; 22:7117. [PMID: 36236213 PMCID: PMC9570854 DOI: 10.3390/s22197117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 09/15/2022] [Accepted: 09/16/2022] [Indexed: 06/16/2023]
Abstract
Human action recognition has a wide range of applications, including Ambient Intelligence systems and user assistance. Starting from the recognized actions performed by the user, a better human-computer interaction can be achieved, and improved assistance can be provided by social robots in real-time scenarios. In this context, the performance of the prediction system is a key aspect. The purpose of this paper is to introduce a neural network approach based on various types of convolutional layers that can achieve a good performance in recognizing actions but with a high inference speed. The experimental results show that our solution, based on a combination of graph convolutional networks (GCN) and temporal convolutional networks (TCN), is a suitable approach that reaches the proposed goal. In addition to the neural network model, we design a pipeline that contains two stages for obtaining relevant geometric features, data augmentation and data preprocessing, also contributing to an increased performance.
Collapse
|
35
|
Zhang Y. MEST: An Action Recognition Network with Motion Encoder and Spatio-Temporal Module. Sensors (Basel) 2022; 22:6595. [PMID: 36081054 PMCID: PMC9460449 DOI: 10.3390/s22176595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/22/2022] [Revised: 08/12/2022] [Accepted: 08/23/2022] [Indexed: 06/15/2023]
Abstract
As a sub-field of video content analysis, action recognition has received extensive attention in recent years, which aims to recognize human actions in videos. Compared with a single image, video has a temporal dimension. Therefore, it is of great significance to extract the spatio-temporal information from videos for action recognition. In this paper, an efficient network to extract spatio-temporal information with relatively low computational load (dubbed MEST) is proposed. Firstly, a motion encoder to capture short-term motion cues between consecutive frames is developed, followed by a channel-wise spatio-temporal module to model long-term feature information. Moreover, the weight standardization method is applied to the convolution layers followed by batch normalization layers to expedite the training process and facilitate convergence. Experiments are conducted on five public datasets of action recognition, Something-Something-V1 and -V2, Jester, UCF101 and HMDB51, where MEST exhibits competitive performance compared to other popular methods. The results demonstrate the effectiveness of our network in terms of accuracy, computational cost and network scales.
Collapse
Affiliation(s)
- Yi Zhang
- Department of Computer Science, Sichuan University, Chengdu 610017, China
| |
Collapse
|
36
|
Hwang PJ, Hsu CC, Chou PY, Wang WY, Lin CH. Vision-Based Learning from Demonstration System for Robot Arms. Sensors (Basel) 2022; 22:2678. [PMID: 35408292 PMCID: PMC9002941 DOI: 10.3390/s22072678] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2022] [Revised: 03/25/2022] [Accepted: 03/29/2022] [Indexed: 06/14/2023]
Abstract
Robotic arms have been widely used in various industries and have the advantages of cost savings, high productivity, and efficiency. Although robotic arms are good at increasing efficiency in repetitive tasks, they still need to be re-programmed and optimized when new tasks are to be deployed, resulting in detrimental downtime and high cost. It is therefore the objective of this paper to present a learning from demonstration (LfD) robotic system to provide a more intuitive way for robots to efficiently perform tasks through learning from human demonstration on the basis of two major components: understanding through human demonstration and reproduction by robot arm. To understand human demonstration, we propose a vision-based spatial-temporal action detection method to detect human actions that focuses on meticulous hand movement in real time to establish an action base. An object trajectory inductive method is then proposed to obtain a key path for objects manipulated by the human through multiple demonstrations. In robot reproduction, we integrate the sequence of actions in the action base and the key path derived by the object trajectory inductive method for motion planning to reproduce the task demonstrated by the human user. Because of the capability of learning from demonstration, the robot can reproduce the tasks that the human demonstrated with the help of vision sensors in unseen contexts.
Collapse
|
37
|
Vandevoorde K, Vollenkemper L, Schwan C, Kohlhase M, Schenck W. Using Artificial Intelligence for Assistance Systems to Bring Motor Learning Principles into Real World Motor Tasks. Sensors (Basel) 2022; 22:s22072481. [PMID: 35408094 PMCID: PMC9002555 DOI: 10.3390/s22072481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Revised: 03/18/2022] [Accepted: 03/20/2022] [Indexed: 11/03/2022]
Abstract
Humans learn movements naturally, but it takes a lot of time and training to achieve expert performance in motor skills. In this review, we show how modern technologies can support people in learning new motor skills. First, we introduce important concepts in motor control, motor learning and motor skill learning. We also give an overview about the rapid expansion of machine learning algorithms and sensor technologies for human motion analysis. The integration between motor learning principles, machine learning algorithms and recent sensor technologies has the potential to develop AI-guided assistance systems for motor skill training. We give our perspective on this integration of different fields to transition from motor learning research in laboratory settings to real world environments and real world motor tasks and propose a stepwise approach to facilitate this transition.
Collapse
|
38
|
Lee GC, Loo CK. On the Post Hoc Explainability of Optimized Self-Organizing Reservoir Network for Action Recognition. Sensors (Basel) 2022; 22:1905. [PMID: 35271052 PMCID: PMC8914683 DOI: 10.3390/s22051905] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 02/18/2022] [Accepted: 02/19/2022] [Indexed: 06/14/2023]
Abstract
This work proposes a novel unsupervised self-organizing network, called the Self-Organizing Convolutional Echo State Network (SO-ConvESN), for learning node centroids and interconnectivity maps compatible with the deterministic initialization of Echo State Network (ESN) input and reservoir weights, in the context of human action recognition (HAR). To ensure stability and echo state property in the reservoir, Recurrent Plots (RPs) and Recurrence Quantification Analysis (RQA) techniques are exploited for explainability and characterization of the reservoir dynamics and hence tuning ESN hyperparameters. The optimized self-organizing reservoirs are cascaded with a Convolutional Neural Network (CNN) to ensure that the activation of internal echo state representations (ESRs) echoes similar topological qualities and temporal features of the input time-series, and the CNN efficiently learns the dynamics and multiscale temporal features from the ESRs for action recognition. The hyperparameter optimization (HPO) algorithms are additionally adopted to optimize the CNN stage in SO-ConvESN. Experimental results on the HAR problem using several publicly available 3D-skeleton-based action datasets demonstrate the showcasing of the RPs and RQA technique in examining the explainability of reservoir dynamics for designing stable self-organizing reservoirs and the usefulness of implementing HPOs in SO-ConvESN for the HAR task. The proposed SO-ConvESN exhibits competitive recognition accuracy.
Collapse
Affiliation(s)
- Gin Chong Lee
- Faculty of Engineering and Technology, Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450, Malaysia;
| | - Chu Kiong Loo
- Department of Artificial Intelligence, Faculty of Computer Science and Information Technology, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| |
Collapse
|
39
|
Jiao SJ, Liu LY, Liu Q. A Hybrid Deep Learning Model for Recognizing Actions of Distracted Drivers. Sensors (Basel) 2021; 21:s21217424. [PMID: 34770728 PMCID: PMC8588220 DOI: 10.3390/s21217424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 10/30/2021] [Accepted: 11/04/2021] [Indexed: 11/17/2022]
Abstract
With the rapid spreading of in-vehicle information systems such as smartphones, navigation systems, and radios, the number of traffic accidents caused by driver distractions shows an increasing trend. Timely identification and warning are deemed to be crucial for distracted driving and the establishment of driver assistance systems is of great value. However, almost all research on the recognition of the driver’s distracted actions using computer vision methods neglected the importance of temporal information for action recognition. This paper proposes a hybrid deep learning model for recognizing the actions of distracted drivers. Specifically, we used OpenPose to obtain skeleton information of the human body and then constructed the vector angle and modulus ratio of the human body structure as features to describe the driver’s actions, thereby realizing the fusion of deep network features and artificial features, which improve the information density of spatial features. The K-means clustering algorithm was used to preselect the original frames, and the method of inter-frame comparison was used to obtain the final keyframe sequence by comparing the Euclidean distance between manually constructed vectors representing frames and the vector representing the cluster center. Finally, we constructed a two-layer long short-term memory neural network to obtain more effective spatiotemporal features, and one softmax layer to identify the distracted driver’s action. The experimental results based on the collected dataset prove the effectiveness of this framework, and it can provide a theoretical basis for the establishment of vehicle distraction warning systems.
Collapse
|
40
|
Wurm MF, Caramazza A. Two 'what' pathways for action and object recognition. Trends Cogn Sci 2021; 26:103-116. [PMID: 34702661 DOI: 10.1016/j.tics.2021.10.003] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 09/03/2021] [Accepted: 10/01/2021] [Indexed: 10/20/2022]
Abstract
The ventral visual stream is conceived as a pathway for object recognition. However, we also recognize the actions an object can be involved in. Here, we show that action recognition critically depends on a pathway in lateral occipitotemporal cortex, partially overlapping and topographically aligned with object representations that are precursors for action recognition. By contrast, object features that are more relevant for object recognition, such as color and texture, are typically found in ventral occipitotemporal cortex. We argue that occipitotemporal cortex contains similarly organized lateral and ventral 'what' pathways for action and object recognition, respectively. This account explains a number of observed phenomena, such as the duplication of object domains and the specific representational profiles in lateral and ventral cortex.
Collapse
Affiliation(s)
- Moritz F Wurm
- Center for Mind/Brain Sciences - CIMeC, University of Trento, Corso Bettini 31, 38068 Rovereto, Italy.
| | - Alfonso Caramazza
- Center for Mind/Brain Sciences - CIMeC, University of Trento, Corso Bettini 31, 38068 Rovereto, Italy; Department of Psychology, Harvard University, 33 Kirkland St, Cambridge, MA 02138, USA
| |
Collapse
|
41
|
Kim D, Lee I, Kim D, Lee S. Action Recognition Using Close-Up of Maximum Activation and ETRI-Activity3D LivingLab Dataset. Sensors (Basel) 2021; 21:s21206774. [PMID: 34695988 PMCID: PMC8539691 DOI: 10.3390/s21206774] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Revised: 09/24/2021] [Accepted: 10/07/2021] [Indexed: 11/16/2022]
Abstract
The development of action recognition models has shown great performance on various video datasets. Nevertheless, because there is no rich data on target actions in existing datasets, it is insufficient to perform action recognition applications required by industries. To satisfy this requirement, datasets composed of target actions with high availability have been created, but it is difficult to capture various characteristics in actual environments because video data are generated in a specific environment. In this paper, we introduce a new ETRI-Activity3D-LivingLab dataset, which provides action sequences in actual environments and helps to handle a network generalization issue due to the dataset shift. When the action recognition model is trained on the ETRI-Activity3D and KIST SynADL datasets and evaluated on the ETRI-Activity3D-LivingLab dataset, the performance can be severely degraded because the datasets were captured in different environments domains. To reduce this dataset shift between training and testing datasets, we propose a close-up of maximum activation, which magnifies the most activated part of a video input in detail. In addition, we present various experimental results and analysis that show the dataset shift and demonstrate the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Doyoung Kim
- Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea; (D.K.); (I.L.)
| | - Inwoong Lee
- Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea; (D.K.); (I.L.)
| | - Dohyung Kim
- Intelligent Robotics Research Division, Electronics and Telecommunications Research Institute, Daejeon 34129, Korea;
| | - Sanghoon Lee
- Department of Electrical and Electronic Engineering, Yonsei University, Seoul 03722, Korea; (D.K.); (I.L.)
- Department of Radiology, College of Medicine, Yonsei University, Seoul 03722, Korea
- Correspondence:
| |
Collapse
|
42
|
Liu D, Xu H, Wang J, Lu Y, Kong J, Qi M. Adaptive Attention Memory Graph Convolutional Networks for Skeleton-Based Action Recognition. Sensors (Basel) 2021; 21:6761. [PMID: 34695972 PMCID: PMC8538327 DOI: 10.3390/s21206761] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 10/06/2021] [Accepted: 10/08/2021] [Indexed: 11/17/2022]
Abstract
Graph Convolutional Networks (GCNs) have attracted a lot of attention and shown remarkable performance for action recognition in recent years. For improving the recognition accuracy, how to build graph structure adaptively, select key frames and extract discriminative features are the key problems of this kind of method. In this work, we propose a novel Adaptive Attention Memory Graph Convolutional Networks (AAM-GCN) for human action recognition using skeleton data. We adopt GCN to adaptively model the spatial configuration of skeletons and employ Gated Recurrent Unit (GRU) to construct an attention-enhanced memory for capturing the temporal feature. With the memory module, our model can not only remember what happened in the past but also employ the information in the future using multi-bidirectional GRU layers. Furthermore, in order to extract discriminative temporal features, the attention mechanism is also employed to select key frames from the skeleton sequence. Extensive experiments on Kinetics, NTU RGB+D and HDM05 datasets show that the proposed network achieves better performance than some state-of-the-art methods.
Collapse
Affiliation(s)
- Di Liu
- College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China; (D.L.); (H.X.); (J.W.); (Y.L.)
| | - Hui Xu
- College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China; (D.L.); (H.X.); (J.W.); (Y.L.)
| | - Jianzhong Wang
- College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China; (D.L.); (H.X.); (J.W.); (Y.L.)
| | - Yinghua Lu
- College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China; (D.L.); (H.X.); (J.W.); (Y.L.)
| | - Jun Kong
- Institute for Intelligent Elderly Care, Changchun Humanities and Sciences College, Changchun 130117, China
- Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun 130024, China
| | - Miao Qi
- College of Information Sciences and Technology, Northeast Normal University, Changchun 130117, China; (D.L.); (H.X.); (J.W.); (Y.L.)
| |
Collapse
|
43
|
Guan Y, Wang N, Yang C. An Improvement of Robot Stiffness-Adaptive Skill Primitive Generalization Using the Surface Electromyography in Human-Robot Collaboration. Front Neurosci 2021; 15:694914. [PMID: 34594181 PMCID: PMC8478287 DOI: 10.3389/fnins.2021.694914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 08/06/2021] [Indexed: 11/29/2022] Open
Abstract
Learning from Demonstration in robotics has proved its efficiency in robot skill learning. The generalization goals of most skill expression models in real scenarios are specified by humans or associated with other perceptual data. Our proposed framework using the Probabilistic Movement Primitives (ProMPs) modeling to resolve the shortcomings of the previous research works; the coupling between stiffness and motion is inherently established in a single model. Such a framework can request a small amount of incomplete observation data to infer the entire skill primitive. It can be used as an intuitive generalization command sending tool to achieve collaboration between humans and robots with human-like stiffness modulation strategies on either side. Experiments (human–robot hand-over, object matching, pick-and-place) were conducted to prove the effectiveness of the work. Myo armband and Leap motion camera are used as surface electromyography (sEMG) signal and motion capture sensors respective in the experiments. Also, the experiments show that the proposed framework strengthened the ability to distinguish actions with similar movements under observation noise by introducing the sEMG signal into the ProMP model. The usage of the mixture model brings possibilities in achieving automation of multiple collaborative tasks.
Collapse
Affiliation(s)
- Yuan Guan
- Bristol Robotics Laboratory, University of the West of England, Bristol, United Kingdom
| | - Ning Wang
- Bristol Robotics Laboratory, University of the West of England, Bristol, United Kingdom
| | - Chenguang Yang
- Bristol Robotics Laboratory, University of the West of England, Bristol, United Kingdom
| |
Collapse
|
44
|
Wang J, Cao D, Wang J, Liu C. Action Recognition of Lower Limbs Based on Surface Electromyography Weighted Feature Method. Sensors (Basel) 2021; 21:6147. [PMID: 34577352 PMCID: PMC8470121 DOI: 10.3390/s21186147] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 09/03/2021] [Accepted: 09/08/2021] [Indexed: 11/16/2022]
Abstract
To improve the recognition rate of lower limb actions based on surface electromyography (sEMG), an effective weighted feature method is proposed, and an improved genetic algorithm support vector machine (IGA-SVM) is designed in this paper. First, for the problem of high feature redundancy and low discrimination in the surface electromyography feature extraction process, the weighted feature method is proposed based on the correlation between muscles and actions. Second, to solve the problem of the genetic algorithm selection operator easily falling into a local optimum solution, the improved genetic algorithm-support vector machine is designed by championship with sorting method. Finally, the proposed method is used to recognize six types of lower limb actions designed, and the average recognition rate reaches 94.75%. Experimental results indicate that the proposed method has definite potentiality in lower limb action recognition.
Collapse
Affiliation(s)
- Jiashuai Wang
- School of Engineering, Qufu Normal University, Rizhao 276826, China; (J.W.); (J.W.)
| | - Dianguo Cao
- School of Engineering, Qufu Normal University, Rizhao 276826, China; (J.W.); (J.W.)
| | - Jinqiang Wang
- School of Engineering, Qufu Normal University, Rizhao 276826, China; (J.W.); (J.W.)
| | - Chengyu Liu
- School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China;
| |
Collapse
|
45
|
Zin TT, Htet Y, Akagi Y, Tamura H, Kondo K, Araki S, Chosa E. Real-Time Action Recognition System for Elderly People Using Stereo Depth Camera. Sensors (Basel) 2021; 21:5895. [PMID: 34502783 DOI: 10.3390/s21175895] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/28/2021] [Revised: 08/28/2021] [Accepted: 08/30/2021] [Indexed: 11/16/2022]
Abstract
Smart technologies are necessary for ambient assisted living (AAL) to help family members, caregivers, and health-care professionals in providing care for elderly people independently. Among these technologies, the current work is proposed as a computer vision-based solution that can monitor the elderly by recognizing actions using a stereo depth camera. In this work, we introduce a system that fuses together feature extraction methods from previous works in a novel combination of action recognition. Using depth frame sequences provided by the depth camera, the system localizes people by extracting different regions of interest (ROI) from UV-disparity maps. As for feature vectors, the spatial-temporal features of two action representation maps (depth motion appearance (DMA) and depth motion history (DMH) with a histogram of oriented gradients (HOG) descriptor) are used in combination with the distance-based features, and fused together with the automatic rounding method for action recognition of continuous long frame sequences. The experimental results are tested using random frame sequences from a dataset that was collected at an elder care center, demonstrating that the proposed system can detect various actions in real-time with reasonable recognition rates, regardless of the length of the image sequences.
Collapse
|
46
|
Sun Y, Shen Y, Ma L. MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition. Sensors (Basel) 2021; 21:5339. [PMID: 34450781 DOI: 10.3390/s21165339] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 07/25/2021] [Accepted: 08/03/2021] [Indexed: 11/17/2022]
Abstract
Skeleton-based human action recognition has made great progress, especially with the development of a graph convolution network (GCN). The most important work is ST-GCN, which automatically learns both spatial and temporal patterns from skeleton sequences. However, this method still has some imperfections: only short-range correlations are appreciated, due to the limited receptive field of graph convolution. However, long-range dependence is essential for recognizing human action. In this work, we propose the use of a spatial-temporal relative transformer (ST-RT) to overcome these defects. Through introducing relay nodes, ST-RT avoids the transformer architecture, breaking the inherent skeleton topology in spatial and the order of skeleton sequence in temporal dimensions. Furthermore, we mine the dynamic information contained in motion at different scales. Finally, four ST-RTs, which extract spatial-temporal features from four kinds of skeleton sequence, are fused to form the final model, multi-stream spatial-temporal relative transformer (MSST-RT), to enhance performance. Extensive experiments evaluate the proposed methods on three benchmarks for skeleton-based action recognition: NTU RGB+D, NTU RGB+D 120 and UAV-Human. The results demonstrate that MSST-RT is on par with SOTA in terms of performance.
Collapse
|
47
|
Lin FC, Ngo HH, Dow CR, Lam KH, Le HL. Student Behavior Recognition System for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection. Sensors (Basel) 2021; 21:5314. [PMID: 34450754 DOI: 10.3390/s21165314] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Revised: 07/31/2021] [Accepted: 08/04/2021] [Indexed: 11/19/2022]
Abstract
Human action recognition has attracted considerable research attention in the field of computer vision, especially for classroom environments. However, most relevant studies have focused on one specific behavior of students. Therefore, this paper proposes a student behavior recognition system based on skeleton pose estimation and person detection. First, consecutive frames captured with a classroom camera were used as the input images of the proposed system. Then, skeleton data were collected using the OpenPose framework. An error correction scheme was proposed based on the pose estimation and person detection techniques to decrease incorrect connections in the skeleton data. The preprocessed skeleton data were subsequently used to eliminate several joints that had a weak effect on behavior classification. Second, feature extraction was performed to generate feature vectors that represent human postures. The adopted features included normalized joint locations, joint distances, and bone angles. Finally, behavior classification was conducted to recognize student behaviors. A deep neural network was constructed to classify actions, and the proposed system was able to identify the number of students in a classroom. Moreover, a system prototype was implemented to verify the feasibility of the proposed system. The experimental results indicated that the proposed scheme outperformed the skeleton-based scheme in complex situations. The proposed system had a 15.15% higher average precision and 12.15% higher average recall than the skeleton-based scheme did.
Collapse
|
48
|
Zhang Y, Po LM, Xiong J, REHMAN YAU, Cheung KW. ASNet: Auto-Augmented Siamese Neural Network for Action Recognition. Sensors (Basel) 2021; 21:4720. [PMID: 34300460 PMCID: PMC8309510 DOI: 10.3390/s21144720] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 07/03/2021] [Accepted: 07/07/2021] [Indexed: 11/16/2022]
Abstract
Human action recognition methods in videos based on deep convolutional neural networks usually use random cropping or its variants for data augmentation. However, this traditional data augmentation approach may generate many non-informative samples (video patches covering only a small part of the foreground or only the background) that are not related to a specific action. These samples can be regarded as noisy samples with incorrect labels, which reduces the overall action recognition performance. In this paper, we attempt to mitigate the impact of noisy samples by proposing an Auto-augmented Siamese Neural Network (ASNet). In this framework, we propose backpropagating salient patches and randomly cropped samples in the same iteration to perform gradient compensation to alleviate the adverse gradient effects of non-informative samples. Salient patches refer to the samples containing critical information for human action recognition. The generation of salient patches is formulated as a Markov decision process, and a reinforcement learning agent called SPA (Salient Patch Agent) is introduced to extract patches in a weakly supervised manner without extra labels. Extensive experiments were conducted on two well-known datasets UCF-101 and HMDB-51 to verify the effectiveness of the proposed SPA and ASNet.
Collapse
Affiliation(s)
- Yujia Zhang
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China; (Y.Z.); (J.X.)
| | - Lai-Man Po
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China; (Y.Z.); (J.X.)
| | - Jingjing Xiong
- Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China; (Y.Z.); (J.X.)
| | | | - Kwok-Wai Cheung
- School of Communication, The Hang Seng University of Hong Kong, Hong Kong, China;
| |
Collapse
|
49
|
Abstract
Classification of human actions is an ongoing research problem in computer vision. This review is aimed to scope current literature on data fusion and action recognition techniques and to identify gaps and future research direction. Success in producing cost-effective and portable vision-based sensors has dramatically increased the number and size of datasets. The increase in the number of action recognition datasets intersects with advances in deep learning architectures and computational support, both of which offer significant research opportunities. Naturally, each action-data modality-such as RGB, depth, skeleton, and infrared (IR)-has distinct characteristics; therefore, it is important to exploit the value of each modality for better action recognition. In this paper, we focus solely on data fusion and recognition techniques in the context of vision with an RGB-D perspective. We conclude by discussing research challenges, emerging trends, and possible future research directions.
Collapse
|
50
|
Lee H, Youm S. Development of a Wearable Camera and AI Algorithm for Medication Behavior Recognition. Sensors (Basel) 2021; 21:s21113594. [PMID: 34064177 PMCID: PMC8196696 DOI: 10.3390/s21113594] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2021] [Revised: 05/01/2021] [Accepted: 05/18/2021] [Indexed: 11/22/2022]
Abstract
As many as 40% to 50% of patients do not adhere to long-term medications for managing chronic conditions, such as diabetes or hypertension. Limited opportunity for medication monitoring is a major problem from the perspective of health professionals. The availability of prompt medication error reports can enable health professionals to provide immediate interventions for patients. Furthermore, it can enable clinical researchers to modify experiments easily and predict health levels based on medication compliance. This study proposes a method in which videos of patients taking medications are recorded using a camera image sensor integrated into a wearable device. The collected data are used as a training dataset based on applying the latest convolutional neural network (CNN) technique. As for an artificial intelligence (AI) algorithm to analyze the medication behavior, we constructed an object detection model (Model 1) using the faster region-based CNN technique and a second model that uses the combined feature values to perform action recognition (Model 2). Moreover, 50,000 image data were collected from 89 participants, and labeling was performed on different data categories to train the algorithm. The experimental combination of the object detection model (Model 1) and action recognition model (Model 2) was newly developed, and the accuracy was 92.7%, which is significantly high for medication behavior recognition. This study is expected to enable rapid intervention for providers seeking to treat patients through rapid reporting of drug errors.
Collapse
Affiliation(s)
| | - Sekyoung Youm
- Department of Industrial and Systems Engineering and Gerontechnology Research Center, Dongguk University, Seoul 40620, Korea
- Correspondence: ; Tel.: +82-2-2260-3377
| |
Collapse
|