1
|
Vahdani E, Tian Y. Deep Learning-Based Action Detection in Untrimmed Videos: A Survey. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:4302-4320. [PMID: 35877805 DOI: 10.1109/tpami.2022.3193611] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This article provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this article reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Action detection in online setting is also reviewed where the goal is to detect actions in each frame without considering any future context in a live video stream. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.
Collapse
|
2
|
Dong X, Chen Z, Liu YJ, Yao J, Guo X. GPU-Based Supervoxel Generation With a Novel Anisotropic Metric. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:8847-8860. [PMID: 34694996 DOI: 10.1109/tip.2021.3120878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Video over-segmentation into supervoxels is an important pre-processing technique for many computer vision tasks. Videos are an order of magnitude larger than images. Most existing methods for generating supervovels are either memory- or time-inefficient, which limits their application in subsequent video processing tasks. In this paper, we present an anisotropic supervoxel method, which is memory-efficient and can be executed on the graphics processing unit (GPU). Therefore, our algorithm achieves good balance among segmentation quality, memory usage and processing time. In order to provide accurate segmentation for moving objects in video, we use the optical flow information to design a brand new non-Euclidean metric to calculate the anisotropic distances between seeds and voxels. To efficiently compute the anisotropic metric, we adjust the classic jump flooding algorithm (which is designed for parallel execution on the GPU) to generate anisotropic Voronoi tessellation in the combined color and spatio-temporal space. We evaluate our method and the representative supervoxel algorithms for their capability on segmentation performance, computation speed and memory efficiency. We also apply supervoxel results to the application of foreground propagation in videos to test the performance on solving practical problems. Experiments show that our algorithm is much faster than the existing methods, and achieves good balance on segmentation quality and efficiency.
Collapse
|
3
|
|
4
|
Yi R, Ye Z, Zhao W, Yu M, Lai YK, Liu YJ. Feature-Aware Uniform Tessellations on Video Manifold for Content-Sensitive Supervoxels. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2021; 43:3183-3195. [PMID: 32167886 DOI: 10.1109/tpami.2020.2979714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Over-segmenting a video into supervoxels has strong potential to reduce the complexity of downstream computer vision applications. Content-sensitive supervoxels (CSSs) are typically smaller in content-dense regions (i.e., with high variation of appearance and/or motion) and larger in content-sparse regions. In this paper, we propose to compute feature-aware CSSs (FCSSs) that are regularly shaped 3D primitive volumes well aligned with local object/region/motion boundaries in video. To compute FCSSs, we map a video to a 3D manifold embedded in a combined color and spatiotemporal space, in which the volume elements of video manifold give a good measure of the video content density. Then any uniform tessellation on video manifold can induce CSS in the video. Our idea is that among all possible uniform tessellations on the video manifold, FCSS finds one whose cell boundaries well align with local video boundaries. To achieve this goal, we propose a novel restricted centroidal Voronoi tessellation method that simultaneously minimizes the tessellation energy (leading to uniform cells in the tessellation) and maximizes the average boundary distance (leading to good local feature alignment). Theoretically our method has an optimal competitive ratio O(1), and its time and space complexities are O(NK) and O(N+K) for computing K supervoxels in an N-voxel video. We also present a simple extension of FCSS to streaming FCSS for processing long videos that cannot be loaded into main memory at once. We evaluate FCSS, streaming FCSS and ten representative supervoxel methods on four video datasets and two novel video applications. The results show that our method simultaneously achieves state-of-the-art performance with respect to various evaluation criteria.
Collapse
|
5
|
Wang B, Chen Y, Liu W, Qin J, Du Y, Han G, He S. Real-Time Hierarchical Supervoxel Segmentation via a Minimum Spanning Tree. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:9665-9677. [PMID: 33074808 DOI: 10.1109/tip.2020.3030502] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Supervoxel segmentation algorithm has been applied as a preprocessing step for many vision tasks. However, existing supervoxel segmentation algorithms cannot generate hierarchical supervoxel segmentation well preserving the spatiotemporal boundaries in real time, which prevents the downstream applications from accurate and efficient processing. In this paper, we propose a real-time hierarchical supervoxel segmentation algorithm based on the minimum spanning tree (MST), which achieves state-of-the-art accuracy meanwhile at least 11× faster than existing methods. In particular, we present a dynamic graph updating operation into the iterative construction process of the MST, which can geometrically decrease the numbers of vertices and edges. In this way, the proposed method is able to generate arbitrary scales of supervoxels on the fly. We prove the efficiency of our algorithm that can produce hierarchical supervoxels in the time complexity of O(n) , where n denotes the number of voxels in the input video. Quantitative and qualitative evaluations on public benchmarks demonstrate that our proposed algorithm significantly outperforms the state-of-the-art algorithms in terms of supervoxel segmentation accuracy and computational efficiency. Furthermore, we demonstrate the effectiveness of the proposed method on a downstream application of video object segmentation.
Collapse
|
6
|
Wang B, Liu W, Han G, He S. Learning Long-term Structural Dependencies for Video Salient Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; PP:9017-9031. [PMID: 32941135 DOI: 10.1109/tip.2020.3023591] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Existing video salient object detection (VSOD) methods focus on exploring either short-term or long-term temporal information. However, temporal information is exploited in a global frame-level or regular grid structure, neglecting interframe structural dependencies. In this paper, we propose to learn long-term structural dependencies with a structure-evolving graph convolutional network (GCN). Particularly, we construct a graph for the entire video using a fast supervoxel segmentation method, in which each node is connected according to spatio-temporal structural similarity. We infer the inter-frame structural dependencies of salient object using convolutional operations on the graph. To prune redundant connections in the graph and better adapt to the moving salient object, we present an adaptive graph pooling to evolve the structure of the graph by dynamically merging similar nodes, learning better hierarchical representations of the graph. Experiments on six public datasets show that our method outperforms all other state-of-the-art methods. Furthermore, We also demonstrate that our proposed adaptive graph pooling can effectively improve the supervoxel algorithm in the term of segmentation accuracy.
Collapse
|
7
|
Abstract
Object segmentation and object tracking are fundamental research areas in the computer vision community. These two topics are difficult to handle some common challenges, such as occlusion, deformation, motion blur, scale variation, and more. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity; the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of Video Object Segmentation and Tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This survey aims to provide a comprehensive review of the state-of-the-art VOST methods, classify these methods into different categories, and identify new trends. First, we broadly categorize VOST methods into Video Object Segmentation (VOS) and Segmentation-based Object Tracking (SOT). Each category is further classified into various types based on the segmentation and tracking mechanism. Moreover, we present some representative VOS and SOT methods of each time node. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.
Collapse
Affiliation(s)
- Rui Yao
- School of Computer Science and Technology, China University of Mining and Technology, China; Engineering Research Center of Mine Digitization, Ministry of Education of the People’s Republic of China, China; The Suzhou Smart City Research Institute, Suzhou University of Science and Technology, Xuzhou, China
| | | | - Shixiong Xia
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| | - Jiaqi Zhao
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| | - Yong Zhou
- School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China
| |
Collapse
|
8
|
Kamranian Z, Naghsh Nilchi AR, Sadeghian H, Tombari F, Navab N. Joint motion boundary detection and CNN-based feature visualization for video object segmentation. Neural Comput Appl 2020. [DOI: 10.1007/s00521-019-04448-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
9
|
Xiong B, Jain SD, Grauman K. Pixel Objectness: Learning to Segment Generic Objects Automatically in Images and Videos. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2019; 41:2677-2692. [PMID: 30130176 DOI: 10.1109/tpami.2018.2865794] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
We propose an end-to-end learning framework for segmenting generic objects in both images and videos. Given a novel image or video, our approach produces a pixel-level mask for all "object-like" regions-even for object categories never seen during training. We formulate the task as a structured prediction problem of assigning an object/background label to each pixel, implemented using a deep fully convolutional network. When applied to a video, our model further incorporates a motion stream, and the network learns to combine both appearance and motion and attempts to extract all prominent objects whether they are moving or not. Beyond the core model, a second contribution of our approach is how it leverages varying strengths of training annotations. Pixel-level annotations are quite difficult to obtain, yet crucial for training a deep network approach for segmentation. Thus we propose ways to exploit weakly labeled data for learning dense foreground segmentation. For images, we show the value in mixing object category examples with image-level labels together with relatively few images with boundary-level annotations. For video, we show how to bootstrap weakly annotated videos together with the network trained for image segmentation. Through experiments on multiple challenging image and video segmentation benchmarks, our method offers consistently strong results and improves the state-of-the-art for fully automatic segmentation of generic (unseen) objects. In addition, we demonstrate how our approach benefits image retrieval and image retargeting, both of which flourish when given our high-quality foreground maps. Code, models, and videos are at: http://vision.cs.utexas.edu/projects/pixelobjectness/.
Collapse
|
10
|
|
11
|
|
12
|
Liu L, Zhou Y, Shao L. Deep Action Parsing in Videos With Large-Scale Synthesized Data. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2018; 27:2869-2882. [PMID: 29570088 DOI: 10.1109/tip.2018.2813530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Action parsing in videos with complex scenes is an interesting but challenging task in computer vision. In this paper, we propose a generic 3D convolutional neural network in a multi-task learning manner for effective Deep Action Parsing (DAP3D-Net) in videos. Particularly, in the training phase, action localization, classification, and attributes learning can be jointly optimized on our appearance-motion data via DAP3D-Net. For an upcoming test video, we can describe each individual action in the video simultaneously as: Where the action occurs, What the action is, and How the action is performed. To well demonstrate the effectiveness of the proposed DAP3D-Net, we also contribute a new Numerous-category Aligned Synthetic Action data set, i.e., NASA, which consists of 200 000 action clips of over 300 categories and with 33 pre-defined action attributes in two hierarchical levels (i.e., low-level attributes of basic body part movements and high-level attributes related to action motion). We learn DAP3D-Net using the NASA data set and then evaluate it on our collected Human Action Understanding data set and the public THUMOS data set. Experimental results show that our approach can accurately localize, categorize, and describe multiple actions in realistic videos.
Collapse
|
13
|
|
14
|
|
15
|
Galteri L, Seidenari L, Bertini M, Bimbo AD. Spatio-Temporal Closed-Loop Object Detection. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:1253-1263. [PMID: 28092538 DOI: 10.1109/tip.2017.2651367] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Object detection is one of the most important tasks of computer vision. It is usually performed by evaluating a subset of the possible locations of an image, that are more likely to contain the object of interest. Exhaustive approaches have now been superseded by object proposal methods. The interplay of detectors and proposal algorithms has not been fully analyzed and exploited up to now, although this is a very relevant problem for object detection in video sequences. We propose to connect, in a closed-loop, detectors and object proposal generator functions exploiting the ordered and continuous nature of video sequences. Different from tracking we only require a previous frame to improve both proposal and detection: no prediction based on local motion is performed, thus avoiding tracking errors. We obtain three to four points of improvement in mAP and a detection time that is lower than Faster Regions with CNN features (R-CNN), which is the fastest Convolutional Neural Network (CNN) based generic object detector known at the moment.
Collapse
|
16
|
Fu H, Xu D, Lin S. Object-Based Multiple Foreground Segmentation in RGBD Video. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2017; 26:1418-1427. [PMID: 28092539 DOI: 10.1109/tip.2017.2651369] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We present an RGB and Depth (RGBD) video segmentation method that takes advantage of depth data and can extract multiple foregrounds in the scene. This video segmentation is addressed as an object proposal selection problem formulated in a fully-connected graph, where a flexible number of foregrounds may be chosen. In our graph, each node represents a proposal, and the edges model intra-frame and inter-frame constraints on the solution. The proposals are selected based on an RGBD video saliency map in which depth-based features are utilized to enhance the identification of foregrounds. Experiments show that the proposed multiple foreground segmentation method outperforms related techniques, and the depth cue serves as a helpful complement to RGB features. Moreover, our method provides performance comparable to the state-of-the-art RGB video segmentation techniques on regular RGB videos with estimated depth maps.
Collapse
|
17
|
|