1
|
Ding B, Xie J, Nie J, Wu Y, Cao J. C 2BG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection. Neural Netw 2024; 179:106535. [PMID: 39047336 DOI: 10.1016/j.neunet.2024.106535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 07/03/2024] [Accepted: 07/09/2024] [Indexed: 07/27/2024]
Abstract
Multi-modal 3D object detection is instrumental in identifying and localizing objects within 3D space. It combines RGB images from cameras and point-clouds data from lidar sensors, serving as a fundamental technology for autonomous driving applications. Current methods commonly employ simplistic element-wise additions or multiplications to aggregate multi-modal features extracted from point-clouds and images. While these methods enhance detection accuracy, the utilization of basic operations presents challenges in effectively balancing the significance between modalities. This can potentially introduce noise and irrelevant information during the feature aggregation process. Additionally, the multi-level features extracted from images display imbalances in receptive fields. To tackle the aforementioned challenges, we propose two innovative networks: a cross-modality balance network (CMN) and a cross-scale balance network (CSN). CMN incorporates cross-modality attention mechanisms and introduces an auxiliary 2D detection head to balance the significance of both modalities. Meanwhile, CSN leverages cross-scale attention mechanisms to mitigate the gap in receptive fields between different image levels. Additionally, we introduce a novel Local with Global Voxel Attention Encoder (LGVAE) designed to capture global semantics by extracting more comprehensive point-level information into voxel-level features. We perform comprehensive experiments on three challenging public benchmarks: KITTI, Dense and nuScenes. The results consistently demonstrate improvements across multiple 3D object detection frameworks, affirming the effectiveness and versatility of our proposed method. Remarkably, our approach achieves a substantial absolute gain of 3.1% over the baseline MVXNet on the challenging Hard set of the Dense test set.
Collapse
Affiliation(s)
- Bonan Ding
- School of Big Data and Software Engineering, Chongqing University, Chongqing, 400044, China
| | - Jin Xie
- School of Big Data and Software Engineering, Chongqing University, Chongqing, 400044, China.
| | - Jing Nie
- School of Microelectronics and Communication Engineering, Chongqing University, Chongqing, 400044, China
| | - Yulong Wu
- School of Big Data and Software Engineering, Chongqing University, Chongqing, 400044, China
| | - Jiale Cao
- School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China; Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
| |
Collapse
|
2
|
Shao Y, Tan A, Wang B, Yan T, Sun Z, Zhang Y, Liu J. MS 23D: A 3D object detection method using multi-scale semantic feature points to construct 3D feature layer. Neural Netw 2024; 179:106623. [PMID: 39154419 DOI: 10.1016/j.neunet.2024.106623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 06/20/2024] [Accepted: 08/08/2024] [Indexed: 08/20/2024]
Abstract
LiDAR point clouds can effectively depict the motion and posture of objects in three-dimensional space. Many studies accomplish the 3D object detection by voxelizing point clouds. However, in autonomous driving scenarios, the sparsity and hollowness of point clouds create some difficulties for voxel-based methods. The sparsity of point clouds makes it challenging to describe the geometric features of objects. The hollowness of point clouds poses difficulties for the aggregation of 3D features. We propose a two-stage 3D object detection framework, called MS23D. (1) We propose a method using voxel feature points from multi-branch to construct the 3D feature layer. Using voxel feature points from different branches, we construct a relatively compact 3D feature layer with rich semantic features. Additionally, we propose a distance-weighted sampling method, reducing the loss of foreground points caused by downsampling and allowing the 3D feature layer to retain more foreground points. (2) In response to the hollowness of point clouds, we predict the offsets between deep-level feature points and the object's centroid, making them as close as possible to the object's centroid. This enables the aggregation of these feature points with abundant semantic features. For feature points from shallow-level, we retain them on the object's surface to describe the geometric features of the object. To validate our approach, we evaluated its effectiveness on both the KITTI and ONCE datasets.
Collapse
Affiliation(s)
- Yongxin Shao
- The School of Mechanical and Electrical Engineering, China Jiliang University, Hanzhou, China
| | - Aihong Tan
- The School of Mechanical and Electrical Engineering, China Jiliang University, Hanzhou, China.
| | - Binrui Wang
- The School of Mechanical and Electrical Engineering, China Jiliang University, Hanzhou, China
| | - Tianhong Yan
- The School of Mechanical and Electrical Engineering, China Jiliang University, Hanzhou, China.
| | - Zhetao Sun
- The School of Mechanical and Electrical Engineering, China Jiliang University, Hanzhou, China
| | - Yiyang Zhang
- The School of Mechanical and Electrical Engineering, China Jiliang University, Hanzhou, China
| | - Jiaxin Liu
- The School of Mechanical and Electrical Engineering, China Jiliang University, Hanzhou, China
| |
Collapse
|
3
|
Wang J, Qi Y. Multi-level feature fusion and joint refinement for simultaneous object pose estimation and camera localization. Neural Netw 2024; 174:106238. [PMID: 38508048 DOI: 10.1016/j.neunet.2024.106238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2023] [Revised: 01/22/2024] [Accepted: 03/13/2024] [Indexed: 03/22/2024]
Abstract
Object pose estimation and camera localization are critical in various applications. However, achieving algorithm universality, which refers to category-level pose estimation and scene-independent camera localization, presents challenges for both techniques. Although the two tasks keep close relationships due to spatial geometry constraints, different tasks require distinct feature extractions. This paper pays attention to a unified RGB-D based framework that simultaneously performs category-level object pose estimation and scene-independent camera localization. The framework consists of a pose estimation branch called SLO-ObjNet, a localization branch called SLO-LocNet, a pose confidence calculation process and object-level optimization. At the start, we obtain the initial camera and object results from SLO-LocNet and SLO-ObjNet. In these two networks, we design there-level feature fusion modules as well as the loss function to achieve feature sharing between two tasks. Then the proposed approach involves a confidence calculation process to determine the accuracy of object poses obtained. Additionally, an object-level Bundle Adjustment (BA) optimization algorithm is further used to improve the precision of these techniques. The BA algorithm establishes relationships among feature points, objects, and cameras with the usage of camera-point, camera-object, and object-point metrics. To evaluate the performance of this approach, experiments are conducted on localization and pose estimation datasets including REAL275, CAMERA25, LineMOD, YCB-Video, 7 Scenes, ScanNet and TUM RGB-D. The results show that this approach outperforms existing methods in terms of both estimation and localization accuracy. Additionally, SLO-LocNet and SLO-ObjNet are trained on ScanNet data and tested on 7 Scenes and TUM RGB-D datasets to demonstrate its universality performance. Finally, we also highlight the positive effects of fusion modules, loss function, confidence process and BA for improving overall performance.
Collapse
Affiliation(s)
- Junyi Wang
- School of Computer Science and Technology, Shandong University, Qingdao, China; Qingdao Research Institute of Beihang University, Qingdao, China.
| | - Yue Qi
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China; Qingdao Research Institute of Beihang University, Qingdao, China.
| |
Collapse
|
4
|
Zhou W, Zheng F, Zhao Y, Pang Y, Yi J. MSDCNN: A multiscale dilated convolution neural network for fine-grained 3D shape classification. Neural Netw 2024; 172:106141. [PMID: 38301340 DOI: 10.1016/j.neunet.2024.106141] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Revised: 01/17/2024] [Accepted: 01/21/2024] [Indexed: 02/03/2024]
Abstract
Multi-view deep neural networks have shown excellent performance on 3D shape classification tasks. However, global features aggregated from multiple views data often lack content information and spatial relationship, which leads to difficult identification the small variance among subcategories in the same category. To solve this problem, in this paper, a novel multiscale dilated convolution neural network termed as MSDCNN is proposed for multi-view fine-grained 3D shape classification. Firstly, a sequence of views are rendered from 12-viewpoints around the input 3D shape by the sequential view capturing module. Then, the first 22 convolution layers of ResNeXt50 is employed to extract the semantic features of each view, and a global mixed feature map is obtained through the element-wise maximum operation of the 12 output feature maps. Furthermore, attention dilated module (ADM), which combines four concatenated attention dilated block (ADB), is designed to extract larger receptive field features from global mixed feature map to enhance context information among the views. Specifically, each ADB is consisted by an attention mechanism module and a dilated convolution with different dilation rates. In addition, prediction module with label smoothing is proposed to classify features, which contains 3 × 3 convolution and adaptive average pooling. The performance of our method is validated experimentally on the ModelNet10, ModelNet40 and FG3D datasets. Experimental results demonstrate the effectiveness and superiority of the proposed MSDCNN framework for 3D shape fine-grained classification.
Collapse
Affiliation(s)
- Wei Zhou
- College of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331, PR China.
| | - Fujian Zheng
- College of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331, PR China; College of Optoelectronic Engineering, Chongqing University, Chongqing 400030, PR China.
| | - Yiheng Zhao
- College of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331, PR China.
| | - Yiran Pang
- Department of Computer & Electrical Engineering and Computer Science, Florida Atlantic University, FL 33431, United States of America.
| | - Jun Yi
- College of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing 401331, PR China.
| |
Collapse
|
5
|
Wang H, Chen T, Ji X, Qian F, Ma Y, Wang S. LiDAR-camera-system-based unsupervised and weakly supervised 3D object detection. JOURNAL OF THE OPTICAL SOCIETY OF AMERICA. A, OPTICS, IMAGE SCIENCE, AND VISION 2023; 40:1849-1860. [PMID: 37855540 DOI: 10.1364/josaa.494980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Accepted: 08/17/2023] [Indexed: 10/20/2023]
Abstract
LiDAR camera systems are now becoming an important part of autonomous driving 3D object detection. Due to limitations in time and resources, only a few critical frames of the synchronized camera data and acquired LiDAR points may be annotated. However, there is still a large amount of unannotated data in practical applications. Therefore, we propose a LiDAR-camera-system-based unsupervised and weakly supervised (LCUW) network as a novel 3D object-detection method. When unannotated data are put into the network, we propose an independent learning mode, which is an unsupervised data preprocessing module. Meanwhile, for detection tasks with high accuracy requirements, we propose an Accompany Construction mode, which is a weakly supervised data preprocessing module that requires only a small amount of annotated data. Then, we generate high-quality training data from the remaining unlabeled data. We also propose a full aggregation bridge block in the feature-extraction part, which uses a stepwise fusion and deepening representation strategy to improve the accuracy. Our comparative, ablation, and runtime test experiments show that the proposed method performs well while advancing the application of LiDAR camera systems.
Collapse
|