1
|
Ma D, Su J, Li S, Xian Y. AerialIRGAN: unpaired aerial visible-to-infrared image translation with dual-encoder structure. Sci Rep 2024; 14:22105. [PMID: 39333306 PMCID: PMC11436762 DOI: 10.1038/s41598-024-73381-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Accepted: 09/17/2024] [Indexed: 09/29/2024] Open
Abstract
Due to the high cost of equipment and the constraints of shooting conditions, obtaining aerial infrared images of specific targets is very challenging. Most methods using Generative Adversarial Networks for translating visible images to infrared greatly depend on registered data and struggle to handle the diversity and complexity of scenes in aerial infrared targets. This paper proposes a one side end-to-end unpaired aerial visible-to-infrared image translation algorithm, termed AerialIRGAN. AerialIRGAN introduces a dual-encoder structure, where one encoder is designed based on the Segment Anything Model to extract deep semantic features from visible images, and the other encoder is designed based on UniRepLKNet to capture small-scale patterns and sparse patterns from visible images. Subsequently, AerialIRGAN constructs a bridging module to deeply integrate the features of both encoders and their corresponding decoders. Finally, AerialIRGAN proposes a structural appearance consistency loss to guide the synthetic infrared images to maintain the structure of the source image while possessing distinct infrared characteristics. The experimental results show that compared to the existing typical infrared image generation algorithms, the proposed method can generate higher-quality infrared images and achieve better performance in both subjective visual description and objective metric evaluation.
Collapse
Affiliation(s)
- Decao Ma
- Xi'an Research Institute of High Technology, 710025, Xi'an, China
| | - Juan Su
- Xi'an Research Institute of High Technology, 710025, Xi'an, China.
| | - Shaopeng Li
- Xi'an Research Institute of High Technology, 710025, Xi'an, China
- Department of Automation, Tsinghua University, 100084, Beijing, China
| | - Yong Xian
- Xi'an Research Institute of High Technology, 710025, Xi'an, China
| |
Collapse
|
2
|
Xiao X, Xiong X, Meng F, Chen Z. Multi-Scale Feature Interactive Fusion Network for RGBT Tracking. SENSORS (BASEL, SWITZERLAND) 2023; 23:3410. [PMID: 37050470 PMCID: PMC10098685 DOI: 10.3390/s23073410] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 03/16/2023] [Accepted: 03/22/2023] [Indexed: 06/19/2023]
Abstract
The fusion tracking of RGB and thermal infrared image (RGBT) is paid wide attention to due to their complementary advantages. Currently, most algorithms obtain modality weights through attention mechanisms to integrate multi-modalities information. They do not fully exploit the multi-scale information and ignore the rich contextual information among features, which limits the tracking performance to some extent. To solve this problem, this work proposes a new multi-scale feature interactive fusion network (MSIFNet) for RGBT tracking. Specifically, we use different convolution branches for multi-scale feature extraction and aggregate them through the feature selection module adaptively. At the same time, a Transformer interactive fusion module is proposed to build long-distance dependencies and enhance semantic representation further. Finally, a global feature fusion module is designed to adjust the global information adaptively. Numerous experiments on publicly available GTOT, RGBT234, and LasHeR datasets show that our algorithm outperforms the current mainstream tracking algorithms.
Collapse
Affiliation(s)
- Xianbing Xiao
- School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China
| | - Xingzhong Xiong
- Artificial Intelligence Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China
| | - Fanqin Meng
- Artificial Intelligence Key Laboratory of Sichuan Province, Sichuan University of Science and Engineering, Yibin 644000, China
| | - Zhen Chen
- School of Automation and Information Engineering, Sichuan University of Science and Engineering, Yibin 644000, China
| |
Collapse
|
3
|
Cui Z, Zhou L, Wang C, Xu C, Yang J. Visual Micro-Pattern Propagation. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2023; 45:1267-1286. [PMID: 35104215 DOI: 10.1109/tpami.2022.3147974] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Statistic observations demonstrate that visual feature patterns or structure patterns recur high-frequently within/across homo/heterogeneous images. Motivated by the interdependencies of visual patterns, we propose visual micro-pattern propagation (VMPP) to facilitate universal visual pattern learning. Especially, we present a graph framework to unify the conventional micro-pattern propagations in spatial, temporal, cross-modal and cross-task domains. A general formulation of pattern propagation named cross-graph model is presented under this framework, and accordingly a factorized version is derived for more efficient computation as well as better understanding. To correlate homo/heterogeneous patterns, in cross-graph we introduce two types of pattern relations from feature-level and structure-level. The structure pattern relation defines second-order visual connections for heterogeneous patterns by measuring first-order visual relations of homogeneous feature patterns. In virtue of the constructed first-/second-order connections, we design feature pattern diffusion and structure pattern diffusion to prop up various pattern propagation cases. To fulfill different pattern diffusions involved, further, we deeply study two fundamental visual problems, multi-task pixel-level prediction and online dual-modal object tracking, and accordingly propose two end-to-end pattern propagation networks by encapsulating and integrating some necessary diffusion modules therein. We conduct extensive experiments by dissecting every diffusion component as well as comparing numerous advanced methods. The experiments validate the effectiveness of our proposed various pattern diffusion ways and meantime report the state-of-the-art results on the two representative visual problems.
Collapse
|
4
|
|
5
|
Li J, Fang B, Zhou M. Multi-Modal Sparse Tracking by Jointing Timing and Modal Consistency. INT J PATTERN RECOGN 2022. [DOI: 10.1142/s0218001422510089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
In this paper, we propose a multi-modal sparse tracking by jointing timing and modal consistency to locate the target location with the similarity of multiple local appearances. First, we propose an alignable patching strategy for red-green-blue (RGB) color mode and thermal infrared mode to adapt to the local changes of the target. Second, we propose a consistency expression of the corresponding aligned patches between the modes and the correlation of the gaussian mapping within mode to reconstruct the target judgment likelihood function. Finally, we propose an updating scenario based on timing correlation and mode sparsity to fit with the target changes. According to the experimental results, significant improvement in terms of tracking accuracy can be achieved on average compared with the state-of-the-art algorithms. The source code of our algorithm is available on https://github.com/Liincq/tracker.
Collapse
Affiliation(s)
- Jiajun Li
- College of Computer Science, Chongqing University, Chongqing 400044, P. R. China
| | - Bin Fang
- College of Computer Science, Chongqing University, Chongqing 400044, P. R. China
| | - Mingliang Zhou
- College of Computer Science, Chongqing University, Chongqing 400044, P. R. China
| |
Collapse
|
6
|
Li C, Xue W, Jia Y, Qu Z, Luo B, Tang J, Sun D. LasHeR: A Large-Scale High-Diversity Benchmark for RGBT Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 31:392-404. [PMID: 34874855 DOI: 10.1109/tip.2021.3130533] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
RGBT tracking receives a surge of interest in the computer vision community, but this research field lacks a large-scale and high-diversity benchmark dataset, which is essential for both the training of deep RGBT trackers and the comprehensive evaluation of RGBT tracking methods. To this end, we present a La rge- s cale H igh-diversity [Formula: see text]nchmark for short-term R GBT tracking (LasHeR) in this work. LasHeR consists of 1224 visible and thermal infrared video pairs with more than 730K frame pairs in total. Each frame pair is spatially aligned and manually annotated with a bounding box, making the dataset well and densely annotated. LasHeR is highly diverse capturing from a broad range of object categories, camera viewpoints, scene complexities and environmental factors across seasons, weathers, day and night. We conduct a comprehensive performance evaluation of 12 RGBT tracking algorithms on the LasHeR dataset and present detailed analysis. In addition, we release the unaligned version of LasHeR to attract the research interest for alignment-free RGBT tracking, which is a more practical task in real-world applications. The datasets and evaluation protocols are available at: https://github.com/mmic-lcl/Datasets-and-benchmark-code.
Collapse
|
7
|
Tu Z, Lin C, Zhao W, Li C, Tang J. M 5L: Multi-Modal Multi-Margin Metric Learning for RGBT Tracking. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 31:85-98. [PMID: 34784275 DOI: 10.1109/tip.2021.3125504] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Classifying hard samples in the course of RGBT tracking is a quite challenging problem. Existing methods only focus on enlarging the boundary between positive and negative samples, but ignore the relations of multilevel hard samples, which are crucial for the robustness of hard sample classification. To handle this problem, we propose a novel Multi-Modal Multi-Margin Metric Learning framework named M5L for RGBT tracking. In particular, we divided all samples into four parts including normal positive, normal negative, hard positive and hard negative ones, and aim to leverage their relations to improve the robustness of feature embeddings, e.g., normal positive samples are closer to the ground truth than hard positive ones. To this end, we design a multi-modal multi-margin structural loss to preserve the relations of multilevel hard samples in the training stage. In addition, we introduce an attention-based fusion module to achieve quality-aware integration of different source data. Extensive experiments on large-scale datasets testify that our framework clearly improves the tracking performance and performs favorably the state-of-the-art RGBT trackers.
Collapse
|
8
|
Lu A, Li C, Yan Y, Tang J, Luo B. RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence Loss. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2021; 30:5613-5625. [PMID: 34125675 DOI: 10.1109/tip.2021.3087341] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
RGBT tracking has attracted increasing attention since RGB and thermal infrared data have strong complementary advantages, which could make trackers all-day and all-weather work. Existing works usually focus on extracting modality-shared or modality-specific information, but the potentials of these two cues are not well explored and exploited in RGBT tracking. In this paper, we propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning for RGBT tracking. To this end, we design three kinds of adapters within an end-to-end deep learning framework. In specific, we use the modified VGG-M as the generality adapter to extract the modality-shared target representations. To extract the modality-specific features while reducing the computational complexity, we design a modality adapter, which adds a small block to the generality adapter in each layer and each modality in a parallel manner. Such a design could learn multilevel modality-specific representations with a modest number of parameters as the vast majority of parameters are shared with the generality adapter. We also design instance adapter to capture the appearance properties and temporal variations of a certain target. Moreover, to enhance the shared and specific features, we employ the loss of multiple kernel maximum mean discrepancy to measure the distribution divergence of different modal features and integrate it into each layer for more robust representation learning. Extensive experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker against the state-of-the-art methods.
Collapse
|
9
|
Bayoudh K, Knani R, Hamdaoui F, Mtibaa A. A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets. THE VISUAL COMPUTER 2021; 38:2939-2970. [PMID: 34131356 PMCID: PMC8192112 DOI: 10.1007/s00371-021-02166-7] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 05/15/2021] [Indexed: 06/12/2023]
Abstract
The research progress in multimodal learning has grown rapidly over the last decade in several areas, especially in computer vision. The growing potential of multimodal data streams and deep learning algorithms has contributed to the increasing universality of deep multimodal learning. This involves the development of models capable of processing and analyzing the multimodal information uniformly. Unstructured real-world data can inherently take many forms, also known as modalities, often including visual and textual content. Extracting relevant patterns from this kind of data is still a motivating goal for researchers in deep learning. In this paper, we seek to improve the understanding of key concepts and algorithms of deep multimodal learning for the computer vision community by exploring how to generate deep models that consider the integration and combination of heterogeneous visual cues across sensory modalities. In particular, we summarize six perspectives from the current literature on deep multimodal learning, namely: multimodal data representation, multimodal fusion (i.e., both traditional and deep learning-based schemes), multitask learning, multimodal alignment, multimodal transfer learning, and zero-shot learning. We also survey current multimodal applications and present a collection of benchmark datasets for solving problems in various vision domains. Finally, we highlight the limitations and challenges of deep multimodal learning and provide insights and directions for future research.
Collapse
Affiliation(s)
- Khaled Bayoudh
- Electrical Department, National Engineering School of Monastir (ENIM), Laboratory of Electronics and Micro-electronics (LR99ES30), Faculty of Sciences of Monastir (FSM), University of Monastir, Monastir, Tunisia
| | - Raja Knani
- Physics Department, Laboratory of Electronics and Micro-electronics (LR99ES30), Faculty of Sciences of Monastir (FSM), University of Monastir, Monastir, Tunisia
| | - Fayçal Hamdaoui
- Electrical Department, National Engineering School of Monastir (ENIM), Laboratory of Control, Electrical Systems and Environment (LASEE), National Engineering School of Monastir, University of Monastir, Monastir, Tunisia
| | - Abdellatif Mtibaa
- Electrical Department, National Engineering School of Monastir (ENIM), Laboratory of Electronics and Micro-electronics (LR99ES30), Faculty of Sciences of Monastir (FSM), University of Monastir, Monastir, Tunisia
| |
Collapse
|
10
|
Ashiba MI, Tolba MS, El-Fishawy AS, El-Samie FEA. Hybrid enhancement of infrared night vision imaging system. MULTIMEDIA TOOLS AND APPLICATIONS 2020; 79:6085-6108. [DOI: 10.1007/s11042-019-7510-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2018] [Revised: 02/07/2019] [Accepted: 03/18/2019] [Indexed: 09/02/2023]
|
11
|
Li H, Wu XJ, Kittler J. MDLatLRR: A novel decomposition method for infrared and visible image fusion. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2020; 29:4733-4746. [PMID: 32142438 DOI: 10.1109/tip.2020.2975984] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Image decomposition is crucial for many image processing tasks, as it allows to extract salient features from source images. A good image decomposition method could lead to a better performance, especially in image fusion tasks. We propose a multi-level image decomposition method based on latent low-rank representation(LatLRR), which is called MDLatLRR. This decomposition method is applicable to many image processing fields. In this paper, we focus on the image fusion task. We build a novel image fusion framework based on MDLatLRR which is used to decompose source images into detail parts(salient features) and base parts. A nuclear-norm based fusion strategy is used to fuse the detail parts and the base parts are fused by an averaging strategy. Compared with other state-of-the-art fusion methods, the proposed algorithm exhibits better fusion performance in both subjective and objective evaluation.
Collapse
|
12
|
Kang B, Liang D, Ding W, Zhou H, Zhu WP. Grayscale-Thermal Tracking via Inverse Sparse Representation based Collaborative Encoding. IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2019; 29:3401-3415. [PMID: 31880552 DOI: 10.1109/tip.2019.2959912] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Grayscale-thermal tracking has attracted a great deal of attention due to its capability of fusing two different yet complementary target observations. Existing methods often consider extracting the discriminative target information and exploring the target correlation among different images as two separate issues, ignoring their interdependence. This may cause tracking drifts in challenging video pairs. This paper presents a collaborative encoding model called joint correlation and discriminant analysis based inver-sparse representation (JCDA-InvSR) to jointly encode the target candidates in the grayscale and thermal video sequences. In particular, we develop a multi-objective programming to integrate the feature selection and the multi-view correlation analysis into a unified optimization problem in JCDA-InvSR, which can simultaneously highlight the special characters of the grayscale and thermal targets through alternately optimizing two aspects: the target discrimination within a given image and the target correlation across different images. For robust grayscale-thermal tracking, we also incorporate the prior knowledge of target candidate codes into the SVM based target classifier to overcome the overfitting caused by limited training labels. Extensive experiments on GTOT and RGBT234 datasets illustrate the promising performance of our tracking framework.
Collapse
|
13
|
|
14
|
Computational Imaging Method with a Learned Plug-and-Play Prior for Electrical Capacitance Tomography. Cognit Comput 2019. [DOI: 10.1007/s12559-019-09682-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
15
|
Mask Sparse Representation Based on Semantic Features for Thermal Infrared Target Tracking. REMOTE SENSING 2019. [DOI: 10.3390/rs11171967] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Thermal infrared (TIR) target tracking is a challenging task as it entails learning an effective model to identify the target in the situation of poor target visibility and clutter background. The sparse representation, as a typical appearance modeling approach, has been successfully exploited in the TIR target tracking. However, the discriminative information of the target and its surrounding background is usually neglected in the sparse coding process. To address this issue, we propose a mask sparse representation (MaskSR) model, which combines sparse coding together with high-level semantic features for TIR target tracking. We first obtain the pixel-wise labeling results of the target and its surrounding background in the last frame, and then use such results to train target-specific deep networks using a supervised manner. According to the output features of the deep networks, the high-level pixel-wise discriminative map of the target area is obtained. We introduce the binarized discriminative map as a mask template to the sparse representation and develop a novel algorithm to collaboratively represent the reliable target part and unreliable target part partitioned with the mask template, which explicitly indicates different discriminant capabilities by label 1 and 0. The proposed MaskSR model controls the superiority of the reliable target part in the reconstruction process via a weighted scheme. We solve this multi-parameter constrained problem by a customized alternating direction method of multipliers (ADMM) method. This model is applied to achieve TIR target tracking in the particle filter framework. To improve the sampling effectiveness and decrease the computation cost at the same time, a discriminative particle selection strategy based on kernelized correlation filter is proposed to replace the previous random sampling for searching useful candidates. Our proposed tracking method was tested on the VOT-TIR2016 benchmark. The experiment results show that the proposed method has a significant superiority compared with various state-of-the-art methods in TIR target tracking.
Collapse
|
16
|
|