Fan H, Yu Z, Wang Q, Fan B, Tang Y. QueryTrack: Joint-Modality Query Fusion Network for RGBT Tracking.
IEEE TRANSACTIONS ON IMAGE PROCESSING : A PUBLICATION OF THE IEEE SIGNAL PROCESSING SOCIETY 2024;
33:3187-3199. [PMID:
38687651 DOI:
10.1109/tip.2024.3393298]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/02/2024]
Abstract
Existing RGB-Thermal trackers usually treat intra-modal feature extraction and inter-modal feature fusion as two separate processes, therefore the mutual promotion of extraction and fusion is neglected. Then, the complementary advantages of RGB-T fusion are not fully exploited, and the independent feature extraction is not adaptive to modal quality fluctuation during tracking. To address the limitations, we design a joint-modality query fusion network, in which the intra-modal feature extraction and the inter-modal fusion are coupled together and promote each other via joint-modality queries. The queries are initialized based on the multimodal features of the current frame, making the subsequent fusion adaptive to modal quality fluctuation during tracking. Then the joint-modality query fusion (JQF) utilizes the queries to interact with RGB-T features, allowing the intra-modal enhancement and the inter-modal interactions to be unified for mutual promotion. In this way, JQF can distinguish and enhance the complementary modality features, while filtering out redundant information. For real-time tracking, we propose regional cross-attention for cross-modal interactions to reduce computational cost. Our end-to-end tracker sets a new state-of-the-art performance on multiple RGBT tracking benchmarks including LasHeR, VTUAV, RGBT234 and GTOT, while running at a real-time speed.
Collapse