3260 papers • 126 benchmarks • 313 datasets
RGBT tracking, or RGB-Thermal tracking, is a sophisticated method utilized in computer vision for tracking objects across both RGB (Red, Green, Blue) and thermal infrared modalities. This technique combines information from both RGB and thermal imagery to enhance object detection and tracking performance, particularly in challenging environments where lighting conditions may vary or be limited. By integrating data from these two modalities, RGBT tracking systems can effectively compensate for the limitations of each individual modality, such as the inability of RGB cameras to capture clear images in low-light or adverse weather conditions, and the inability of thermal cameras to accurately identify object details. RGBT tracking algorithms typically involve sophisticated fusion techniques to combine information from RGB and thermal sensors, enabling robust and accurate object tracking in diverse scenarios ranging from surveillance and security applications to autonomous vehicles and search and rescue operations.
(Image credit: Papersgraph)
These leaderboards are used to track progress in rgb-t-tracking-5
Use these libraries to find rgb-t-tracking-5 models and implementations
No subtasks available.
Different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy, with specific focus on visible-depth (RGB-D) and visible-thermal (RGB-T) tracking.
A new dynamic modality-aware filter generation module (named MFGNet) is proposed to boost the message communication between visible and thermal data by adaptively adjusting the convolutional kernels for various input images in practical tracking.
This work designs five attribute-specific fusion branches to integrate RGB and thermal features under the challenges of thermal crossover, illumination variation, scale variation, occlusion and fast motion respectively and proposes a novel Attribute-based Progressive Fusion Network (APFNet) to increase the fusion capacity with a small number of parameters while reducing the dependence on large-scale training data.
A novel approach to suppress background effects for RGB-T tracking by integrating the soft cross-modality consistency into the ranking model which allows the sparse inconsistency to account for the different properties between these two modalities.
This work proposes an end-to-end tracking framework for fusing the RGB and TIR modalities in RGB-T tracking, and evaluates the effectiveness of modality fusion in each of the main components in DiMP, i.e. feature extractor, target estimation network, and classifier.
The unaligned version of LasHeR is released to attract the research interest for alignment-free RGBT tracking, which is a more practical task in real-world applications.
A large-scale benchmark with high diversity for visible-thermal UAV tracking (VTUAV), including 500 sequences with 1.7 million high-resolution (1920* 1080 pixels) frame pairs is constructed, and a coarse-to-fine attribute annotation is provided, where frame-level attributes are provided to exploit the potential of challenge-specific trackers.
RGB-T tracking aims to leverage the mutual enhancement and complement ability of RGB and TIR modalities for improving the tracking process in various scenarios, where cross-modal interaction is the key component. Some previous methods concatenate the RGB and TIR search region features directly to perform a coarse interaction process with redundant background noises introduced. Many other methods sample candidate boxes from search frames and conduct various fusion approaches on isolated pairs of RGB and TIR boxes, which limits the cross-modal interaction within local regions and brings about inadequate context modeling. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module which exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. Original templates are also updated with enriched multimodal contexts from the template medium. Our TBSI module is inserted into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances. Code is available at https://github.com/RyanHTR/TBSI.
Inspired by the recent success of the prompt learning in language models, ViPT is developed, which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multi-modal tracking tasks and can achieve state-of-the-art performance while satisfying parameter efficiency.
Adding a benchmark result helps the community track progress.