Multi-Modal Fusion for End-to-End RGB-T Tracking

被引:111
作者
Zhang, Lichao [1 ]
Danelljan, Martin [2 ]
Gonzalez-Garcia, Abel [1 ]
van de Weijer, Joost [1 ]
Khan, Fahad Shahbaz [3 ]
机构
[1] Univ Autonoma Barcelona, Comp Vis Ctr, Barcelona, Spain
[2] Swiss Fed Inst Technol, Comp Vis Lab, Zurich, Switzerland
[3] Incept Inst Artificial Intelligence, Abu Dhabi, U Arab Emirates
来源
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW) | 2019年
关键词
OBJECT TRACKING;
D O I
10.1109/ICCVW.2019.00278
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose an end-to-end tracking framework for fusing the RGB and TIR modalities in RGB-T tracking. Our baseline tracker is DiMP (Discriminative Model Prediction), which employs a carefully designed target prediction network trained end-to-end using a discriminative loss. We analyze the effectiveness of modality fusion in each of the main components in DiMP, i.e. feature extractor, target estimation network, and classifier. We consider several fusion mechanisms acting at different levels of the framework, including pixel-level, feature-level and response-level. Our tracker is trained in an end-to-end manner, enabling the components to learn how to fuse the information from both modalities. As data to train our model, we generate a large-scale RGB-T dataset by considering an annotated RGB tracking dataset (GOT-10k) and synthesizing paired TIR images using an image-to-image translation approach. We perform extensive experiments on VOT-RGBT2019 dataset and RGBT210 dataset, evaluating each type of modality fusing on each model component. The results show that the proposed fusion mechanisms improve the performance of the single modality counterparts. We obtain our best results when fusing at the feature-level on both the IoU-Net and the model predictor, obtaining an EAO score of 0.391 on VOT-RGBT2019 dataset. With this fusion mechanism we achieve the state-of-the-art performance on RGBT210 dataset.
引用
收藏
页码:2252 / 2261
页数:10
相关论文
共 58 条
[1]  
[Anonymous], 2018, P EUR C COMP VIS WOR
[2]  
[Anonymous], 2014, ECCV
[3]   Staple: Complementary Learners for Real-Time Tracking [J].
Bertinetto, Luca ;
Valmadre, Jack ;
Golodetz, Stuart ;
Miksik, Ondrej ;
Torr, Philip H. S. .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :1401-1409
[4]   Fully-Convolutional Siamese Networks for Object Tracking [J].
Bertinetto, Luca ;
Valmadre, Jack ;
Henriques, Joao F. ;
Vedaldi, Andrea ;
Torr, Philip H. S. .
COMPUTER VISION - ECCV 2016 WORKSHOPS, PT II, 2016, 9914 :850-865
[5]   Learning Discriminative Model Prediction for Tracking [J].
Bhat, Goutam ;
Danelljan, Martin ;
Van Gool, Luc ;
Timofte, Radu .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :6181-6190
[6]  
Birchfield ST, 2005, PROC CVPR IEEE, P1158
[7]  
Bolme DS, 2010, PROC CVPR IEEE, P2544, DOI 10.1109/CVPR.2010.5539960
[8]   Thermo-visual feature fusion for object tracking using multiple spatiogram trackers [J].
Conaire, Ciaran O. ;
O'Connor, Noel E. ;
Smeaton, Alan .
MACHINE VISION AND APPLICATIONS, 2008, 19 (5-6) :483-494
[9]   Histograms of oriented gradients for human detection [J].
Dalal, N ;
Triggs, B .
2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, :886-893
[10]   ECO: Efficient Convolution Operators for Tracking [J].
Danelljan, Martin ;
Bhat, Goutam ;
Khan, Fahad Shahbaz ;
Felsberg, Michael .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6931-6939