C2MR: Continual Cross-Modal Retrieval for Streaming Multi-modal Data

被引:2
作者
Zhang, Huaiwen [1 ]
Yang, Yang [1 ]
Qi, Fan [2 ]
Qian, Shengsheng [3 ]
Xu, Changsheng [3 ]
机构
[1] Inner Mongolia Univ, Hohhot, Peoples R China
[2] Tianjin Univ Technol, Tianjin, Peoples R China
[3] Chinese Acad Sci, Inst Automat, Beijing, Peoples R China
来源
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023 | 2023年
基金
北京市自然科学基金; 中国国家自然科学基金;
关键词
Continual Learning; Cross-modal Retrieval;
D O I
10.1145/3581783.3611919
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Massive numbers of new images are uploaded to the internet every day. However, existing cross-modal retrieval (CMR) approaches struggle to accommodate this continuously growing data. The prevalent practice involves periodically retraining or fine-tuning a newmodel based on the accumulated data, which in turn invalidates billions of indexed features extracted by the previous model and incurs another substantial computational cost to extract new features for the entire data archive. Is it possible to develop a retrieval model that effectively captures the knowledge of upcoming sessions while preserving the discriminative power of features extracted in previous sessions? In this paper, we propose an online continual learning setup, OC-CMR, to formalize the data-incremental growth challenge faced by cross-modal retrieval systems. It consists of two key settings: 1) Similar to the real-world scenarios, the streaming multi-modal data arrives once per session; 2) Consider the computational costs, each instance of archived data has its feature extracted only once and by its corresponding model in its session. Based on our OC-CMR, we perform in-depth evaluations of state-of-the-art cross-modal retrieval methods and observe that they suffer from representational shift and collapse due to the catastrophic forgetting. To address this issue, we propose the Continual Cross-Modal Retrieval ((CMR)-M-2) approach, which learns a shared common space not only across modalities but also sessions and maintains relationships between samples from distinct sessions via cross-modal relational coherence and semantic representation coordination. We construct two new benchmarks by adapting MS-COCO and Flickr30K datasets to the OC-CMR setting, providing a more challenging evaluation framework for CMR tasks. Experimental results demonstrate that our method effectively alleviates forgetting and significantly outperforms combinations of previous arts in cross-modal retrieval and continual learning.
引用
收藏
页码:8963 / 8974
页数:12
相关论文
共 52 条
[1]   Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments [J].
Anderson, Peter ;
Wu, Qi ;
Teney, Damien ;
Bruce, Jake ;
Johnson, Mark ;
Sunderhauf, Niko ;
Reid, Ian ;
Gould, Stephen ;
van den Hengel, Anton .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3674-3683
[2]   Rainbow Memory: Continual Learning with a Memory of Diverse Samples [J].
Bang, Jihwan ;
Kim, Heesu ;
Yoo, YoungJoon ;
Ha, Jung-Woo ;
Choi, Jonghyun .
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :8214-8223
[3]   Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches [J].
Bhunia, Ayan Kumar ;
Gajjala, Viswanatha Reddy ;
Koley, Subhadeep ;
Kundu, Rohit ;
Sain, Aneeshan ;
Xiang, Tao ;
Song, Yi-Zhe .
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, :2283-2292
[4]  
Boschini M., 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence
[5]  
Brahma Pratik Prabhanjan, 2018, P IEEE C COMP VIS PA, P1066
[6]   Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence [J].
Chaudhry, Arslan ;
Dokania, Puneet K. ;
Ajanthan, Thalaiyasingam ;
Torr, Philip H. S. .
COMPUTER VISION - ECCV 2018, PT XI, 2018, 11215 :556-572
[7]   Cross-Modal Retrieval with Heterogeneous Graph Embedding [J].
Chen, Dapeng ;
Wang, Min ;
Chen, Haobin ;
Wu, Lin ;
Qin, Jing ;
Peng, Wei .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :3291-3300
[8]   IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [J].
Chen, Hui ;
Ding, Guiguang ;
Liu, Xudong ;
Lin, Zijia ;
Liu, Ji ;
Han, Jungong .
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2020), 2020, :12652-12660
[9]  
Chen T., 2020, P INT C MACH LEARN, P1597
[10]   A Continual Learning Survey: Defying Forgetting in Classification Tasks [J].
De Lange, Matthias ;
Aljundi, Rahaf ;
Masana, Marc ;
Parisot, Sarah ;
Jia, Xu ;
Leonardis, Ales ;
Slabaugh, Greg ;
Tuytelaars, Tinne .
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (07) :3366-3385