VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning

被引:0
作者
Fei, Nanyi [1 ]
Jiang, Hao [2 ]
Lu, Haoyu [3 ]
Long, Jinqiang [3 ]
Dai, Yanqi [3 ]
Fan, Tuo [2 ]
Cao, Zhao [2 ]
Lu, Zhiwu [3 ]
机构
[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China
[2] Huawei Poisson Lab, Hangzhou, Zhejiang, Peoples R China
[3] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China
来源
ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I | 2024年 / 14608卷
基金
中国国家自然科学基金;
关键词
multi-modal model; multi-task learning; cross-modal search;
D O I
10.1007/978-3-031-56027-9_4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cross-modal search is one fundamental task in multi-modal learning, but there is hardly any work that aims to solve multiple cross-modal search tasks at once. In this work, we propose a novel Versatile Elastic Multi-mOdal (VEMO) model for search-oriented multi-task learning. VEMO is versatile because we integrate cross-modal semantic search, named entity recognition, and scene text spotting into a unified framework, where the latter two can be further adapted to entity- and character-based image search tasks. VEMO is also elastic because we can freely assemble sub-modules of our flexible network architecture for corresponding tasks. Moreover, to give more choices on the effect-efficiency trade-off when performing cross-modal semantic search, we place multiple encoder exits. Experimental results show the effectiveness of our VEMO with only 37.6% network parameters compared to those needed for uni-task training. Further evaluations on entity- and character-based image search tasks also validate the superiority of search-oriented multi-task learning.
引用
收藏
页码:56 / 72
页数:17
相关论文
共 52 条
[1]   Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].
Bain, Max ;
Nagrani, Arsha ;
Varol, Gul ;
Zisserman, Andrew .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718
[2]  
Bao HB, 2022, Arxiv, DOI [arXiv:2111.02358, DOI 10.48550/ARXIV.2111.02358]
[3]   Total-Text: toward orientation robustness in scene text detection [J].
Ch'ng, Chee-Kheng ;
Chan, Chee Seng ;
Liu, Cheng-Lin .
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2020, 23 (01) :31-52
[4]  
Chen SJ, 2024, Arxiv, DOI [arXiv:2109.09138, DOI 10.1145/3663363]
[5]   UNITER: UNiversal Image-TExt Representation Learning [J].
Chen, Yen-Chun ;
Li, Linjie ;
Yu, Licheng ;
El Kholy, Ahmed ;
Ahmed, Faisal ;
Gan, Zhe ;
Cheng, Yu ;
Liu, Jingjing .
COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120
[6]  
Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
[7]  
Dosovitskiy A., 2020, INT C LEARN REPR, P1
[8]   TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting [J].
Feng, Wei ;
He, Wenhao ;
Yin, Fei ;
Zhang, Xu-Yao ;
Liu, Cheng-Lin .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9075-9084
[9]   Good News, Everyone! Context Driven Entity-Aware Captioning for News Images [J].
Furkan Biten, Ali ;
Gomez, Lluis ;
Rusinol, Marcal ;
Karatzas, Dimosthenis .
2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12458-12467
[10]  
Huang ZC, 2020, Arxiv, DOI [arXiv:2004.00849, 10.48550/arXiv.2004.00849]