VEMO: A Versatile Elastic Multi-modal Model for Search-Oriented Multi-task Learning

被引：0

作者：

Fei, Nanyi ^{[1
]}

Jiang, Hao ^{[2
]}

Lu, Haoyu ^{[3
]}

Long, Jinqiang ^{[3
]}

Dai, Yanqi ^{[3
]}

Fan, Tuo ^{[2
]}

Cao, Zhao ^{[2
]}

Lu, Zhiwu ^{[3
]}

机构：

[1] Renmin Univ China, Sch Informat, Beijing, Peoples R China

[2] Huawei Poisson Lab, Hangzhou, Zhejiang, Peoples R China

[3] Renmin Univ China, Gaoling Sch Artificial Intelligence, Beijing, Peoples R China

来源：

ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I | 2024年 / 14608卷

基金：

中国国家自然科学基金;

关键词：

multi-modal model; multi-task learning; cross-modal search;

D O I：

10.1007/978-3-031-56027-9_4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Cross-modal search is one fundamental task in multi-modal learning, but there is hardly any work that aims to solve multiple cross-modal search tasks at once. In this work, we propose a novel Versatile Elastic Multi-mOdal (VEMO) model for search-oriented multi-task learning. VEMO is versatile because we integrate cross-modal semantic search, named entity recognition, and scene text spotting into a unified framework, where the latter two can be further adapted to entity- and character-based image search tasks. VEMO is also elastic because we can freely assemble sub-modules of our flexible network architecture for corresponding tasks. Moreover, to give more choices on the effect-efficiency trade-off when performing cross-modal semantic search, we place multiple encoder exits. Experimental results show the effectiveness of our VEMO with only 37.6% network parameters compared to those needed for uni-task training. Further evaluations on entity- and character-based image search tasks also validate the superiority of search-oriented multi-task learning.

引用

页码：56 / 72

页数：17

共 52 条

[1] Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [J].

Bain, Max ;

Nagrani, Arsha ;

Varol, Gul ;

Zisserman, Andrew .

2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :1708-1718

[2]

Bao HB, 2022, Arxiv, DOI [arXiv:2111.02358, DOI 10.48550/ARXIV.2111.02358]

[3] Total-Text: toward orientation robustness in scene text detection [J].

Ch'ng, Chee-Kheng ;

Chan, Chee Seng ;

Liu, Cheng-Lin .

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2020, 23 (01) :31-52

[4]

Chen SJ, 2024, Arxiv, DOI [arXiv:2109.09138, DOI 10.1145/3663363]

[5] UNITER: UNiversal Image-TExt Representation Learning [J].

Chen, Yen-Chun ;

Li, Linjie ;

Yu, Licheng ;

El Kholy, Ahmed ;

Ahmed, Faisal ;

Gan, Zhe ;

Cheng, Yu ;

Liu, Jingjing .

COMPUTER VISION - ECCV 2020, PT XXX, 2020, 12375 :104-120

[6]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[7]

Dosovitskiy A., 2020, INT C LEARN REPR, P1

[8] TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting [J].

Feng, Wei ;

He, Wenhao ;

Yin, Fei ;

Zhang, Xu-Yao ;

Liu, Cheng-Lin .

2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :9075-9084

[9] Good News, Everyone! Context Driven Entity-Aware Captioning for News Images [J].

Furkan Biten, Ali ;

Gomez, Lluis ;

Rusinol, Marcal ;

Karatzas, Dimosthenis .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :12458-12467

[10]

Huang ZC, 2020, Arxiv, DOI [arXiv:2004.00849, 10.48550/arXiv.2004.00849]

← 1 2 3 4 5 6 →