Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

被引：5

作者：

Li, Yehao ^{[1
]}

Fan, Jiahao ^{[2
]}

Pan, Yingwei ^{[1
]}

Yao, Ting ^{[1
]}

Lin, Weiyao ^{[2
]}

Mei, Tao ^{[1
]}

机构：

[1] JD AI Res, 8 Beichen West St, Beijing 100105, Peoples R China

[2] Shanghai Jiao Tong Univ, 800 Dongchuan Rd, Shanghai 200240, Peoples R China

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2022年 / 18卷 / 02期

基金：

国家重点研发计划;

关键词：

Vision-language pre-training; encoder-decoder networks;

D O I：

10.1145/3473140

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.

引用

页数：16

共 17 条

[1] Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training
Zhou, Wenlve
Zhou, Zhiheng
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8201 - 8214
[2] IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training
Liu, Che
Cheng, Sibo
Shi, Miaojing
Shah, Anand
Bai, Wenjia
Arcucci, Rossella
IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (01) : 519 - 529
[3] Enhancing medical text detection with vision-language pre-training and efficient segmentation
Li, Tianyang
Bai, Jinxu
Wang, Qingzhu
COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3995 - 4007
[4] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Liu, Zikang
Chen, Sihan
Guo, Longteng
Li, Handong
He, Xingjian
Liu, Jing
PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
[5] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
Li, Tianyang
Bai, Jinxu
Wang, Qingzhu
Xu, Hanwen
ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
[6] IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
Huang, Xinyu
Zhang, Youcai
Cheng, Ying
Tian, Weiwei
Zhao, Ruiwei
Feng, Rui
Zhang, Yuejie
Li, Yaqian
Guo, Yandong
Zhang, Xiaobo
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4573 - 4583
[7] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
Ruan, Shouwei
Dong, Yinpeng
Liu, Hanging
Huang, Yao
Su, Hang
Wei, Xingxing
COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 309 - 327
[8] To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training
Su, Ke
Zhang, Xingxing
Zhang, Siyang
Zhu, Jun
Zhang, Bo
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 5370 - 5381
[9] Noise-Robust Vision-Language Pre-Training With Positive-Negative Learning
Huang, Zhenyu
Yang, Mouxing
Xiao, Xinyan
Hu, Peng
Peng, Xi
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (01) : 338 - 350
[10] Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
Xue, Chuhui
Zhang, Wenqing
Hao, Yu
Lu, Shijian
Torr, Philip H. S.
Bai, Song
COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 284 - 302

← 1 2 →