Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

被引:5
|
作者
Li, Yehao [1 ]
Fan, Jiahao [2 ]
Pan, Yingwei [1 ]
Yao, Ting [1 ]
Lin, Weiyao [2 ]
Mei, Tao [1 ]
机构
[1] JD AI Res, 8 Beichen West St, Beijing 100105, Peoples R China
[2] Shanghai Jiao Tong Univ, 800 Dongchuan Rd, Shanghai 200240, Peoples R China
基金
国家重点研发计划;
关键词
Vision-language pre-training; encoder-decoder networks;
D O I
10.1145/3473140
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.
引用
收藏
页数:16
相关论文
共 17 条
  • [1] Unsupervised Domain Adaption Harnessing Vision-Language Pre-Training
    Zhou, Wenlve
    Zhou, Zhiheng
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (09) : 8201 - 8214
  • [2] IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-Training
    Liu, Che
    Cheng, Sibo
    Shi, Miaojing
    Shah, Anand
    Bai, Wenjia
    Arcucci, Rossella
    IEEE TRANSACTIONS ON MEDICAL IMAGING, 2025, 44 (01) : 519 - 529
  • [3] Enhancing medical text detection with vision-language pre-training and efficient segmentation
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3995 - 4007
  • [4] Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
    Liu, Zikang
    Chen, Sihan
    Guo, Longteng
    Li, Handong
    He, Xingjian
    Liu, Jing
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 5120 - 5131
  • [5] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    Xu, Hanwen
    ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
  • [6] IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training
    Huang, Xinyu
    Zhang, Youcai
    Cheng, Ying
    Tian, Weiwei
    Zhao, Ruiwei
    Feng, Rui
    Zhang, Yuejie
    Li, Yaqian
    Guo, Yandong
    Zhang, Xiaobo
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4573 - 4583
  • [7] Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models
    Ruan, Shouwei
    Dong, Yinpeng
    Liu, Hanging
    Huang, Yao
    Su, Hang
    Wei, Xingxing
    COMPUTER VISION - ECCV 2024, PT XXVI, 2025, 15084 : 309 - 327
  • [8] To Boost Zero-Shot Generalization for Embodied Reasoning With Vision-Language Pre-Training
    Su, Ke
    Zhang, Xingxing
    Zhang, Siyang
    Zhu, Jun
    Zhang, Bo
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 5370 - 5381
  • [9] Noise-Robust Vision-Language Pre-Training With Positive-Negative Learning
    Huang, Zhenyu
    Yang, Mouxing
    Xiao, Xinyan
    Hu, Peng
    Peng, Xi
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2025, 47 (01) : 338 - 350
  • [10] Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
    Xue, Chuhui
    Zhang, Wenqing
    Hao, Yu
    Lu, Shijian
    Torr, Philip H. S.
    Bai, Song
    COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 284 - 302