IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

被引：3

作者：

Huang, Xinyu ^{[1
]}

Zhang, Youcai ^{[2
]}

Cheng, Ying ^{[3
]}

Tian, Weiwei ^{[3
]}

Zhao, Ruiwei ^{[3
]}

Feng, Rui ^{[1
,3
,4
,5
]}

Zhang, Yuejie ^{[1
]}

Li, Yaqian ^{[2
]}

Guo, Yandong ^{[2
]}

Zhang, Xiaobo ^{[4
]}

机构：

[1] Fudan Univ, Sch Comp Sci, Shanghai Key Lab Intelligent Informat Proc, Shanghai, Peoples R China

[2] OPPO Res Inst, Chengdu, Peoples R China

[3] Fudan Univ, Acad Engn & Technol, Shanghai, Peoples R China

[4] Fudan Univ, Childrens Hosp, Natl Childrens Med Ctr, Shanghai, Peoples R China

[5] Shanghai Collaborat Innovat Ctr Intelligent Visua, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022 | 2022年

基金：

中国国家自然科学基金;

关键词：

Vision-Language Pre-training; Multi-Label Recognition; Natural Language Supervision; Vision-Language Intelligence;

D O I：

10.1145/3503161.3548108

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost. Public code is available at: https://github.com/xinyu1205/IDEA-pytorch.

引用

页码：4573 / 4583

页数：11

共 50 条

[1] Vision-language pre-training via modal interaction
Cheng, Hang
Ye, Hehui
Zhou, Xiaofei
Liu, Ximeng
Chen, Fei
Wang, Meiqing
PATTERN RECOGNITION, 2024, 156
[2] Vision-Language Pre-Training for Boosting Scene Text Detectors
Song, Sibo
Wan, Jianqiang
Yang, Zhibo
Tang, Jun
Cheng, Wenqing
Bai, Xiang
Yao, Cong
2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
[3] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training
Moon, Jong Hak
Lee, Hyungyung
Shin, Woncheol
Kim, Young-Hak
Choi, Edward
IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2022, 26 (12) : 6070 - 6080
[4] Survey on Vision-language Pre-training
Yin J.
Zhang Z.-D.
Gao Y.-H.
Yang Z.-W.
Li L.
Xiao M.
Sun Y.-Q.
Yan C.-G.
Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
[5] Position-guided Text Prompt for Vision-Language Pre-training
Wang, Jinpeng
Zhou, Pan
Shou, Mike Zheng
Yan, Shuicheng
2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
[6] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
Liang, Mingliang
Larson, Martha
PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
[7] VLP: A Survey on Vision-language Pre-training
Chen, Fei-Long
Zhang, Du-Zhen
Han, Ming-Lun
Chen, Xiu-Yi
Shi, Jing
Xu, Shuang
Xu, Bo
MACHINE INTELLIGENCE RESEARCH, 2023, 20 (01) : 38 - 56
[8] VLP: A Survey on Vision-language Pre-training
Fei-Long Chen
Du-Zhen Zhang
Ming-Lun Han
Xiu-Yi Chen
Jing Shi
Shuang Xu
Bo Xu
Machine Intelligence Research, 2023, 20 (01) : 38 - 56
[9] VLP: A Survey on Vision-language Pre-training
Fei-Long Chen
Du-Zhen Zhang
Ming-Lun Han
Xiu-Yi Chen
Jing Shi
Shuang Xu
Bo Xu
Machine Intelligence Research, 2023, 20 : 38 - 56
[10] Enhancing medical text detection with vision-language pre-training and efficient segmentation
Li, Tianyang
Bai, Jinxu
Wang, Qingzhu
COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3995 - 4007

← 1 2 3 4 5 →