Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

被引:0
作者
Ki, Geewook [1 ,2 ]
Lee, Hodong [1 ,3 ]
Kim, Daehee [1 ]
Jung, Haeji [3 ]
Park, Sanghee [1 ]
Kim, Yoonsik [1 ]
Yun, Sangdoo [4 ]
Kim, Taeho [1 ]
Lee, Bado [1 ]
Park, Seunghyun [1 ]
机构
[1] NAVER Cloud AI, Seoul, South Korea
[2] KAIST Ai, Daejeon, South Korea
[3] Korea Univ, Seoul, South Korea
[4] NAVER AI Lab, Seoul, South Korea
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in Large Language Models (LLMs) have stimulated a surge of research aimed at extending their applications to the visual domain. While these models exhibit promise in generating abstract image captions and facilitating natural conversations, their performance on text-rich images still requires improvement. In this paper, we introduce Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details that are often overlooked in existing methods. Cream combines vision and auxiliary encoders, fortified by a contrastive feature alignment technique, to achieve a more effective comprehension of language information in visually situated contexts within the images. Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants. Through rigorous evaluations across diverse visually-situated language understanding tasks that demand reasoning capabilities, we demonstrate the compelling performance of Cream, positioning it as a prominent model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/cream.
引用
收藏
页码:11989 / 12010
页数:22
相关论文
共 71 条
[11]   TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages [J].
Clark, Jonathan H. ;
Choi, Eunsol ;
Collins, Michael ;
Garrette, Dan ;
Kwiatkowski, Tom ;
Nikolaev, Vitaly ;
Palomaki, Jennimaria .
TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2020, 8 :454-470
[12]  
Dai Wenliang, 2023, arXiv
[13]  
Davis B., 2022, COMPUTER VISION ECCV, P280, DOI DOI 10.1007/978-3-031-25069-919
[14]  
Dosovitskiy A., 2021, arXiv
[15]  
Driess Danny, 2023, Palm-e: An embodied multimodal language model
[16]   Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering [J].
Goyal, Yash ;
Khot, Tejas ;
Summers-Stay, Douglas ;
Batra, Dhruv ;
Parikh, Devi .
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :6325-6334
[17]   VizWiz Grand Challenge: Answering Visual Questions from Blind People [J].
Gurari, Danna ;
Li, Qing ;
Stangl, Abigale J. ;
Guo, Anhong ;
Lin, Chi ;
Grauman, Kristen ;
Luo, Jiebo ;
Bigham, Jeffrey P. .
2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, :3608-3617
[18]   Deep Residual Learning for Image Recognition [J].
He, Kaiming ;
Zhang, Xiangyu ;
Ren, Shaoqing ;
Sun, Jian .
2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778
[19]  
Hong T, 2022, AAAI CONF ARTIF INTE, P10767
[20]   LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking [J].
Huang, Yupan ;
Lv, Tengchao ;
Cui, Lei ;
Lu, Yutong ;
Wei, Furu .
PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, :4083-4091