Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

被引:0
作者
Ki, Geewook [1 ,2 ]
Lee, Hodong [1 ,3 ]
Kim, Daehee [1 ]
Jung, Haeji [3 ]
Park, Sanghee [1 ]
Kim, Yoonsik [1 ]
Yun, Sangdoo [4 ]
Kim, Taeho [1 ]
Lee, Bado [1 ]
Park, Seunghyun [1 ]
机构
[1] NAVER Cloud AI, Seoul, South Korea
[2] KAIST Ai, Daejeon, South Korea
[3] Korea Univ, Seoul, South Korea
[4] NAVER AI Lab, Seoul, South Korea
来源
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023) | 2023年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent advances in Large Language Models (LLMs) have stimulated a surge of research aimed at extending their applications to the visual domain. While these models exhibit promise in generating abstract image captions and facilitating natural conversations, their performance on text-rich images still requires improvement. In this paper, we introduce Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details that are often overlooked in existing methods. Cream combines vision and auxiliary encoders, fortified by a contrastive feature alignment technique, to achieve a more effective comprehension of language information in visually situated contexts within the images. Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants. Through rigorous evaluations across diverse visually-situated language understanding tasks that demand reasoning capabilities, we demonstrate the compelling performance of Cream, positioning it as a prominent model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/cream.
引用
收藏
页码:11989 / 12010
页数:22
相关论文
共 71 条
[1]  
Alayrac JB, 2022, ADV NEUR IN
[2]  
[Anonymous], 2021, DOCUMENT ANAL RECOGN, DOI DOI 10.1007/978-3-030-86337-142
[3]  
[Anonymous], 2023, DOCUMENT ANAL RECOGN, DOI DOI 10.1007/978-3-031-41682-819
[4]  
[Anonymous], PMLR
[5]   DocFormer: End-to-End Transformer for Document Understanding [J].
Appalaraju, Srikar ;
Jasani, Bhavan ;
Kota, Bhargava Urala ;
Xie, Yusheng ;
Manmatha, R. .
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, :973-983
[6]   Scene Text Visual Question Answering [J].
Biten, Ali Furkan ;
Tito, Ruben ;
Mafla, Andres ;
Gomez, Lluis ;
Rusinol, Marcal ;
Valveny, Ernest ;
Jawahar, C. V. ;
Karatzas, Dimosthenis .
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, :4290-4300
[7]  
Brown TB, 2020, ADV NEUR IN, V33
[8]  
Chiang W.L., 2023, Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
[9]  
Chowdhery Aakanksha., 2022, Palm: Scaling language modeling with pathways
[10]  
Chung Hyung Won, 2022, Scaling instruction-finetuned language models