Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models

被引：0

作者：

Kao, Chang-Sheng ^{[1
]}

Chen, Yun-Nung ^{[1
]}

机构：

[1] Natl Taiwan Univ, Taipei, Taiwan

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024 | 2024年

关键词：

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment not only improves overall communicative efficacy but also enhances the quality of conversational experiences. However, existing methods for dialogue-to-image retrieval face limitations due to the constraints of pretrained vision language models (VLMs) in comprehending complex dialogues accurately. To address this, we present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors, facilitating seamless connection with images. Extensive experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors, leading to significant enhancements in dialogue-to-image retrieval performance. Furthermore, our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets, underscoring its practicality and potential impact in real-world applications.

引用

页码：11777 / 11788

页数：12

共 17 条

[1]

Chen Xiaolin, 2023, ACM T INFORM SYST, V42, P1

[2] Visual Dialog [J].

Das, Abhishek ;

Kottur, Satwik ;

Gupta, Khushi ;

Singh, Avi ;

Yadav, Deshraj ;

Moura, Jose M. F. ;

Parikh, Devi ;

Batra, Dhruv .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1080-1089

[3]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[4]

Feng JZ, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, P7348

[5]

Jia C, 2021, PR MACH LEARN RES, V139

[6] The Open Images Dataset V4 Unified Image Classification, Object Detection, and Visual Relationship Detection at Scale [J].

Kuznetsova, Alina ;

Rom, Hassan ;

Alldrin, Neil ;

Uijlings, Jasper ;

Krasin, Ivan ;

Pont-Tuset, Jordi ;

Kamali, Shahab ;

Popov, Stefan ;

Malloci, Matteo ;

Kolesnikov, Alexander ;

Duerig, Tom ;

Ferrari, Vittorio .

INTERNATIONAL JOURNAL OF COMPUTER VISION, 2020, 128 (07) :1956-1981

[7] Stacked Cross Attention for Image-Text Matching [J].

Lee, Kuang-Huei ;

Chen, Xi ;

Hua, Gang ;

Hu, Houdong ;

He, Xiaodong .

COMPUTER VISION - ECCV 2018, PT IV, 2018, 11208 :212-228

[8] Knowledge-aware Multimodal Dialogue Systems [J].

Liao, Lizi ;

Ma, Yunshan ;

He, Xiangnan ;

Hong, Richang ;

Chua, Tat-Seng .

PROCEEDINGS OF THE 2018 ACM MULTIMEDIA CONFERENCE (MM'18), 2018, :801-809

[9]

Lin H, 2020, LANGUAGE MODELS ARE, V33, P1877, DOI DOI 10.48550/ARXIV.2005.14165

[10] Microsoft COCO: Common Objects in Context [J].

Lin, Tsung-Yi ;

Maire, Michael ;

Belongie, Serge ;

Hays, James ;

Perona, Pietro ;

Ramanan, Deva ;

Dollar, Piotr ;

Zitnick, C. Lawrence .

COMPUTER VISION - ECCV 2014, PT V, 2014, 8693 :740-755

← 1 2 →