Distilling implicit multimodal knowledge into large language models for zero-resource dialogue generation

被引：0

作者：

Zhang, Bo ^{[1
]}

Ma, Hui ^{[2
]}

Ding, Jian ^{[1
]}

Wang, Jian ^{[1
]}

Xu, Bo ^{[1
]}

Lin, Hongfei ^{[1
]}

机构：

[1] Dalian Univ Technol, Sch Comp Sci & Technol, Dalian 116024, Liaoning, Peoples R China

[2] Hefei Univ Technol, Sch Comp Sci & Informat Engn, Hefei 230601, Anhui, Peoples R China

来源：

INFORMATION FUSION | 2025年 / 118卷

关键词：

Large language models; Multimodal fusion; Zero resource; Dialogue generation;

D O I：

10.1016/j.inffus.2025.102985

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Integrating multimodal knowledge into large language models (LLMs) represents a significant advancement in dialogue generation capabilities. However, the effective incorporation of such knowledge in zero-resource scenarios remains a substantial challenge due to the scarcity of diverse, high-quality dialogue datasets. To address this, we propose the Visual Implicit Knowledge Distillation Framework (VIKDF), an innovative approach aimed at enhancing LLMs for enriched dialogue generation in zero-resource contexts by leveraging implicit multimodal knowledge. VIKDF comprises two main stages: knowledge distillation, using an Implicit Query Transformer to extract and encode visual implicit knowledge from image-text pairs into knowledge vectors; and knowledge integration, employing a novel Bidirectional Variational Information Fusion technique to seamlessly integrate these distilled vectors into LLMs. This enables the LLMs to generate dialogues that are not only coherent and engaging but also exhibit a deep understanding of the context through implicit multimodal cues, effectively overcoming the limitations of zero-resource scenarios. Our extensive experimentation across two dialogue datasets shows that VIKDF outperforms existing state-of-the-art models in generating high-quality dialogues. The code is available at https://github.com/zhangbo-nlp/VIKDF.

引用

页数：11

共 53 条

[1] Audio Visual Scene-Aware Dialog [J].

Alamri, Huda ;

Cartillier, Vincent ;

Das, Abhishek ;

Wang, Jue ;

Cherian, Anoop ;

Essa, Irfan ;

Batra, Dhruv ;

Marks, Tim K. ;

Hori, Chiori ;

Anderson, Peter ;

Lee, Stefan ;

Parikh, Devi .

2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, :7550-7559

[2]

Allen-Zhu Z., 2022, INT C LEARN REPR

[3]

[Anonymous], 2011, Advances in neural information processing systems

[4]

Brown TB, 2020, ADV NEUR IN, V33

[5] Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts [J].

Changpinyo, Soravit ;

Sharma, Piyush ;

Ding, Nan ;

Soricut, Radu .

2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, :3557-3567

[6]

Chen X, 2016, 30 C NEURAL INFORM P, V29

[7]

Dai WL, 2022, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), P2383

[8] Visual Dialog [J].

Das, Abhishek ;

Kottur, Satwik ;

Gupta, Khushi ;

Singh, Avi ;

Yadav, Deshraj ;

Moura, Jose M. F. ;

Parikh, Devi ;

Batra, Dhruv .

30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, :1080-1089

[9]

Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

[10]

Ding ZY, 2024, AAAI CONF ARTIF INTE, P17907

← 1 2 3 4 5 6 →