Multi-Modal Latent Space Learning for Chain-of-Thought Reasoning in Language Models

被引:0
作者
He, Liqi [1 ]
Li, Zuchao [1 ]
Cai, Xiantao [1 ]
Wang, Ping [2 ,3 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Natl Engn Res Ctr Multimedia Software, Wuhan 430072, Peoples R China
[2] Wuhan Univ, Ctr Studies Informat Resources, Wuhan 430072, Peoples R China
[3] Wuhan Univ, Sch Informat Management, Wuhan 430072, Peoples R China
来源
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16 | 2024年
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Chain-of-thought (CoT) reasoning has exhibited impressive performance in language models for solving complex tasks and answering questions. However, many real-world questions require multi-modal information, such as text and images. Previous research on multi-modal CoT has primarily focused on extracting fixed image features from off-the-shelf vision models and then fusing them with text using attention mechanisms. This approach has limitations because these vision models were not designed for complex reasoning tasks and do not align well with language thoughts. To overcome this limitation, we introduce a novel approach for multimodal CoT reasoning that utilizes latent space learning via diffusion processes to generate effective image features that align with language thoughts. Our method fuses image features and text representations at a deep level and improves the complex reasoning ability of multi-modal CoT. We demonstrate the efficacy of our proposed method on multi-modal ScienceQA and machine translation benchmarks, achieving state-of-the-art performance on ScienceQA. Overall, our approach offers a more robust and effective solution for multimodal reasoning in language models, enhancing their ability to tackle complex real-world problems.
引用
收藏
页码:18180 / 18187
页数:8
相关论文
共 22 条
[1]  
Carion N., 2020, Lecture Notes in, V12346
[2]  
Cotterell R., 2021, P 2021 C N AM CHAPT, P483
[3]   Delving into the modeling and operation of energy communities as epicenters for systemic transformations [J].
Cristobal, Ana B. ;
Sanz-Cuadrado, Cristina ;
Zhang, Zhe ;
Victoria, Marta ;
Fialho, Luis ;
Cavaco, Afonso ;
Bokalic, Matevz ;
Narvarte, Luis .
UNIVERSAL ACCESS IN THE INFORMATION SOCIETY, 2023,
[4]  
Elliott Desmond, 2016, P 5 WORKSH VIS LANG, P70, DOI DOI 10.18653/V1/W16-3210
[5]  
Li LH, 2019, Arxiv, DOI [arXiv:1908.03557, 10.48550/arXiv.1908.03557]
[6]  
Khashabi D, 2020, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, P1896
[7]  
Kim W, 2021, PR MACH LEARN RES, V139
[8]  
King DB, 2015, ACS SYM SER, V1214, P1, DOI 10.1021/bk-2015-1214.ch001
[9]  
Kojima T., 2022, LARGE LANGUAGE MODEL
[10]  
Kwon M., 2022, Diffusion Models already have a Semantic Latent Space