Multimodal retrieval-augmented generation for financial documents: image-centric analysis of charts and tables with large language models

被引：0

作者：

Jiang, Cheng ^{[1
]}

Zhang, Pengle ^{[1
,2
]}

Ni, Ying ^{[3
]}

Wang, Xiaoli ^{[3
]}

Peng, Hanghang ^{[3
]}

Liu, Sen ^{[4
]}

Fei, Mengdi ^{[1
]}

He, Yuxin ^{[1
]}

Xiao, Yaxuan ^{[1
]}

Huang, Jin ^{[1
,2
]}

Ma, Xingyu ^{[1
]}

Yang, Tian ^{[1
]}

机构：

[1] Wuhan Text Univ, State Key Lab New Text Mat & Adv Proc Technol, Wuhan, Peoples R China

[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[3] Guotai Asset Management Co Ltd, Shanghai, Peoples R China

[4] Univ Penn, Philadelphia, PA USA

来源：

VISUAL COMPUTER | 2025年

关键词：

Financial documents; Chart and table image data; Retrieval performance; Retrieval-augmented generation (RAG); Large language models (LLMs);

D O I：

10.1007/s00371-025-03829-5

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In the financial domain, retrieval-augmented generation (RAG) enables large language models (LLMs) to leverage external financial documents for generation, which is crucial for financial decision-making. However, current RAG systems fail to address the inadequate handling of visual data in financial documents and face the issue of insufficient relevance between retrieval results and the corresponding queries during the retrieval process. To address these issues, we present a novel approach that integrates multimodal RAG with LLMs to enhance financial document analysis, particularly focusing on the interpretation of tables and charts. We propose a method to convert chart and table image data into Markdown format for integration with textual data. Through this method, we achieved comprehensive parsing of text, tables, and charts in financial documents. Meanwhile, to achieve better retrieval performance, we utilized a hybrid retrieval method combining a vector database and a graph database. In addition, we utilized LLMs for annotation and refined it through careful human review to compile a rich financial dataset focused on table and chart images, intended to evaluate the retrieval efficiency and generation quality in RAG. By analyzing both the retrieval and generation processes, the results demonstrate the potential of this method to revolutionize financial data visualization and decision-making processes. More information and access to our code are available at our GitHub repository: https://github.com/ChengZ2003/multimodal_RAG.

引用

页数：14

共 46 条

[1] Baek J., 2023, arXiv, DOI [arXiv:2306.04136, DOI 10.48550/ARXIV.2306.04136]
[2] Bai J., 2023, PREPRINT, DOI [arXiv:2309.16609, 10.48550/arXiv.2309.16609, DOI 10.48550/ARXIV.2309.16609]
[3] Bhatia G, 2024, Arxiv, DOI arXiv:2402.10986
[4] Brown TB, 2020, ADV NEUR IN, V33
[5] Towards Generating Financial Reports from Tabular Data Using Transformers
Chapman, Clayton Leroy
Hillebrand, Lars
Stenzel, Marc Robin
Deusser, Tobias
Biesner, David
Bauckhage, Christian
Sifa, Rafet
[J]. MACHINE LEARNING AND KNOWLEDGE EXTRACTION, CD-MAKE 2022, 2022, 13480 : 221 - 232
[6] Chen Jianlv, 2024, PREPRINT, DOI arXiv:2402.03216
[7] Chen W., 2022, arXiv, DOI arXiv:2210.02928
[8] Chen Z, 2024, PROC CVPR IEEE, P24185, DOI 10.1109/CVPR52733.2024.02283
[9] Chu ZX, 2023, Arxiv, DOI arXiv:2310.17784
[10] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171

← 1 2 3 4 5 →