Multimodal retrieval-augmented generation for financial documents: image-centric analysis of charts and tables with large language models

被引：0

作者：

Jiang, Cheng ^{[1
]}

Zhang, Pengle ^{[1
,2
]}

Ni, Ying ^{[3
]}

Wang, Xiaoli ^{[3
]}

Peng, Hanghang ^{[3
]}

Liu, Sen ^{[4
]}

Fei, Mengdi ^{[1
]}

He, Yuxin ^{[1
]}

Xiao, Yaxuan ^{[1
]}

Huang, Jin ^{[1
,2
]}

Ma, Xingyu ^{[1
]}

Yang, Tian ^{[1
]}

机构：

[1] Wuhan Text Univ, State Key Lab New Text Mat & Adv Proc Technol, Wuhan, Peoples R China

[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[3] Guotai Asset Management Co Ltd, Shanghai, Peoples R China

[4] Univ Penn, Philadelphia, PA USA

来源：

VISUAL COMPUTER | 2025年

关键词：

Financial documents; Chart and table image data; Retrieval performance; Retrieval-augmented generation (RAG); Large language models (LLMs);

D O I：

10.1007/s00371-025-03829-5

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In the financial domain, retrieval-augmented generation (RAG) enables large language models (LLMs) to leverage external financial documents for generation, which is crucial for financial decision-making. However, current RAG systems fail to address the inadequate handling of visual data in financial documents and face the issue of insufficient relevance between retrieval results and the corresponding queries during the retrieval process. To address these issues, we present a novel approach that integrates multimodal RAG with LLMs to enhance financial document analysis, particularly focusing on the interpretation of tables and charts. We propose a method to convert chart and table image data into Markdown format for integration with textual data. Through this method, we achieved comprehensive parsing of text, tables, and charts in financial documents. Meanwhile, to achieve better retrieval performance, we utilized a hybrid retrieval method combining a vector database and a graph database. In addition, we utilized LLMs for annotation and refined it through careful human review to compile a rich financial dataset focused on table and chart images, intended to evaluate the retrieval efficiency and generation quality in RAG. By analyzing both the retrieval and generation processes, the results demonstrate the potential of this method to revolutionize financial data visualization and decision-making processes. More information and access to our code are available at our GitHub repository: https://github.com/ChengZ2003/multimodal_RAG.

引用

页数：14

共 46 条

[41] Scene-aware Human Pose Generation using Transformer
Yao, Jieteng
Chen, Junjie
Niu, Li
Sheng, Bin
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2847 - 2855
[42] Yao Y., 2024, arXiv, DOI arXiv:2408.01800
[43] Yepes Antonio Jose Jimeno, 2024, arXiv, DOI arXiv:2402.05131
[44] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
Yuan, Zheng
Jin, Qiao
Tan, Chuanqi
Zhao, Zhengyun
Yuan, Hongyi
Huang, Fei
Huang, Songfang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
[45] Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models
Zhang, Boyu
Yang, Hongyang
Zhou, Tianyu
Babar, Ali
Liu, Xiao-Yang
[J]. PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 349 - 356
[46] Zhang Boyu, 2023, arXiv

← 1 2 3 4 5 →