Multimodal retrieval-augmented generation for financial documents: image-centric analysis of charts and tables with large language models

被引：0

作者：

Jiang, Cheng ^{[1
]}

Zhang, Pengle ^{[1
,2
]}

Ni, Ying ^{[3
]}

Wang, Xiaoli ^{[3
]}

Peng, Hanghang ^{[3
]}

Liu, Sen ^{[4
]}

Fei, Mengdi ^{[1
]}

He, Yuxin ^{[1
]}

Xiao, Yaxuan ^{[1
]}

Huang, Jin ^{[1
,2
]}

Ma, Xingyu ^{[1
]}

Yang, Tian ^{[1
]}

机构：

[1] Wuhan Text Univ, State Key Lab New Text Mat & Adv Proc Technol, Wuhan, Peoples R China

[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China

[3] Guotai Asset Management Co Ltd, Shanghai, Peoples R China

[4] Univ Penn, Philadelphia, PA USA

来源：

VISUAL COMPUTER | 2025年

关键词：

Financial documents; Chart and table image data; Retrieval performance; Retrieval-augmented generation (RAG); Large language models (LLMs);

D O I：

10.1007/s00371-025-03829-5

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

In the financial domain, retrieval-augmented generation (RAG) enables large language models (LLMs) to leverage external financial documents for generation, which is crucial for financial decision-making. However, current RAG systems fail to address the inadequate handling of visual data in financial documents and face the issue of insufficient relevance between retrieval results and the corresponding queries during the retrieval process. To address these issues, we present a novel approach that integrates multimodal RAG with LLMs to enhance financial document analysis, particularly focusing on the interpretation of tables and charts. We propose a method to convert chart and table image data into Markdown format for integration with textual data. Through this method, we achieved comprehensive parsing of text, tables, and charts in financial documents. Meanwhile, to achieve better retrieval performance, we utilized a hybrid retrieval method combining a vector database and a graph database. In addition, we utilized LLMs for annotation and refined it through careful human review to compile a rich financial dataset focused on table and chart images, intended to evaluate the retrieval efficiency and generation quality in RAG. By analyzing both the retrieval and generation processes, the results demonstrate the potential of this method to revolutionize financial data visualization and decision-making processes. More information and access to our code are available at our GitHub repository: https://github.com/ChengZ2003/multimodal_RAG.

引用

页数：14

共 46 条

[21] Li Junnan, P MACHINE LEARNING R
[22] Large Language Models in Finance: A Survey
Li, Yinheng
Wang, Shaofei
Ding, Han
Chen, Hang
[J]. PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 374 - 382
[23] EAPT: Efficient Attention Pyramid Transformer for Image Processing
Lin, Xiao
Sun, Shuzhou
Huang, Wei
Sheng, Bin
Li, Ping
Feng, David Dagan
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 50 - 61
[24] Liu C., 2024, Advances in neural information processing systems, V36, P1
[25] Lost in the Middle: How Language Models Use Long Contexts
Liu, Nelson F.
Lin, Kevin
Hewitt, John
Paranjape, Ashwin
Bevilacqua, Michele
Petroni, Fabio
Liang, Percy
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 157 - 173
[26] TMM-Nets: Transferred Multi- to Mono-Modal Generation for Lupus Retinopathy Diagnosis
Liu, Ruhan
Wang, Tianqin
Li, Huating
Zhang, Ping
Li, Jing
Yang, Xiaokang
Shen, Dinggang
Sheng, Bin
[J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (04) : 1083 - 1094
[27] Liu Z, 2020, PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P4513
[28] 3D reconstruction-oriented fully automatic multi-modal tumor segmentation by dual attention-guided VNet
Meng, Dongdong
Li, Sheng
Sheng, Bin
Wu, Hao
Tian, Suqing
Ma, Wenjun
Wang, Guoping
Yan, Xueqing
[J]. VISUAL COMPUTER, 2023, 39 (08) : 3183 - 3196
[29] Ouyang L, 2022, ADV NEUR IN
[30] Rajpoot PK, 2023, Arxiv, DOI arXiv:2306.17519

← 1 2 3 4 5 →