Multimodal retrieval-augmented generation for financial documents: image-centric analysis of charts and tables with large language models

被引:0
作者
Jiang, Cheng [1 ]
Zhang, Pengle [1 ,2 ]
Ni, Ying [3 ]
Wang, Xiaoli [3 ]
Peng, Hanghang [3 ]
Liu, Sen [4 ]
Fei, Mengdi [1 ]
He, Yuxin [1 ]
Xiao, Yaxuan [1 ]
Huang, Jin [1 ,2 ]
Ma, Xingyu [1 ]
Yang, Tian [1 ]
机构
[1] Wuhan Text Univ, State Key Lab New Text Mat & Adv Proc Technol, Wuhan, Peoples R China
[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[3] Guotai Asset Management Co Ltd, Shanghai, Peoples R China
[4] Univ Penn, Philadelphia, PA USA
关键词
Financial documents; Chart and table image data; Retrieval performance; Retrieval-augmented generation (RAG); Large language models (LLMs);
D O I
10.1007/s00371-025-03829-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In the financial domain, retrieval-augmented generation (RAG) enables large language models (LLMs) to leverage external financial documents for generation, which is crucial for financial decision-making. However, current RAG systems fail to address the inadequate handling of visual data in financial documents and face the issue of insufficient relevance between retrieval results and the corresponding queries during the retrieval process. To address these issues, we present a novel approach that integrates multimodal RAG with LLMs to enhance financial document analysis, particularly focusing on the interpretation of tables and charts. We propose a method to convert chart and table image data into Markdown format for integration with textual data. Through this method, we achieved comprehensive parsing of text, tables, and charts in financial documents. Meanwhile, to achieve better retrieval performance, we utilized a hybrid retrieval method combining a vector database and a graph database. In addition, we utilized LLMs for annotation and refined it through careful human review to compile a rich financial dataset focused on table and chart images, intended to evaluate the retrieval efficiency and generation quality in RAG. By analyzing both the retrieval and generation processes, the results demonstrate the potential of this method to revolutionize financial data visualization and decision-making processes. More information and access to our code are available at our GitHub repository: https://github.com/ChengZ2003/multimodal_RAG.
引用
收藏
页数:14
相关论文
共 46 条
  • [21] Li Junnan, P MACHINE LEARNING R
  • [22] Large Language Models in Finance: A Survey
    Li, Yinheng
    Wang, Shaofei
    Ding, Han
    Chen, Hang
    [J]. PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 374 - 382
  • [23] EAPT: Efficient Attention Pyramid Transformer for Image Processing
    Lin, Xiao
    Sun, Shuzhou
    Huang, Wei
    Sheng, Bin
    Li, Ping
    Feng, David Dagan
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 50 - 61
  • [24] Liu C., 2024, Advances in neural information processing systems, V36, P1
  • [25] Lost in the Middle: How Language Models Use Long Contexts
    Liu, Nelson F.
    Lin, Kevin
    Hewitt, John
    Paranjape, Ashwin
    Bevilacqua, Michele
    Petroni, Fabio
    Liang, Percy
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 157 - 173
  • [26] TMM-Nets: Transferred Multi- to Mono-Modal Generation for Lupus Retinopathy Diagnosis
    Liu, Ruhan
    Wang, Tianqin
    Li, Huating
    Zhang, Ping
    Li, Jing
    Yang, Xiaokang
    Shen, Dinggang
    Sheng, Bin
    [J]. IEEE TRANSACTIONS ON MEDICAL IMAGING, 2023, 42 (04) : 1083 - 1094
  • [27] Liu Z, 2020, PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, P4513
  • [28] 3D reconstruction-oriented fully automatic multi-modal tumor segmentation by dual attention-guided VNet
    Meng, Dongdong
    Li, Sheng
    Sheng, Bin
    Wu, Hao
    Tian, Suqing
    Ma, Wenjun
    Wang, Guoping
    Yan, Xueqing
    [J]. VISUAL COMPUTER, 2023, 39 (08) : 3183 - 3196
  • [29] Ouyang L, 2022, ADV NEUR IN
  • [30] Rajpoot PK, 2023, Arxiv, DOI arXiv:2306.17519