Multimodal retrieval-augmented generation for financial documents: image-centric analysis of charts and tables with large language models

被引:0
作者
Jiang, Cheng [1 ]
Zhang, Pengle [1 ,2 ]
Ni, Ying [3 ]
Wang, Xiaoli [3 ]
Peng, Hanghang [3 ]
Liu, Sen [4 ]
Fei, Mengdi [1 ]
He, Yuxin [1 ]
Xiao, Yaxuan [1 ]
Huang, Jin [1 ,2 ]
Ma, Xingyu [1 ]
Yang, Tian [1 ]
机构
[1] Wuhan Text Univ, State Key Lab New Text Mat & Adv Proc Technol, Wuhan, Peoples R China
[2] Shanghai Jiao Tong Univ, Shanghai, Peoples R China
[3] Guotai Asset Management Co Ltd, Shanghai, Peoples R China
[4] Univ Penn, Philadelphia, PA USA
关键词
Financial documents; Chart and table image data; Retrieval performance; Retrieval-augmented generation (RAG); Large language models (LLMs);
D O I
10.1007/s00371-025-03829-5
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
In the financial domain, retrieval-augmented generation (RAG) enables large language models (LLMs) to leverage external financial documents for generation, which is crucial for financial decision-making. However, current RAG systems fail to address the inadequate handling of visual data in financial documents and face the issue of insufficient relevance between retrieval results and the corresponding queries during the retrieval process. To address these issues, we present a novel approach that integrates multimodal RAG with LLMs to enhance financial document analysis, particularly focusing on the interpretation of tables and charts. We propose a method to convert chart and table image data into Markdown format for integration with textual data. Through this method, we achieved comprehensive parsing of text, tables, and charts in financial documents. Meanwhile, to achieve better retrieval performance, we utilized a hybrid retrieval method combining a vector database and a graph database. In addition, we utilized LLMs for annotation and refined it through careful human review to compile a rich financial dataset focused on table and chart images, intended to evaluate the retrieval efficiency and generation quality in RAG. By analyzing both the retrieval and generation processes, the results demonstrate the potential of this method to revolutionize financial data visualization and decision-making processes. More information and access to our code are available at our GitHub repository: https://github.com/ChengZ2003/multimodal_RAG.
引用
收藏
页数:14
相关论文
共 46 条
  • [41] Scene-aware Human Pose Generation using Transformer
    Yao, Jieteng
    Chen, Junjie
    Niu, Li
    Sheng, Bin
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 2847 - 2855
  • [42] Yao Y., 2024, arXiv, DOI arXiv:2408.01800
  • [43] Yepes Antonio Jose Jimeno, 2024, arXiv, DOI arXiv:2402.05131
  • [44] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
    Yuan, Zheng
    Jin, Qiao
    Tan, Chuanqi
    Zhao, Zhengyun
    Yuan, Hongyi
    Huang, Fei
    Huang, Songfang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 547 - 556
  • [45] Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models
    Zhang, Boyu
    Yang, Hongyang
    Zhou, Tianyu
    Babar, Ali
    Liu, Xiao-Yang
    [J]. PROCEEDINGS OF THE 4TH ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2023, 2023, : 349 - 356
  • [46] Zhang Boyu, 2023, arXiv