A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [1] The Convergence of Open Data, Linked Data, Ontologies, and Large Language Models: Enabling Next-Generation Knowledge Systems
    Cigliano, Andrea
    Fallucchi, Francesca
    METADATA AND SEMANTIC RESEARCH, MTSR 2024, 2025, 2331 : 197 - 213
  • [2] Game Generation via Large Language Models
    Hu, Chengpeng
    Zhao, Yunlong
    Liu, Jialin
    2024 IEEE CONFERENCE ON GAMES, COG 2024, 2024,
  • [3] On the Capacity of Citation Generation by Large Language Models
    Qian, Haosheng
    Fan, Yixing
    Zhang, Ruqing
    Guo, Jiafeng
    INFORMATION RETRIEVAL, CCIR 2024, 2025, 15418 : 109 - 123
  • [4] Large Language Model-Driven Structured Output: A Comprehensive Benchmark and Spatial Data Generation Framework
    Li, Diya
    Zhao, Yue
    Wang, Zhifang
    Jung, Calvin
    Zhang, Zhe
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (11)
  • [5] Evaluating application of large language models to biomedical patent claim generation
    Chen, Feng-Chi
    Pan, Chia-Lin
    AIPlux Development Team, AIPlux Development
    WORLD PATENT INFORMATION, 2025, 80
  • [6] The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer
    Kaiser, Philippe
    Yang, Shan
    Bach, Michael
    Breit, Christian
    Mertz, Kirsten
    Stieltjes, Bram
    Ebbing, Jan
    Wetterauer, Christian
    Henkel, Maurice
    WORLD JOURNAL OF UROLOGY, 2025, 43 (01)
  • [7] Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Trustworthy Response Generation in Chinese
    Wang, Haochun
    Zhao, Sendong
    Qiang, Zewen
    Li, Zijian
    Liu, Chi
    Xi, Nuwa
    Du, Yanrui
    Qin, Bing
    Liu, Ting
    ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2025, 19 (02)
  • [8] Demystifying Data Management for Large Language Models
    Miao, Xupeng
    Jia, Zhihao
    Cui, Bin
    COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 547 - 555
  • [9] Prompting Large Language Models With the Socratic Method
    Chang, Edward Y.
    2023 IEEE 13TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE, CCWC, 2023, : 351 - 360
  • [10] Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
    Ghanadian, Hamideh
    Nejadgholi, Isar
    Al Osman, Hussein
    IEEE ACCESS, 2024, 12 : 14350 - 14363