A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [31] Triplet-based contrastive method enhances the reasoning ability of large language models
    Chen, Hongwei
    Zhu, Jiahui
    Wang, Wei
    Zhu, Yuan
    Xi, Liya
    JOURNAL OF SUPERCOMPUTING, 2025, 81 (04):
  • [32] Large corpora and large language models: a replicable method for automating grammatical annotation
    Morin, Cameron
    Larsson, Matti Marttinen
    LINGUISTICS VANGUARD, 2025,
  • [33] A divide-and-conquer approach to neural natural language generation from structured data
    Dethlefs, Nina
    Schoene, Annika
    Cuayahuitl, Heriberto
    NEUROCOMPUTING, 2021, 433 : 300 - 309
  • [34] Quantitative Evaluation of Using Large Language Models and Retrieval-Augmented Generation in Computer Science Education
    Wang, Kevin Shukang
    Lawrence, Ramon
    PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 2, 2025, : 1183 - 1189
  • [35] Quantitative Evaluation of Using Large Language Models and Retrieval-Augmented Generation in Computer Science Education
    Wang, Kevin Shukang
    Lawrence, Ramon
    PROCEEDINGS OF THE 56TH ACM TECHNICAL SYMPOSIUM ON COMPUTER SCIENCE EDUCATION, SIGCSE TS 2025, VOL 1, 2025, : 1183 - 1189
  • [36] Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
    Ntinopoulos, Vasileios
    Biefer, Hector Rodriguez Cetina
    Tudorache, Igor
    Papadopoulos, Nestoras
    Odavic, Dragan
    Risteski, Petar
    Haeussler, Achim
    Dzemali, Omer
    BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)
  • [37] Safety of Large Language Models in Addressing Depression
    Heston, Thomas F.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
  • [38] Harnessing Large Language Models for Chart Review
    Xu, Dongchu
    Cunningham, Jonathan W.
    JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2025, 14 (07):
  • [39] Large Language Models in Gastroenterology: Systematic Review
    Gong, Eun Jeong
    Bang, Chang Seok
    Lee, Jae Jun
    Park, Jonghyung
    Kim, Eunsil
    Kim, Subeen
    Kimm, Minjae
    Choi, Seoung-Ho
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [40] Automated Commit Message Generation With Large Language Models: An Empirical Study and Beyond
    Xue, Pengyu
    Wu, Linhao
    Yu, Zhongxing
    Jin, Zhi
    Yang, Zhen
    Li, Xinyi
    Yang, Zhenyu
    Tan, Yue
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2024, 50 (12) : 3208 - 3224