A Method for Efficient Structured Data Generation with Large Language Models

被引：0

作者：

Hou, Zongzhi ^{[1
]}

Zhao, Ruohan ^{[1
]}

Li, Zhongyang ^{[1
]}

Wang, Zheng ^{[1
]}

Wu, Yizhen ^{[1
]}

Gou, Junwei ^{[1
]}

Zhu, Zhifeng ^{[1
]}

机构：

[1] Huawei, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年

关键词：

Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;

D O I：

10.1145/3688866.3689127

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.

引用

页码：36 / 44

页数：9

共 50 条

[1] The Convergence of Open Data, Linked Data, Ontologies, and Large Language Models: Enabling Next-Generation Knowledge Systems
Cigliano, Andrea
Fallucchi, Francesca
METADATA AND SEMANTIC RESEARCH, MTSR 2024, 2025, 2331 : 197 - 213
[2] Game Generation via Large Language Models
Hu, Chengpeng
Zhao, Yunlong
Liu, Jialin
2024 IEEE CONFERENCE ON GAMES, COG 2024, 2024,
[3] On the Capacity of Citation Generation by Large Language Models
Qian, Haosheng
Fan, Yixing
Zhang, Ruqing
Guo, Jiafeng
INFORMATION RETRIEVAL, CCIR 2024, 2025, 15418 : 109 - 123
[4] Large Language Model-Driven Structured Output: A Comprehensive Benchmark and Spatial Data Generation Framework
Li, Diya
Zhao, Yue
Wang, Zhifang
Jung, Calvin
Zhang, Zhe
ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (11)
[5] Evaluating application of large language models to biomedical patent claim generation
Chen, Feng-Chi
Pan, Chia-Lin
AIPlux Development Team, AIPlux Development
WORLD PATENT INFORMATION, 2025, 80
[6] The interaction of structured data using openEHR and large Language models for clinical decision support in prostate cancer
Kaiser, Philippe
Yang, Shan
Bach, Michael
Breit, Christian
Mertz, Kirsten
Stieltjes, Bram
Ebbing, Jan
Wetterauer, Christian
Henkel, Maurice
WORLD JOURNAL OF UROLOGY, 2025, 43 (01)
[7] Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Trustworthy Response Generation in Chinese
Wang, Haochun
Zhao, Sendong
Qiang, Zewen
Li, Zijian
Liu, Chi
Xi, Nuwa
Du, Yanrui
Qin, Bing
Liu, Ting
ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2025, 19 (02)
[8] Demystifying Data Management for Large Language Models
Miao, Xupeng
Jia, Zhihao
Cui, Bin
COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 547 - 555
[9] Prompting Large Language Models With the Socratic Method
Chang, Edward Y.
2023 IEEE 13TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE, CCWC, 2023, : 351 - 360
[10] Socially Aware Synthetic Data Generation for Suicidal Ideation Detection Using Large Language Models
Ghanadian, Hamideh
Nejadgholi, Isar
Al Osman, Hussein
IEEE ACCESS, 2024, 12 : 14350 - 14363

← 1 2 3 4 5 →