A Method for Efficient Structured Data Generation with Large Language Models

被引：0

作者：

Hou, Zongzhi ^{[1
]}

Zhao, Ruohan ^{[1
]}

Li, Zhongyang ^{[1
]}

Wang, Zheng ^{[1
]}

Wu, Yizhen ^{[1
]}

Gou, Junwei ^{[1
]}

Zhu, Zhifeng ^{[1
]}

机构：

[1] Huawei, Shanghai, Peoples R China

来源：

PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年

关键词：

Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;

D O I：

10.1145/3688866.3689127

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.

引用

页码：36 / 44

页数：9

共 50 条

[41] Code-level quantum circuit generation based on large language models
He, Zhimin
Li, Guohong
Situ, Haozhen
Zhou, Yan
Zheng, Shenggen
Li, Lvzhou
SCIENTIA SINICA-PHYSICA MECHANICA & ASTRONOMICA, 2025, 55 (04)
[42] Large Language Models Meet Next-Generation Networking Technologies: A Review
Hang, Ching-Nam
Yu, Pei-Duo
Morabito, Roberto
Tan, Chee-Wei
FUTURE INTERNET, 2024, 16 (10)
[43] Event Grounded Criminal Court View Generation with Cooperative (Large) Language Models
Yue, Linan
Liu, Qi
Zhao, Lili
Wang, Li
Gao, Weibo
An, Yanqing
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2221 - 2230
[44] Understanding natural language: Potential application of large language models to ophthalmology
Yang, Zefeng
Wang, Deming
Zhou, Fengqi
Song, Diping
Zhang, Yinhang
Jiang, Jiaxuan
Kong, Kangjie
Liu, Xiaoyi
Qiao, Yu
Chang, Robert T.
Han, Ying
Li, Fei
Tham, Clement C.
Zhang, Xiulan
ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (04):
[45] Data augmentation based on large language models for radiological report classification
Collado-Montanez, Jaime
Martin-Valdivia, Maria-Teresa
Martinez-Camara, Eugenio
KNOWLEDGE-BASED SYSTEMS, 2025, 308
[46] MediGPT: Exploring Potentials of Conventional and Large Language Models on Medical Data
Rony, Mohammad Abu Tareq
Islam, Mohammad Shariful
Sultan, Tipu
Alshathri, Samah
El-Shafai, Walid
IEEE ACCESS, 2024, 12 : 103473 - 103487
[47] Performance of two large language models for data extraction in evidence synthesis
Konet, Amanda
Thomas, Ian
Gartlehner, Gerald
Kahwati, Leila
Hilscher, Rainer
Kugley, Shannon
Crotty, Karen
Viswanathan, Meera
Chew, Robert
RESEARCH SYNTHESIS METHODS, 2024, 15 (05) : 818 - 824
[48] Detecting Data Races in OpenMP with Deep Learning and Large Language Models
Alsofyani, May
Wang, Liqiang
53RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2024, 2024, : 96 - 103
[49] Are Large Language Models Capable of Causal Reasoning for Sensing Data Analysis?
Hu, Zhizhang
Zhang, Yue
Rossi, Ryan
Yu, Tong
Kim, Sungchul
Pan, Shijia
PROCEEDINGS OF THE 2024 WORKSHOP ON EDGE AND MOBILE FOUNDATION MODELS, EDGEFM 2024, 2024, : 24 - 29
[50] Quantum space-efficient large language models for Prolog query translation
Ahmed, Roshan
Sridevi, S.
QUANTUM INFORMATION PROCESSING, 2024, 23 (10)

← 1 2 3 4 5 →