A Method for Efficient Structured Data Generation with Large Language Models

被引:0
|
作者
Hou, Zongzhi [1 ]
Zhao, Ruohan [1 ]
Li, Zhongyang [1 ]
Wang, Zheng [1 ]
Wu, Yizhen [1 ]
Gou, Junwei [1 ]
Zhu, Zhifeng [1 ]
机构
[1] Huawei, Shanghai, Peoples R China
来源
PROCEEDINGS OF THE 2ND WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM(CUBE)A 2024 | 2024年
关键词
Multi-modality; Data Generation; Artificial Intelligence; Large Language Model;
D O I
10.1145/3688866.3689127
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid advancement of large language model technology, the data utilized for training these models has become increasingly significant. The quality of text data samples produced by large unsupervised models is often inadequate, leading to insufficient outcomes. This inadequacy arises from the model's constrained capacity to precisely emulate the underlying structure of the data without direct supervision, resulting in outputs that may lack the necessary fidelity and relevance to the authentic data distribution. In order to overcome the shortcomings of training data generation for specific language generation tasks, this paper proposes a fast data generation system (Fast Data Generation System, FDGS) that can handle multi-modal and structured data generation. As a method for generating data, FDGS uses clustering abstraction to handle multiple data input types through templates. This approach allows for quick data generation and reduces consumption. FDGS is robust, ensuring stable and reliable performance under various conditions. It is more cost-effective in terms of token usage compared to traditional methods that work on a per-instance basis and do not use templates. By abstracting and clustering different input types, FDGS can efficiently generate data from large models. This system is highly adaptable, making it a great choice for multi-modal data generation tasks. It relies on the basic functions of general large-scale language models and employs a query-answer bidirectional generation mechanism to achieve fast data amplification.
引用
收藏
页码:36 / 44
页数:9
相关论文
共 50 条
  • [41] Code-level quantum circuit generation based on large language models
    He, Zhimin
    Li, Guohong
    Situ, Haozhen
    Zhou, Yan
    Zheng, Shenggen
    Li, Lvzhou
    SCIENTIA SINICA-PHYSICA MECHANICA & ASTRONOMICA, 2025, 55 (04)
  • [42] Large Language Models Meet Next-Generation Networking Technologies: A Review
    Hang, Ching-Nam
    Yu, Pei-Duo
    Morabito, Roberto
    Tan, Chee-Wei
    FUTURE INTERNET, 2024, 16 (10)
  • [43] Event Grounded Criminal Court View Generation with Cooperative (Large) Language Models
    Yue, Linan
    Liu, Qi
    Zhao, Lili
    Wang, Li
    Gao, Weibo
    An, Yanqing
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 2221 - 2230
  • [44] Understanding natural language: Potential application of large language models to ophthalmology
    Yang, Zefeng
    Wang, Deming
    Zhou, Fengqi
    Song, Diping
    Zhang, Yinhang
    Jiang, Jiaxuan
    Kong, Kangjie
    Liu, Xiaoyi
    Qiao, Yu
    Chang, Robert T.
    Han, Ying
    Li, Fei
    Tham, Clement C.
    Zhang, Xiulan
    ASIA-PACIFIC JOURNAL OF OPHTHALMOLOGY, 2024, 13 (04):
  • [45] Data augmentation based on large language models for radiological report classification
    Collado-Montanez, Jaime
    Martin-Valdivia, Maria-Teresa
    Martinez-Camara, Eugenio
    KNOWLEDGE-BASED SYSTEMS, 2025, 308
  • [46] MediGPT: Exploring Potentials of Conventional and Large Language Models on Medical Data
    Rony, Mohammad Abu Tareq
    Islam, Mohammad Shariful
    Sultan, Tipu
    Alshathri, Samah
    El-Shafai, Walid
    IEEE ACCESS, 2024, 12 : 103473 - 103487
  • [47] Performance of two large language models for data extraction in evidence synthesis
    Konet, Amanda
    Thomas, Ian
    Gartlehner, Gerald
    Kahwati, Leila
    Hilscher, Rainer
    Kugley, Shannon
    Crotty, Karen
    Viswanathan, Meera
    Chew, Robert
    RESEARCH SYNTHESIS METHODS, 2024, 15 (05) : 818 - 824
  • [48] Detecting Data Races in OpenMP with Deep Learning and Large Language Models
    Alsofyani, May
    Wang, Liqiang
    53RD INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2024, 2024, : 96 - 103
  • [49] Are Large Language Models Capable of Causal Reasoning for Sensing Data Analysis?
    Hu, Zhizhang
    Zhang, Yue
    Rossi, Ryan
    Yu, Tong
    Kim, Sungchul
    Pan, Shijia
    PROCEEDINGS OF THE 2024 WORKSHOP ON EDGE AND MOBILE FOUNDATION MODELS, EDGEFM 2024, 2024, : 24 - 29
  • [50] Quantum space-efficient large language models for Prolog query translation
    Ahmed, Roshan
    Sridevi, S.
    QUANTUM INFORMATION PROCESSING, 2024, 23 (10)