Improving Text Classification with Large Language Model-Based Data Augmentation

被引：11

作者：

Zhao, Huanhuan ^{[1
]}

Chen, Haihua ^{[2
]}

Ruggles, Thomas A. ^{[3
]}

Feng, Yunhe ^{[4
]}

Singh, Debjani ^{[3
]}

Yoon, Hong-Jun ^{[5
]}

机构：

[1] Univ Tennessee, Data Sci & Engn, Knoxville, TN 37996 USA

[2] Univ North Texas, Dept Informat Sci, Denton, TX 76203 USA

[3] Oak Ridge Natl Lab, Environm Sci Div, Oak Ridge, TN 37830 USA

[4] Univ North Texas, Computat Sci & Engn, Denton, TX 76203 USA

[5] Oak Ridge Natl Lab, Computat Sci & Engn Div, Oak Ridge, TN 37830 USA

来源：

ELECTRONICS | 2024年 / 13卷 / 13期

关键词：

data augmentation; large language model; ChatGPT; imbalanced data; text classification; natural language processing; machine learning; artificial intelligence;

D O I：

10.3390/electronics13132535

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Large Language Models (LLMs) such as ChatGPT possess advanced capabilities in understanding and generating text. These capabilities enable ChatGPT to create text based on specific instructions, which can serve as augmented data for text classification tasks. Previous studies have approached data augmentation (DA) by either rewriting the existing dataset with ChatGPT or generating entirely new data from scratch. However, it is unclear which method is better without comparing their effectiveness. This study investigates the application of both methods to two datasets: a general-topic dataset (Reuters news data) and a domain-specific dataset (Mitigation dataset). Our findings indicate that: 1. ChatGPT generated new data consistently enhanced model's classification results for both datasets. 2. Generating new data generally outperforms rewriting existing data, though crafting the prompts carefully is crucial to extract the most valuable information from ChatGPT, particularly for domain-specific data. 3. The augmentation data size affects the effectiveness of DA; however, we observed a plateau after incorporating 10 samples. 4. Combining the rewritten sample with new generated sample can potentially further improve the model's performance.

引用

页数：14

共 28 条

[1]

Akkaradamrongrat S, 2019, INT JOINT CONF COMP, P181, DOI [10.1109/jcsse.2019.8864181, 10.1109/JCSSE.2019.8864181]

[2]

Brown TB, 2020, Arxiv, DOI [arXiv:2005.14165, DOI 10.48550/ARXIV.2005.14165]

[3] Construction and Evaluation of a High-Quality Corpus for Legal Intelligence Using Semiautomated Approaches [J].

Chen, Haihua ;

Pieptea, Lavinia F. ;

Ding, Junhua .

IEEE TRANSACTIONS ON RELIABILITY, 2022, 71 (02) :657-673

[4] Enhancing social network hate detection using back translation and GPT-3 augmentations during training and test-time [J].

Cohen, Seffi ;

Presil, Dan ;

Katz, Or ;

Arbili, Ofir ;

Messica, Shvat ;

Rokach, Lior .

INFORMATION FUSION, 2023, 99

[5]

Dai HX, 2023, Arxiv, DOI [arXiv:2302.13007, DOI 10.48550/ARXIV.2302.13007]

[6]

Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, DOI 10.48550/ARXIV.1810.04805]

[7]

Hu Zhiting, 2019, Advances in Neural Information Processing Systems, V32

[8]

Karimi A, 2021, FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, P2748

[9]

Kolomiyets O., 2011, P 49 ANN M ASS COMP

[10]

Li Y., 2017, P 15 C EUR CHAPT ASS

← 1 2 3 →