Data augmentation strategies to improve text classification: a use case in smart cities

被引:0
作者
Bencke, Luciana [1 ]
Moreira, Viviane Pereira [1 ]
机构
[1] Fed Univ Rio Grande Sul UFRGS, Inst Informat, Porto Alegre, RS, Brazil
关键词
Data augmentation; Text classification; Low-resources; Smart cities;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Text classification is a very common and important task in Natural Language Processing. In many domains and real-world settings, a few labeled instances are the only resource available to train classifiers. Models trained on small datasets tend to overfit and produce inaccurate results - Data augmentation (DA) techniques come as an alternative to minimize this problem. DA generates synthetic instances that can be fed to the classification algorithm during training. In this article, we explore a variety of DA methods, including back translation, paraphrasing, and text generation. We assess the impact of the DA methods over simulated low-data scenarios using well-known public datasets in English with classifiers built fine-tuning BERT models. We describe the means to adapt these DA methods to augment a small Portuguese dataset containing tweets labeled with smart city dimensions (e.g., transportation, energy, water, etc.). Our experiments showed that some classes were noticeably improved by DA - with an improvement of 43% in terms of F1 compared to the baseline with no augmentation. In a qualitative analysis, we observed that the DA methods were able to preserve the label but failed to preserve the semantics in some cases and that generative models were able to produce high-quality synthetic instances.
引用
收藏
页数:36
相关论文
共 57 条
  • [1] Alammar J., 2019, ILLUSTRATED GPT 2 VI
  • [2] Amjad M, 2020, PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), P2537
  • [3] Anaby-Tavor A, 2020, AAAI CONF ARTIF INTE, V34, P7383
  • [4] [Anonymous], 2014, 371202014 ISO
  • [5] Beddiar DR., 2021, ONLINE SOC NETW MEDI, V24, DOI [DOI 10.1016/J.OSNEM.2021.100153, 10.1016/j.osnem.2021.100153]
  • [6] Automated classification of social network messages into Smart Cities dimensions
    Bencke, Luciana
    Cechinel, Cristian
    Munoz, Roberto
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2020, 109 : 218 - 237
  • [7] Using back-and-forth translation to create artificial augmented textual data for sentiment analysis models
    Body, Thomas
    Tao, Xiaohui
    Li, Yuefeng
    Li, Lin
    Zhong, Ning
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 178
  • [8] Brown T., 2020, LANGUAGE MODELS ARE, V33, P1877, DOI DOI 10.48550/ARXIV.2005.14165
  • [9] City dynamics through Twitter: Relationships between land use and spatiotemporal demographics
    Carlos Garcia-Palomares, Juan
    Henar Salas-Olmedo, Maria
    Moya-Gomez, Borja
    Condeco-Melhorado, Ana
    Gutierrez, Javier
    [J]. CITIES, 2018, 72 : 310 - 319
  • [10] An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
    Chen, Jiaao
    Tam, Derek
    Raffel, Colin
    Bansal, Mohit
    Yang, Diyi
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 191 - 211