Fine-Tuning Transformer-Based Representations in Active Learning for Labelling Crisis Dataset of Tweets

被引:0
作者
Paul N.R. [1 ]
Balabantaray R.C. [1 ]
Sahoo D. [2 ]
机构
[1] Department of Computer Science and Engineering, IIIT Bhubaneswar, Gothapatna, Odisha, Bhubaneswar
[2] Department of Faculty of Emerging Technologies, Sri Sri University, Godisahi, Odisha, Cuttack
关键词
Active learning; BERT; Transformer; Tweet labelling; Word embeddings;
D O I
10.1007/s42979-023-02061-z
中图分类号
学科分类号
摘要
Supervised machine learning-based models are generally used for classifying tweets related to crisis. A labelled tweet dataset is a major requirement for training the models. Labelling huge quantities of text data manually is a time-consuming and costly process. Active learning reduces some of the work necessary to use vast volumes of unlabelled data for machine learning tasks without fully labelling them. During the active learning process, the representation strategy employed for tweets has a substantial impact on the process’ effectiveness. The representations like Bag-of-Words and representations based on pre-trained word embeddings like GloVe have been used in the active learning process and have proven to be effective in representing tweets. The introduction of pre-trained transformer-based models like BERT, XLNet, and GPT-2 is prevalent in natural language processing tasks. These transformer-based models can also be used to represent embeddings of tweets but are not yet explored fully as an alternative to other embeddings used in active learning. This work offers a complete evaluation of the usefulness of representations for active learning, based on transformer-based language models. This study also demonstrates that transformer-based models, particularly BERT-like models, which have yet to be widely used in active learning, outperform more regularly used vector representations such as Bag-of-Words or other traditional word-embeddings such as GloVe. This work also compares the usefulness of representations based on the “[CLS]” token and aggregated representations generated using BERT-like models. The effectiveness of representations based on different types of BERT such as DistilBert, Roberta, and Albert is also investigated in this work. In this work finally, we propose a method called adaptive fine-tuning active learning, which is fine-tuning the representations produced by BERT-like models during the active learning process. The results show that the minimal label information gained through active learning may be utilized to not only train a classifier but also adaptively improve embedding produced by BERT-like transformer-based language models. © 2023, The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd.
引用
收藏
相关论文
共 83 条
[1]  
Landwehr P.M., Carley K.M., Social media in disaster relief: usage patterns, data mining tools, and current research directions, Data mining and knowledge discovery for big data studies, pp. 225-257, (2014)
[2]  
Kaufhold M.A., Reuter C., The self-organization of digital volunteers across social media: the case of the 2013 European floods in Germany, J Homel Secur Emerg Manag, 13, 1, pp. 137-166, (2016)
[3]  
Palen L., Vieweg S., The emergence of online wide scale interaction in unexpected events: Assistance, alliance & retreat, Proceedings of the ACM 2008 Conference on Computer Supported Cooperative Work (CSCW, pp. 117-126, (2008)
[4]  
Starbird K., Palen L., Hughes A.L., Vieweg S., Chatter on the red: What hazards threat reveals about the social life of microblogged information, In: Proceedings of the 2010 ACM Conference on Computer Supported Cooperative Work., pp. 241-250, (2010)
[5]  
Qu Y., Huang C., Zhang P., Zhang J., Microblogging after a major disaster in China: A case study of the 2010 Yushu earthquake, . In: Proceedings of the ACM 2011 Conference on Computer Supported Cooperative Work, pp. 25-34, (2011)
[6]  
Imran M., Castillo C., Diaz F., Vieweg S., Processing social media messages in mass emergency: a survey, ACM Comput Surv (CSUR), 47, 4, (2015)
[7]  
Vieweg S., Hughes A.L., Starbird K., Palen L., Microblogging during two natural hazards events: What Twitter may contribute to situational awareness, In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM., pp. 1079-1088, (2010)
[8]  
Karimi S., Yin J., Paris C., Classifying microblogs for disasters, . In: 18Th Australasian Document Computing Symposium., pp. 26-33, (2013)
[9]  
Li R., Lei K.H., Khadiwala R., Chang K.C.C., Tedas: A twitter-based event detection and analysis system, In: IEEE 28Th Int. Conf. on Data Engineering (ICDE), (2012)
[10]  
Stowe K., Paul M., Palmer M., Palen L., Anderson K., Identifying and categorizing disaster-related tweets, (2016)