Domain Effect Investigation for Bert Models Fine-Tuned on Different Text Categorization Tasks

被引:4
作者
Coban, Onder [1 ]
Yaganoglu, Mete [1 ]
Bozkurt, Ferhat [1 ]
机构
[1] Ataturk Univ, Dept Comp Engn Fac Engn, Erzurum, Turkiye
关键词
User-generated content; Text categorization; Deep learning; Bidirectional Encoder Representations from Transformer;
D O I
10.1007/s13369-023-08142-8
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Text categorization (TC) is one of the most useful automatic tools in today's world to organize huge text data automatically. It is widely used by practitioners to classify texts automatically for different purposes, including sentiment analysis, authorship detection, spam detection, and so on. However, studying TC task for different fields can be challenging since it is required to train a separate model on a labeled and large data set specific to that field. This is very time-consuming, and creating a domain-specific large and labeled data is often very hard. In order to overcome this problem, language models are recently employed to transfer learned information from a large data to another downstream task. Bidirectional Encoder Representations from Transformer (BERT) is one of the most popular language models and has been shown to provide very good results for TC tasks. Hence, in this study, we use four pretrained BERT models trained on formal text data as well as our own BERT models trained on Facebook messages. We then fine-tuned BERT models on different downstream data sets collected from different domains such as Twitter, Instagram, and so on. We aim to investigate whether fine-tuned BERT models can provide satisfying results on different downstream tasks of different domains via transfer learning. The results of our extensive experiments show that BERT models provide very satisfying results and selecting both the BERT model and downstream tasks' data from the same or similar domain is akin to improve the performance in a further direction. This shows that a well-trained language model can remove the need for a separate training process for each different downstream TC task within the OSN domain.
引用
收藏
页码:3685 / 3702
页数:18
相关论文
共 60 条
[1]   Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques [J].
Afan, Haitham Abdulmohsin ;
Osman, Ahmedbahaaaldin Ibrahem Ahmed ;
Essam, Yusuf ;
Ahmed, Ali Najah ;
Huang, Yuk Feng ;
Kisi, Ozgur ;
Sherif, Mohsen ;
Sefelnasr, Ahmed ;
Chau, Kwok-wing ;
El-Shafie, Ahmed .
ENGINEERING APPLICATIONS OF COMPUTATIONAL FLUID MECHANICS, 2021, 15 (01) :1420-1439
[2]  
Akca O., 2022, TRADITIONAL MACHINE, P1
[3]   Automated Arabic Long-Tweet Classification Using Transfer Learning with BERT [J].
Alruily, Meshrif ;
Fazal, Abdul Manaf ;
Mostafa, Ayman Mohamed ;
Ezz, Mohamed .
APPLIED SCIENCES-BASEL, 2023, 13 (06)
[4]   Comparison of Transformer-Based Models Trained in Turkish and Different Languages on Turkish Natural Language Processing Problems [J].
Aytan, Burak ;
Sakar, C. Okan .
2022 30TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE, SIU, 2022,
[5]   Deep learning-based appearance features extraction for automated carp species identification [J].
Banan, Ashkan ;
Nasiri, Amin ;
Taheri-Garavand, Amin .
AQUACULTURAL ENGINEERING, 2020, 89
[6]   Turkish abstractive text summarization using pretrained sequence-to-sequence models [J].
Baykara, Batuhan ;
Gungor, Tunga .
NATURAL LANGUAGE ENGINEERING, 2023, 29 (05) :1275-1304
[7]  
Birim A., 2021, IEEE, P1
[8]  
Bojanowski P., 2017, T ASS COMPUT LINGUIS, V5, P135, DOI [10.1162/tacla00051, DOI 10.1162/TACL_A_00051, 10.1162/tacl_a_00051, DOI 10.1162/TACLA00051]
[9]   Adapting Established Text Representations for Predicting Review Sentiment in Turkish [J].
Cavusoglu, Izel ;
Pielka, Maren ;
Sifa, Rafet .
2020 IEEE 7TH INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS (DSAA 2020), 2020, :755-756
[10]  
Celikten A., 2021, 2021 29th Signal Processing and Communications Applications Conference (SIU), P1