A review of semi-supervised learning for text classification

被引:53
作者
Duarte, Jose Marcio [1 ]
Berton, Lilian [1 ]
机构
[1] Univ Fed Sao Paulo, Sci & Technol Dept, Cesare Mansueto Giulio Lattes Ave 1201, BR-12247014 Sao Jose Dos Campos, SP, Brazil
关键词
Natural language processing; Text classification; Machine learning; Semi-supervised learning; SENTIMENT ANALYSIS; INFORMATION; SELECTION;
D O I
10.1007/s10462-023-10393-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A huge amount of data is generated daily leading to big data challenges. One of them is related to text mining, especially text classification. To perform this task we usually need a large set of labeled data that can be expensive, time-consuming, or difficult to be obtained. Considering this scenario semi-supervised learning (SSL), the branch of machine learning concerned with using labeled and unlabeled data has expanded in volume and scope. Since no recent survey exists to overview how SSL has been used in text classification, we aim to fill this gap and present an up-to-date review of SSL for text classification. We retrieve 1794 works from the last 5 years from IEEE Xplore, ACM Digital Library, Science Direct, and Springer. Then, 157 articles were selected to be included in this review. We present the application domain, datasets, and languages employed in the works. The text representations and machine learning algorithms. We also summarize and organize the works following a recent taxonomy of SSL. We analyze the percentage of labeled data used, the evaluation metrics, and obtained results. Lastly, we present some limitations and future trends in the area. We aim to provide researchers and practitioners with an outline of the area as well as useful information for their current research.
引用
收藏
页码:9401 / 9469
页数:69
相关论文
共 186 条
[1]   Semi-supervised Multi-aspect Detection of Misinformation Using Hierarchical Joint Decomposition [J].
Abdali, Sara ;
Shah, Neil ;
Papalexakis, Evangelos E. .
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: APPLIED DATA SCIENCE AND DEMO TRACK, ECML PKDD 2020, PT V, 2021, 12461 :406-422
[2]   Binary domain adaptation with independence maximization [J].
Abdi, Lida ;
Hasehmi, Sattar .
INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2021, 12 (09) :2559-2578
[3]  
Agarwal Rohit, 2021, 2021 2nd International Conference on Computation, Automation and Knowledge Management (ICCAKM), P332, DOI 10.1109/ICCAKM50778.2021.9357720
[4]   Detecting Deceptive Reviews using Generative Adversarial Networks [J].
Aghakhani, Hojjat ;
Machiry, Aravind ;
Nilizadeh, Shirin ;
Kruegel, Christopher ;
Vigna, Giovanni .
2018 IEEE SYMPOSIUM ON SECURITY AND PRIVACY WORKSHOPS (SPW 2018), 2018, :89-95
[5]   Fast and scalable neural embedding models for biomedical sentence classification [J].
Agibetov, Asan ;
Blagec, Kathrin ;
Xu, Hong ;
Samwald, Matthias .
BMC BIOINFORMATICS, 2018, 19
[6]   Recurrent Attention Walk for Semi-supervised Classification [J].
Akujuobi, Uchenna ;
Zhang, Qiannan ;
Han Yufei ;
Zhang, Xiangliang .
PROCEEDINGS OF THE 13TH INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING (WSDM '20), 2020, :16-24
[7]  
Akujuobi U, 2018, IEEE INT CONF BIG DA, P584, DOI 10.1109/BigData.2018.8621957
[8]  
Alam Firoj, 2018, P INT AAAI C WEB SOC, V12
[9]  
Alnashwan R, 2019, 2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2019), P131, DOI [10.1109/BigDataService.2019.00024, 10.1109/BigDataSeryice.2019.00024]
[10]   Instance labeling in semi-supervised learning with meaning values of words [J].
Altinel, Berna ;
Ganiz, Murat Can ;
Diri, Banu .
ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2017, 62 :152-163