Category-aware self-training for extremely weakly supervised text classification

被引:0
作者
Su, Jing [1 ,2 ]
机构
[1] Northwestern Polytech Univ, Sch Comp Sci, Xian, Peoples R China
[2] Northwestern Polytech Univ, Key Lab Big Data Storage & Management, Minist Ind & Informat Technol, Xian, Shaanxi, Peoples R China
关键词
Text classification; Extremely weak supervised learning; Self-training; Category-aware;
D O I
10.1016/j.eswa.2025.126431
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The popular forms of text classification under extremely weakly supervised employ a two-phase pipeline. The first is the generation of pseudo-labeled data, followed by a self-training phase. In the pseudo-labeling phase, it primarily obtains initial pseudo-labels through pre-trained models or existing unsupervised methods. After this phase, the pseudo-labeled training set still contains some noise. Therefore, the self-training process plays a key role in influencing the final performance. However, during the self-training process, it was observed that many existing approaches struggle to capture key information in documents without any guidance, resulting in the acquisition of erroneous knowledge. Therefore, this paper proposes a category-aware self-training process that can guide the model to purposefully acquire key information from documents. Specifically, rather than merely utilizing text-based word vector features as in previous models, it delves into the category perspective, directing the model to learn both category-related word vector features and statistical attributes. This enables the model to refit based on these coarse-grained features and label information, providing more opportunities for the model to locate key information in documents. Furthermore, by applying Gaussian processing to the intermediate features, the model's robustness is also enhanced. Finally, we conducted extensive testing on multiple publicly available datasets. Experimental results demonstrate that our proposed method outperforms existing approaches in extremely weakly supervised text classification.
引用
收藏
页数:13
相关论文
共 37 条
  • [31] Yang ZL, 2019, ADV NEUR IN, V32
  • [32] Zeng ZQ, 2022, Arxiv, DOI [arXiv:2205.06604, 10.18653/v1/2022.findings-naacl.176, DOI 10.18653/V1/2022.FINDINGS-NAACL.176]
  • [33] Zhang L, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P2803
  • [34] Zhang Y., 2023, P 2023 C EMP METH NA, P12655, DOI [10.18653/v1/2023.emnlp-main.780, DOI 10.18653/V1/2023.EMNLP-MAIN.780]
  • [35] Zhang YY, 2023, Arxiv, DOI arXiv:2305.13723
  • [36] Zhao XD, 2023, PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, P15590
  • [37] Improving the Robustness of Deep Neural Networks via Stability Training
    Zheng, Stephan
    Song, Yang
    Leung, Thomas
    Goodfellow, Ian
    [J]. 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, : 4480 - 4488