Effective Text Classification Through Supervised Rough Set-Based Term Weighting

被引:0
作者
Cekik, Rasim [1 ]
机构
[1] Sirnak Univ, Fac Engn, Dept Comp Engn, TR-73000 Sirnak, Turkiye
来源
SYMMETRY-BASEL | 2025年 / 17卷 / 01期
关键词
text classification; term weighting; rough set; supervised learning; natural language processing;
D O I
10.3390/sym17010090
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This research presents an innovative approach in text mining based on rough set theory. This study fundamentally utilizes the concept of symmetry from rough set theory to construct indiscernibility matrices and model uncertainties in data analysis, ensuring both methodological structure and solution processes remain symmetric. The effective management and analysis of large-scale textual data heavily relies on automated text classification technologies. In this context, term weighting plays a crucial role in determining classification performance. Particularly, supervised term weighting methods that utilize class information have emerged as the most effective approaches. However, the optimal representation of class-term relationships remains an area requiring further research. This study proposes the Rough Multivariate Weighting Scheme (RMWS) and presents its mathematical derivative, the Square Root Rough Multivariate Weighting Scheme (SRMWS). The RMWS model employs rough sets to identify information-carrying documents within the document-term-class space and adopts a computational methodology incorporating alpha, beta, and gamma coefficients. Moreover, the distribution of the term among classes is again effectively revealed. Comprehensive experimental studies were conducted on three different datasets featuring imbalanced-multiclass, balanced-multiclass, and imbalanced-binary class structures to evaluate the model's effectiveness. The results show that RMWS and its derivative SRMWS methods outperform existing approaches by exhibiting superior performance on balanced and unbalanced datasets without being affected by class imbalance and number of classes. Furthermore, the SRMWS method is found to be the most effective for SVM and KNN classifiers, while the RMWS method achieves the best results for NB classifiers. These results show that the proposed methods significantly improve the text classification performance.
引用
收藏
页数:29
相关论文
共 31 条
[1]  
ASUNCION A., 2007, UCI MACHINE LEARNING
[2]  
Bojanowski P, 2017, T ASSOC COMPUT LING, V5, P135, DOI [10.1162/tacl_a_00051, 10.1162/tacl_a_00051, DOI 10.1162/TACL_A_00051]
[3]   YAKE! Keyword extraction from single documents using multiple local features [J].
Campos, Ricardo ;
Mangaravite, Vitor ;
Pasquali, Arian ;
Jorge, Alipio ;
Nunes, Celia ;
Jatowt, Adam .
INFORMATION SCIENCES, 2020, 509 :257-289
[4]  
Çekik R, 2023, Gazi University Journal of Science Part A Engineering and Innovation, V10, P472, DOI [10.54287/gujsa.1379024, 10.54287/gujsa.1379024, DOI 10.54287/GUJSA.1379024]
[5]   A new metric for feature selection on short text datasets [J].
Cekik, Rasim ;
Uysal, Alper Kursat .
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (13)
[6]   A novel filter feature selection method using rough set for short text data [J].
Cekik, Rasim ;
Uysal, Alper Kursat .
EXPERT SYSTEMS WITH APPLICATIONS, 2020, 160
[7]   A new classification method based on rough sets theory [J].
Cekik, Rasim ;
Telceken, Sedat .
SOFT COMPUTING, 2018, 22 (06) :1881-1889
[8]   Turning from TF-IDF to TF-IGM for term weighting in text classification [J].
Chen, Kewen ;
Zhang, Zuping ;
Long, Jun ;
Zhang, Hao .
EXPERT SYSTEMS WITH APPLICATIONS, 2016, 66 :245-260
[9]   Context-Aware Term Weighting For First Stage Passage Retrieval [J].
Dai, Zhuyun ;
Callan, Jamie .
PROCEEDINGS OF THE 43RD INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '20), 2020, :1533-1536
[10]  
Debole F, 2004, STUD FUZZ SOFT COMP, V138, P81