A high-quality feature selection method based on frequent and correlated items for text classification

被引:66
作者
Farghaly, Heba Mamdouh [1 ]
Abd El-Hafeez, Tarek [1 ,2 ]
机构
[1] Minia Univ, Fac Sci, Dept Comp Sci, El Minia, Egypt
[2] Deraya Univ, Comp Sci Unit, El Minia, Egypt
关键词
Feature selection; Dimensionality reduction; Text classification; Association rule mining; Feature interaction;
D O I
10.1007/s00500-023-08587-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The feature selection problem is a significant challenge in pattern recognition, especially for classification tasks. The quality of the selected features plays a critical role in building effective models, and poor-quality data can make this process more difficult. This work explores the use of association analysis in data mining to select meaningful features, addressing the issue of duplicated information in the selected features. A novel feature selection technique for text classification is proposed, based on frequent and correlated items. This method considers both relevance and feature interactions, using association as a metric to evaluate the relationship between the target and features. The technique was tested using the SMS spam collecting dataset from the UCI machine learning repository and compared with well-known feature selection methods. The results showed that the proposed technique effectively reduced redundant information while achieving high accuracy (95.155%) using only 6% of the features.
引用
收藏
页码:11259 / 11274
页数:16
相关论文
共 40 条
[21]  
Qing Liu, 2018, 2018 IEEE 4th International Conference on Computer and Communications (ICCC). Proceedings, P2338, DOI 10.1109/CompComm.2018.8780663
[22]   Feature Selection Algorithm Based on Association Rules [J].
Qu, Yi ;
Fang, Yu ;
Yan, Fengqi .
2018 INTERNATIONAL CONFERENCE ON COMPUTER INFORMATION SCIENCE AND APPLICATION TECHNOLOGY, 2019, 1168
[23]  
Saif H, 2014, LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P810
[24]   Stemming and Lemmatization for Information Retrieval Systems in Amazigh Language [J].
Samir, Amri ;
Lahbib, Zenkouar .
BIG DATA, CLOUD AND APPLICATIONS, BDCA 2018, 2018, 872 :222-233
[25]  
Sangodiah A, 2014, 2014 IEEE INTERNATIONAL CONFERENCE ON CONTROL SYSTEM COMPUTING AND ENGINEERING, P536, DOI 10.1109/ICCSCE.2014.7072776
[26]   Machine learning in automated text categorization [J].
Sebastiani, F .
ACM COMPUTING SURVEYS, 2002, 34 (01) :1-47
[27]   Feature selection via maximizing global information gain for text classification [J].
Shang, Changxing ;
Li, Min ;
Feng, Shengzhong ;
Jiang, Qingshan ;
Fan, Jianping .
KNOWLEDGE-BASED SYSTEMS, 2013, 54 :298-309
[28]   Clear cell renal cell carcinoma: CT-based radiomics features for the prediction of Fuhrman grade [J].
Shu, Jun ;
Tang, Yongqiang ;
Cui, Jingjing ;
Yang, Ruwu ;
Meng, Xiaoli ;
Cai, Zhengting ;
Zhang, Jingsong ;
Xu, Wanni ;
Wen, Didi ;
Yin, Hong .
EUROPEAN JOURNAL OF RADIOLOGY, 2018, 109 :8-12
[29]   Hybrid model of Correlation based Filter Feature Selection and Machine Learning classifiers applied on Smart Meter Data set [J].
Sinayobye, Janvier Omar ;
Kiwanuka, N. Fred ;
Kaawaase Kyanda, Swaib ;
Musabe, Richard .
2019 IEEE/ACM SYMPOSIUM ON SOFTWARE ENGINEERING IN AFRICA (SEIA 2019), 2019, :1-10
[30]   A Feature Selection Approach to Detect Spam in the Facebook Social Network [J].
Sohrabi, Mohammad Karim ;
Karimi, Firoozeh .
ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2018, 43 (02) :949-958