An enhanced feature selection method for text classification

被引：0

作者：

Kang, Jinbeom ^{[1
]}

Lee, Eunshil ^{[1
]}

Hong, Kwanghee ^{[1
]}

Park, Jeahyun ^{[1
]}

Kim, Taehwan ^{[1
]}

Park, Juyoung ^{[1
]}

Choi, Joongmin ^{[1
]}

Yang, Jaeyoung ^{[1
]}

机构：

[1] Hanyang Univ, Dept Comp Sci & Engn, Ansan, Kunngi Do, South Korea

来源：

PROCEEDINGS OF THE SECOND IASTED INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE | 2006年

关键词：

feature selection; impurity of words; unbalanced distribution; machine learning; text classification;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Feature selection in machine learning is a task of identifying a set of representative terms or features from a document collection that are mainly used in text classification. Existing feature selection methods including information gain and X chi(2)-test focus on those features that are useful for all topics, and consequently lack the power of selecting those features that are truly the representatives of a particular topic (or class). Also, these methods assume that the distribution of documents for each class is balanced. However, this assumption affects negatively to the classification accuracy because real-world document collections rarely have a balanced distribution, and also it is difficult to prepare a set of training documents with even number of documents for each class. To resolve this problem, we propose a new feature selection method for text classification that focuses on the purity of a word that emphasizes its representativeness for a particular class. Also our method assumes unbalanced distribution of documents over multiple classes, and combines feature values with the weight factors that,reflect the number of training documents in each class. In summary, we can obtain feature candidates using the word purity and then select the features with the unbalanced distribution of documents. Via some experiments, we demonstrate that our method outperforms existing methods.

引用

页码：36 / 41

页数：6

共 50 条

[41] Text Guide: Improving the Quality of Long Text Classification by a Text Selection Method Based on Feature Importance
Fiok, Krzysztof
Karwowski, Waldemar
Gutierrez-Franco, Edgar
Davahli, Mohammad Reza
Wilamowski, Maciej
Ahram, Tareq
Al-Juaid, Awad
Zurada, Jozef
IEEE ACCESS, 2021, 9 (09): : 105439 - 105450
[42] Comparison on Feature Selection Methods for Text Classification
Liu, Wenkai
Xiao, Jiongen
Hong, Ming
2020 THE 4TH INTERNATIONAL CONFERENCE ON MANAGEMENT ENGINEERING, SOFTWARE ENGINEERING AND SERVICE SCIENCES (ICMSS 2020), 2020, : 82 - 86
[43] A Bayesian feature selection paradigm for text classification
Feng, Guozhong
Guo, Jianhua
Jing, Bing-Yi
Hao, Lizhu
INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (02) : 283 - 302
[44] Effective feature selection technique for text classification
Seetha, Hari
Murty, M. Narasimha
Saravanan, R.
INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2015, 7 (03) : 165 - 184
[45] A feature selection and classification technique for text categorization
Girgis, MR
Aly, AA
INTERNATIONAL JOURNAL OF COOPERATIVE INFORMATION SYSTEMS, 2003, 12 (04) : 441 - 454
[46] Feature selection improves text classification accuracy
不详
IEEE INTELLIGENT SYSTEMS, 2005, 20 (06) : 75 - 75
[47] A new approach to feature selection in text classification
Wang, Y
Wang, XJ
PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 3814 - 3819
[48] Composite Feature Extraction and Selection for Text Classification
Wan, Chuan
Wang, Yuling
Liu, Yaoze
Ji, Jinchao
Feng, Guozhong
IEEE ACCESS, 2019, 7 : 35208 - 35219
[49] Higher order feature selection for text classification
Jan Bakus
Mohamed S. Kamel
Knowledge and Information Systems, 2006, 9 : 468 - 491
[50] Optimal Feature Selection for Imbalanced Text Classification
Khurana A.
Verma O.P.
IEEE Transactions on Artificial Intelligence, 2023, 4 (01): : 135 - 147

← 1 2 3 4 5 →