Feature selection for text classification: A review

被引:193
作者
Deng, Xuelian [1 ]
Li, Yuqing [1 ]
Weng, Jian [2 ]
Zhang, Jilian [3 ]
机构
[1] Guangxi Univ Chinese Med, Coll Publ Hlth & Management, Guangxi, Peoples R China
[2] Jinan Univ, Coll Informat Sci & Technol, Guangzhou, Guangdong, Peoples R China
[3] Jinan Univ, Coll Cyber Secur, Guangzhou, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Feature Selection; Text classification; Text classifiers; Multimedia; HYBRID FEATURE-SELECTION; GENETIC ALGORITHM; SIMILARITY MEASURE; NAIVE BAYES; IMAGE; CATEGORIZATION; DISTANCE; REGRESSION;
D O I
10.1007/s11042-018-6083-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Big multimedia data is heterogeneous in essence, that is, the data may be a mixture of video, audio, text, and images. This is due to the prevalence of novel applications in recent years, such as social media, video sharing, and location based services (LBS), etc. In many multimedia applications, for example, video/image tagging and multimedia recommendation, text classification techniques have been used extensively to facilitate multimedia data processing. In this paper, we give a comprehensive review on feature selection techniques for text classification. We begin by introducing some popular representation schemes for documents, and similarity measures used in text classification. Then, we review the most popular text classifiers, including Nearest Neighbor (NN) method, Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), and Neural Networks. Next, we survey four feature selection models, namely the filter, wrapper, embedded and hybrid, discussing pros and cons of the state-of-the-art feature selection approaches. Finally, we conclude the paper and give a brief introduction to some interesting feature selection work that does not belong to the four models.
引用
收藏
页码:3797 / 3816
页数:20
相关论文
共 124 条
[1]  
Aggarwal CC, 2001, LECT NOTES COMPUT SC, V1973, P420
[2]  
[Anonymous], [No title captured]
[3]  
[Anonymous], 1998, STAT LEARNING THEORY
[4]  
[Anonymous], 1993, MORGAN KAUFMANN SERI
[5]  
[Anonymous], 2000, WORKSH ART INT WEB S
[6]  
[Anonymous], 1997, ICML
[7]  
[Anonymous], 1971, SMART RETRIEVAL SYST
[8]  
[Anonymous], 1998, Proceedings of ICML-98, 15th International Conference on Machine Learning
[9]  
[Anonymous], 1998, LEARNING TEXT CATEGO
[10]  
[Anonymous], 1994, Irrelevant Features and the Subset Selection Problem. pages, DOI 10.1016/B978-1-55860-335-6.50023-4