Toward Optimal Feature Selection in Naive Bayes for Text Categorization

被引:179
作者
Tang, Bo [1 ]
Kay, Steven [1 ]
He, Haibo [1 ]
机构
[1] Univ Rhode Isl, Dept Elect Comp & Biomed Engn, Kingston, RI 02881 USA
基金
美国国家科学基金会;
关键词
Feature selection; feature reduction; text categorization; Kullback-Leibler divergence; Jeffreys divergence; information gain; CLASSIFICATION;
D O I
10.1109/TKDE.2016.2563436
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automated feature selection is important for text categorization to reduce feature size and to speed up learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination (MD) and MD - chi(2) methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.
引用
收藏
页码:2508 / 2521
页数:14
相关论文
共 49 条
  • [1] A new text categorization technique using distributional clustering and learning logic
    Al-Mubaid, Hisham
    Umair, Syed A.
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (09) : 1156 - 1165
  • [2] Hate Speech Detection with Comment Embeddings
    Djuric, Nemanja
    Zhou, Jing
    Morris, Robin
    Grbovic, Mihajlo
    Radosavljevic, Vladan
    Bhamidipati, Narayan
    [J]. WWW'15 COMPANION: PROCEEDINGS OF THE 24TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2015, : 29 - 30
  • [3] [Anonymous], 1997, ICML
  • [4] [Anonymous], 2009, Advances in neural information processing systems
  • [5] [Anonymous], 1998, LEARNING TEXT CATEGO
  • [6] [Anonymous], 2008, Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM '08
  • [7] [Anonymous], 1997, Technical report, DOI DOI 10.5555/645526.657130
  • [8] A Comprehensive Empirical Comparison of Modern Supervised Classification and Feature Selection Methods for Text Categorization
    Aphinyanaphongs, Yindalon
    Fu, Lawrence D.
    Li, Zhiguo
    Peskin, Eric R.
    Efstathiadis, Efstratios
    Aliferis, Constantin F.
    Statnikov, Alexander
    [J]. JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2014, 65 (10) : 1964 - 1987
  • [9] Bouillot F, 2013, COMM COM INF SC, V146, P111
  • [10] Document clustering using locality preserving indexing
    Cai, D
    He, XF
    Han, JW
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (12) : 1624 - 1637