A Bayesian feature selection paradigm for text classification

被引:30
作者
Feng, Guozhong [1 ,2 ]
Guo, Jianhua [1 ,2 ]
Jing, Bing-Yi [3 ]
Hao, Lizhu [1 ,2 ]
机构
[1] NE Normal Univ, Sch Math & Stat, Changchun 130024, Jilin Province, Peoples R China
[2] NE Normal Univ, Key Lab Appl Stat MOE, Changchun 130024, Jilin Province, Peoples R China
[3] Hong Kong Univ Sci & Technol, Dept Math, Hong Kong, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Bayesian feature selection; Metropolis search; Mixture model; Text classification; VARIABLE SELECTION; MODELS; CATEGORIZATION;
D O I
10.1016/j.ipm.2011.08.002
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The automated classification of texts into predefined categories has witnessed a booming interest, due to the increased availability of documents in digital form and the ensuing need to organize them. An important problem for text classification is feature selection, whose goals are to improve classification effectiveness, computational efficiency, or both. Due to categorization unbalancedness and feature sparsity in social text collection, filter methods may work poorly. In this paper, we perform feature selection in the training process, automatically selecting the best feature subset by learning, from a set of preclassified documents, the characteristics of the categories. We propose a generative probabilistic model, describing categories by distributions, handling the feature selection problem by introducing a binary exclusion/inclusion latent vector, which is updated via an efficient Metropolis search. Real-life examples illustrate the effectiveness of the approach. (C) 2011 Elsevier Ltd. All rights reserved.
引用
收藏
页码:283 / 302
页数:20
相关论文
共 20 条
[1]  
[Anonymous], 2021, Bayesian data analysis
[2]  
[Anonymous], 1989, Analysis of binary data
[3]  
Becker S., 2003, P 2002 C
[4]   Clustering using objective functions and stochastic search [J].
Booth, James G. ;
Casella, George ;
Hobert, James P. .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 2008, 70 :119-139
[5]   Multivariate Bayesian variable selection and prediction [J].
Brown, PJ ;
Vannucci, M ;
Fearn, T .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1998, 60 :627-641
[6]  
CHANG WC, 1983, APPL STAT-J ROY ST C, V32, P267
[7]  
Hao L Z., 2008, THESIS JILIN U CHINA
[8]   Variable selection in clustering via Dirichlet process mixture models [J].
Kim, Sinae ;
Tadesse, Mahlet G. ;
Vannucci, Marina .
BIOMETRIKA, 2006, 93 (04) :877-893
[9]  
Koller Daphne, 1996, International Conference on Machine Learning (ICML), P284
[10]  
Lafferty J., 2003, LANGUAGE MODELING IN