A probabilistic model derived term weighting scheme for text classification

被引:17
作者
Feng, Guozhong [1 ,2 ,3 ]
Li, Shaoting [4 ]
Sun, Tieli [1 ]
Zhang, Bangzuo [1 ]
机构
[1] Northeast Normal Univ, Sch Comp Sci & Informat Technol, Key Lab Intelligent Informat Proc Jilin Univ, Changchun 130117, Jilin, Peoples R China
[2] Northeast Normal Univ, Sch Math & Stat, Key Lab Appl Stat MOE, Changchun 130024, Jilin, Peoples R China
[3] Northeast Normal Univ, Inst Computat Biol, Changchun 130117, Jilin, Peoples R China
[4] Dongbei Univ Finance & Econ, Sch Stat, Dalian 116025, Peoples R China
基金
中国国家自然科学基金;
关键词
Latent feature selection indicator; Matching score function; Naive Bayes; Supervised term weighting; Text classification; CATEGORIZATION; BAYES;
D O I
10.1016/j.patrec.2018.03.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Term weighting is known as a text presentation strategy to assign appropriate value to each term to improve the performance of text classification in the task of transforming the content of textual document into a vector in the term space. Supervised weighting methods using the information on the membership of training documents in predefined classes are naturally expected to provide better results than the unsupervised ones. In this paper, a new weighting scheme is proposed via a matching score function based on a probabilistic model. We introduce a latent variable to indicate whether a term contains text classification information or not, specify conjugate priors and exploit the conjugacy by integrating out the latent indicator and the parameters. Then the non-discriminating terms can be assigned weights close to 0. Experimental results using kNN and SVM classifiers illustrate the effectiveness of the proposed approach on both small and large text data sets. (C) 2018 Published by Elsevier B.V.
引用
收藏
页码:23 / 29
页数:7
相关论文
共 25 条
[1]   Using the absolute difference of term occurrence probabilities in binary text categorization [J].
Altincay, Hakan ;
Erenel, Zafer .
APPLIED INTELLIGENCE, 2012, 36 (01) :148-160
[2]   Manifold Adaptive Experimental Design for Text Categorization [J].
Cai, Deng ;
He, Xiaofei .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2012, 24 (04) :707-719
[3]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[4]  
DEBOLE F, 2003, P SAC 03 18 ACM S AP, P784
[5]  
Deng ZH, 2004, LECT NOTES COMPUT SC, V3007, P588
[6]  
Erenel Z, 2011, J INF SCI ENG, V27, P819
[7]   Feature subset selection using naive Bayes for text classification [J].
Feng, Guozhong ;
Guo, Jianhua ;
Jing, Bing-Yi ;
Sun, Tieli .
PATTERN RECOGNITION LETTERS, 2015, 65 :109-115
[8]   A Bayesian feature selection paradigm for text classification [J].
Feng, Guozhong ;
Guo, Jianhua ;
Jing, Bing-Yi ;
Hao, Lizhu .
INFORMATION PROCESSING & MANAGEMENT, 2012, 48 (02) :283-302
[9]  
Guyon I, 2004, ADV NEURAL INFORM PR, V17
[10]   Analysis of the IJCNN 2007 agnostic learning vs. prior knowledge challenge [J].
Guyon, Isabelle ;
Saffari, Amir ;
Dror, Gideon ;
Cawley, Gavin .
NEURAL NETWORKS, 2008, 21 (2-3) :544-550