A new category-based weighting scheme for automated text categorization

被引:0
作者
Jia, Longjia [1 ]
Sun, Tieli [1 ,2 ]
Yang, Fengqin [1 ]
Sun, Hongguang [1 ]
Zhang, Bangzuo [1 ]
Hung, Chih-Cheng [3 ]
机构
[1] School of Computer Science and Information Technology, Key Laboratory of Intelligent Information Processing of Jilin Universities, Northeast Normal University, Changchun
[2] College of Humanities and Sciences, Northeast Normal University, Changchun
[3] Center for Machine Vision and Security Research, Kennesaw State University, Marietta
关键词
Dimensionality reduction; Machine learning; Term weighting; Text categorization;
D O I
10.1166/jctn.2015.4500
中图分类号
学科分类号
摘要
Term weighting is a strategy that assigns weights to terms in order to improve the performance of text categorization. In this paper, we propose a new category-based term weighting scheme named the probability of relevance frequency (prf), which uses available labeling information to assign appropriate weights to terms. The main idea of prf is that the more concentrated a highfrequency term is in the positive category than in the negative category, the more contribution it makes in separating the positive samples from the negative samples. By replacing word features with category-based features, the dimensionality of the document feature space can be reduced from tens of thousands to a small number of categories. In the experiments, we investigate the effects of prf on the 20 Newsgroups, Reuters-21578, and Yahoo! Answers datasets using the SVM and κ-NN as classifiers. The results show that the prf scheme outperforms other term weighting schemes, such as term frequency (tf), term frequency and inverse document frequency (tf z.ast; idf), term frequency and relevance frequency (tf z.ast; rf), term frequency and inverse question frequency and question frequency and inverse category frequency (iqf z.ast; qf z.ast; icf). Copyright © 2015 American Scientific Publishers All rights reserved.
引用
收藏
页码:5198 / 5205
页数:7
相关论文
共 39 条
[1]  
Quinlan J.R., Learning efficient classification procedures and their application to chess end games, Machine Learning, pp. 463-482, (1983)
[2]  
Salton G., Yang C.-S., J. Documentation, 29, (1973)
[3]  
Deng Z.-H., Luo K.-H., Yu H.-L., Expert Syst. Appl., 41, (2014)
[4]  
Ko Y., Pattern Recognition Lett., 51, (2015)
[5]  
Badawi D., Altincay H., Eng. Appl. Artificial Intelligence, 35, (2014)
[6]  
Altincay H., Erenel Z., Pattern Recognition Lett., 31, (2010)
[7]  
Ren F., Sohrab M.G., Info. Sci., 236, (2013)
[8]  
Nguyen T.T., Chang K., Hui S.C., Knowledge and Information Syst., 35, (2013)
[9]  
Lan M., Tan C.L., Su J., Lu Y., IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, (2009)
[10]  
Quan X., Wenyin L., Qiu B., IEEE Transactions on Pattern Analysis and Machine Intelligence, 33, (2011)