Relevance Feature Discovery for Text Mining

被引:41
作者
Li, Yuefeng [1 ]
Algarni, Abdulmohsen [1 ]
Albathan, Mubarak [1 ,2 ]
Shen, Yan [1 ]
Bijaksana, Moch Arif [1 ]
机构
[1] Queensland Univ Technol, Sch Elect Engn & Comp Sci, Brisbane, Qld 4001, Australia
[2] Al Imam Mohammad Ibn Saud Islamic Univ, Riyadh 11432, Saudi Arabia
基金
澳大利亚研究理事会;
关键词
Text mining; text feature extraction; text classification; FEATURE-SELECTION; REGRESSION SHRINKAGE; PATTERNS; ONTOLOGY;
D O I
10.1109/TKDE.2014.2373357
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods.
引用
收藏
页码:1656 / 1669
页数:14
相关论文
共 71 条
[1]   Text feature selection using ant colony optimization [J].
Aghdam, Mehdi Hosseinzadeh ;
Ghasem-Aghaee, Nasser ;
Basiri, Mohammad Ehsan .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :6843-6853
[2]  
Algarni Abdulmohsen, 2013, Advances in Knowledge Discovery and Data Mining. 17th Pacific-Asia Conference, PAKDD 2013. Proceedings, P532, DOI 10.1007/978-3-642-37453-1_44
[3]  
Algarni A., 2010, CIKM, P799
[4]  
[Anonymous], 2008, Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
[5]  
[Anonymous], 1997, ICML
[6]  
[Anonymous], 2009, INTRO INFORM RETRIEV
[7]  
[Anonymous], 2004, Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, CIKM'04, page
[8]   Comparison of term frequency and document frequency based feature selection metrics in text categorization [J].
Azam, Nouman ;
Yao, JingTao .
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (05) :4760-4768
[9]  
Bekkerman Ron., 2011, Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, P231, DOI DOI 10.1145/2020408.2020449
[10]   Selection of relevant features and examples in machine learning [J].
Blum, AL ;
Langley, P .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :245-271