A novel framework for termset selection and weighting in binary text classification

被引:25
作者
Badawi, Dima [1 ]
Altincay, Hakan [1 ]
机构
[1] Eastern Mediterranean Univ, Dept Comp Engn, Famagusta, Northern Cyprus, Turkey
关键词
Co-occurrence features; Termset selection; Termset weighting; Document representation; Text categorization; WORD;
D O I
10.1016/j.engappai.2014.06.012
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This study presents a new framework for termset selection and weighting. The proposed framework is based on employing the joint occurrence statistics of pairs of terms for termset selection and weighting. More specifically, each termset is evaluated by taking into account the simultaneous or individual occurrences of the terms within the termset. Based on the idea that the occurrence of one term but not the other may also convey valuable information for discrimination, the conventionally used term selection schemes are adapted to be employed for termset selection. Similarly, the weight of a selected termset is computed as a function of the terms that occur in the document under concern where a termset is assigned a nonzero weight if either or both of the terms appear in the document. This weight estimation scheme allows evaluation of the individual occurrences of the terms and their co-occurrences separately so as to compute the document-specific weight of each termset. The proposed termset-based representation is concatenated with the bag-of-words approach to construct the document vectors. Experiments conducted on three widely used datasets have verified the effectiveness of the proposed framework. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:38 / 53
页数:16
相关论文
共 33 条
  • [1] Analytical evaluation of term weighting schemes for text categorization
    Altincay, Hakan
    Erenel, Zafer
    [J]. PATTERN RECOGNITION LETTERS, 2010, 31 (11) : 1310 - 1323
  • [2] [Anonymous], 2003, P ACM S APPL COMP
  • [3] Baker L. D., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P96, DOI 10.1145/290941.290970
  • [4] Bekkerman R., 2004, Technical Report IR-418
  • [5] Boulis C., 2005, P INT WORKSH FEAT SE, P9
  • [6] Buckley C., 1985, TECHNICAL REPORT
  • [7] Caropreso MF, 2001, TEXT DATABASES AND DOCUMENT MANAGEMENT: THEORY AND PRACTICE, P78
  • [8] Feature selection for text classification with Naive Bayes
    Chen, Jingnian
    Huang, Houkuan
    Tian, Shengfeng
    Qu, Youli
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) : 5432 - 5435
  • [9] An analysis of the relative hardness of Reuters-21578 subsets
    Debole, F
    Sebastiani, F
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (06): : 584 - 596
  • [10] Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651