A novel framework for termset selection and weighting in binary text classification

被引：26

作者：

Badawi, Dima ^{[1
]}

Altincay, Hakan ^{[1
]}

机构：

[1] Eastern Mediterranean Univ, Dept Comp Engn, Famagusta, Northern Cyprus, Turkey

来源：

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE | 2014年 / 35卷

关键词：

Co-occurrence features; Termset selection; Termset weighting; Document representation; Text categorization; WORD;

D O I：

10.1016/j.engappai.2014.06.012

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

This study presents a new framework for termset selection and weighting. The proposed framework is based on employing the joint occurrence statistics of pairs of terms for termset selection and weighting. More specifically, each termset is evaluated by taking into account the simultaneous or individual occurrences of the terms within the termset. Based on the idea that the occurrence of one term but not the other may also convey valuable information for discrimination, the conventionally used term selection schemes are adapted to be employed for termset selection. Similarly, the weight of a selected termset is computed as a function of the terms that occur in the document under concern where a termset is assigned a nonzero weight if either or both of the terms appear in the document. This weight estimation scheme allows evaluation of the individual occurrences of the terms and their co-occurrences separately so as to compute the document-specific weight of each termset. The proposed termset-based representation is concatenated with the bag-of-words approach to construct the document vectors. Experiments conducted on three widely used datasets have verified the effectiveness of the proposed framework. (C) 2014 Elsevier Ltd. All rights reserved.

引用

页码：38 / 53

页数：16

共 33 条

[1] Analytical evaluation of term weighting schemes for text categorization [J].

Altincay, Hakan ;

Erenel, Zafer .

PATTERN RECOGNITION LETTERS, 2010, 31 (11) :1310-1323

[2]

[Anonymous], 2003, P ACM S APPL COMP

[3]

Baker L. D., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P96, DOI 10.1145/290941.290970

[4]

Bekkerman R., 2004, Technical Report IR-418

[5]

Boulis C., 2005, P INT WORKSH FEAT SE, P9

[6]

Buckley C., 1985, TECHNICAL REPORT

[7]

Caropreso MF, 2001, TEXT DATABASES AND DOCUMENT MANAGEMENT: THEORY AND PRACTICE, P78

[8] Feature selection for text classification with Naive Bayes [J].

Chen, Jingnian ;

Huang, Houkuan ;

Tian, Shengfeng ;

Qu, Youli .

EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (03) :5432-5435

[9] An analysis of the relative hardness of Reuters-21578 subsets [J].

Debole, F ;

Sebastiani, F .

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2005, 56 (06) :584-596

[10]

Dumais S., 1998, Proceedings of the 1998 ACM CIKM International Conference on Information and Knowledge Management, P148, DOI 10.1145/288627.288651

← 1 2 3 4 →