Learning to classify documents according to genre

被引:62
作者
Finn, Aidan [1 ]
Kushmerick, Nicholas [1 ]
机构
[1] Univ Coll Dublin, Sch Comp Sci & Informat, Dublin, Ireland
来源
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY | 2006年 / 57卷 / 11期
关键词
D O I
10.1002/asi.20427
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Current document-retrieval tools succeed in locating large numbers of documents relevant to a given query. While search results may be relevant according to the topic of the documents, it is more difficult to identify which of the relevant documents are most suitable for a particular user. Automatic genre analysis (i.e., the ability to distinguish documents according to style) would be a useful tool for identifying documents that are most suitable for a particular user. We investigate the use of machine learning for automatic genre classification. We introduce the idea of domain transfer-genre classifiers should be reusable across multiple topics-which does not arise in standard text classification. We investigate different features for building genre classifiers and their ability to transfer across multiple-topic domains. We also show how different feature-sets can be used in conjunction with each other to improve performance and reduce the number of documents that need to be labeled.
引用
收藏
页码:1506 / 1518
页数:13
相关论文
共 21 条
[1]  
[Anonymous], P 35 ANN M ASS COMP
[2]  
[Anonymous], 2004, Genre Analysis
[3]  
ARGAMON S, 1998, 1 INT WORSH INN INF
[4]  
BRILL N, 1994, P 12 NAT C ART INT, P722
[5]  
DEWDNEY N, 2001, WORKSH HUM LANG TECH
[6]  
Gove Philip Babcock, 2002, Webster's Third New International Dictionary of the English Language Unabridged
[7]  
KARGREN J, 1998, 8 DELOS WORKSH US IN, P85
[8]  
Karlgren J, 2004, WORKSHOP STYLE MEANI
[9]  
KARLGREN J, 1994, P 15 INT C COMP LING, V2, P1071
[10]  
KARLGREN J, 1999, NATURAL LANGUAGE INF