Preferential text classification: learning algorithms and evaluation measures

被引:0
作者
Fabio Aiolli
Riccardo Cardin
Fabrizio Sebastiani
Alessandro Sperduti
机构
[1] Università di Padova,Dipartimento di Matematica Pura e Applicata
[2] Consiglio Nazionale delle Ricerche,Istituto di Scienza e Tecnologie dell’Informazione
来源
Information Retrieval | 2009年 / 12卷
关键词
Preferential learning; Supervised learning; Text categorization; Text classification; Primary and secondary categories;
D O I
暂无
中图分类号
学科分类号
摘要
In many applicative contexts in which textual documents are labelled with thematic categories, a distinction is made between the primary categories of a document, which represent the topics that are central to it, and its secondary categories, which represent topics that the document only touches upon. We contend that this distinction, so far neglected in text categorization research, is important and deserves to be explicitly tackled. The contribution of this paper is threefold. First, we propose an evaluation measure for this preferential text categorization task, whereby different kinds of misclassifications involving either primary or secondary categories have a different impact on effectiveness. Second, we establish several baseline results for this task on a well-known benchmark for patent classification in which the distinction between primary and secondary categories is present; these results are obtained by reformulating the preferential text categorization task in terms of well established classification problems, such as single and/or multi-label multiclass classification; state-of-the-art learning technology such as SVMs and kernel-based methods are used. Third, we improve on these results by using a recently proposed class of algorithms explicitly devised for learning from training data expressed in preferential form, i.e., in the form “for document di, category c′ is preferred to category c′′”; this allows us to distinguish between primary and secondary categories not only in the classification phase but also in the learning phase, thus differentiating their impact on the classifiers to be generated.
引用
收藏
页码:559 / 580
页数:21
相关论文
共 26 条
  • [1] Chu W.(2007)Support vector ordinal regression Neural Computation 19 792-815
  • [2] Keerthi S. S.(2001)On the algorithmic implementation of multiclass kernel-based vector machines Journal of Machine Learning Research 2 265-292
  • [3] Crammer K.(2006)Comparing partial rankings SIAM Journal on Discrete Mathematics 20 628-648
  • [4] Singer Y.(2003)Automated categorization in the international patent classification SIGIR Forum 37 10-25
  • [5] Fagin R.(2003)An extensive empirical study of feature selection metrics for text classification Journal of Machine Learning Research 3 1289-1305
  • [6] Kumar R.(2001)Bayes point machines Journal of Machine Learning Research 1 245-279
  • [7] Mahdian M.(2006)Kernel-based learning of hierarchical multilabel classification models Journal of Machine Learning Research 7 1601-1626
  • [8] Sivakumar D.(2002)Hierarchical text classification using neural networks Information Retrieval 5 87-118
  • [9] Vee E.(2006)Step size adaptation in reproducing kernel Hilbert space Journal of Machine Learning Research 7 1107-1133
  • [10] Fall C. J.(undefined)undefined undefined undefined undefined-undefined