Feature selection for text categorisation using self-organising map

被引:0
作者
Manomaisupat, P [1 ]
Ahmad, K [1 ]
机构
[1] Univ Surrey, Dept Comp, Guildford GU2 7XH, Surrey, England
来源
PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON NEURAL NETWORKS AND BRAIN, VOLS 1-3 | 2005年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The categorisation of documents in large diverse collections poses a keen problem. The choice of a vector that may represent a document collection, and categories of documents within, is still an art form. We describe a study where four different types of term occurrence and document frequency metrices have been used with varying levels of success measured by classification accuracy statistics and average quantization error; TFIDF and its variant, term relevance, have been used together with a metric based on contrastive linguistics and another uses a finely-classified terminology data base. A novel method of term representation has been used-each element of the vector corresponds to the absence/presence of a set terms colocated within the element on the basis of frequency. In addition, we have defined a new baseline for comparison-a randomly selected set of terms for constructing a representative vector from within the collection. Categorisation was performed using the classic self-organising maps. We confirm that there is an optimum size of the input vector-c.100-200 terms-exists for each of the term-occurrence/document frequency metrices, and there appears to be a saturation point beyond that optimal limit.
引用
收藏
页码:1875 / 1880
页数:6
相关论文
共 10 条
  • [1] Ahmad K., 1995, Machine Translation and the Lexicon. Third International EAMT Workshop. Proceedings, P51
  • [2] AHMAD K, 2001, HDB TERMINOLOGY MANA, V2
  • [3] CHRISTOPHER DM, 2003, FDN STAT NATURAL LAN
  • [4] Hung C, 2003, THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, P75
  • [5] Hybrid neural document clustering using guided self-organization and wordnet
    Hung, CL
    Wermter, S
    Smith, P
    [J]. IEEE INTELLIGENT SYSTEMS, 2004, 19 (02) : 68 - 77
  • [6] Self organization of a massive document collection
    Kohonen, T
    Kaski, S
    Lagus, K
    Salojärvi, J
    Honkela, J
    Paatero, V
    Saarela, A
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS, 2000, 11 (03): : 574 - 585
  • [7] Kohonen T, 2001, SELF ORG MAPS, DOI [10.1007/978-3-642-56927-2_1, DOI 10.1007/978-3-642-56927-2_1]
  • [8] Mladenic D, 1999, MACHINE LEARNING, PROCEEDINGS, P258
  • [9] REIJSBERGEN V, 1979, INFORM RETRIEVAL
  • [10] Salton Gerard, 1983, INTRO MODERN INFORM