Fast categorization of web documents represented by graphs

被引:0
|
作者
Markov, A. [1 ]
Last, M. [1 ]
Kandel, A. [2 ]
机构
[1] Ben Gurion Univ Negev, Dept Informat Syst Engn, IL-84105 Beer Sheva, Israel
[2] Univ S Florida, Dept Comp Sci & Engn, Tampa, FL 33620 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most text categorization methods are based on the vector-space model of information retrieval. One of the important advantages of this representation model is that it can be used by both instance-based and model-based classifiers for categorization. However, this popular method of document representation does not capture important structural information, such as the order and proximity of word occurrence or the location of a word within the document. It also makes no use of the mark-up information that is available from web document HTML tags. A recently developed graph-based representation of web documents can preserve the structural information. The new document model was shown to outperform the traditional vector representation, using the k-Nearest Neighbor (k-NN) classification algorithm. The problem, however, is that the eager (model-based) classifiers cannot work with this representation directly. In this chapter, three new, hybrid approaches to web document categorization are presented, built upon both graph and vector space representations, thus preserving the benefits and overcoming the limitations of each. The hybrid methods presented here are compared to vector-based models using two model-based classifiers (C4.5 decision-tree algorithm and probabilistic Naive Bayes) and several benchmark web document collections. The results demonstrate that the hybrid methods outperform, in most cases, existing approaches in terms of classification accuracy, and in addition, achieve a significant increase in the categorization speed.
引用
收藏
页码:56 / +
页数:3
相关论文
共 50 条
  • [1] Web documents categorization using neural networks
    Corrêa, RF
    Ludermir, TB
    NEURAL INFORMATION PROCESSING, 2004, 3316 : 758 - 762
  • [2] TWO-PHASE CATEGORIZATION OF WEB DOCUMENTS
    Bartik, Vladimir
    Burget, Radek
    KDIR 2010: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND INFORMATION RETRIEVAL, 2010, : 458 - 462
  • [3] Web documents categorization using fuzzy representation and HAC
    Deng, JW
    Chen, LH
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, VOL II, 2000, : 24 - 28
  • [4] A Probabilistic model for fast and confident categorization of textual documents
    Goutte, Cyril
    SURVEY OF TEXT MINING II: CLUSTERING, CLASSIFICATION, AND RETRIEVAL, 2008, : 187 - 202
  • [5] Fast fuzzy clustering of Web documents
    Wang, Jian-Hui
    Jiang, Long-Bin
    Yang, Shu
    Chang'an Daxue Xuebao (Ziran Kexue Ban)/Journal of Chang'an University (Natural Science Edition), 2007, 27 (02): : 107 - 110
  • [6] Semi-supervised categorization of documents using the Web as corpus
    Guzman Cabrera, Rafael
    PROCESAMIENTO DEL LENGUAJE NATURAL, 2011, (46): : 127 - 128
  • [7] Fast Construction of Compressed Web Graphs
    Bross, Jan
    Gog, Simon
    Hauck, Matthias
    Paradies, Marcus
    STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE 2017), 2017, 10508 : 116 - 128
  • [8] A fast and efficient method for processing web documents
    Szego, D
    COMPUTATIONAL SCIENCE - ICCS 2004, PT 1, PROCEEDINGS, 2004, 3036 : 553 - 556
  • [9] Application of Ant-based Template Matching for Web Documents Categorization
    Ong, Siok Lan
    Lai, Weng Kin
    Tai, Tracy S. Y.
    Hoe, Kok Meng
    Ooi, Choo Hau
    INFORMATICA-JOURNAL OF COMPUTING AND INFORMATICS, 2005, 29 (02): : 173 - 181
  • [10] Automatic categorization of web text documents using fuzzy inference rule
    Ankita Dhar
    Himadri Mukherjee
    Niladri Sekhar Dash
    Kaushik Roy
    Sādhanā, 2020, 45