Employing Structural and Textual Feature Extraction for Semistructured Document Classification

被引:7
|
作者
Khabbaz, Mohammad [1 ]
Kianmehr, Keivan [2 ]
Alhajj, Reda [3 ,4 ]
机构
[1] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1Z2, Canada
[2] Univ Western Ontario, Dept Elect & Comp Engn, London, ON N6A 3K7, Canada
[3] Univ Calgary, Dept Comp Sci, Calgary, AB T2N 1N4, Canada
[4] Global Univ, Dept Comp Sci, Beirut 6908, Lebanon
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS | 2012年 / 42卷 / 06期
关键词
Document classification; feature reduction; soft clustering; structural information; XML documents; ALGORITHM;
D O I
10.1109/TSMCC.2012.2208102
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper addresses XML document classification by considering both structural and content-based features of the documents. This approach leads to better constructing a set of informative feature vectors that represents both structural and textual aspects of XML documents. For this purpose, we integrate soft clustering of words and feature reduction into the process. To extract structural information, we employ an existing frequent tree-mining algorithm combined with an information gain filter to retrieve the most informative substructures from XML documents. However, for extracting content information, we propose soft clustering of words using each cluster as a textual feature. We have conducted extensive experiments on a benchmark dataset, namely 20NewsGroups, and an XML documents dataset given in LOGML that describes the web-server logs of user sessions. With regards to the classifier built only using our textual features, the results show that it outperforms a naive support-vector-machine (SVM)-based classifier, as well as an information retrieval classifier (IRC). We further demonstrate the effectiveness of incorporating both structural and content information into the process of learning, by comparing our classifier model and several XML document classifiers. In particular, by applying SVM and decision tree algorithms using our feature vector representation of XML documents dataset, we have achieved 85.79% and 87.04% classification accuracy, respectively, which are higher than accuracy achieved by XRules, a well-known structural-based XML document classifier.
引用
收藏
页码:1566 / 1578
页数:13
相关论文
共 50 条
  • [1] Textual document pre-processing and feature extraction in OLEX
    Curia, R
    Ettorre, M
    Gallucci, L
    Iiritano, S
    Rullo, P
    DATA MINING VI: DATA MINING, TEXT MINING AND THEIR BUSINESS APPLICATIONS, 2005, : 163 - 173
  • [2] Visual and Textual Deep Feature Fusion for Document Image Classification
    Bakkali, Souhail
    Ming, Zuheng
    Coustaty, Mickael
    Rusinol, Marcal
    2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 2394 - 2403
  • [3] Facial Feature Extraction and Textual Description Classification using SVM
    Bansode, N. K.
    Sinha, P. K.
    2014 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2014,
  • [4] Benchmarking Feature Extraction Techniques for Textual Data Stream Classification
    Thuma, Bruno Siedekum
    de Vargas, Pedro Silva
    Garcia, Cristiano
    Britto, Alceu de Souza, Jr.
    Barddal, Jean Paul
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [5] Automatic Extraction of Non-Textual Information in Web Document and Their Classification
    Zachariasova, Martina
    Hudec, Robert
    Benco, Miroslav
    Kamencay, Patrik
    2012 35TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2012, : 753 - 757
  • [6] Combining structural and textual contexts for compressing semistructured databases
    Adiego, J
    de la Fuente, P
    Navarro, G
    SIXTH MEXICAN INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE, PROCEEDINGS, 2005, : 68 - 73
  • [7] Deep feature extraction with tri-channel textual feature map for text classification
    Li, Kunyan
    Kang, Chen
    PATTERN RECOGNITION LETTERS, 2024, 178 : 49 - 54
  • [8] Coral reef image classification employing Improved LDP for feature extraction
    Mary, N. Ani Brown
    Dharma, Dejey
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2017, 49 : 225 - 242
  • [9] Textual Feature Extraction Using Ant Colony Optimization for Hate Speech Classification
    Gite, Shilpa
    Patil, Shruti
    Dharrao, Deepak
    Yadav, Madhuri
    Basak, Sneha
    Rajendran, Arundarasi
    Kotecha, Ketan
    BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (01)
  • [10] A document classification approach by GA feature extraction based corner classification neural network
    Zhang, WF
    Xu, BW
    Cui, ZF
    2005 INTERNATIONAL CONFERENCE ON CYBERWORLDS, PROCEEDINGS, 2005, : 499 - 504