Employing Structural and Textual Feature Extraction for Semistructured Document Classification

被引：7

作者：

Khabbaz, Mohammad ^{[1
]}

Kianmehr, Keivan ^{[2
]}

Alhajj, Reda ^{[3
,4
]}

机构：

[1] Univ British Columbia, Dept Comp Sci, Vancouver, BC V6T 1Z2, Canada

[2] Univ Western Ontario, Dept Elect & Comp Engn, London, ON N6A 3K7, Canada

[3] Univ Calgary, Dept Comp Sci, Calgary, AB T2N 1N4, Canada

[4] Global Univ, Dept Comp Sci, Beirut 6908, Lebanon

来源：

IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS | 2012年 / 42卷 / 06期

关键词：

Document classification; feature reduction; soft clustering; structural information; XML documents; ALGORITHM;

D O I：

10.1109/TSMCC.2012.2208102

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

This paper addresses XML document classification by considering both structural and content-based features of the documents. This approach leads to better constructing a set of informative feature vectors that represents both structural and textual aspects of XML documents. For this purpose, we integrate soft clustering of words and feature reduction into the process. To extract structural information, we employ an existing frequent tree-mining algorithm combined with an information gain filter to retrieve the most informative substructures from XML documents. However, for extracting content information, we propose soft clustering of words using each cluster as a textual feature. We have conducted extensive experiments on a benchmark dataset, namely 20NewsGroups, and an XML documents dataset given in LOGML that describes the web-server logs of user sessions. With regards to the classifier built only using our textual features, the results show that it outperforms a naive support-vector-machine (SVM)-based classifier, as well as an information retrieval classifier (IRC). We further demonstrate the effectiveness of incorporating both structural and content information into the process of learning, by comparing our classifier model and several XML document classifiers. In particular, by applying SVM and decision tree algorithms using our feature vector representation of XML documents dataset, we have achieved 85.79% and 87.04% classification accuracy, respectively, which are higher than accuracy achieved by XRules, a well-known structural-based XML document classifier.

引用

页码：1566 / 1578

页数：13

共 50 条

[1] Textual document pre-processing and feature extraction in OLEX
Curia, R
Ettorre, M
Gallucci, L
Iiritano, S
Rullo, P
DATA MINING VI: DATA MINING, TEXT MINING AND THEIR BUSINESS APPLICATIONS, 2005, : 163 - 173
[2] Visual and Textual Deep Feature Fusion for Document Image Classification
Bakkali, Souhail
Ming, Zuheng
Coustaty, Mickael
Rusinol, Marcal
2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 2394 - 2403
[3] Facial Feature Extraction and Textual Description Classification using SVM
Bansode, N. K.
Sinha, P. K.
2014 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI), 2014,
[4] Benchmarking Feature Extraction Techniques for Textual Data Stream Classification
Thuma, Bruno Siedekum
de Vargas, Pedro Silva
Garcia, Cristiano
Britto, Alceu de Souza, Jr.
Barddal, Jean Paul
2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
[5] Automatic Extraction of Non-Textual Information in Web Document and Their Classification
Zachariasova, Martina
Hudec, Robert
Benco, Miroslav
Kamencay, Patrik
2012 35TH INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS AND SIGNAL PROCESSING (TSP), 2012, : 753 - 757
[6] Combining structural and textual contexts for compressing semistructured databases
Adiego, J
de la Fuente, P
Navarro, G
SIXTH MEXICAN INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE, PROCEEDINGS, 2005, : 68 - 73
[7] Deep feature extraction with tri-channel textual feature map for text classification
Li, Kunyan
Kang, Chen
PATTERN RECOGNITION LETTERS, 2024, 178 : 49 - 54
[8] Coral reef image classification employing Improved LDP for feature extraction
Mary, N. Ani Brown
Dharma, Dejey
JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2017, 49 : 225 - 242
[9] Textual Feature Extraction Using Ant Colony Optimization for Hate Speech Classification
Gite, Shilpa
Patil, Shruti
Dharrao, Deepak
Yadav, Madhuri
Basak, Sneha
Rajendran, Arundarasi
Kotecha, Ketan
BIG DATA AND COGNITIVE COMPUTING, 2023, 7 (01)
[10] A document classification approach by GA feature extraction based corner classification neural network
Zhang, WF
Xu, BW
Cui, ZF
2005 INTERNATIONAL CONFERENCE ON CYBERWORLDS, PROCEEDINGS, 2005, : 499 - 504

← 1 2 3 4 5 →