Clustering web documents using hierarchical representation with multi-granularity

被引:11
|
作者
Huang, Faliang [1 ]
Zhang, Shichao [2 ,5 ]
He, Minghua [3 ]
Wu, Xindong [4 ]
机构
[1] Fujian Normal Univ, Fac Software, Fuzhou 350007, Peoples R China
[2] Guangxi Normal Univ, Coll Comp Sci & IT, Guilin 541004, Peoples R China
[3] Aston Univ, Birmingham B4 7ET, Aston Triangle, England
[4] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
[5] Univ Technol Sydney, Fac Engn & Informat Technol, Broadway, NSW 2007, Australia
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2014年 / 17卷 / 01期
基金
澳大利亚研究理事会;
关键词
web document clustering; hierarchical representation; multi-granularity; INFORMATION GRANULATION;
D O I
10.1007/s11280-012-0197-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with "false correlation". In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
引用
收藏
页码:105 / 126
页数:22
相关论文
共 50 条
  • [1] Clustering web documents using hierarchical representation with multi-granularity
    Faliang Huang
    Shichao Zhang
    Minghua He
    Xindong Wu
    World Wide Web, 2014, 17 : 105 - 126
  • [2] MULTI-GRANULARITY KNOWLEDGE MINING ON THE WEB
    Xie, Ming
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2012, 22 (01) : 1 - 16
  • [3] Adaptive multi-granularity sparse subspace clustering
    Deng, Tingquan
    Yang, Ge
    Huang, Yang
    Yang, Ming
    Fujita, Hamido
    INFORMATION SCIENCES, 2023, 642
  • [4] Multi-granularity Complex Network Representation Learning
    Li, Peisen
    Wang, Guoyin
    Hu, Jun
    Li, Yun
    ROUGH SETS, IJCRS 2020, 2020, 12179 : 236 - 250
  • [5] Dynamic Multi-Granularity Translation System: DAG-Structured Multi-Granularity Representation and Self-Attention
    Lv, Shenrong
    Yang, Bo
    Wang, Ruiyang
    Lu, Siyu
    Tian, Jiawei
    Zheng, Wenfeng
    Chen, Xiaobing
    Yin, Lirong
    SYSTEMS, 2024, 12 (10):
  • [6] Hierarchical classification with exponential weighting of multi-granularity paths
    Wang, Yibin
    Zhu, Qing
    Cheng, Yusheng
    INFORMATION SCIENCES, 2024, 675
  • [7] Multi-granularity network representation learning on overlapping communities
    Zhou, Rongrong
    Li, Jinhai
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2024, 15 (07) : 2935 - 2955
  • [8] Multi-granularity Visualization of Trajectory Clusters using Sub-trajectory Clustering
    Chang, Cheng
    Zhou, Baoyao
    2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, : 577 - 582
  • [9] Multi-Granularity Ensemble Classification Algorithm Based on Attribute Representation
    Zhang Q.-H.
    Zhi X.-C.
    Wang G.-Y.
    Yang F.
    Xue F.-Z.
    Jisuanji Xuebao/Chinese Journal of Computers, 2022, 45 (08): : 1712 - 1729
  • [10] Robust Object Tracking Based on Multi-granularity Sparse Representation
    Chu, Honglin
    Wen, Jiajun
    Lai, Zhihui
    INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: VISUAL DATA ENGINEERING, PT I, 2019, 11935 : 142 - 154