Clustering web documents using hierarchical representation with multi-granularity

被引:11
|
作者
Huang, Faliang [1 ]
Zhang, Shichao [2 ,5 ]
He, Minghua [3 ]
Wu, Xindong [4 ]
机构
[1] Fujian Normal Univ, Fac Software, Fuzhou 350007, Peoples R China
[2] Guangxi Normal Univ, Coll Comp Sci & IT, Guilin 541004, Peoples R China
[3] Aston Univ, Birmingham B4 7ET, Aston Triangle, England
[4] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
[5] Univ Technol Sydney, Fac Engn & Informat Technol, Broadway, NSW 2007, Australia
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2014年 / 17卷 / 01期
基金
澳大利亚研究理事会;
关键词
web document clustering; hierarchical representation; multi-granularity; INFORMATION GRANULATION;
D O I
10.1007/s11280-012-0197-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with "false correlation". In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
引用
收藏
页码:105 / 126
页数:22
相关论文
共 50 条
  • [31] Multi-granularity principal curves extraction based on improved spectral clustering of complex distribution data
    Zhang, Hongyun
    Zhang, Ting
    Wang, Peipei
    Wei, Zhihua
    INTERNATIONAL JOURNAL OF APPROXIMATE REASONING, 2019, 105 : 356 - 367
  • [32] A multi-granularity clustering based evolutionary algorithm for large-scale sparse multi-objective optimization
    Tian, Ye
    Shao, Shuai
    Xie, Guohui
    Zhang, Xingyi
    SWARM AND EVOLUTIONARY COMPUTATION, 2024, 84
  • [33] RMHNet: A Relation-Aware Multi-granularity Hierarchical Network for Person Re-identification
    Xie, Gengsheng
    Wen, Xianbin
    NEURAL PROCESSING LETTERS, 2023, 55 (02) : 1433 - 1454
  • [34] An efficient selector for multi-granularity attribute reduction
    Liu, Keyu
    Yang, Xibei
    Fujita, Hamido
    Liu, Dun
    Yang, Xin
    Qian, Yuhua
    INFORMATION SCIENCES, 2019, 505 : 457 - 472
  • [35] An approach for multi-granularity optical path creation using merarchy label
    Gui, X
    Xu, YB
    Song, HS
    Zhang, R
    Gu, WY
    Network Architectures, Management, and Applications II, Pts 1 and 2, 2005, 5626 : 873 - 879
  • [36] A Multi-granularity Customization Relationship Model for SaaS
    Li, Hongbo
    Shi, Yuliang
    Li, Qingzhong
    WISM: 2009 INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS AND MINING, PROCEEDINGS, 2009, : 611 - 615
  • [37] STUDY ON THE MULTI-GRANULARITY VIRTUALIZATION OF MANUFACTURING RESOURCES
    Hu, Chunsheng
    Xu, Chengdong
    Cao, Xiaobo
    Zhang, Pengfei
    PROCEEDINGS OF THE ASME 8TH INTERNATIONAL MANUFACTURING SCIENCE AND ENGINEERING CONFERENCE - 2013, VOL 2, 2013,
  • [38] TEACHER-STUDENT LEARNING WITH MULTI-GRANULARITY CONSTRAINT TOWARDS COMPACT FACIAL FEATURE REPRESENTATION
    Wang, Shurun
    Wang, Shiqi
    Yang, Wenhan
    Zhang, Xinfeng
    Wang, Shanshe
    Ma, Siwei
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 8503 - 8507
  • [39] Multi-granularity Characteristics Analysis of Software Networks
    Sun, Weiqiang
    Jin, Chunlin
    Liu, Ji
    EMERGING RESEARCH IN ARTIFICIAL INTELLIGENCE AND COMPUTATIONAL INTELLIGENCE, 2012, 315 : 88 - +
  • [40] Multi-granularity Similarity Measure of Cloud Concept
    Yang, Jie
    Wang, Guoyin
    Li, Xukun
    ROUGH SETS, (IJCRS 2016), 2016, 9920 : 318 - 330