Clustering web documents using hierarchical representation with multi-granularity

被引:11
|
作者
Huang, Faliang [1 ]
Zhang, Shichao [2 ,5 ]
He, Minghua [3 ]
Wu, Xindong [4 ]
机构
[1] Fujian Normal Univ, Fac Software, Fuzhou 350007, Peoples R China
[2] Guangxi Normal Univ, Coll Comp Sci & IT, Guilin 541004, Peoples R China
[3] Aston Univ, Birmingham B4 7ET, Aston Triangle, England
[4] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
[5] Univ Technol Sydney, Fac Engn & Informat Technol, Broadway, NSW 2007, Australia
来源
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS | 2014年 / 17卷 / 01期
基金
澳大利亚研究理事会;
关键词
web document clustering; hierarchical representation; multi-granularity; INFORMATION GRANULATION;
D O I
10.1007/s11280-012-0197-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web document cluster analysis plays an important role in information retrieval by organizing large amounts of documents into a small number of meaningful clusters. Traditional web document clustering is based on the Vector Space Model (VSM), which takes into account only two-level (document and term) knowledge granularity but ignores the bridging paragraph granularity. However, this two-level granularity may lead to unsatisfactory clustering results with "false correlation". In order to deal with the problem, a Hierarchical Representation Model with Multi-granularity (HRMM), which consists of five-layer representation of data and a two-phase clustering process is proposed based on granular computing and article structure theory. To deal with the zero-valued similarity problem resulted from the sparse term-paragraph matrix, an ontology based strategy and a tolerance-rough-set based strategy are introduced into HRMM. By using granular computing, structural knowledge hidden in documents can be more efficiently and effectively captured in HRMM and thus web document clusters with higher quality can be generated. Extensive experiments show that HRMM, HRMM with tolerance-rough-set strategy, and HRMM with ontology all outperform VSM and a representative non VSM-based algorithm, WFP, significantly in terms of the F-Score.
引用
收藏
页码:105 / 126
页数:22
相关论文
共 50 条
  • [41] RMHNet: A Relation-Aware Multi-granularity Hierarchical Network for Person Re-identification
    Gengsheng Xie
    Xianbin Wen
    Neural Processing Letters, 2023, 55 : 1433 - 1454
  • [42] Multi-granularity cross-modal representation learning for named entity recognition on social media
    Liu, Peipei
    Wang, Gaosheng
    Li, Hong
    Liu, Jie
    Ren, Yimo
    Zhu, Hongsong
    Sun, Limin
    INFORMATION PROCESSING & MANAGEMENT, 2024, 61 (01)
  • [43] Learning multi-granularity representation with transformer for visible-infrared person re-identification
    Feng, Yujian
    Chen, Feng
    Sun, Guozi
    Wu, Fei
    Ji, Yimu
    Liu, Tianliang
    Liu, Shangdong
    Jing, Xiao-Yuan
    Luo, Jiebo
    PATTERN RECOGNITION, 2025, 164
  • [44] Multi-domain grooming algorithm based on hierarchical integrated multi-granularity auxiliary graph in optical mesh networks
    Jingjing Wu
    Lei Guo
    Weigang Hou
    Photonic Network Communications, 2012, 23 : 205 - 216
  • [45] Multi-domain grooming algorithm based on hierarchical integrated multi-granularity auxiliary graph in optical mesh networks
    Wu, Jingjing
    Guo, Lei
    Hou, Weigang
    PHOTONIC NETWORK COMMUNICATIONS, 2012, 23 (03) : 205 - 216
  • [46] A Causal Disentangled Multi-granularity Graph Classification Method
    Li, Yuan
    Liu, Li
    Chen, Penggang
    Zhang, Youmin
    Wang, Guoyin
    ROUGH SETS, IJCRS 2023, 2023, 14481 : 354 - 368
  • [47] ACCELERATOR ON MULTI-GRANULARITY ATTRIBUTE REDUCTION FOR CONTINUOUS PARAMETERS
    Zhao, Da-Sheng
    Song, Jing-Jing
    Xu, Tai-Hua
    Tsang, Eric C. C.
    PROCEEDINGS OF 2021 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS (ICMLC), 2021, : 158 - 163
  • [48] Two scenarios of flexible multi-standard architecture designs using a multi-granularity exploration
    Gul, Sufi Tabassum
    Moy, Christophe
    Palicot, Jacques
    2007 IEEE 18TH INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS, VOLS 1-9, 2007, : 3612 - 3616
  • [49] Multi-Granularity Partial Encryption Method of CAD Model
    Cai, X. T.
    He, F. Z.
    Li, W. D.
    Li, X. X.
    Wu, Y. Q.
    PROCEEDINGS OF THE 2013 IEEE 17TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN (CSCWD), 2013, : 23 - 30
  • [50] Quality Estimation for Machine Translation with Multi-granularity Interaction
    Tian, Ke
    Zhang, Jiajun
    MACHINE TRANSLATION, CCMT 2020, 2020, 1328 : 55 - 65