Clustering of web search results based on the cuckoo search algorithm and Balanced Bayesian Information Criterion

被引:69
|
作者
Cobos, Carlos [1 ,2 ]
Munoz-Collazos, Henry [1 ]
Urbano-Munoz, Richar [1 ]
Mendoza, Martha [1 ,2 ]
Leon, Elizabeth [3 ]
Herrera-Viedma, Enrique [4 ,5 ]
机构
[1] Univ Cauca, Informat Technol Res Grp GTI, Popayan, Colombia
[2] Univ Cauca, Dept Comp Sci, Elect & Telecommun Engn Fac, Popayan, Colombia
[3] Univ Nacl Colombia, Syst & Ind Engn, Fac Engn, Popayan, Colombia
[4] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain
[5] King Abdulaziz Univ, Dept Elect & Comp Engn, Fac Engn, Jeddah 21589, Saudi Arabia
关键词
Cuckoo search algorithm; Clustering of web result; Web document clustering; Balanced Bayesian Information Criterion; k-Mean; K-MEANS; DESIGN;
D O I
10.1016/j.ins.2014.05.047
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The clustering of web search results - or web document clustering - has become a very interesting research area among academic and scientific communities involved in information retrieval. Web search result clustering systems, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering web results already exist, but results show room for more to be done. This paper introduces a new description-centric algorithm for the clustering of web results, called WDC-CSK, which is based on the cuckoo search meta-heuristic algorithm, k-means algorithm, Balanced Bayesian Information Criterion, split and merge methods on clusters, and frequent phrases approach for cluster labeling. The cuckoo search meta-heuristic provides a combined global and local search strategy in the solution space. Split and merge methods replace the original Levy flights operation and try to improve existing solutions (nests), so they can be considered as local search methods. WDC-CSK includes an abandon operation that provides diversity and prevents the population nests from converging too quickly. Balanced Bayesian Information Criterion is used as a fitness function and allows defining the number of clusters automatically. WDC-CSK was tested with four data sets (DMOZ-50, AMBIENT, MORESQUE and ODP-239) over 447 queries. The algorithm was also compared against other established web document clustering algorithms, including Suffix Tree Clustering (STC), Lingo, and Bisecting k-means. The results show a considerable improvement upon the other algorithms as measured by recall, F-measure, fall-out, accuracy and SSLk. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:248 / 264
页数:17
相关论文
共 50 条
  • [41] Topological tree clustering of web search results
    Freeman, Richard T.
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2006, PROCEEDINGS, 2006, 4224 : 789 - 797
  • [42] Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach
    Cobos, Carlos
    Duque, Andrea
    Bolanos, Jamith
    Mendoza, Martha
    Leon, Elizabeth
    ADVANCES IN SOFT COMPUTING, MICAI 2016, PT II, 2017, 10062 : 285 - 316
  • [43] Speaker Clustering Based on Bayesian Information Criterion
    Tsai, Wei-Ho
    JOURNAL OF INFORMATION SCIENCE AND ENGINEERING, 2008, 24 (06) : 1873 - 1886
  • [44] Snap-drift cuckoo search: A novel cuckoo search optimization algorithm
    Rakhshani, Hojjat
    Rahati, Amin
    APPLIED SOFT COMPUTING, 2017, 52 : 771 - 794
  • [45] Clustering Web video search results based on integration of multiple features
    Alex Hindle
    Jie Shao
    Dan Lin
    Jiaheng Lu
    Rui Zhang
    World Wide Web, 2011, 14 : 53 - 73
  • [46] Clustering Web video search results based on integration of multiple features
    Hindle, Alex
    Shao, Jie
    Lin, Dan
    Lu, Jiaheng
    Zhang, Rui
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2011, 14 (01): : 53 - 73
  • [47] Use link-based clustering to improve Web search results
    Wang, YT
    Kitsuregawa, M
    SECOND INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, VOL I, PROCEEDINGS, 2002, : 115 - 124
  • [48] A semantics-based method for clustering of Chinese web search results
    Zhang, Hui
    Wang, Deqing
    Wang, Li
    Bi, Zhuming
    Chen, Yong
    ENTERPRISE INFORMATION SYSTEMS, 2014, 8 (01) : 147 - 165
  • [49] Web search results clustering based on a novel suffix tree structure
    Wang, Junze
    Mo, Yijun
    Huang, Benxiong
    Wen, Jie
    He, Li
    AUTONOMIC AND TRUSTED COMPUTING, PROCEEDINGS, 2008, 5060 : 540 - 554
  • [50] Cuckoo search algorithm based on frog leaping local search and chaos theory
    Liu, Xueying
    Fu, Meiling
    APPLIED MATHEMATICS AND COMPUTATION, 2015, 266 : 1083 - 1092