Clustering of web search results based on the cuckoo search algorithm and Balanced Bayesian Information Criterion

被引:69
|
作者
Cobos, Carlos [1 ,2 ]
Munoz-Collazos, Henry [1 ]
Urbano-Munoz, Richar [1 ]
Mendoza, Martha [1 ,2 ]
Leon, Elizabeth [3 ]
Herrera-Viedma, Enrique [4 ,5 ]
机构
[1] Univ Cauca, Informat Technol Res Grp GTI, Popayan, Colombia
[2] Univ Cauca, Dept Comp Sci, Elect & Telecommun Engn Fac, Popayan, Colombia
[3] Univ Nacl Colombia, Syst & Ind Engn, Fac Engn, Popayan, Colombia
[4] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain
[5] King Abdulaziz Univ, Dept Elect & Comp Engn, Fac Engn, Jeddah 21589, Saudi Arabia
关键词
Cuckoo search algorithm; Clustering of web result; Web document clustering; Balanced Bayesian Information Criterion; k-Mean; K-MEANS; DESIGN;
D O I
10.1016/j.ins.2014.05.047
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The clustering of web search results - or web document clustering - has become a very interesting research area among academic and scientific communities involved in information retrieval. Web search result clustering systems, also called Web Clustering Engines, seek to increase the coverage of documents presented for the user to review, while reducing the time spent reviewing them. Several algorithms for clustering web results already exist, but results show room for more to be done. This paper introduces a new description-centric algorithm for the clustering of web results, called WDC-CSK, which is based on the cuckoo search meta-heuristic algorithm, k-means algorithm, Balanced Bayesian Information Criterion, split and merge methods on clusters, and frequent phrases approach for cluster labeling. The cuckoo search meta-heuristic provides a combined global and local search strategy in the solution space. Split and merge methods replace the original Levy flights operation and try to improve existing solutions (nests), so they can be considered as local search methods. WDC-CSK includes an abandon operation that provides diversity and prevents the population nests from converging too quickly. Balanced Bayesian Information Criterion is used as a fitness function and allows defining the number of clusters automatically. WDC-CSK was tested with four data sets (DMOZ-50, AMBIENT, MORESQUE and ODP-239) over 447 queries. The algorithm was also compared against other established web document clustering algorithms, including Suffix Tree Clustering (STC), Lingo, and Bisecting k-means. The results show a considerable improvement upon the other algorithms as measured by recall, F-measure, fall-out, accuracy and SSLk. (C) 2014 Elsevier Inc. All rights reserved.
引用
收藏
页码:248 / 264
页数:17
相关论文
共 50 条
  • [1] Clustering of Web Search Results based on an Iterative Fuzzy C-means Algorithm and Bayesian Information Criterion
    Cobos, Carlos
    Mendoza, Martha
    Leon, Elizabeth
    Manic, Milos
    Herrera-Viedma, Enrique
    PROCEEDINGS OF THE 2013 JOINT IFSA WORLD CONGRESS AND NAFIPS ANNUAL MEETING (IFSA/NAFIPS), 2013, : 507 - 512
  • [2] Bayesian network structure learning based on cuckoo search algorithm
    Askari, Mahbobe Bani Asad
    Ahsaee, Mostafa Ghazizadeh
    2018 6TH IRANIAN JOINT CONGRESS ON FUZZY AND INTELLIGENT SYSTEMS (CFIS), 2018, : 127 - 130
  • [3] A Study on Clustering Algorithm of Web Search Results Based on Rough Set
    Zhang, Jin
    Chen, Shuxuan
    PROCEEDINGS OF 2013 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2012, : 292 - 295
  • [4] Clustering Web Search Results Based on Interactive Suffix Tree Algorithm
    Wang, Ying
    Zuo, Wanli
    Peng, Tao
    He, Fengling
    Hu, Hailong
    THIRD 2008 INTERNATIONAL CONFERENCE ON CONVERGENCE AND HYBRID INFORMATION TECHNOLOGY, VOL 2, PROCEEDINGS, 2008, : 851 - 857
  • [5] Simple and Efficient Clustering Approach Based on Cuckoo Search Algorithm
    Khrissi, Lahbib
    El Akkad, Nabil
    Satori, Hassan
    Satori, Khalid
    2020 FOURTH INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING IN DATA SCIENCES (ICDS), 2020,
  • [6] Fuzzy Clustering and Visualization of Information for Web Search Results
    Zaidi, Faraz
    JOURNAL OF INTERNET TECHNOLOGY, 2012, 13 (06): : 939 - 952
  • [7] CLUSTERING WEB SEARCH RESULTS USING SEMANTIC INFORMATION
    Wen, Han
    Huang, Guo-Shun
    Li, Zhao
    PROCEEDINGS OF 2009 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-6, 2009, : 1504 - +
  • [8] Clustering using improved cuckoo search algorithm
    Zhao, Jie
    Lei, Xiujuan
    Wu, Zhenqiang
    Tan, Ying
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8794 : 479 - 488
  • [9] Improved Cuckoo Search Algorithm for Document Clustering
    Boushaki, Saida Ishak
    Kamel, Nadjet
    Bendjeghaba, Omar
    COMPUTER SCIENCE AND ITS APPLICATIONS, CIIA 2015, 2015, 456 : 217 - 228
  • [10] A Modified Cuckoo Search Algorithm for Data Clustering
    Mohanty, Preeti Pragyan
    Nayak, Subrat Kumar
    INTERNATIONAL JOURNAL OF APPLIED METAHEURISTIC COMPUTING, 2022, 13 (01)