Enhanced clustering models with wiki-based k-nearest neighbors-based representation for web search result clustering

被引:8
作者
Abdulameer, Ali Sabah [1 ]
Tiun, Sabrina [1 ]
Sani, Nor Samsiah [1 ]
Ayob, Masri [1 ]
Taha, Adil Yaseen [1 ]
机构
[1] Univ Kebangsaan Malaysia, Fac Informat Sci & Technol, Ctr Artificial Intelligence Technol CAIT, Bangi 43600, Selangor, Malaysia
关键词
Clustering methods; Web search result; Word representation; Query expansion;
D O I
10.1016/j.jksuci.2020.02.003
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Information retrieval is a difficult process due to the overabundance of information on the web. Nowadays, search result responds to user queries with too many results although only a few are relevant. Therefore, the existing clustering methods that fail in clustering snippets (short texts) of web documents due to the low frequencies of document terms should be deeply investigated. One of the approaches that can be used to solve this problem is the expansion of document terms with semantically similar terms. Hence, a list of terms with their closest and accurate semantically similar words (word representation) must be built. This study aims to design and develop a new framework to enhance the performance of web search result clustering (WSRC). The research also presents a new unsupervised distributed word representation scheme where each word is represented by a vector of its semantically related words; such as scheme expands snippets and user queries. The proposed framework consists of several activities, such as (1) various standard datasets (Open Directory Project [ODP]-239 and MORESQUE) that are used for evaluating search result clustering algorithms for most cited dataset works, (2) text pre-processing, (3) document representation based on a new wiki-based k-nearest neighbors (KNN) representation method, (4) effect of the proposed model on the performance of traditional clustering methods (k-means, k-medoids, single-linkage, and complete-linkage) for WSRC, and (5) evaluation stage of the proposed method. Results indicate that enhanced clustering methods, according to the new wiki-KNN based representation method in comparison with the baseline methods, show a significant improvement in WSRC. Furthermore, the new data representation scheme has enhanced the overall performance of clustering methods. (C) 2020 The Authors. Production and hosting by Elsevier B.V. on behalf of King Saud University.
引用
收藏
页码:840 / 850
页数:11
相关论文
共 57 条
[1]  
Abdalgader Khaled, 2017, IAENG International Journal of Computer Science, V44, P523
[2]   A COMPARATIVE STUDY OF WORD REPRESENTATION METHODS WITH CONDITIONAL RANDOM FIELDS AND MAXIMUM ENTROPY MARKOV FOR BIO-NAMED ENTITY RECOGNITION [J].
Abdi, Maan Tareq ;
Mohd, Masnizah .
MALAYSIAN JOURNAL OF COMPUTER SCIENCE, 2018, 31 (05) :15-30
[3]  
Abdulameer Ali Sabah, 2015, Journal of Theoretical and Applied Information Technology, V81, P621
[4]  
Abualigah L.M., 2016, P CSIT 2016 2016 7 I, P12, DOI [DOI 10.1109/CSIT.2016.7549464, DOI 10.1109/CSIT.2016.7549456, 10.1109/CSIT.2016.7549456]
[5]  
Acharya S, 2014, 25 INT C COMP LING C, P99
[6]  
Agichtein E., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P3, DOI 10.1145/1148170.1148175
[7]  
Alam M, 2015, WEB SEARCH RESULT CL
[8]  
[Anonymous], 1999, Proceedings of the fifth acm sigkdd international conference on knowledge discovery and data mining
[9]  
[Anonymous], 2011, IJCAI 2011 P 22 INT
[10]  
[Anonymous], 2017, MINING MULTIMEDIA DO