An improved focused crawler based on Semantic Similarity Vector Space Model

被引:36
作者
Du, Yajun [1 ]
Liu, Wenjun [2 ]
Lv, Xianjing [2 ]
Peng, Guoli [2 ]
机构
[1] Xihua Univ, Sch Comp & Software Engn, Chengdu 610039, Peoples R China
[2] Xihua Univ Lib, Chengdu 610039, Peoples R China
基金
中国国家自然科学基金;
关键词
Focused crawler; Semantic similarity; VSM; SSRM; FORMAL CONCEPT ANALYSIS; ONTOLOGY;
D O I
10.1016/j.asoc.2015.07.026
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. In many studies, the Vector Space Model (VSM) and Semantic Similarity Retrieval Model (SSRM) take advantage of cosine similarity and semantic similarity to compute similarities between web pages and the given topic. However, if there are no common terms between a web page and the given topic, the VSM will not obtain the proper topical similarity of the web page. In addition, if all of the terms between them are synonyms, then the SSRM will also not obtain the proper topical similarity. To address these problems, this paper proposes an improved retrieval model, the Semantic Similarity Vector Space Model (SSVSM), which integrates the TF*IDF values of the terms and the semantic similarities among the terms to construct topic and document semantic vectors that are mapped to the same double-term set, and computes the cosine similarities between these semantic vectors as topic-relevant similarities of documents, including the full texts and anchor texts of unvisited hyperlinks. Next, the proposed model predicts the priorities of the unvisited hyperlinks by integrating the full text and anchor text topic-relevant similarities. The experimental results demonstrate that this approach improves the performance of the focused crawlers and outperforms other focused crawlers based on Breadth-First, VSM and SSRM. In conclusion, this method is significant and effective for focused crawlers. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:392 / 407
页数:16
相关论文
共 34 条
[1]  
Abkenari F. A., 2012, INF SCI, V184, P266
[2]  
[Anonymous], 1994, P AICS C
[3]  
[Anonymous], P 32 ANN M ASS COMP
[4]  
[Anonymous], P IEEE GLOBECOM
[5]  
[Anonymous], 2017, INT
[6]   Improving the performance of focused web crawlers [J].
Batsakis, Sotiris ;
Petrakis, Euripides G. M. ;
Milios, Evangelos .
DATA & KNOWLEDGE ENGINEERING, 2009, 68 (10) :1001-1013
[7]   Focused crawling of tagged web resources using ontology [J].
Bedi, Punam ;
Thukral, Anjali ;
Banati, Hema .
COMPUTERS & ELECTRICAL ENGINEERING, 2013, 39 (02) :613-628
[8]   Focused crawling: a new approach to topic-specific Web resource discovery [J].
Chakrabarti, S ;
van den Berg, M ;
Dom, B .
COMPUTER NETWORKS-THE INTERNATIONAL JOURNAL OF COMPUTER AND TELECOMMUNICATIONS NETWORKING, 1999, 31 (11-16) :1623-1640
[9]  
Du Y.J., 2009, J COMPUT INF SYST, V3, P1097
[10]   A topic-specific crawling strategy based on semantics similarity [J].
Du, YaJun ;
Pen, QiangQiang ;
Gao, ZhaoQiong .
DATA & KNOWLEDGE ENGINEERING, 2013, 88 :75-93