A focused crawler based on semantic disambiguation vector space model

被引:0
作者
Liu, Wenjun [1 ]
He, Yu [1 ]
Wu, Jing [1 ]
Du, Yajun [1 ]
Liu, Xing [2 ]
Xi, Tiejun [1 ]
Gan, Zurui [1 ]
Jiang, Pengjun [1 ]
Huang, Xiaoping [1 ]
机构
[1] XiHua Univ, Sch Comp & Software Engn, Chengdu 610039, Peoples R China
[2] XiHua Univ, Xihua Honors Coll, Chengdu 610039, Peoples R China
基金
中国国家自然科学基金;
关键词
Focused crawler; Semantic disambiguation graph; Semantic vector space model; Semantic similarity; SIMILARITY; CLASSIFICATION; NETWORK; SYSTEM;
D O I
10.1007/s40747-022-00707-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The focused crawler grabs continuously web pages related to the given topic according to priorities of unvisited hyperlinks. In many previous studies, the focused crawlers predict priorities of unvisited hyperlinks based on the text similarity models. However, the representation terms of the web page ignore the phenomenon of polysemy, and the topic similarity of the text cannot combine the cosine similarity and the semantic similarity effectively. To address these problems, this paper proposes a focused crawler based on semantic disambiguation vector space model (SDVSM). The SDVSM method combines the semantic disambiguation graph (SDG) and the semantic vector space model (SVSM). The SDG is used to remove the ambiguation terms irrelevant to the given topic from representation terms of retrieved web pages. The SVSM is used to calculate the topic similarity of the text by constructing text and topic semantic vectors based on TF x IDF weights of terms and semantic similarities between terms. The experiment results indicate that the SDVSM method can improve the performance of the focused crawler by comparing different evaluation indicators for four focused crawlers. In conclusion, the proposed method can make the focused crawler grab the higher quality and more quantity web pages related to the given topic from the Internet.
引用
收藏
页码:345 / 366
页数:22
相关论文
共 39 条
[1]   Fuzzy system for intelligent word recognition using a regular grammar [J].
Alvarez, D. ;
Fernandez, R. A. ;
Sanchez, L. .
JOURNAL OF APPLIED LOGIC, 2017, 24 :45-53
[2]   An intelligent system for focused crawling from Big Data sources [J].
Bifulco, Ida ;
Cirillo, Stefano ;
Esposito, Christian ;
Guadagni, Roberta ;
Polese, Giuseppe .
EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
[3]   The anatomy of a large-scale hypertextual Web search engine [J].
Brin, S ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :107-117
[4]   An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques [J].
Capuano, Andrea ;
Rinaldi, Antonio M. ;
Russo, Cristiano .
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (11-12) :7577-7598
[5]   A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network [J].
Dhanith, P. R. Joe ;
Surendiran, B. ;
Raja, S. P. .
INTERNATIONAL JOURNAL OF INTERACTIVE MULTIMEDIA AND ARTIFICIAL INTELLIGENCE, 2021, 6 (06) :122-132
[6]  
Dillegenti M., 2000, 26th International Conference on Very Large Databases, VLDB 2000, P527
[7]   An improved focused crawler based on Semantic Similarity Vector Space Model [J].
Du, Yajun ;
Liu, Wenjun ;
Lv, Xianjing ;
Peng, Guoli .
APPLIED SOFT COMPUTING, 2015, 36 :392-407
[8]   News Text Summarization Based on Multi-Feature and Fuzzy Logic [J].
Du, Yan ;
Huo, Hua .
IEEE ACCESS, 2020, 8 :140261-140272
[9]   A new architecture for improving focused crawling using deep neural network [J].
ElAraby, M. E. ;
Abuelenin, Sherihan M. ;
Moftah, Hossam M. ;
Rashad, M. Z. .
JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2019, 37 (01) :1233-1245
[10]   A text summarization method based on fuzzy rules and applicable to automated assessment [J].
Goularte, Fabio Bif ;
Nassar, Silvia Modesto ;
Fileto, Renato ;
Saggion, Horacio .
EXPERT SYSTEMS WITH APPLICATIONS, 2019, 115 :264-275