Parallelizing the extraction of fresh information from online social networks

被引:6
作者
Guo, Rui [1 ]
Wang, Hongzhi [1 ]
Chen, Mengwen [1 ]
Li, Jianzhong [1 ]
Gao, Hong [1 ]
机构
[1] Harbin Inst Technol, Harbin, Peoples R China
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2016年 / 59卷
关键词
Crawler; Freshness; Online social network;
D O I
10.1016/j.future.2015.11.021
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Online social networks (OSNs) are among the hottest new services in recent years. OSNs maintain records of the lives of users, thereby providing potential resources for journalists, sociologists, and business analysts. Crawling data from social networks is a basic step during the processing and analysis of social network information. However, as OSNs become larger and the information on the network updates faster than the web pages, crawling is more difficult due to limitations in terms of bandwidth, politeness or etiquette, and computational power. To extract fresh information from OSNs in an efficient and effective manner, we propose a novel method for crawling and we also discuss a parallelization architecture for OSNs. To identify the features of OSNs, we collected data from real OSNs, analyzed, them, and built a model to describe the behavior of users. Based on this model, we developed methods to predict the behavior of users. According to these predictions, we can schedule our crawler in a more reasonable manner and extract more fresh information using parallelization techniques. Our experimental results demonstrate that the proposed strategies can extract information from OSNs in an efficient and effective manner. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:33 / 46
页数:14
相关论文
共 24 条
[1]  
Achrekar H., 2011, IEEE INFOCOM 2011 - IEEE Conference on Computer Communications. Workshops, P702, DOI 10.1109/INFCOMW.2011.5928903
[2]  
[Anonymous], 2011, Proceedings of the conference on empirical methods in natural language processing
[3]  
[Anonymous], 2003, ACM Transactions on Internet Technology (TOIT), DOI DOI 10.1145/857166.857170
[4]  
[Anonymous], 2001, INTRO ALGORITHMS
[5]  
[Anonymous], 1979, COMPUTERS INTRACTABI
[6]   Design Trade-Offs for Search Engine Caching [J].
Baeza-Yates, Ricardo ;
Gionis, Aristides ;
Junqueira, Flavio P. ;
Murdock, Vanessa ;
Plachouras, Vassilis ;
Silvestri, Fabrizio .
ACM TRANSACTIONS ON THE WEB, 2008, 2 (04)
[7]  
Boanjak Matko., 2012, Proceedings of the 21th International Conference on World Wide Web Companion, P1233, DOI DOI 10.1145/2187980.2188266
[8]  
Byun C., 2012, P 2012 ACM RES APPL, P76
[9]   Scheduling algorithms for Web crawling [J].
Castillo, C ;
Marin, M ;
Rodriguez, A ;
Baeza-Yates, R .
WEBMEDIA & LA-WEB 2004, VOL 1, PROCEEDINGS, 2004, :10-17
[10]  
Chau DuenHorng., 2007, Proceedings of the 16th International Conference on World Wide Web, WWW '07, P1283, DOI DOI 10.1145/1242572