IRPDP_HT2: a scalable data pre-processing method in web usage mining using Hadoop MapReduce

被引:1
作者
Srivastava, Atul Kumar [1 ]
Srivastava, Mitali [2 ]
机构
[1] Bennett Univ, Sch Comp Sci & Engn Technol, Greater Noida, India
[2] Capgemini Pvt Ltd, Pune, India
关键词
Big data; MapReduce; Hadoop; Data pre-processing; Web mining; Web usage mining; INFORMATION; PREDICTION;
D O I
10.1007/s00500-023-08019-w
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data preparation is a vital step in the web usage mining process since it provides structured data for the subsequent stages. Hence, it is necessary to convert raw server logs into user sessions to generate structured data for pattern discovery phase. In recent decade, popular websites' server log production has risen to many terabytes to petabytes each day. As a result, server logs possess big data issues such as storage and processing. This study focuses on initial phases of web usage mining process such as data cleaning, user identification, and session identification. These phases are classified as data-intensive processes and deemed-computation intensive. In the last decade, MapReduce emerges as one of the best parallel programming frameworks for data-intensive applications. An efficient MapReduce-based data pre-processing algorithm, i.e. IRPDP_HT2, is proposed in this study. Previous parallel data pre-processing algorithms either include partial phases or lack with efficient robot detection approaches. IRPDP_HT2 algorithm uses a variety of efficient heuristics in all three phases of data pre-processing to identify both ethical and unethical robots. The suggested IRPDP_HT2 approach is found to be effective and scalable for larger datasets after various experiments on a cluster of nodes. The effectiveness of suggested heuristics is also examined during session identification phase. Three variants of IRPDP_HT2 such as PDP_HT2, IPDP_HT2, and RPDP_HT2 are also developed and tested. Impact of robots' requests and internal dummy connections' requests on session count by IRPDP_HT2 algorithm is 45.81% which is more than in PDP_HT2, IPDP_HT2, and RPDP_HT2 algorithms. Further speed-up and size-up are also analysed to demonstrate scalability of algorithm. In the presence of larger datasets, the algorithm's running time falls, while the number of data nodes grows. The size-up of IRPDP_HT2 demonstrates that even after doubling the input data, the algorithm's running time does not grow in that ratio for the fixed number of data nodes.
引用
收藏
页码:7907 / 7923
页数:17
相关论文
共 53 条
[1]  
Arumugam G, 2009, 2009 INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE SECURITY, P151
[2]  
Aye T. T., 2011, 2011 3rd International Conference on Computer Research and Development (ICCRD 2011), P490, DOI 10.1109/ICCRD.2011.5764181
[3]  
Barney B., 2010, Lawrence Livermore National Laboratory, V6, P10
[4]  
Bayir M.A., 2009, Proceedings of the 18th International Conference on World Wide Web, P161
[5]   Discovering better navigation sequences for the session construction problem [J].
Bayir, Murat Ali ;
Toroslu, Ismail Hakki ;
Demirbas, Murat ;
Cosar, Ahmet .
DATA & KNOWLEDGE ENGINEERING, 2012, 73 :58-72
[6]  
Berendt B, 2001, P WORKSH WEB MIN 1 S, P7
[7]  
Castellano G, 2007, P 6 C 6 WSEAS INT C, P12, DOI DOI 10.5555/1348485.1348488
[8]   CHARACTERIZING BROWSING STRATEGIES IN THE WORLD-WIDE-WEB [J].
CATLEDGE, LD ;
PITKOW, JE .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1995, 27 (06) :1065-1073
[9]  
Chaofeng L, 2006, INT C MAN SCI ENG
[10]  
Cooley R., 1999, Knowledge and Information Systems, V1, P5