Analysis of Data Extraction and Data Cleaning in Web Usage Mining

被引:5
|
作者
Srivastava, Mitali [1 ]
Garg, Rakhi [2 ]
Mishra, P. K. [1 ]
机构
[1] Banaras Hindu Univ, Fac Sci, Dept Comp Sci, Varanasi, Uttar Pradesh, India
[2] Banaras Hindu Univ, Mahila Maha Vidyalaya, Comp Sci Sect, Varanasi, Uttar Pradesh, India
来源
ICARCSET'15: PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON ADVANCED RESEARCH IN COMPUTER SCIENCE ENGINEERING & TECHNOLOGY (ICARCSET - 2015) | 2015年
关键词
Web usage mining; Data preprocessing; Data extraction; Data cleaning;
D O I
10.1145/2743065.2743078
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Data preprocessing is considered as an important phase of Web usage mining due to unstructured, heterogeneous and noisy nature of log data. Complete and effective data preprocessing insures the efficiency and scalability of algorithms used in pattern discovery phase of Web usage mining. Data preprocessing generally includes the steps- Data fusion, Data cleaning, User identification, Session identification, Path completion etc. Data cleaning is the initial and important step in preprocessing to extract cleaned data for further processing. It is important to apply data extraction before data cleaning on raw log data in analysis of specific time-duration i.e. one day, one week or one month etc. In this paper we have mainly focused on data fusion, data extraction and data cleaning steps of preprocessing and proposed an algorithm for data extraction which extracts log data according to analysis of time duration. This algorithm also sorts log entries according to their date and time which will be further used in prediction of browsing sequence of user. After that we have applied data cleaning algorithm on extracted real Web server log. In data cleaning almost all irrelevant files, irrelevant HTTP methods and wrong HTTP status codes are considered and after experiment it is analyzed that raw log data reduces to almost 80% which shows the importance of initial phases of data preprocessing.
引用
收藏
页数:6
相关论文
共 50 条
  • [41] Web Caching Replacement Algorithm Based on Web Usage Data
    Sorn Jarukasemratana
    Tsuyoshi Murata
    New Generation Computing, 2013, 31 : 311 - 329
  • [42] Web Caching Replacement Algorithm Based on Web Usage Data
    Jarukasemratana, Sorn
    Murata, Tsuyoshi
    NEW GENERATION COMPUTING, 2013, 31 (04) : 311 - 329
  • [43] Knowledge discovery from web usage data: Extraction and applications of sequential and clustering patterns - A survey
    Raju, G. T.
    Satyanarayana, P. S.
    Patnaik, L. M.
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2008, 4 (02): : 381 - 389
  • [44] Research on Web usage mining for electronic commerce
    Li, CF
    Lu, YS
    PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE AND ENGINEERING, 2005, : 703 - 710
  • [45] Autonomous Sensor Data Cleaning in Stream Mining Setting
    Kenda, Klemen
    Mladenic, Dunja
    BUSINESS SYSTEMS RESEARCH JOURNAL, 2018, 9 (02): : 69 - 79
  • [46] IRPDP_HT2: a scalable data pre-processing method in web usage mining using Hadoop MapReduce
    Srivastava, Atul Kumar
    Srivastava, Mitali
    SOFT COMPUTING, 2023, 27 (12) : 7907 - 7923
  • [47] A MapReduce-Based User Identification Algorithm in Web Usage Mining
    Srivastava, Mitali
    Garg, Rakhi
    Mishra, P. K.
    INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY AND WEB ENGINEERING, 2018, 13 (02) : 11 - 23
  • [48] IRPDP_HT2: a scalable data pre-processing method in web usage mining using Hadoop MapReduce
    Atul Kumar Srivastava
    Mitali Srivastava
    Soft Computing, 2023, 27 : 7907 - 7923
  • [49] Research On Association Mining Data Cleaning for Professional Field
    Zhai Lili
    Wu Minglei
    Zhang Shuchen
    Zhao Qingqing
    Tian Li
    PROCEEDINGS OF 2013 2ND INTERNATIONAL CONFERENCE ON MEASUREMENT, INFORMATION AND CONTROL (ICMIC 2013), VOLS 1 & 2, 2013, : 563 - 566
  • [50] Advances in web usage mining
    Pabarskaite, Z
    Raudys, A
    6TH WORLD MULTICONFERENCE ON SYSTEMICS, CYBERNETICS AND INFORMATICS, VOL XI, PROCEEDINGS: COMPUTER SCIENCE II, 2002, : 508 - 512