Machine learning based heterogeneous web advertisements detection using a diverse feature set

被引:4
作者
Nengroo, Ab Shaqoor [1 ]
Kuppusamy, K. S. [1 ]
机构
[1] Pondicherry Univ, Sch Engn & Technol, Dept Comp Sci, Pondicherry 605014, India
来源
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE | 2018年 / 89卷
关键词
Advertisements; Web accessibility; Content extraction random forest; Machine learning; EXTRACTION;
D O I
10.1016/j.future.2018.06.028
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Advertisement identification and filtering in web pages gain significance due to various factors such as accessibility, security, privacy, and obtrusiveness. Current practices in this direction involve maintaining URL-based regular expressions called filter lists. Each URL obtained on a web page is matched against this filter list. While effectual, this procedure lacks scalability as it demands regular continuance of the filter list. To counter these limitations, we devise a machine learning based advertisement detection system using a diverse feature set which can distinguish advertisement blocks from non-advertisement blocks. The method can act as a base to provide various accessibility-related features like smooth browsing and text summarization for persons with visual impairments, cognitive impairments, and photosensitive epilepsy. The results from a classifier trained on the proposed feature set achieve 98.6% accuracy in identifying advertisements. (C) 2018 Elsevier B.V. All rights reserved.
引用
收藏
页码:68 / 77
页数:10
相关论文
共 28 条
  • [1] Adelberg B., 1998, SIGMOD C 1998
  • [2] Ahuja N., 2016, DIGITAL ADVERTISING
  • [3] [Anonymous], 2002, Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
  • [4] Bar-Yossef Z., 2002, P 11 INT C WORLD WID, P580, DOI DOI 10.1145/511446.511522
  • [5] A survey on feature selection methods
    Chandrashekar, Girish
    Sahin, Ferat
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2014, 40 (01) : 16 - 28
  • [6] Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
  • [7] Developing a trust model for pervasive computing based on Apriori association rules learning and Bayesian classification
    D'Angelo, Gianni
    Rampone, Salvatore
    Palmieri, Francesco
    [J]. SOFT COMPUTING, 2017, 21 (21) : 6297 - 6315
  • [8] Feature extraction and soft computing methods for aerospace structure defect classification
    D'Angelo, Gianni
    Rampone, Salvatore
    [J]. MEASUREMENT, 2016, 85 : 192 - 209
  • [9] An uncertainty-managing batch relevance-based approach to network anomaly detection
    D'angelo, Gianni
    Palmieri, Francesco
    Ficco, Massimo
    Rampone, Salvatore
    [J]. APPLIED SOFT COMPUTING, 2015, 36 : 408 - 418
  • [10] Eveleth R., POPUP ADS ARE TERRIB