Google Penguin: Evasion in Non-English Languages and a New Classifier

被引:4
作者
Alarifi, Abdulrahman [1 ]
Alsaleh, Mansour [1 ]
Al-Salman, AbdulMalik [2 ]
Alswayed, AbdulMajeed [2 ]
Alkhaledi, Ahmad [2 ]
机构
[1] King Abdulaziz City Sci & Technol, Comp Res Inst, Riyadh, Saudi Arabia
[2] King Saud Univ, Dept Comp Sci, Riyadh, Saudi Arabia
来源
2013 12TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2013), VOL 2 | 2013年
关键词
Web spam; Link spam; Content spam; Search engine spam; Spamdexing;
D O I
10.1109/ICMLA.2013.135
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Web spam techniques aim to mislead search engines so that web spam pages get ranked higher than they deserve. This leads to misleading search results as spam pages might appear in search results although the content of these spam pages might not be related to the search terms. Despite the effort of search engines to deploy various techniques to detect and filter out web spam pages from being listed in their search results, spammers continue to develop new tactics to evade search engines detection mechanisms. In this paper, we study the effectiveness and accuracy of newly developed antispamming techniques in Google search engine. Focusing on Arabic spam pages, our study results show that Google antispamming techniques are ineffective against spam pages with Arabic content. We explore various types of web spam detection features to obtain an appropriate set of detection features that yield a reasonable detection accuracy. In order to build and evaluate our classifier, we collect and manually label a dataset of Arabic web pages, including both benign and spam pages. We believe this Arabic web spam corpus helps researchers in conducting sound measurement studies. We also develop a browser plug-in that utilizes our classifier and warns the user about web spam pages before accessing them, upon clicking on a search term. The plug-in has also the ability to filter out search engine results.
引用
收藏
页码:274 / 280
页数:7
相关论文
共 24 条
  • [1] Alarifi A, 2013, INT CONF ADV COMMUN, P173
  • [2] Web Spam: a Study of the Page Language Effect on the Spam Detection Features
    Alarifi, Abdulrahman
    Alsaleh, Mansour
    [J]. 2012 11TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2012), VOL 2, 2012, : 216 - 221
  • [3] [Anonymous], 2004, VLDB
  • [4] [Anonymous], P 1 INT WORKSH ADV I
  • [5] [Anonymous], 2012, ACM SIGKDD Explorations Newsletter, DOI [DOI 10.1145/2207243.2207252, 10.1145/2207243.2207252]
  • [6] Becchetti L., 2006, 2 INT WORKSHOP ADVER, P1, DOI DOI 10.1145/1326561.1326563
  • [7] Adversarial Web search
    Castillo C.
    Davison B.D.
    [J]. Foundations and Trends in Information Retrieval, 2010, 4 (05): : 377 - 486
  • [8] Cutts M., ANOTHER STEP REWARD
  • [9] Edelman B., AD THUMBNAILS ADVERT
  • [10] Edelman B., 2009, DETERRING ONLINE ADV