Content-based analysis to detect Arabic web spam

被引:16
作者
Al-Kabi, Mohammed
Wahsheh, Heider
Alsmadi, Izzat [1 ]
Al-Shawakfa, Emad
Wahbeh, Abdullah [2 ]
Al-Hmoud, Ahmed
机构
[1] Yarmouk Univ, CIS Dept, Irbid 21163, Jordan
[2] Dakota State Univ, Madison, SD USA
关键词
Arabic content features; Arabic web spam; Arabic web spam detection; content features; web spam; web spam detection; CONTENT TRUST MODEL;
D O I
10.1177/0165551512439173
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with this huge amount of information, a more complex challenge that continuously gets more and more difficult to illuminate is the spam in web pages. For several reasons, web spammers try to intrude in the search results and inject artificially biased results in favour of their websites or pages. Spam pages are added to the internet on a daily basis, thus making it difficult for search engines to keep up with the fast-growing and dynamic nature of the web, especially since spammers tend to add more keywords to their websites to deceive the search engines and increase the rank of their pages. In this research, we have investigated four different classification algorithms (naive Bayes, decision tree, SVM and K-NN) to detect Arabic web spam pages, based on content. The three groups of datasets used, with 1%, 15% and 50% spam contents, were collected using a crawler that was customized for this study. Spam pages were classified manually. Different tests and comparisons have revealed that the Decision Tree was the best classifier for this purpose.
引用
收藏
页码:284 / 296
页数:13
相关论文
共 31 条
[21]  
Martines- romo J., 2009, 5 INT WORKSH ADV INF, P21, DOI DOI 10.1145/1531914.1531920
[22]  
MENA Online Advertising Industry, MENA ONL ADV IND
[23]  
Niu XF, 2010, LECT NOTES COMPUT SC, V6184, P18
[24]  
Pera MS, 2008, LECT NOTES COMPUT SC, V5073, P204
[25]  
Ryding K. C., 2005, REFERENCE GRAMMAR MO
[26]  
Spirin N, SURVEY WEB SPAM DETE
[27]  
Svore Krysta., 2007, P 3 WORKSHOP ADVERSA, P9
[28]  
Wahsheh HA, 2011, P 5 INT C INF TECHN
[29]  
Wang W, 2007, INT FED INFO PROC, V238, P139
[30]   Using evidence based content trust model for spam detection [J].
Wang, Wei ;
Zeng, Guosun ;
Tang, Daizhong .
EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (08) :5599-5606