Web spam detection using trust and distrust-based ant colony optimization learning

被引:5
作者
Manaskasemsak, Bundit [1 ]
Rungsawang, Arnon [1 ]
机构
[1] Kasetsart Univ, Fac Engn, Dept Comp Engn, Bangkok, Thailand
关键词
Trust; Adaptive learning path; Ant colony optimization; Distrust; Spam detection; Web spam;
D O I
10.1108/IJWIS-12-2014-0047
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Purpose - This paper aims to present a machine learning approach for solving the problem of Web spam detection. Based on an adoption of the ant colony optimization (ACO), three algorithms are proposed to construct rule-based classifiers to distinguish between non-spam and spam hosts. Moreover, the paper also proposes an adaptive learning technique to enhance the spam detection performance. Design/methodology/approach - The Trust-ACO algorithm is designed to let an ant start from a non-spam seed, and afterwards, decide to walk through paths in the host graph. Trails (i.e. trust paths) discovered by ants are then interpreted and compiled to non-spam classification rules. Similarly, the Distrust-ACO algorithm is designed to generate spam classification ones. The last Combine-ACO algorithm aims to accumulate rules given from the former algorithms. Moreover, an adaptive learning technique is introduced to let ants walk with longer (or shorter) steps by rewarding them when they find desirable paths or penalizing them otherwise. Findings - Experiments are conducted on two publicly available WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets. The results show that the proposed algorithms outperform well-known rule-based classification baselines. Especially, the proposed adaptive learning technique helps improving the AUC scores up to 0.899 and 0.784 on the former and the latter datasets, respectively. Originality/value - To the best of our knowledge, this is the first comprehensive study that adopts the ACO learning approach to solve the problem of Web spam detection. In addition, we have improved the traditional ACO by using the adaptive learning technique.
引用
收藏
页码:142 / 161
页数:20
相关论文
共 31 条
  • [1] [Anonymous], 1999, PAGERANK CITATION RA
  • [2] Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models
    Araujo, Lourdes
    Martinez-Romo, Juan
    [J]. IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2010, 5 (03) : 581 - 590
  • [3] Baeza-Yates R.A., 1999, MODERN INFORM RETRIE
  • [4] Becchetti L., 2008, EUROPEAN INTEGRATED, P99
  • [5] Becchetti L., 2006, AIRWEB, P1, DOI DOI 10.1145/1326561.1326563
  • [6] Castillo C., 2006, SIGIR Forum, V40, P11, DOI 10.1145/1189702.1189703
  • [7] Castillo Carlos, 2007, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P423, DOI 10.1145/1277741.1277814
  • [8] Castillo C., 2008, ACM SIGIR FORUM, V42, P68
  • [9] Effectively Detecting Content Spam on the Web Using Topical Diversity Measures
    Dong, Cailing
    Zhou, Bin
    [J]. 2012 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY (WI-IAT 2012), VOL 1, 2012, : 266 - 273
  • [10] Dorigo M., 1997, IEEE Transactions on Evolutionary Computation, V1, P53, DOI 10.1109/4235.585892