WebGuard: A Web filtering engine combining textual, structural, and visual content-based analysis

被引:68
作者
Hammami, M
Chahir, Y
Chen, LM
机构
[1] Ecole Cent Lyon, LIRIS, CNRS, UMR 5205, F-69134 Ecully, France
[2] Univ Caen, CNRS, URA 6072, GREYC, F-14032 Caen, France
关键词
Web classification and categorization; data mining; Web textual and structural content; visual content analysis; skin color model; pornographic Web site filtering;
D O I
10.1109/TKDE.2006.34
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Along with the ever-growing Web comes the proliferation of objectionable content, such as sex, violence, racism, etc. We need efficient tools for classifying and filtering undesirable Web content. In this paper, we investigate this problem and describe WebGuard, an automatic machine learning-based pornographic Web site classification and filtering system. Unlike most commercial filtering products, which are mainly based on textual content-based analysis such as indicative keywords detection or manually collected black list checking, WebGuard relies on several major data mining techniques associated with textual, structural content-based analysis, and skin color related visual content-based analysis as well. Experiments conducted on a testbed of 400 Web sites including 200 adult sites and 200 nonpornographic ones showed WebGuard's filtering effectiveness, reaching a 97.4 percent classification accuracy rate when textual and structural content-based analysis was combined with visual content-based analysis. Further experiments on a black list of 12,311 adult Web sites manually collected and classified by the French Ministry of Education showed that Web(Guard scored a 95.62 percent classification accuracy rate. The basic framework of WebGuard can apply to other categorization problems of Web sites which combine, as most of them do today,textual and visual content.
引用
收藏
页码:272 / 284
页数:13
相关论文
共 36 条
[1]   A simple and efficient face detection algorithm for video database applications [J].
Albiol, A ;
Torres, L ;
Bouman, CA ;
Delp, EJ .
2000 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL II, PROCEEDINGS, 2000, :239-242
[2]  
[Anonymous], 9811 CRL
[3]  
ATTARDI G, 1999, P THAI 99 EUR S TEL, P105
[4]  
*BIONET SYST LLC, 2005, NET NANN 4 04
[5]  
Breiman L., 1998, CLASSIFICATION REGRE
[6]  
BRIN S, 1998, P WWW7
[7]  
Chahir Y, 2000, J VIS COMMUN IMAGE R, V11, P302, DOI [10.1006/jvci.1999.0428, 10.1006/jvic.1999.0428]
[8]  
CHAKRABARTI S, 1998, P 1998 ACM SIGMOD IN
[9]   Efficient crawling through URL ordering [J].
Cho, J ;
Garcia-Molina, H ;
Page, L .
COMPUTER NETWORKS AND ISDN SYSTEMS, 1998, 30 (1-7) :161-172
[10]  
Efron B., 1994, INTRO BOOTSTRAP, DOI DOI 10.1201/9780429246593