A novel approach for effective web page classification

被引:0
作者
Mangai, J. Alamelu [1 ]
Kumar, V. Santhosh [1 ]
Balamurugan, S. Appavu [1 ]
机构
[1] BITS, Dept Comp Sci & Engn, Pilani Dubai Campus, POB 345055, Dubai, U Arab Emirates
关键词
feature selection; data tuning; web page classification; machine learning; WebKBfeature selection; WebKB;
D O I
10.1504/IJDMMM.2013.055860
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the exponential increase in volume of the WWW every day, web page classification has become tedious. Since with no quality data there is no quality mining results, it is worth to emphasise on fine tuning the data for classification, rather than improving the classifiers themselves. This paper investigates the methods for improving web page classification by feature extraction, selection and data tuning. This paper also proposes a new classification model for web page classification called a probabilistic web page classifier (PWPC). It is based on a probabilistic framework and attribute-value similarity measure (AVS). The proposed method is tested on a benchmarking dataset, WebKB and the performance of PWPC on the fine tuned web pages has exhibited significant accuracy over the traditional machine learning classifiers.With the exponential increase in volume of the WWW every day, web page classification has become tedious. Since with no quality data there is no quality mining results, it is worth to emphasise on fine tuning the data for classification, rather than improving the classifiers themselves. This paper investigates the methods for improving web page classification by feature extraction, selection and data tuning. This paper also proposes a new classification model for web page classification called a probabilistic web page classifier (PWPC). It is based on a probabilistic framework and attribute-value similarity measure (AVS). The proposed method is tested on a benchmarking dataset, WebKB and the performance of PWPC on the fine tuned web pages has exhibited significant accuracy over the traditional machine learning classifiers.
引用
收藏
页码:233 / 245
页数:13
相关论文
共 25 条
[1]  
[Anonymous], 2010, NETCRAFT WEB SERVER
[2]  
Asirvatham A. P., 2001, AWARDED 2 PRIZE NATL
[3]   Two novel feature selection approaches for web page classification [J].
Chen, Chih-Ming ;
Lee, Hahn-Ming ;
Chang, Yu-Jung .
EXPERT SYSTEMS WITH APPLICATIONS, 2009, 36 (01) :260-272
[4]  
Chen W., 2010, JCIS, V6, P2925
[5]  
Craven P., 1998, EUREKA STREET
[6]  
Dai WY, 2006, LECT NOTES COMPUT SC, V4016, P435
[7]  
de Boer V., 2010, P 6 INT C WEB INF SY
[8]  
Dou Shen, 2004, Proceedings of Sheffield SIGIR 2004. The Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P242
[9]  
Farhoodi Mojgan, 2009, International Journal of Information Studies, V1, P263
[10]  
Hall M., 2009, SIGKDD EXPLOR, V11, P10, DOI DOI 10.1145/1656274.1656278