Innovating web page classification through reducing noise

被引:12
作者
Li, XL [1 ]
Shi, ZZ
机构
[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100080, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Singapore 117543, Singapore
基金
中国国家自然科学基金;
关键词
web page classification; similarity measure; classification algorithm without noise;
D O I
10.1007/BF02949820
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
This paper presents a new method that eliminates noise in Web page classification. It first describes the presentation of a Web page based on HTML tags. Then through a novel distance formula, it eliminates the noise in similarity measure. After carefully analyzing Web pages, we design an algorithm that can distinguish related hyperlinks from noisy ones. We can utilize non-noisy hyperlinks to improve the performance of Web page classification (the CAWN algorithm). For any page, we can classify it through the text and category of neighbor pages related to the page. The experimental results show that our approach improved classification accuracy.
引用
收藏
页码:9 / 17
页数:9
相关论文
共 15 条
[1]   AUTOMATED LEARNING OF DECISION RULES FOR TEXT CATEGORIZATION [J].
APTE, C ;
DAMERAU, F ;
WEISS, SM .
ACM TRANSACTIONS ON INFORMATION SYSTEMS, 1994, 12 (03) :233-251
[2]  
Bharat K., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P104, DOI 10.1145/290941.290972
[3]  
Chakraborty S. S., 1998, Acta Polytechnica Scandinavica, Electrical Engineering Series, P1
[4]  
COHEN WW, 1996, P 19 ANN INT ACM SIG, P307
[5]  
JOACHIMS T, 1997, INT C MACHINE LEARNI
[6]   Automatic text categorization and its application to text retrieval [J].
Lam, W ;
Ruiz, M ;
Srinivasan, P .
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1999, 11 (06) :865-879
[7]  
Lang K., 1995, Machine Learning. Proceedings of the Twelfth International Conference on Machine Learning, P331
[8]  
LEWIS DD, 1996, P 19 ANN INT ACM SIG, P298
[9]  
LI XL, 2000, P C INT INF PROC 16, P398
[10]  
MODHA DS, 2000, IBM RES REPORT