A novel content and style based measurement of web pages distance

被引:0
作者
Zhang, QP [1 ]
Liang, M [1 ]
Lai, LL [1 ]
机构
[1] Fudan Univ, Dept Comp Sci & Engn, Shanghai 200433, Peoples R China
来源
PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9 | 2005年
关键词
web mining; distance function; web page; cluster;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, many web-based systems have been using machine learning techniques in order to design more intelligent mechanisms for organizing, indexing, and retrieving web content, and it is necessary for more and more researches and applications to calculate the distance of web pages rationally. Generally proposed methodology is fit for extracting the differences between HTML documents of web pages, results of which cannot be used to tell the actual distance, between the content of web pages and the facade displayed in internet explorers. Based on these above, content distance, style distance, and hybrid distance are proposed in this paper, to make measurement result more practical. The efficiency will be proved through some classical experiments.
引用
收藏
页码:429 / 435
页数:7
相关论文
共 9 条
[1]  
Baeza-Yates R.A., 1999, Modern Information Retrieval
[2]   USING COLLABORATIVE FILTERING TO WEAVE AN INFORMATION TAPESTRY [J].
GOLDBERG, D ;
NICHOLS, D ;
OKI, BM ;
TERRY, D .
COMMUNICATIONS OF THE ACM, 1992, 35 (12) :61-70
[3]  
GU X, 2002, 2 INT C AD HYP AD WE, P164
[4]  
JEH G, 2002, P KDD2002 C EDM CAN
[5]   GroupLens: Applying collaborative filtering to Usenet news [J].
Konstan, JA ;
Miller, BN ;
Maltz, D ;
Herlocker, JL ;
Gordon, LR ;
Riedl, J .
COMMUNICATIONS OF THE ACM, 1997, 40 (03) :77-87
[6]   Web-based evolutionary and adaptive information retrieval [J].
Kushchu, I .
IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, 2005, 9 (02) :117-125
[7]  
SHARDANAND U, 1995, P C HUM FACT COMP SY
[8]  
YANG GZ, 2003, IJCAI WORKSH INF INT, P39
[9]  
YANG Y, 2001, 6 INT C DOC AN REC I