Research on Web Page Classification Method Based on Query Log

被引:1
作者
Ye F. [1 ]
Ma Y. [1 ]
机构
[1] School of Computer Engineering and Science, Shanghai University, Shanghai
关键词
A; diesel; query log; TP; 391.1; Web page classification; word embedding;
D O I
10.1007/s12204-017-1899-0
中图分类号
学科分类号
摘要
Web page classification is an important application in many fields of Internet information retrieval, such as providing directory classification and vertical search. Methods based on query log which is a light weight version of Web page classification can avoid Web content crawling, making it relatively high in efficiency, but the sparsity of user click data makes it difficult to be used directly for constructing a classifier. To solve this problem, we explore the semantic relations among different queries through word embedding, and propose three improved graph structure classification algorithms. To reflect the semantic relevance between queries, we map the user query into the low-dimensional space according to its query vector in the first step. Then, we calculate the uniform resource locator (URL) vector according to the relationship between the query and URL. Finally, we use the improved label propagation algorithm (LPA) and the bipartite graph expansion algorithm to classify the unlabeled Web pages. Experiments show that our methods make about 20% more increase in F1-value than other Web page classification methods based on query log. © 2017, Shanghai Jiaotong University and Springer-Verlag GmbH Germany, part of Springer Nature.
引用
收藏
页码:404 / 410
页数:6
相关论文
共 24 条
[1]  
Sun A.X., Lim E.P., Ng W.K., Web classification using support vector machine [J], Proceedings of the 4th International Workshop on Web Information and Data Management (WIDM 2002), pp. 1-4, (2002)
[2]  
Shih L.K., Karger D.R., Using URLs and table layout for Web classification tasks [C], International Conference on World Wide Web, pp. 193-202, (2004)
[3]  
Cristo M., Calado P.D., Moura E.S., Et al., Link information as a similarity measure inWeb classification [C], International Symposium on String Processing and Information Retrieval, pp. 43-55, (2003)
[4]  
Anh N.T.K., Thanh V.M., Linh N.V., Efficient label propagation for classification on information networks [C], Symposium on Information & Communication Technology, pp. 41-46, (2012)
[5]  
Duan Q.G., Miao D.Q., Jin K.M., A rough set approach to classifying Web page without negative examples [C], Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, pp. 481-488, (2007)
[6]  
Kim S.M., Pantel P., Duan L., Et al., Improving web page classification by label-propagation over click graphs [C], ACM Conference on Information and Knowledge Management, pp. 572-576, (2009)
[7]  
Nie L., Hua Z.G., He X.F., Et al., Learning document labels from enriched click graphs [C], the IEEE International Conference on Data Mining Workshops, pp. 57-64, (2010)
[8]  
Li X., Wang Y.Y., Acero A., Learning query intent from regularized click graphs [C], The International ACM SIGIR Conference, pp. 339-346, (2008)
[9]  
Zhang Z.Y., Nasraoui O., Mining search engine query logs for query recommendation [C], International Conference on World Wide Web, pp. 1039-1040, (2006)
[10]  
Zhu X.J., Ghahramani Z.B., Learning from labeled and unlabeled data with label propagation [R], (2002)