A classification approach for less popular webpages based on latent semantic analysis and rough set model

被引:27
作者
Wang, Jun [1 ]
Peng, Jiaxu [1 ]
Liu, Ou [2 ]
机构
[1] Beihang Univ, Sch Econ & Management, Beijing 100191, Peoples R China
[2] Hong Kong Polytech Univ, Sch Accounting & Finance, Kowloon, Hong Kong, Peoples R China
基金
中国国家自然科学基金;
关键词
Webpage classification; Complex network analysis; Rough set; Latent semantic analysis; SYSTEM;
D O I
10.1016/j.eswa.2014.08.013
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Nowadays, with the explosive growth of web information, the webpage classification faces great challenge. Computers have difficulty in understanding the semantic meaning of textual or non-textual webpages. Fortunately, Web 2.0 based collaborative tagging system brings new opportunities to solve this problem. It abstracts structured tags from unstructured content in webpages. However, large numbers of webpages on the Internet are less popular. Their tagging information is sparse, which makes their topic unclear and leads to ambiguous classification. Illuminated by the "ambiguous classification", we name the less popular webpage "hesitant webpage". In this paper, we propose an advanced approach for hesitant webpages classification. Firstly, hesitant webpages are divided into bridges, hubs and attached webpages according to their roles on the Internet. Secondly, attached webpages are classified by mining and extending their information in two perspectives. One is the latent semantic analysis (LSA) which is applied to fully explore the semantic meaning of sparse tags. It promotes accurate cognition of webpages semantically close to attached webpages. Another is the proposed density-relation-based rough set model which measures the affiliation degree of attached webpages in different categories. Experiment on real data shows that our approach effectively classifies the hesitant webpages base on the semantic meaning. (C) 2014 Elsevier Ltd. All rights reserved.
引用
收藏
页码:642 / 648
页数:7
相关论文
共 26 条
[1]  
Anderson C., 2007, The Long Tail
[2]  
Anderson Chris, 2006, The long tail: Why the future of business is selling less of more, P2
[3]   Centrality and network flow [J].
Borgatti, SP .
SOCIAL NETWORKS, 2005, 27 (01) :55-71
[4]   Identifying influential nodes in complex networks [J].
Chen, Duanbing ;
Lu, Linyuan ;
Shang, Ming-Sheng ;
Zhang, Yi-Cheng ;
Zhou, Tao .
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2012, 391 (04) :1777-1787
[5]  
Chou BH, 2010, LECT NOTES COMPUT SC, V6263, P52, DOI 10.1007/978-3-642-15105-7_5
[6]   An expert system using rough sets theory and self-organizing maps to design space exploration of complex products [J].
Chu, Xue-Zheng ;
Gao, Liang ;
Qiu, Hao-Bo ;
Li, Wei-Dong ;
Shao, Xin-Yu .
EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (11) :7364-7372
[7]  
DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO
[8]  
2-9
[9]  
Ester Martin, 1996, kdd
[10]  
Feng ZD, 2007, LECT NOTES COMPUT SC, V4654, P385