Scalable Web Mining with Newistic

被引:0
作者
Dan, Ovidiu [1 ]
Mocian, Horatiu [2 ]
机构
[1] INHOLLAND Univ, Diemen, Netherlands
[2] Imperial Coll, London, England
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS | 2009年 / 5476卷
关键词
Web mining; text mining; clustering; information extraction; quality threshold;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Newistic is a web mining platform that collects and analyses documents crawled from the Internet. Although it currently processes news articles, it can be easily adapted to any other form of text. Data mining functions performed by the system are categorization, clustering and named entity extraction. The main design concern of the system is scalability. which is achieved by a modular architecture that allows multiple instances of the same component to be run in parallel. This paper presents a novel algorithm for analysing web pages which tries to determine the title and text of a news item directly from the HTML code, discarding noise such as menus, ads, or copyright notices. Another contribution of this paper is the application of the Quality Threshold clustering algorithm for document clustering. Additionally, the algorithm has been optimized to increase its speed.
引用
收藏
页码:556 / +
页数:3
相关论文
共 15 条
[1]  
[Anonymous], 2002, P 40 ANN M ASS COMP
[2]  
Das AS, 2007, P 16 INT C WORLD WID, P271
[3]  
Del Corso G.M., 2005, P 14 INT WORLD WID W
[4]  
Dunlavy Daniel M., 2003, P HLT NAACL, P11
[5]  
FORGY EW, 1965, BIOMETRICS, V21, P768
[6]  
GABRILOVICH E, 2004, P 13 INT WORLD WID W
[7]  
GULLI A, 2005, P 14 INT WORLD WID W, P880
[8]   Exploring expression data: Identification and analysis of coexpressed genes [J].
Heyer, LJ ;
Kruglyak, S ;
Yooseph, S .
GENOME RESEARCH, 1999, 9 (11) :1106-1115
[9]  
Lewis DD, 2004, J MACH LEARN RES, V5, P361
[10]  
MASAND B, 1992, SIGIR 92 : PROCEEDINGS OF THE FIFTEENTH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P59