Web information extraction using generalized hidden Markov model

被引:0
作者
Zhong, Ping [1 ]
Chen, Jinlin [2 ]
Cook, Terry [1 ]
机构
[1] CUNY, Grad Ctr, Dept Comp Sci, New York, NY 10021 USA
[2] CUNY, Grad Ctr, Queens Coll, Dept Comp Sci, New York, NY 10021 USA
来源
2006 1ST IEEE WORKSHOP ON HOT TOPICS IN WEB SYSTEMS AND TECHNOLOGIES | 2006年
关键词
hidden Markov model; information extraction; layout analysis; web;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hidden Markov Model (HMM) is an important approach for information extraction (IE). When applied to Web IE, several problems exist with HMM based approaches due to the lack of consideration on Web-specific features. In this paper we present a Generalized Hidden Markov Model (GHMM) that extends traditional HMMs by making use of Web-specific information for Web IE. In our approach we use Web content block instead of term as basic extraction unit. Besides, instead of using the traditional sequential state transition order, we detect the state transition order of GHMM based on layout structure of the corresponding web page. Furthermore, we use multiple emission features instead of single emission feature. In this way GHMM can better accommodate Web IE. Experiments show promising results comparing to traditional HMM based Web IE.
引用
收藏
页码:142 / +
页数:2
相关论文
共 23 条
[11]  
GU X, 2002, 2 INT C AD HYP AD WE, P164
[12]   Recognition of common areas in a web page using visual information: A possible application in a page classification [J].
Kovacevic, M ;
Diligenti, M ;
Gori, M ;
Milutinovic, V .
2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, :250-257
[13]  
KRISTIE S, 1999, AAAI 98 WORKSH ML IE, P37
[14]  
Lafferty J., 2001, P INT C MACH LEARN
[15]  
LEEK RT, 1997, THESIS USSD
[16]  
Li J., 2000, IEEE T SIGNAL PROCES, V48
[17]   Hierarchical wrapper induction for semistructured information sources [J].
Muslea, I ;
Minton, S ;
Knoblock, CA .
AUTONOMOUS AGENTS AND MULTI-AGENT SYSTEMS, 2001, 4 (1-2) :93-114
[18]  
ROMERO R, 2004, P MOB HCI 2004
[19]  
SKOUNAKIS M, 2003, P 18 INT JOINT C ART
[20]  
SONG M, 2005, P COMP SCI ICCS 200K