Web information extraction using generalized hidden Markov model

被引:0
作者
Zhong, Ping [1 ]
Chen, Jinlin [2 ]
Cook, Terry [1 ]
机构
[1] CUNY, Grad Ctr, Dept Comp Sci, New York, NY 10021 USA
[2] CUNY, Grad Ctr, Queens Coll, Dept Comp Sci, New York, NY 10021 USA
来源
2006 1ST IEEE WORKSHOP ON HOT TOPICS IN WEB SYSTEMS AND TECHNOLOGIES | 2006年
关键词
hidden Markov model; information extraction; layout analysis; web;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Hidden Markov Model (HMM) is an important approach for information extraction (IE). When applied to Web IE, several problems exist with HMM based approaches due to the lack of consideration on Web-specific features. In this paper we present a Generalized Hidden Markov Model (GHMM) that extends traditional HMMs by making use of Web-specific information for Web IE. In our approach we use Web content block instead of term as basic extraction unit. Besides, instead of using the traditional sequential state transition order, we detect the state transition order of GHMM based on layout structure of the corresponding web page. Furthermore, we use multiple emission features instead of single emission feature. In this way GHMM can better accommodate Web IE. Experiments show promising results comparing to traditional HMM based Web IE.
引用
收藏
页码:142 / +
页数:2
相关论文
共 23 条
  • [1] [Anonymous], 2003, VIPS VISION BASED PA
  • [2] [Anonymous], P ICML 00 STANF CA
  • [3] Bikel D.M., 1997, Proceedings of the fifth conference on Applied natural language processing. Association for Computational Linguistics, P194
  • [4] CHEN J, 2001, P WWW 10 MAY 1 5 200
  • [5] CHEN Y, 2003, P 12 INT C WORLD WID
  • [6] FORNEY GD, 1973, IEEE P, V3, P268, DOI DOI 10.1109/PR0C.1973.9030
  • [7] Freitag D, 2000, SEVENTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2001) / TWELFTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-2000), P577
  • [8] Freitag D, 2000, SEVENTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2001) / TWELFTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE (IAAI-2000), P584
  • [9] Freitag D., 1999, A A A I Workshop on Machine Learning for Information Extraction, P31
  • [10] AUTOBIB: Automatic extraction of bibliographic information on the web
    Geng, JF
    Yang, J
    [J]. INTERNATIONAL DATABASE ENGINEERING AND APPLICATIONS SYMPOSIUM, PROCEEDINGS, 2004, : 193 - 204