Flexible Approach for Web Information Extraction Based on HTML']HTMLParser

被引:0
|
作者
Shan, Lin [1 ]
Qun, Zhang [1 ]
机构
[1] Hubei Univ Technol, Sch Comp Sci, Wuhan, Peoples R China
来源
PROCEEDINGS OF 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, VOLS I-VI | 2012年
关键词
information extraction; Web crawler; !text type='HTML']HTML[!/text]Parser; filter; visitor; custom tags;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Nowadays Internet presents a huge amount of information for users. How to extract information quickly and effectively from various sources becomes very important. Web information extraction is the key element not only to Web crawler or search engine, but also for many specialized services such as competitive intelligence tools. This paper recommends a flexible and high-performance approach to the web information extraction. HTMLParser is a parsing library mainly used to transform or extract the Web information with HTML. It uses Node, Abstract Node, and Tag to express HTML page. It can extract information mainly with two ways: filter and visitor. With HTMLParser, we can conveniently extract hyperlink, email, title, etc. In this paper, we also extend HTMLParser to extract custom tags in certain web pages to expand its application area. Experimental results confirm the feasibility of the approach.
引用
收藏
页码:683 / 686
页数:4
相关论文
共 50 条
  • [1] FLEXIBLE WEB INFORMATION EXTRACTION WITH HTML']HTMLPARSER
    Shan, Lin
    3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (ITCS 2011), PROCEEDINGS, 2011, : 295 - 298
  • [2] An approach of automatic web mail information extraction
    Li, Yingrun
    Shu, Hui
    2008 PROCEEDINGS OF INFORMATION TECHNOLOGY AND ENVIRONMENTAL SYSTEM SCIENCES: ITESS 2008, VOL 2, 2008, : 1113 - 1118
  • [3] Towards Flexible Mashup of Web Applications Based on Information Extraction and Transfer
    Guo, Junxia
    Han, Hao
    Tokudal, Takehiro
    WEB INFORMATION SYSTEM ENGINEERING-WISE 2010, 2010, 6488 : 602 - +
  • [4] A HTML']HTML to WML Translating Model Based on Information Extraction for Mobile Commerce
    Song, Mingqiu
    Yu, Bo
    2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 9166 - 9169
  • [5] A Method Research of Extracting Web Information Based on HTML']HTML 5 New Standard
    Liu, Qing-hua
    Feng, Li-yun
    INTERNATIONAL CONFERENCE ON ELECTRICAL, CONTROL AND AUTOMATION ENGINEERING (ECAE 2013), 2013, : 520 - 524
  • [6] A hybrid approach for web information extraction
    Xiao, Ji-Yi
    Zhu, Dao-Hui
    Zou, La-Mei
    PROCEEDINGS OF 2008 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2008, : 1560 - 1563
  • [7] Product-advisory on the web: An information extraction approach
    Schmidt, Sebastian
    Mandl, Stefan
    Ludwig, Bemd
    Stoyan, Herbert
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND APPLICATIONS, 2007, : 633 - +
  • [8] Web-Based Information Extraction Technology
    孙铁利
    教巍巍
    刘淑华
    JournalofDonghuaUniversity(EnglishEdition), 2007, (02) : 288 - 292
  • [9] An Improved Ontology-Based Web Information Extraction
    Zhang, Jing
    Ding, Wei Ze
    2015 INTERNATIONAL CONFERENCE OF EDUCATIONAL INNOVATION THROUGH TECHNOLOGY - EITT 2015, 2015, : 37 - 41
  • [10] Study of Extraction for Web Pages Information Based on XML
    Li, Suming
    PROCEEDINGS OF THE 2016 2ND WORKSHOP ON ADVANCED RESEARCH AND TECHNOLOGY IN INDUSTRY APPLICATIONS, 2016, 81 : 829 - 832