Web Entity Extraction Based on Entity Attribute Classification

被引:0
作者
Li, Chuan-Xi [1 ]
Chen, Peng [1 ]
Wang, Ru-Jing [1 ]
Su, Ya-Ru [1 ]
机构
[1] Chinese Acad Sci, Inst Intelligent Machines, Hefei 230031, Peoples R China
来源
FOURTH INTERNATIONAL CONFERENCE ON MACHINE VISION (ICMV 2011): COMPUTER VISION AND IMAGE ANALYSIS: PATTERN RECOGNITION AND BASIC TECHNOLOGIES | 2012年 / 8350卷
关键词
VIPS; Information extraction; entity extraction; text mining; INFORMATION EXTRACTION;
D O I
10.1117/12.920237
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The large amount of entity data are continuously published on web pages. Extracting these entities automatically for further application is very significant. Rule-based entity extraction method yields promising result, however, it is labor-intensive and hard to be scalable. The paper proposes a web entity extraction method based on entity attribute classification, which can avoid manual annotation of samples. First, web pages are segmented into different blocks by algorithm Vision-based Page Segmentation (VIPS), and a binary classifier LibSVM is trained to retrieve the candidate blocks which contain the entity contents. Second, the candidate blocks are partitioned into candidate items, and the classifiers using LibSVM are performed for the attributes annotation of the items and then the annotation results are aggregated into an entity. Results show that the proposed method performs well to extract agricultural supply and demand entities from web pages.
引用
收藏
页数:6
相关论文
共 22 条
[1]  
Bohunsky Paula., 2010, Proceedings of the 19th International Conference on World Wide Web, WWW 10, P1067
[2]   Ontology-based information extraction and integration from heterogeneous data sources [J].
Buitelaar, Paul ;
Cimiano, Philipp ;
Frank, Anette ;
Hartung, Matthias ;
Racloppa, Stefania .
INTERNATIONAL JOURNAL OF HUMAN-COMPUTER STUDIES, 2008, 66 (11) :759-788
[3]  
Cai D., 2003, VIPS VISION BASED PA
[4]   LIBSVM: A Library for Support Vector Machines [J].
Chang, Chih-Chung ;
Lin, Chih-Jen .
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[5]  
Cohen William W., 2002, 11 INT CONFWORLD WID, P232, DOI DOI 10.1145/511446.511477
[6]  
Crescenzi V., 2001, Proceedings of the 27th International Conference on Very Large Data Bases, P109
[7]   Adaptive web information extraction - The Amorphic system works to extract Web information for use in business intelligence applications. [J].
Gregg, DG ;
Walczak, S .
COMMUNICATIONS OF THE ACM, 2006, 49 (05) :78-+
[8]  
Li D., 2009, 2 IFIP INT C COMP CO
[9]  
Liu L., 2000, P 16 INT C DAT ENG
[10]  
McCallum A., 2000, Icml, V17, P591