Interactive Data Extraction from Semi-Structured Text

被引:0
作者
Broman, Per [1 ]
Thalheim, Bernhard [1 ]
机构
[1] Univ Kiel, Inst Comp Sci, D-24098 Kiel, Germany
来源
INFORMATION MODELLING AND KNOWLEDGE BASES XXIII | 2012年 / 237卷
关键词
data extraction; semi-structured data; unstructured data; weighted finite-state automata; INFORMATION;
D O I
10.3233/978-1-60750-992-9-1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Designing a tool for data extraction from semi-structured and unstructured text, we are confronted with a problem that has largely been neglected by scholars so far: What if we need to find matches for several different patterns in a document and there are no keywords to support the search? And if so, what if the same section matches several different patterns or if matches in part overlap? How can we decide which one to pick? We suggest that this is an important problem in data extraction and propose a solution based on a token classification system and weighted finite-state automata.
引用
收藏
页码:1 / 19
页数:19
相关论文
共 21 条
[1]  
Adelberg B., 1998, SIGMOD Record, V27, P283, DOI 10.1145/276305.276330
[2]  
Adelberg B, 1999, SIGMOD RECORD, VOL 28, NO 2 - JUNE 1999, P559, DOI 10.1145/304181.304576
[3]   COMBINING TEXT CLASSIFIERS AND HIDDEN MARKOV MODELS FOR INFORMATION EXTRACTION [J].
Barros, Flavia A. ;
Silva, Eduardo F. A. ;
Prudencio, Ricardo B. C. ;
Filho, Valmir M. ;
Nascimento, Andre C. A. .
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2009, 18 (02) :311-329
[4]   Extracting information from heterogeneous information sources using ontologically specified target views [J].
Biskup, J ;
Embley, DW .
INFORMATION SYSTEMS, 2003, 28 (03) :169-212
[5]  
Chang CH, 2006, IEEE T KNOWL DATA EN, V18, P1411, DOI 10.1109/TKDE.2006.152
[6]   Automated extraction of data from text using an XML parser: An earth science example using fossil descriptions [J].
Curry, Gordon B. ;
Connor, Richard C. H. .
GEOSPHERE, 2008, 4 (01) :159-169
[7]   Automating the extraction of data from HTML']HTML tables with unknown structure [J].
Embley, DW ;
Tao, C ;
Liddle, SW .
DATA & KNOWLEDGE ENGINEERING, 2005, 54 (01) :3-28
[8]   Conceptual-model-based data extraction from multiple-record Web pages [J].
Embley, DW ;
Campbell, DM ;
Jiang, YS ;
Liddle, SW ;
Lonsdale, DW ;
Ng, YK ;
Smith, RD .
DATA & KNOWLEDGE ENGINEERING, 1999, 31 (03) :227-251
[9]   Generating finite-state transducers for semi-structured data extraction from the Web [J].
Hsu, CN ;
Dung, MT .
INFORMATION SYSTEMS, 1998, 23 (08) :521-538
[10]   RAD: A scalable framework for annotator development [J].
Khaitan, Sanjeet ;
Ramakrishnan, Ganesh ;
Joshi, Sachindra ;
Chalamalla, Anup .
2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, :1624-+