Inference of Regular Expressions for Text Extraction from Examples

被引:72
作者
Bartoli, Alberto [1 ]
De Lorenzo, Andrea [1 ]
Medvet, Eric [1 ]
Tarlao, Fabiano [1 ]
机构
[1] Univ Trieste, Dept Engn & Architecture DIA, I-34127 Trieste, Italy
关键词
Genetic programming; information extraction; programming by examples; multiobjective optimization; heuristic search; INDUCTION ALGORITHMS; COMPLEXITY-MEASURES; PERFORMANCE;
D O I
10.1109/TKDE.2016.2515587
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work, we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior. We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators.
引用
收藏
页码:1217 / 1230
页数:14
相关论文
共 46 条
[1]  
[Anonymous], 2001, IJCAI
[2]  
[Anonymous], 1992, GENETIC PROGRAMMING
[3]  
Babbar R., 2010, Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, AND '10, P43, DOI DOI 10.1145/1871840.1871848
[4]   Adapting Searchy to extract data using evolved wrappers [J].
Barrero, David F. ;
R-Moreno, Maria D. ;
Camacho, David .
EXPERT SYSTEMS WITH APPLICATIONS, 2012, 39 (03) :3061-3070
[5]   A Hyper-Heuristic Evolutionary Algorithm for Automatically Designing Decision-Tree Algorithms [J].
Barros, Rodrigo C. ;
Basgalupp, Marcio P. ;
de Carvalho, Andre C. P. L. F. ;
Freitas, Alex A. .
PROCEEDINGS OF THE FOURTEENTH INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2012, :1237-1244
[6]  
Bartoli A., 2015, INFERENCE REGULAR EX
[7]   Playing Regex Golf with Genetic Programming [J].
Bartoli, Alberto ;
De Lorenzo, Andrea ;
Medvet, Eric ;
Tarlao, Fabiano .
GECCO'14: PROCEEDINGS OF THE 2014 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2014, :1063-1069
[8]   Learning Text Patterns Using Separate-and-Conquer Genetic Programming [J].
Bartoli, Alberto ;
De Lorenzo, Andrea ;
Medvet, Eric ;
Tarlao, Fabiano .
GENETIC PROGRAMMING (EUROGP 2015), 2015, 9025 :16-27
[9]   Automatic Synthesis of Regular Expressions from Examples [J].
Bartoli, Alberto ;
Davanzo, Giorgio ;
De Lorenzo, Andrea ;
Medvet, Eric ;
Sorio, Enrico .
COMPUTER, 2014, 47 (12) :72-80
[10]  
Bartoli A, 2012, PROCEEDINGS OF THE FOURTEENTH INTERNATIONAL CONFERENCE ON GENETIC AND EVOLUTIONARY COMPUTATION COMPANION (GECCO'12), P1477