Inference of Regular Expressions for Text Extraction from Examples

被引:72
作者
Bartoli, Alberto [1 ]
De Lorenzo, Andrea [1 ]
Medvet, Eric [1 ]
Tarlao, Fabiano [1 ]
机构
[1] Univ Trieste, Dept Engn & Architecture DIA, I-34127 Trieste, Italy
关键词
Genetic programming; information extraction; programming by examples; multiobjective optimization; heuristic search; INDUCTION ALGORITHMS; COMPLEXITY-MEASURES; PERFORMANCE;
D O I
10.1109/TKDE.2016.2515587
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A large class of entity extraction tasks from text that is either semistructured or fully unstructured may be addressed by regular expressions, because in many practical cases the relevant entities follow an underlying syntactical pattern and this pattern may be described by a regular expression. In this work, we consider the long-standing problem of synthesizing such expressions automatically, based solely on examples of the desired behavior. We present the design and implementation of a system capable of addressing extraction tasks of realistic complexity. Our system is based on an evolutionary procedure carefully tailored to the specific needs of regular expression generation by examples. The procedure executes a search driven by a multiobjective optimization strategy aimed at simultaneously improving multiple performance indexes of candidate solutions while at the same time ensuring an adequate exploration of the huge solution space. We assess our proposal experimentally in great depth, on a number of challenging datasets. The accuracy of the obtained solutions seems to be adequate for practical usage and improves over earlier proposals significantly. Most importantly, our results are highly competitive even with respect to human operators.
引用
收藏
页码:1217 / 1230
页数:14
相关论文
共 46 条
[11]  
Bongard J, 2005, J MACH LEARN RES, V6, P1651
[12]  
Brauer F., 2011, CIKM, V11, P1285, DOI 10.1145/2063576.2063763
[13]  
Brazma A., 1993, Proceeding of the Sixth Annual ACM Conference on Computational Learning Theory, P236, DOI 10.1145/168304.168340
[14]   Analysis of data complexity measures for classification [J].
Cano, Jose-Ramon .
EXPERT SYSTEMS WITH APPLICATIONS, 2013, 40 (12) :4820-4831
[15]  
Cetinkaya A., 2007, Proceedings of the 9th Annual Conference Companion on Genetic and Evolutionary Computation. GECCO'07, P2643
[16]  
Cicchello O., 2003, Journal of Machine Learning Research, V4, P603
[17]  
Cochran RA, 2015, ACM SIGPLAN NOTICES, V50, P677, DOI [10.1145/2676726.2676973, 10.1145/2775051.2676973]
[18]  
Cronen-Townsend S., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, P299
[19]   Multi-objective methods for tree size control [J].
Edwin D. de Jong ;
Jordan B. Pollack .
Genetic Programming and Evolvable Machines, 2003, 4 (3) :211-233
[20]   Learning regular languages from simple positive examples [J].
Denis, F .
MACHINE LEARNING, 2001, 44 (1-2) :37-66