Inference of Concise Regular Expressions and DTDs

被引:63
作者
Bex, Geert Jan [1 ,2 ]
Neven, Frank [1 ,2 ]
Schwentick, Thomas [3 ]
Vansummeren, Stijn [4 ]
机构
[1] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium
[2] Transnatl Univ Limburg, B-3590 Diepenbeek, Belgium
[3] TU Dortmund, Fak Informat, D-44227 Dortmund, Germany
[4] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2010年 / 35卷 / 02期
关键词
Algorithms; Languages; Theory; Regular expressions; schema inference; XML;
D O I
10.1145/1735886.1735890
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions-the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)-that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however ( for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs ( which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets.
引用
收藏
页数:47
相关论文
共 62 条
[1]  
Abiteboul S., 1999, DATA WEB
[2]  
Ahonen H., 1996, Report A- 1996-4.
[3]  
ANGLUIN DANA., 1983, ACM COMPUT SURV, V15, P237
[4]  
[Anonymous], XHTML 1 0 EXT HYPERT
[5]  
[Anonymous], DAGSTUHL SEMINAR P
[6]  
Barbosa D., 2002, P 5 INT WORKSH WEB D, P49
[7]   Studying the XML Web: Gathering statistics from an XML sample (vol 8, pg 413, 2006) [J].
Barbosa, Denilson ;
Mignet, Laurent ;
Veltri, Pierangelo .
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2006, 9 (02) :187-212
[8]   XPath satisfiability in the presence of DTDs [J].
Benedikt, Michael ;
Fan, Wenfei ;
Geerts, Floris .
JOURNAL OF THE ACM, 2008, 55 (02)
[9]   PLoS Biology -: We're open [J].
Bernstein, P ;
Cohen, B ;
MacCallum, C ;
Parthasarathy, H ;
Patterson, M ;
Siegel, V .
PLOS BIOLOGY, 2003, 1 (01) :3-3
[10]  
Bex G. J., 2007, VLDB 2007 VLDB 2007, P998