Parametric schema inference for massive JSON']JSON datasets

被引:36
作者
Baazizi, Mohamed-Amine [1 ]
Colazzo, Dario [2 ]
Ghelli, Giorgio [3 ]
Sartiani, Carlo [4 ]
机构
[1] Sorbonne Univ, CNRS, Lab Informat Paris 6, F-75005 Paris, France
[2] PSL Res Univ, LAMSADE, CNRS, Univ Paris Dauphine, F-75016 Paris, France
[3] Univ Pisa, Dipartimento Informat, Pisa, Italy
[4] Univ Basilicata, DIMIE, Potenza, Italy
关键词
!text type='JSON']JSON[!/text; Schema inference; Map-reduce; Spark; Big data collections;
D O I
10.1007/s00778-018-0532-7
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.
引用
收藏
页码:497 / 521
页数:25
相关论文
共 28 条
[1]  
[Anonymous], [No title captured]
[2]  
Baazizi M. -A., 2017, DBPL 17
[3]  
Baazizi M. -A., 2018, PROOFS PARAMETRIC SC
[4]  
Ben Lahmar Houssem, 2017, EDBT 17
[5]  
Benzaken V., 2006, VLDB, P271
[6]  
Bex G. J., 2006, VLDB, P115
[7]  
Beyer KS, 2011, PROC VLDB ENDOW, V4, P1272
[8]   FAD.js']js: Fast JSON']JSON Data Access Using JIT-based Speculative Optimizations [J].
Bonetta, Daniele ;
Brantner, Matthias .
PROCEEDINGS OF THE VLDB ENDOWMENT, 2017, 10 (12) :1778-1789
[9]   JSON']JSON: Data model, Query languages and Schema specification [J].
Bourhis, Pierre ;
Reutter, Juan L. ;
Suarez, Fernando ;
Vrgoc, Domagoj .
PODS'17: PROCEEDINGS OF THE 36TH ACM SIGMOD-SIGACT-SIGAI SYMPOSIUM ON PRINCIPLES OF DATABASE SYSTEMS, 2017, :123-135
[10]  
Bray T., 2014, JAVASCRIPT OBJECT NO