Parametric schema inference for massive JSON datasets

被引:1
|
作者
Mohamed-Amine Baazizi
Dario Colazzo
Giorgio Ghelli
Carlo Sartiani
机构
[1] Sorbonne Université,CNRS, Laboratoire d’Informatique de Paris 6
[2] PSL Research University,CNRS, LAMSADE
[3] Università di Pisa, Université Paris Dauphine
[4] Università della Basilicata,Dipartimento di Informatica
来源
The VLDB Journal | 2019年 / 28卷
关键词
JSON; Schema inference; Map-reduce; Spark; Big data collections;
D O I
暂无
中图分类号
学科分类号
摘要
In recent years, JSON established itself as a very popular data format for representing massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences as well: Data analysts and programmers cannot exploit a schema for a reliable description of the structure of the dataset, the correctness of complex queries and programs cannot be statically checked, and many schema-based optimizations are not possible. In this paper, we deal with the problem of inferring a schema from massive JSON datasets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our contributions, which are the design of a parametric and parallelizable schema inference algorithm, its theoretical study, and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Our algorithm is parametric as the analyst can specify a parameter determining the level of precision and conciseness of the inferred schema. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, conciseness of inferred schemas, and scalability.
引用
收藏
页码:497 / 521
页数:24
相关论文
共 50 条
  • [11] Reducing Ambiguity in Json']Json Schema Discovery
    Spoth, William
    Kennedy, Oliver
    Lu, Ying
    Hammerschmidt, Beda
    Liu, Zhen Hua
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 1732 - 1744
  • [12] An Approach for Schema Extraction of JSON']JSON and Extended JSON']JSON Document Collections
    Frozza, Angelo Augusto
    Mello, Ronaldo dos Santos
    da Costa, Felipe de Souza
    2018 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION (IRI), 2018, : 356 - 363
  • [13] JSON']JSONDISCOVERER: Visualizing the schema lurking behind JSON']JSON documents
    Canovas Izquierdo, Javier Luis
    Cabot, Jordi
    KNOWLEDGE-BASED SYSTEMS, 2016, 103 : 52 - 55
  • [14] Definition of REST web services with JSON']JSON schema
    Barbaglia, Guido
    Murzilli, Simone
    Cudini, Stefano
    SOFTWARE-PRACTICE & EXPERIENCE, 2017, 47 (06): : 907 - 920
  • [15] JSON']JSON document clustering based on schema embeddings
    Priya, D. Uma
    Thilagam, P. Santhi
    JOURNAL OF INFORMATION SCIENCE, 2024, 50 (05) : 1112 - 1130
  • [16] Research on the Translation from XSD to JSON']JSON Schema
    Guo, Shijiao
    Xia, Hongxia
    Xiang, Guangli
    2017 IEEE 9TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN), 2017, : 1393 - 1396
  • [17] Validation of Modern JSON']JSON Schema: Formalization and Complexity
    Attouche, Lyes
    Baazizi, Mohamed-Amine
    Colazzo, Dario
    Ghelli, Giorgio
    Sartiani, Carlo
    Scherzinger, Stefanie
    PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2024, 8 (POPL):
  • [18] Json']JsonCurer: Data Quality Management for JSON']JSON Based on an Aggregated Schema
    Xiong, Kai
    Xu, Xinyi
    Fu, Siwei
    Weng, Di
    Wang, Yongheng
    Wu, Yingcai
    IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2024, 30 (06) : 3008 - 3021
  • [19] Statistical inference in massive datasets by empirical likelihood
    Xuejun Ma
    Shaochen Wang
    Wang Zhou
    Computational Statistics, 2022, 37 : 1143 - 1164
  • [20] Statistical inference in massive datasets by empirical likelihood
    Ma, Xuejun
    Wang, Shaochen
    Zhou, Wang
    COMPUTATIONAL STATISTICS, 2022, 37 (03) : 1143 - 1164