An Approach to Extracting Topic-guided Views from the Sources of a Data Lake

被引:12
作者
Diannantini, Claudia [1 ]
Lo Giudice, Paolo [2 ]
Potena, Domenico [1 ]
Storti, Emanuele [1 ]
Ursino, Domenico [1 ]
机构
[1] Polytech Univ Marche, DII, Ancona, Italy
[2] Univ Mediterranea Reggio Calabria, DIIES, Reggio Di Calabria, Italy
关键词
Data lakes; Unstructuted data sources; Metadata management; Thematic views; Semantic similarities; DBpedia; LINKED DATA; INFORMATION; INTEGRATION; QUERIES; CONSTRUCTION; SYSTEM; DIKE;
D O I
10.1007/s10796-020-10010-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the last years, data lakes are emerging as an effective and an efficient support for information and knowledge extraction from a huge amount of highly heterogeneous and quickly changing data sources. Data lake management requires the definition of new techniques, very different from the ones adopted for data warehouses in the past. In this scenario, one of the most challenging issues to address consists in the extraction of topic-guided (i.e., thematic) views from the (very heterogeneous and often unstructured) sources of a data lake. In this paper, we propose a new network-based model to uniformly represent structured, semi-structured and unstructured sources of a data lake. Then, we present a new approach to, at least partially, "structuring" unstructured data. Finally, we define a technique to extract topic-guided views from the sources of a data lake, based on similarity and other semantic relationships among source metadata.
引用
收藏
页码:243 / 262
页数:20
相关论文
共 57 条
  • [1] Abiteboul S., 1998, Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. PODS 1998, P254, DOI 10.1145/275487.275516
  • [2] Aversano L, 2010, ICSOFT 2010: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL 1, P266
  • [3] Bachtarzi C., 2015, P INT C NEW TECHN IN, P1
  • [4] Semantic integration of heterogeneous information sources
    Bergamaschi, S
    Castano, S
    Vincini, M
    Beneventano, D
    [J]. DATA & KNOWLEDGE ENGINEERING, 2001, 36 (03) : 215 - 249
  • [5] Evaluating Queries and Updates on Big XML Documents
    Bidoit, Nicole
    Colazzo, Dario
    Malla, Noor
    Sartiani, Carlo
    [J]. INFORMATION SYSTEMS FRONTIERS, 2018, 20 (01) : 63 - 90
  • [6] Towards Intelligent Data Analysis: The Metadata Challenge
    Bilalli, Besim
    Abello, Alberto
    Aluja-Banet, Tomas
    Wrembel, Robert
    [J]. IOTBD: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTERNET OF THINGS AND BIG DATA, 2016, : 331 - 338
  • [7] Extracting information from heterogeneous information sources using ontologically specified target views
    Biskup, J
    Embley, DW
    [J]. INFORMATION SYSTEMS, 2003, 28 (03) : 169 - 212
  • [8] Latent Dirichlet allocation
    Blei, DM
    Ng, AY
    Jordan, MI
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2003, 3 (4-5) : 993 - 1022
  • [9] Social networks and information retrieval, how are they converging? A survey, a taxonomy and an analysis of social information retrieval approaches and platforms
    Bouadjenek, Mohamed Reda
    Hacid, Hakim
    Bouzeghoub, Mokrane
    [J]. INFORMATION SYSTEMS, 2016, 56 : 1 - 18
  • [10] Bougouin A., 2013, P 6 INT JOINT C NAT, P543