Content-based Union and Complement Metrics for Dataset Search over RDF Knowledge Graphs

被引:15
作者
Mountantonakis, Michalis [1 ,2 ]
Tzitzikas, Yannis [1 ,2 ]
机构
[1] FORTH ICS, Inst Comp Sci, N Plastira 100, GR-70013 Iraklion, Greece
[2] Univ Crete, Comp Sci Dept, N Plastira 100, GR-70013 Iraklion, Greece
来源
ACM JOURNAL OF DATA AND INFORMATION QUALITY | 2020年 / 12卷 / 02期
关键词
Dataset search; dataset quality; interlinking; enrichment; reusability; discoverability; linked data; contextual connectivity; relevancy; lattice of measurements; data integration; LINKED DATA; VOCABULARIES; WEB;
D O I
10.1145/3372750
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
RDF Knowledge Graphs (or Datasets) contain valuable information that can be exploited for a variety of real-world tasks. However, due to the enormous size of the available RDF datasets, it is difficult to discover the most valuable datasets for a given task. For improving dataset Discoverability, Interlinking, and Reusability, there is a trend for Dataset Search systems. Such systems are mainly based on metadata and ignore the contents; however, in tasks related to data integration and enrichment, the contents of datasets have to be considered. This is important for data integration but also for data enrichment, for instance, quite often datasets' owners want to enrich the content of their dataset, by selecting datasets that provide complementary information for their dataset. The above tasks require content-based union and complement metrics between any subset of datasets; however, there is a lack of such approaches. For making feasible the computation of such metrics at very large scale, we propose an approach relying on (a) a set of pre-constructed (and periodically refreshed) semantics-aware indexes, and (b) "lattice-based" incremental algorithms that exploit the posting lists of such indexes, as well as set theory properties, for enabling efficient responses at query time. Finally, we discuss the efficiency of the proposed methods by presenting comparative results, and we report measurements for 400 real RDF datasets (containing over 2 billion triples), by exploiting the proposed metrics.
引用
收藏
页数:31
相关论文
共 38 条
[31]   Mining the Web of Linked Data with RapidMiner [J].
Ristoski, Petar ;
Bizer, Christian ;
Paulheim, Heiko .
JOURNAL OF WEB SEMANTICS, 2015, 35 :142-151
[32]   Unifying heterogeneous and distributed information about marine species through the top level ontology MarineTLO [J].
Tzitzikas, Yannis ;
Allocca, Carlo ;
Bekiari, Chryssoula ;
Marketakis, Yannis ;
Fafalios, Pavlos ;
Doerr, Martin ;
Minadakis, Nikos ;
Patkos, Theodore ;
Candela, Leonardo .
PROGRAM-ELECTRONIC LIBRARY AND INFORMATION SYSTEMS, 2016, 50 (01) :16-40
[33]   Where is My URI? [J].
Valdestilhas, Andre ;
Soru, Tommaso ;
Nentwig, Markus ;
Marx, Edgard ;
Saleem, Muhammad ;
Ngomo, Axel-Cyrille Ngonga .
SEMANTIC WEB (ESWC 2018), 2018, 10843 :671-681
[34]  
Vandenbussche PY, 2017, SEMANT WEB, V8, P437, DOI 10.3233/SW-160213
[35]   Wikidata: A Free Collaborative Knowledgebase [J].
Vrandecic, Denny ;
Kroetzsch, Markus .
COMMUNICATIONS OF THE ACM, 2014, 57 (10) :78-85
[36]   Entity-Based Data Source Contextualization for Searching the Web of Data [J].
Wagner, Andreas ;
Haase, Peter ;
Rettinger, Achim ;
Lamm, Holger .
SEMANTIC WEB: ESWC 2014 SATELLITE EVENTS, 2014, 8798 :25-41
[37]   SpEnD: Linked Data SPARQL Endpoints Discovery Using Search Engines [J].
Yumusak, Semih ;
Dogdu, Erdogan ;
Kodaz, Halife ;
Kamilaris, Andreas ;
Vandenbussche, Pierre-Yves .
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2017, E100D (04) :758-767
[38]   Quality assessment for Linked Data: A Survey [J].
Zaveri, Amrapali ;
Rula, Anisa ;
Maurino, Andrea ;
Pietrobon, Ricardo ;
Lehmann, Jens ;
Auer, Soeren .
SEMANTIC WEB, 2016, 7 (01) :63-93