Non-parametric Class Completeness Estimators for Collaborative Knowledge Graphs-The Case of Wikidata

被引:15
作者
Luggen, Michael [1 ]
Difallah, Djellel [2 ]
Sarasua, Cristina [3 ]
Demartini, Gianluca [4 ]
Cudre-Mauroux, Philippe [1 ]
机构
[1] Univ Fribourg, Fribourg, Switzerland
[2] NYU, New York, NY USA
[3] Univ Zurich, Zurich, Switzerland
[4] Univ Queensland, Brisbane, Qld, Australia
来源
SEMANTIC WEB - ISWC 2019, PT I | 2019年 / 11778卷
基金
欧洲研究理事会; 澳大利亚研究理事会;
关键词
Knowledge Graph; Class completeness; Class cardinality; Estimators; Edit history; SPECIES RICHNESS; NUMBER;
D O I
10.1007/978-3-030-30793-6_26
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Collaborative Knowledge Graph platforms allow humans and automated scripts to collaborate in creating, updating and inter-linking entities and facts. To ensure both the completeness of the data as well as a uniform coverage of the different topics, it is crucial to identify underrepresented classes in the Knowledge Graph. In this paper, we tackle this problem by developing statistical techniques for class cardinality estimation in collaborative Knowledge Graph platforms. Our method is able to estimate the completeness of a class-as defined by a schema or ontology-hence can be used to answer questions such as "Does the knowledge base have a complete list of all {Beer Brands-Volcanos-Video Game Consoles}?" As a use-case, we focus on Wikidata, which poses unique challenges in terms of the size of its ontology, the number of users actively populating its graph, and its extremely dynamic nature. Our techniques are derived from species estimation and data-management methodologies, and are applied to the case of graphs and collaborative editing. In our empirical evaluation, we observe that (i) the number and frequency of unique class instances drastically influence the performance of an estimator, (ii) bursts of inserts cause some estimators to overestimate the true size of the class if they are not properly handled, and (iii) one can effectively measure the convergence of a class towards its true size by considering the stability of an estimator against the number of available instances.
引用
收藏
页码:453 / 469
页数:17
相关论文
共 24 条
  • [1] [Anonymous], 2018, COMP WEB C 2018 WEB
  • [2] ESTIMATING THE NUMBER OF SPECIES - A REVIEW
    BUNGE, J
    FITZPATRICK, M
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1993, 88 (421) : 364 - 373
  • [3] ROBUST ESTIMATION OF POPULATION-SIZE WHEN CAPTURE PROBABILITIES VARY AMONG ANIMALS
    BURNHAM, KP
    OVERTON, WS
    [J]. ECOLOGY, 1979, 60 (05) : 927 - 936
  • [4] ESTIMATING THE NUMBER OF CLASSES VIA SAMPLE COVERAGE
    CHAO, A
    LEE, SM
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1992, 87 (417) : 210 - 217
  • [5] An Improved Nonparametric Lower Bound of Species Richness via a Modified Good-Turing Frequency Formula
    Chiu, Chun-Huo
    Wang, Yi-Ting
    Walther, Bruno A.
    Chao, Anne
    [J]. BIOMETRICS, 2014, 70 (03) : 671 - 682
  • [6] Completeness Management for RDF Data Sources
    Darari, Fariz
    Nutt, Werner
    Pirro, Giuseppe
    Razniewski, Simon
    [J]. ACM TRANSACTIONS ON THE WEB, 2018, 12 (03)
  • [7] Demographics and Dynamics of Mechanical Turk Workers
    Difallah, Djellel
    Filatova, Elena
    Ipeirotis, Panos
    [J]. WSDM'18: PROCEEDINGS OF THE ELEVENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2018, : 135 - 143
  • [8] Erxleben F, 2014, LECT NOTES COMPUT SC, V8796, P50, DOI 10.1007/978-3-319-11964-9_4
  • [9] Predicting Completeness in Knowledge Bases
    Galarraga, Luis
    Razniewski, Simon
    Amarilli, Antoine
    Suchanek, Fabian M.
    [J]. WSDM'17: PROCEEDINGS OF THE TENTH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2017, : 375 - 383
  • [10] THE POPULATION FREQUENCIES OF SPECIES AND THE ESTIMATION OF POPULATION PARAMETERS
    GOOD, IJ
    [J]. BIOMETRIKA, 1953, 40 (3-4) : 237 - 264