Enriching integrated statistical open city data by combining equational knowledge and missing value imputation

被引:12
作者
Bischof, Stefan [1 ]
Harth, Andreas [2 ]
Kaempgen, Benedikt [3 ]
Polleres, Axel [4 ,5 ]
Schneider, Patrik [1 ]
机构
[1] Siemens AG Osterreich, Siemensstr 90, A-1210 Vienna, Austria
[2] Karlsruhe Inst Technol, Karlsruhe, Germany
[3] FZI Res Ctr Informat Technol, Karlsruhe, Germany
[4] Vienna Univ Econ & Business, Vienna, Austria
[5] Complex Sci Hub Vienna, Vienna, Austria
来源
JOURNAL OF WEB SEMANTICS | 2018年 / 48卷
关键词
Open data; Linked Data; Data cleaning; Data integration; WEB; ONTOLOGIES;
D O I
10.1016/j.websem.2017.09.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Several institutions collect statistical data about cities, regions, and countries for various purposes. Yet, while access to high quality and recent such data is both crucial for decision makers and a means for achieving transparency to the public, all too often such collections of data remain isolated and not reuseable, let alone comparable or properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and re-publish the resulting dataset in a re-useable manner as Linked Data. The main features of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques and reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data, including links to external data sources, such as DBpedia, available both in a web browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV. Apart from providing a contribution to the growing collection of data available as Linked Data, our enrichment process for missing values also contributes a novel methodology for combining rule-based inference about equational knowledge with inferences obtained from statistical Machine Learning approaches. While most existing works about inference in Linked Data have focused on ontological reasoning in RDFS and OWL, we believe that these complementary methods and particularly their combination could be fruitfully applied also in many other domains for integrating Statistical Linked Data, independent from our concrete use case of integrating city data. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:22 / 47
页数:26
相关论文
共 75 条
[1]  
Ambite J. L., 2007, INT SEM WEB C ISWC
[2]  
Angles R, 2008, LECT NOTES COMPUT SC, V5318, P114, DOI 10.1007/978-3-540-88564-1_8
[3]  
[Anonymous], 2014, RDF 1.1 concepts and abstract syntax
[4]  
[Anonymous], 2013, Reasoning Web. Semantic Technologies for Intelligent Data Access-9th International Summer School 2013, Mannheim, Germany, July 30-August 2, DOI DOI 10.1007/978-3-642-39784-4_2
[5]  
Auer Soren, 2012, The Semantic Web. 11th International Semantic Web Conference (ISWC 2012). Proceedings, P1, DOI 10.1007/978-3-642-35173-0_1
[6]  
Baader F, 2003, DESCRIPTION LOGIC HANDBOOK: THEORY, IMPLEMENTATION AND APPLICATIONS, P43
[7]  
Berners-Lee T., 2006, LINKED DATA W3C DESI
[8]   Growth, innovation, scaling, and the pace of life in cities [J].
Bettencourt, Luis M. A. ;
Lobo, Jose ;
Helbing, Dirk ;
Kuehnert, Christian ;
West, Geoffrey B. .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2007, 104 (17) :7301-7306
[9]  
Bevington P. R., 2003, DATA REDUCTION ERROR, DOI DOI 10.1063/1.4823194
[10]  
Bischof Stefan, 2013, Semantic Web: Semantics and Big Data. Proceedings of 10th International Conference (ESWC 2013): LNCS 7882, P335