Dataset Reuse: Toward Translating Principles to Practice

被引:18
作者
Koesten, Laura [1 ]
Vougiouklis, Pavlos [2 ]
Simperl, Elena [1 ]
Groth, Paul [3 ]
机构
[1] Kings Coll London, London WC2B 4BG, England
[2] Huawei Technol, Edinburgh EH9 3BF, Midlothian, Scotland
[3] Univ Amsterdam, NL-1090 GH Amsterdam, Netherlands
来源
PATTERNS | 2020年 / 1卷 / 08期
基金
英国工程与自然科学研究理事会;
关键词
PROVENANCE; KNOWLEDGE; FRAMEWORK; METADATA;
D O I
10.1016/j.patter.2020.100136
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The web provides access to millions of datasets that can have additional impact when used beyond their original context. We have little empirical insight into what makes a dataset more reusable than others and which of the existing guidelines and frameworks, if any, make a difference. In this paper, we explore potential reuse features through a literature review and present a case study on datasets on GitHub, a popular open platform for sharing code and data. We describe a corpus of more than 1.4 million data files, from over 65,000 repositories. Using GitHub's engagement metrics as proxies for dataset reuse, we relate them to reuse features from the literature and devise an initial model, using deep neural networks, to predict a dataset's reusability. This demonstrates the practical gap between principles and actionable insights that allow data publishers and tools designers to implement functionalities that provably facilitate reuse.
引用
收藏
页数:21
相关论文
共 108 条
[1]   Meloda, a metric to assess open data reuse [J].
Abella, Alberto ;
Ortiz-de-Urbina-Criado, Marta ;
De-Pablos-Heredero, Carmen .
PROFESIONAL DE LA INFORMACION, 2014, 23 (06) :582-588
[2]   The application of archival concepts to a data-intensive environment: working with scientists to understand data management and preservation needs [J].
Akmon, Dharma ;
Zimmerman, Ann ;
Daniels, Morgan ;
Hedstrom, Margaret .
ARCHIVAL SCIENCE, 2011, 11 (3-4) :329-348
[3]  
Allen R.., 2018, Tech. Rep., object Object
[4]  
[Anonymous], 2014, P 11 WORK C MIN SOFT, DOI DOI 10.1145/2597073.2597074
[5]  
[Anonymous], 2015, ACS SYM SER
[6]  
[Anonymous], 2011, Linked data: Evolving the Web into a global data space
[7]  
[Anonymous], 2014, UNDERSTANDING MACHIN, DOI [10.1017/CBO9781107298019, DOI 10.1017/CBO9781107298019]
[8]   VQA: Visual Question Answering [J].
Antol, Stanislaw ;
Agrawal, Aishwarya ;
Lu, Jiasen ;
Mitchell, Margaret ;
Batra, Dhruv ;
Zitnick, C. Lawrence ;
Parikh, Devi .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :2425-2433
[9]   FactSheets: Increasing trust in AI services through supplier's declarations of conformity [J].
Arnold, M. ;
Bellamy, R. K. E. ;
Hind, M. ;
Houde, S. ;
Mehta, S. ;
Mojsilovic, A. ;
Nair, R. ;
Ramamurthy, K. Natesan ;
Olteanu, A. ;
Piorkowski, D. ;
Reimer, D. ;
Richards, J. ;
Tsay, J. ;
Varshney, K. R. .
IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2019, 63 (4-5)
[10]   Bias on the Web [J].
Baeza-Yates, Ricardo .
COMMUNICATIONS OF THE ACM, 2018, 61 (06) :54-61