A Model-Driven Approach for Systematic Reproducibility and Replicability of Data Science Projects

被引:3
作者
Melchor, Fran [1 ]
Rodriguez-Echeverria, Roberto [1 ]
Conejero, Jose M. [1 ]
Prieto, Alvaro E. [1 ]
Gutierrez, Juan D. [1 ]
机构
[1] Univ Extremadura, INTIA, Caceres, Spain
来源
ADVANCED INFORMATION SYSTEMS ENGINEERING (CAISE 2022) | 2022年
关键词
Reproducibility; Replicability; Process; Data science; Model-driven engineering; PROVENANCE; PIPELINES;
D O I
10.1007/978-3-031-07472-1_9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the last few years, there has been an important increase in the number of tools and approaches to define pipelines that allow the development of data science projects. They allow not only the pipeline definition but also the code generation needed to execute the project providing an easy way to carry out the projects even for non-expert users. However, there are still some challenges that these tools do not address yet, e.g. the possibility of executing pipelines defined by using different tools or execute them in different environments (reproducibility and replicability) or models validation and verification by identifying inconsistent operations (intentionality). In order to alleviate these problems, this paper presents a Model-Driven framework for the definition of data science pipelines independent of the particular execution platform and tools. The framework relies on the separation of the pipeline definition into two different modelling layers: conceptual, where the data scientist may specify all the data and models operations to be carried out by the pipeline; operational, where the data engineer may describe the execution environment details where the operations (defined in the conceptual part) will be implemented. Based on this abstract definition and layers separation, the approach allows: the usage of different tools improving, thus, process replicability; the automation of the process execution, enhancing process reproducibility; and the definition of model verification rules, providing intentionality restrictions.
引用
收藏
页码:147 / 163
页数:17
相关论文
共 30 条
  • [1] Baker M, 2016, NATURE, V533, P452, DOI 10.1038/533452a
  • [2] Incorporating measurement uncertainty into OCL/UML primitive datatypes
    Bertoa, Manuel F.
    Burgueno, Loli
    Moreno, Nathalie
    Vallecillo, Antonio
    [J]. SOFTWARE AND SYSTEMS MODELING, 2020, 19 (05) : 1163 - 1189
  • [3] Brambilla M., 2017, Synthesis Lectures on Software Engineering, VSecond, DOI [DOI 10.2200/S00751ED2V01Y201701SWE004, 10.2200/S00751ED2V01Y201701SWE004]
  • [4] Byrne C, 2017, Development Workflows for Data Scientists
  • [5] Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science
    Chapman, Adriane
    Missier, Paolo
    Simonelli, Giulia
    Torlone, Riccardo
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (04): : 507 - 520
  • [6] Domenech Antonio Molner, 2020, Journal of Physics: Conference Series, V1603, DOI 10.1088/1742-6596/1603/1/012025
  • [7] A Real-Life Machine Learning Experience for Predicting University Dropout at Different Stages Using Academic Data
    Fernandez-Garcia, Antonio Jesus
    Preciado, Juan Carlos
    Melchor, Fran
    Rodriguez-Echeverria, Roberto
    Conejero, Jose Maria
    Sanchez-Figueroa, Fernando
    [J]. IEEE ACCESS, 2021, 9 : 133076 - 133090
  • [8] Gardner J, 2018, IEEE INT CONF BIG DA, P3235, DOI 10.1109/BigData.2018.8621874
  • [9] Gundersen OE, 2018, AAAI CONF ARTIF INTE, P1644
  • [10] Transparency and reproducibility in artificial intelligence
    Haibe-Kains, Benjamin
    Adam, George Alexandru
    Hosny, Ahmed
    Khodakarami, Farnoosh
    Shraddha, Thakkar
    Kusko, Rebecca
    Sansone, Susanna-Assunta
    Tong, Weida
    Wolfinger, Russ D.
    Mason, Christopher E.
    Jones, Wendell
    Dopazo, Joaquin
    Furlanello, Cesare
    Waldron, Levi
    Wang, Bo
    McIntosh, Chris
    Goldenberg, Anna
    Kundaje, Anshul
    Greene, Casey S.
    Broderick, Tamara
    Hoffman, Michael M.
    Leek, Jeffrey T.
    Korthauer, Keegan
    Huber, Wolfgang
    Brazma, Alvis
    Pineau, Joelle
    Tibshirani, Robert
    Hastie, Trevor
    Ioannidis, John P. A.
    Quackenbush, John
    Aerts, Hugo J. W. L.
    [J]. NATURE, 2020, 586 (7829) : E14 - U7