ModelSet: a dataset for machine learning in model-driven engineering

被引:0
作者
José Antonio Hernández López
Javier Luis Cánovas Izquierdo
Jesús Sánchez Cuadrado
机构
[1] Universidad de Murcia,Facultad de Informática
[2] UOC - IN3,undefined
来源
Software and Systems Modeling | 2022年 / 21卷
关键词
Dataset; Machine learning; Model-driven engineering;
D O I
暂无
中图分类号
学科分类号
摘要
The application of machine learning (ML) algorithms to address problems related to model-driven engineering (MDE) is currently hindered by the lack of curated datasets of software models. There are several reasons for this, including the lack of large collections of good quality models, the difficulty to label models due to the required domain expertise, and the relative immaturity of the application of ML to MDE. In this work, we present ModelSet, a labelled dataset of software models intended to enable the application of ML to address software modelling problems. To create it we have devised a method designed to facilitate the exploration and labelling of model datasets by interactively grouping similar models using off-the-shelf technologies like a search engine. We have built an Eclipse plug-in to support the labelling process, which we have used to label 5,466 Ecore meta-models and 5,120 UML models with its category as the main label plus additional secondary labels of interest. We have evaluated the ability of our labelling method to create meaningful groups of models in order to speed up the process, improving the effectiveness of classical clustering methods. We showcase the usefulness of the dataset by applying it in a real scenario: enhancing the MAR search engine. We use ModelSet to train models able to infer useful metadata to navigate search results. The dataset and the tooling are available at https://figshare.com/s/5a6c02fa8ed20782935c and a live version at http://modelset.github.io.
引用
收藏
页码:967 / 986
页数:19
相关论文
共 41 条
[1]  
Allamanis M(2018)A survey of machine learning for big code and naturalness ACM Comput. Surv. 51 1-37
[2]  
Barr ET(2019)Code2vec: learning distributed representations of code ACM Program. Lang. 3 1-29
[3]  
Devanbu P(2019)Metamodel clone detection with SAMOS J. Comput. Lang. 51 57-74
[4]  
Sutton C(2014)The KlaperSuite framework for model-driven reliability analysis of component-based systems Softw. Syst. Model. 13 1269-1290
[5]  
Alon U(2015)Collaborative repositories in model-driven engineering IEEE Softw. 32 28-34
[6]  
Zilberstein M(1996)A density-based algorithm for discovering clusters in large spatial databases with noise Kdd 96 226-231
[7]  
Levy O(2019)Automated metamodel/model co-evolution: a search-based approach Inf. Softw. Technol. 106 49-67
[8]  
Yahav E(2017)Automatic repair of real bugs in java: a large-scale experiment on the Defects4j dataset Emp. Softw. Eng. 22 1936-1964
[9]  
Babur Ö(2019)Empowering OCL research: a large-scale corpus of open-source data from GitHub Emp. Softw. Eng. 24 1574-1609
[10]  
Cleophas L(2007)Multi-label classification: an overview Int. J. Data Warehous. Min. (IJDWM) 3 1-13