ModelSet: a dataset for machine learning in model-driven engineering

被引:32
作者
Hernandez Lopez, Jose Antonio [1 ]
Canovas Izquierdo, Javier Luis [2 ]
Sanchez Cuadrado, Jesus [1 ]
机构
[1] Univ Murcia, Fac Informat, Murcia, Spain
[2] UOC IN3, Castelldefels, Spain
关键词
Dataset; Machine learning; Model-driven engineering;
D O I
10.1007/s10270-021-00929-3
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The application of machine learning (ML) algorithms to address problems related to model-driven engineering (MDE) is currently hindered by the lack of curated datasets of software models. There are several reasons for this, including the lack of large collections of good quality models, the difficulty to label models due to the required domain expertise, and the relative immaturity of the application of ML to MDE. In this work, we present ModelSet, a labelled dataset of software models intended to enable the application of ML to address software modelling problems. To create it we have devised a method designed to facilitate the exploration and labelling of model datasets by interactively grouping similar models using off-the-shelf technologies like a search engine. We have built an Eclipse plug-in to support the labelling process, which we have used to label 5,466 Ecore meta-models and 5,120 UML models with its category as the main label plus additional secondary labels of interest. We have evaluated the ability of our labelling method to create meaningful groups of models in order to speed up the process, improving the effectiveness of classical clustering methods. We showcase the usefulness of the dataset by applying it in a real scenario: enhancing the MAR search engine. We use ModelSet to train models able to infer useful metadata to navigate search results. The dataset and the tooling are available at and a live version at http://modelset.github.io..
引用
收藏
页码:967 / 986
页数:20
相关论文
共 56 条
[1]  
Agt-Rickauer H., 2020, THESIS
[2]   A Survey of Machine Learning for Big Code and Naturalness [J].
Allamanis, Miltiadis ;
Barr, Earl T. ;
Devanbu, Premkumar ;
Sutton, Charles .
ACM COMPUTING SURVEYS, 2018, 51 (04)
[3]   Mining Idioms from Source Code [J].
Allamanis, Miltiadis ;
Sutton, Charles .
22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, :472-483
[4]   code2vec: Learning Distributed Representations of Code [J].
Alon, Uri ;
Zilberstein, Meital ;
Levy, Omer ;
Yahav, Eran .
PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES-PACMPL, 2019, 3 (POPL)
[5]  
Alon Uri, 2020, INT C MACH LEARN PML, P245
[6]  
Apache Software Foundation, LUC
[7]  
Babur Onder, 2019, Zenodo, DOI 10.5281/ZENODO.2585456
[8]   Metamodel clone detection with SAMOS [J].
Babur, Onder ;
Cleophas, Loek ;
van den Brand, Mark .
JOURNAL OF COMPUTER LANGUAGES, 2019, 51 :57-74
[9]   Hierarchical Clustering of Metamodels for Comparative Analysis and Visualization [J].
Babur, Onder ;
Cleophas, Loek ;
van den Brand, Mark .
MODELLING FOUNDATIONS AND APPLICATIONS, ECMFA 2016, 2016, 9764 :3-18
[10]   An Extensible Tool-Chain for Analyzing Datasets of Metamodels [J].
Barriga, Angela ;
Di Ruscio, Davide ;
Iovino, Ludovico ;
Nguyen, Phuong T. ;
Pierantonio, Alfonso .
23RD ACM/IEEE INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS, MODELS 2020 COMPANION, 2020,