Multi-level and Multi-version Approach for Software Development Dataset

被引:0
作者
Zhu J.-X. [1 ,2 ,3 ]
Zhou M.-H. [1 ,2 ]
机构
[1] Institute of Software, School of Electronics Engineering and Computer Science, Peking University, Beijing
[2] Key Laboratory of High Confidence Software Technologies of Ministry of Education (Peking University), Beijing
[3] Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing
来源
Ruan Jian Xue Bao/Journal of Software | 2019年 / 30卷 / 07期
基金
国家重点研发计划; 中国国家自然科学基金;
关键词
Data analysis; Data quality; Data-driven software engineering; Dataset; Software development data;
D O I
10.13328/j.cnki.jos.005489
中图分类号
学科分类号
摘要
With the fast development of open source software and wide application of development supporting tools, there have been a great many of open software development data on the Internet. To improve the software development efficiency and product quality, more and more practitioners and researchers attempt to obtain insights of software development from the data. To facilitate the data analyses and their reproduction and comparison, building and using shared datasets are proposed and practiced. However, the existing datasets are lack of traceability of dataset construction process, application scope, and consideration of data variation over time and with environment changes, which threat the data quality and analysis validity. To address these problems, an advanced approach is proposed for sharing and using the software development datasets. It constructs datasets with multiple levels and multiple versions. Through multiple levels, the datasets remain the raw data, intermediate data, and final data to possess data traceability. Meanwhile, by multiple versions, users can compare and observe the data variety to verify and improve data quality and analysis validity. Based on the previously constructed Mozilla issue tracking dataset, it is demonstrated that how to build and use multi-level and multi-version software development dataset and verified that the proposed approach can help users efficiently use the dataset. © Copyright 2019, Institute of Software, the Chinese Academy of Sciences. All rights reserved.
引用
收藏
页码:2109 / 2123
页数:14
相关论文
共 46 条
[1]  
Zhou M.H., Guo C.G., Bigdata-based thought of software engineering, Communications of the CCF, 10, 3, pp. 37-42, (2014)
[2]  
Mockus A., Engineering big data solutions, Proc. of the Future of Software Engineering, pp. 85-99, (2014)
[3]  
Hassan A.E., The road ahead for mining software repositories, Frontiers of Software Maintenance, FoSM 2008., pp. 48-57, (2008)
[4]  
Hassan A.E., Xie T., Mining software engineering data, Proc. of the 32nd ACM/IEEE Int'l Conf. on Software Engineering, 2, pp. 503-504, (2010)
[5]  
Howison J., Conklin M., Crowston K., FLOSSmole: A collaborative repository for FLOSS research data and analyses, Int'l Journal of Information Technology and Web Engineering (IJITWE), 1, 3, pp. 17-26, (2006)
[6]  
Boetticher G., Menzies T., Ostrand T., The promise repository of empirical software engineering data, (2016)
[7]  
Gousios G., Spinellis D., GHTorrent: Github's data from a firehose, Proc. of the 9th IEEE Working Conf. on Mining Software Repositories (MSR), pp. 12-21, (2012)
[8]  
Zhu J.X., Lin H.W., Zhou M.H., Mei H., Review code evolution history in OSS universe, Proc. of the 4th Asia-Pacific Symp. on Internetware, (2012)
[9]  
Bacchelli A., Mining challenge 2013: Stack overflow, Proc. of the 10th Working Conf. on Mining Software Repositories, (2013)
[10]  
Liebchen G.A., Shepperd M., Data sets and data quality in software engineering, Proc. of the 4th Int'l Workshop on Predictor Models in Software Engineering, pp. 39-44, (2008)