Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

被引：0

作者：

Can Cheng

Bing Li

Zengyang Li

Peng Liang

Xiaofeng Han

Jiahua Zhang

机构：

[1] Wuhan University,School of Computer Science

[2] Central China Normal University,School of Computer Science

来源：

Automated Software Engineering | 2022年 / 29卷

关键词：

Open source software project; GitHub; Public development project;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved.

引用

共 57 条

[1] Bao L(2021)A large scale study of long-time contributor prediction for github projects IEEE Trans. Softw. Eng. 47 1277-1298
[2] Xia X(2015)Research-paper recommender systems: a literature survey Int. J. Digit. Libr. 17 1-34
[3] Lo D(1960)A coefficient of agreement for nominal scales Educ. Psychol. Meas. 20 37-46
[4] Murphy GC(2017)A systematic mapping study of software development with github IEEE Access 5 7173-7192
[5] Beel J(2018)Identifying unusual commits on github J. Softw. Evol. Process 30 e1893-190
[6] Gipp B(2015)An empirical study on software defect prediction with a simplified metric set Inf. Softw. Technol. 59 170-2071
[7] Langer S(2016)An in-depth study of the promises and perils of mining github Empir. Softw. Eng. 21 2035-3322
[8] Breitinger C(2020)Standing on shoulders or feet? An extended study on the usage of the msr data papers Empir. Softw. Eng. 25 3288-35
[9] Cohen J(2016)Curating github for engineered software projects Empir. Softw. Eng. 22 1-32
[10] Cosentino V(2010)Random forest Mach. Learn. 45 5-82

← 1 2 3 4 5 6 →