Improving generality and accuracy of existing public development project selection methods: a study on GitHub ecosystem

被引:0
作者
Can Cheng
Bing Li
Zengyang Li
Peng Liang
Xiaofeng Han
Jiahua Zhang
机构
[1] Wuhan University,School of Computer Science
[2] Central China Normal University,School of Computer Science
来源
Automated Software Engineering | 2022年 / 29卷
关键词
Open source software project; GitHub; Public development project;
D O I
暂无
中图分类号
学科分类号
摘要
With available tools and datasets existing on GitHub ecosystem, researchers have the opportunities to study diverse software engineering problems on a large-scale dataset. However, there are many potential threats when researchers try to directly use large-scale datasets, and one important threat is that GitHub contains many private projects (e.g., homework) and non-development projects (e.g., blog). For researchers who want to study cooperative behavior of developers or development process of projects, their research samples should not contain private projects and non-development projects. To solve this problem, we first analyzed the weaknesses of the base line methods (i.e., selecting top projects) and extended ML-based methods (i.e., training models on a labeled training dataset using ML algorithms, Extended_MLMs for short), and proposed two methods called Enhanced_RFM and Fusion_DL_RFM to address the weaknesses of Extended_RFM (the Extended_MLM that is based on Random Forest and has the best performance among all the Extended_MLMs). The results show that: (1) existing project sample selection methods have a low F-measure and poor generality (i.e., have a bad performance on the testing dataset); (2) Enhanced_RFM outperforms Fusion_DL_RFM on accuracy and stability; and (3) by adopting Enhanced_RFM, the F-measure of Extended_RFM is improved from 0.690 to 0.810 and the precision of Extended_RFM is improved from 0.559 to 0.785 under cross validation, which indicates that the generality of Extended_RFM is significantly improved.
引用
收藏
相关论文
共 57 条
  • [1] Bao L(2021)A large scale study of long-time contributor prediction for github projects IEEE Trans. Softw. Eng. 47 1277-1298
  • [2] Xia X(2015)Research-paper recommender systems: a literature survey Int. J. Digit. Libr. 17 1-34
  • [3] Lo D(1960)A coefficient of agreement for nominal scales Educ. Psychol. Meas. 20 37-46
  • [4] Murphy GC(2017)A systematic mapping study of software development with github IEEE Access 5 7173-7192
  • [5] Beel J(2018)Identifying unusual commits on github J. Softw. Evol. Process 30 e1893-190
  • [6] Gipp B(2015)An empirical study on software defect prediction with a simplified metric set Inf. Softw. Technol. 59 170-2071
  • [7] Langer S(2016)An in-depth study of the promises and perils of mining github Empir. Softw. Eng. 21 2035-3322
  • [8] Breitinger C(2020)Standing on shoulders or feet? An extended study on the usage of the msr data papers Empir. Softw. Eng. 25 3288-35
  • [9] Cohen J(2016)Curating github for engineered software projects Empir. Softw. Eng. 22 1-32
  • [10] Cosentino V(2010)Random forest Mach. Learn. 45 5-82