An Improved Method for Training Data Selection for Cross-Project Defect Prediction

被引：18

作者：

Bhat, Nayeem Ahmad ^{[1
]}

Farooq, Sheikh Umar ^{[1
]}

机构：

[1] Univ Kashmir, Dept Comp Sci, North Campus, Jammu and Kashmir, India

来源：

ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING | 2022年 / 47卷 / 02期

关键词：

Cross-project defect prediction; Class imbalance learning; Distributional difference; Data normalization; Software quality assurance; Training data selection; STATIC CODE ATTRIBUTES; CLASSIFICATION; FAULTS;

D O I：

10.1007/s13369-021-06088-3

中图分类号：

O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];

学科分类号：

07 ; 0710 ; 09 ;

摘要：

The selection of relevant training data significantly improves the quality of cross-project defect prediction (CPDP) process. We propose a training data selection approach and compare its performance against the Burak filter and the Peter filter over Bug Prediction Dataset. In our approach (BurakMHD), firstly a data transformation is applied to the datasets. Then, individual instances of the target project adds k-instances at a minimum Hamming distance each from the transformed multi-source defective and non-defective data instances to the filtered training dataset (filtered TDS). Compared to using all the cross-project data, the false positive rate decreases by 10.6% associated with a 2.6% decrease in defect detection rate. The overall performance nMCC, Balance, G-measure increase by 2.9%, 5.7%, 6.6%, respectively. Compared to Burak filter and Peter filter, defect detection rate increases by 1.5% and 1.8%, respectively, and the false positive rate decreases by 6.4%. The overall performance nMCC, Balance, G-measure increase by 3%, 5.3%, 6.8% and by 3.2%, 5.5%, 7.1% compared to Burak and Peter filter, respectively. Compared to within-project predictions, the overall performance nMCC, Balance, G-measure increase by 1.1%, 3.4%, 4%, respectively, and the defect detection rate and false positive rate decrease by 9.2% and 13.1%, respectively. In general, our approach improved the performance significantly, compared to the Burak filter, Peter filter, cross-project prediction, and within-project prediction. Therefore, we conclude, applying data transformation and filtering training data separately from the defective and non-defective instances of cross-project data is helpful to select the relevant data for CPDP.

引用

页码：1939 / 1954

页数：16

共 55 条

[1] Cross project defect prediction for open source software [J].

Agrawal A. ;

Malhotra R. .

International Journal of Information Technology, 2022, 14 (1) :587-601

[2]

[Anonymous], 2020, Practical Data Science with R

[3] A validation of object-oriented design metrics as quality indicators [J].

Basili, VR ;

Briand, LC ;

Melo, WL .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1996, 22 (10) :751-761

[4] MAHAKIL: Diversity Based Oversampling Approach to Alleviate the Class Imbalance Issue in Software Defect Prediction [J].

Benni, Kwabena Ebo ;

Keung, Jacky ;

Phannachitta, Passakorn ;

Monden, Akito ;

Mensah, Solomon .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2018, 44 (06) :534-550

[5] The Significant Effects of Data Sampling Approaches on Software Defect Prioritization and Classification [J].

Bennin, Kwabena Ebo ;

Keung, Jacky ;

Monden, Akito ;

Phannachitta, Passakorn ;

Mensah, Solomon .

11TH ACM/IEEE INTERNATIONAL SYMPOSIUM ON EMPIRICAL SOFTWARE ENGINEERING AND MEASUREMENT (ESEM 2017), 2017, :364-373

[6]

Bettenburg N., 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR 2012), P60, DOI 10.1109/MSR.2012.6224300

[7] Software defect prediction: do different classifiers find the same defects? [J].

Bowes, David ;

Hall, Tracy ;

Petric, Jean .

SOFTWARE QUALITY JOURNAL, 2018, 26 (02) :525-552

[8] The use of cross-company fault data for the software fault prediction problem [J].

Catal, Cagatay .

TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2016, 24 (05) :3714-3723

[9] The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation [J].

Chicco, Davide ;

Jurman, Giuseppe .

BMC GENOMICS, 2020, 21 (01)

[10] Ten quick tips for machine learning in computational biology [J].

Chicco, Davide .

BIODATA MINING, 2017, 10

← 1 2 3 4 5 6 →