A Novel Feature Selection Method Based on Maximum Likelihood Logistic Regression for Imbalanced Learning in Software Defect Prediction

被引：10

作者：

Bashir, Kamal ^{[1
]}

Li, Tianrui ^{[1
]}

Yahaya, Mahama ^{[2
]}

机构：

[1] Southwest Jiaotong Univ, Sch Informat Sci & Technol, Chengdu, Peoples R China

[2] Southwest Jiaotong Univ, Sch Transport & Logist Engn, Chengdu, Peoples R China

来源：

INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY | 2020年 / 17卷 / 05期

基金：

美国国家科学基金会;

关键词：

Software defect prediction; Machine learning; Class imbalance; Maximum-likelihood logistic regression;

D O I：

10.34028/iajit/17/5/5

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The most frequently used machine learning feature ranking approaches failed to present optimal feature subset for accurate prediction of defective software modules in out-of-sample data. Machine learning Feature Selection (FS) algorithms such as Chi-Square (CS), Information Gain (IG), Gain Ratio (GR), RelieF (RF) and Symmetric Uncertainty (SU) perform relatively poor at prediction, even after balancing class distribution in the training data. In this study, we propose a novel FS method based on the Maximum Likelihood Logistic Regression (MLLR). We apply this method on six software defect datasets in their sampled and unsampled forms to select useful features for classification in the context of Software Defect Prediction (SDP). The Support Vector Machine (SVM) and Random Forest (RaF) classifiers are applied on the FS subsets that are based on sampled and unsampled datasets. The performance of the models captured using Area Ander Receiver Operating Characteristics Curve (AUC) metrics are compared for all FS methods considered. The Analysis Of Variance (ANOVA) F-test results validate the superiority of the proposed method over all the FS techniques, both in sampled and unsampled data. The results confirm that the MLLR can be useful in selecting optimal feature subset for more accurate prediction of defective modules in software development process.

引用

页码：721 / 730

页数：10

共 29 条

[11] Attribute Selection and Imbalanced Data: Problems in Software Defect Prediction [J].

Khoshgoftaar, Taghi M. ;

Gao, Kehan ;

Seliya, Naeem .

22ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2010), PROCEEDINGS, VOL 1, 2010,

[12] A comparative study of iterative and non-iterative feature selection techniques for software defect prediction [J].

Khoshgoftaar, Taghi M. ;

Gao, Kehan ;

Napolitano, Amri ;

Wald, Randall .

INFORMATION SYSTEMS FRONTIERS, 2014, 16 (05) :801-822

[13] Wrappers for feature subset selection [J].

Kohavi, R ;

John, GH .

ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :273-324

[14]

Kumar V., 2014, The Smart Computing Review, V4, P211, DOI [10.6029/smartcr.2014.03.007, DOI 10.6029/SMARTCR.2014.03.007]

[15] Logistic Regression Ensemble for Predicting Customer Defection with Very Large Sample Size [J].

Kuswanto, Heri ;

Asfihani, Ayu ;

Sarumaha, Yogi ;

Ohwada, Hayato .

THIRD INFORMATION SYSTEMS INTERNATIONAL CONFERENCE 2015, 2015, 72 :86-93

[16] Efficient multiclass ROC approximation by decomposition via confusion matrix perturbation analysis [J].

Landgrebe, Thomas C. W. ;

Duin, Robert P. W. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008, 30 (05) :810-822

[17] Toward integrating feature selection algorithms for classification and clustering [J].

Liu, H ;

Yu, L .

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (04) :491-502

[18] GENERALIZED LINEAR-MODELS [J].

MCCULLAGH, P .

EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 1984, 16 (03) :285-292

[19]

Menzies T., Promise repository of empirical software engineering data

[20] Data mining static code attributes to learn defect predictors [J].

Menzies, Tim ;

Greenwald, Jeremy ;

Frank, Art .

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2007, 33 (01) :2-13

← 1 2 3 →