Controlling overfitting in classification-tree models of software quality

被引：26

作者：

Khoshgoftaar T.M. ^{[1
]}

Allen E.B. ^{[2
]}

机构：

[1] Florida Atlantic University, Boca Raton, FL

[2] Mississippi State University, MS

来源：

Empirical Software Engineering | 2001年 / 6卷 / 01期

关键词：

Algorithms - Computer simulation - Data structures - Fault tolerant computer systems - Large scale systems - Telecommunication systems;

D O I：

10.1023/A:1009803004576

中图分类号：

学科分类号：

摘要：

Predicting which modules are likely to have faults during operations is important to software developers, so that software enhancement efforts can be focused on those modules that need improvement the most. Modeling software quality with classification trees is attractive because they readily model nonmonotonic relationships. In this paper, we apply the TREEDISC algorithm which is a refinement of the CHAID algorithm to build classification-tree models. CHAID-based algorithms differ from other classification-tree algorithms in their reliance on chi-squared tests when building the tree. Classification-tree models are vulnerable to overfitting, where the model reflects the structure of the training data set too closely. Even though a model appears to be accurate on training data, if overfitted, it may be much less accurate when applied to a current data set. To account for the severe consequences of misclassifying fault-prone modules, our measure of overfitting is based on expected costs of misclassification, rather than the total number of misclassifications. We conducted a case study of a very large telecommunications system. A two-way analysis of variance with repetitions found that TREEDISC's significance level was highly related to overfitting, and can be used to control it. Moreover, the minimum number of modules in a leaf also influenced the degree of overfitting.

引用

页码：59 / 79

页数：20

共 50 条

[1] Balancing misclassification rates in classification-tree models of software quality
Khoshgoftaar T.M.
Yuan X.
Allen E.B.
Empirical Software Engineering, 2000, 5 (4) : 313 - 330
[2] Classification-tree models of software-quality over multiple releases
Khoshgoftaar, TM
Allen, EB
Jones, WD
Hudepohl, JP
IEEE TRANSACTIONS ON RELIABILITY, 2000, 49 (01) : 4 - 11
[3] Controlling overfitting in software quality models: Experiments with regression trees and classification
Khoshgoftaar, TM
Allen, EB
Deng, JY
SEVENTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM - METRICS 2001, PROCEEDINGS, 2000, : 190 - 198
[4] Reducing overfitting in genetic programming models for software quality classification
Liu, Y
Khoshgoftaar, T
EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH ASSURANCE SYSTEMS ENGINEERING, PROCEEDINGS, 2004, : 56 - 65
[5] A classification-tree hybrid method for studying prognostic models in intensive care
Abu-Hanna, A
de Keizer, N
ARTIFICIAL INTELLIGENCE IN MEDICINE, PROCEEDINGS, 2001, 2101 : 99 - 108
[6] Classification-tree restructuring methodologies: A new perspective
Chen, T.Y.
Poon, P.-L.
Tse, T.H.
2002, Institution of Engineering and Technology (149):
[7] An integrated classification-tree methodology for test case generation
Chen, TY
Poon, PL
Tse, TH
INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2000, 10 (06) : 647 - 679
[8] Confessions during police interrogations : A Classification-Tree Approach
Deslauriers-Varin, Nadine
CRIMINOLOGIE, 2020, 53 (02) : 219 - 254
[9] Building decision tree software quality classification models using genetic programming
Liu, Y
Khoshgoftaar, TM
GENETIC AND EVOLUTIONARY COMPUTATION - GECCO 2003, PT II, PROCEEDINGS, 2003, 2724 : 1808 - 1809
[10] Test case design based on Z and the classification-tree method
Singh, H
Conrad, M
Sadeghipour, S
FIRST IEEE INTERNATIONAL CONFERENCE ON FORMAL ENGINEERING METHODS, PROCEEDINGS, 1997, : 81 - 90

← 1 2 3 4 5 →