Controlling overfitting in classification-tree models of software quality

被引：26

作者：

Khoshgoftaar T.M. ^{[1
]}

Allen E.B. ^{[2
]}

机构：

[1] Florida Atlantic University, Boca Raton, FL

[2] Mississippi State University, MS

来源：

Empirical Software Engineering | 2001年 / 6卷 / 01期

关键词：

Algorithms - Computer simulation - Data structures - Fault tolerant computer systems - Large scale systems - Telecommunication systems;

D O I：

10.1023/A:1009803004576

中图分类号：

学科分类号：

摘要：

Predicting which modules are likely to have faults during operations is important to software developers, so that software enhancement efforts can be focused on those modules that need improvement the most. Modeling software quality with classification trees is attractive because they readily model nonmonotonic relationships. In this paper, we apply the TREEDISC algorithm which is a refinement of the CHAID algorithm to build classification-tree models. CHAID-based algorithms differ from other classification-tree algorithms in their reliance on chi-squared tests when building the tree. Classification-tree models are vulnerable to overfitting, where the model reflects the structure of the training data set too closely. Even though a model appears to be accurate on training data, if overfitted, it may be much less accurate when applied to a current data set. To account for the severe consequences of misclassifying fault-prone modules, our measure of overfitting is based on expected costs of misclassification, rather than the total number of misclassifications. We conducted a case study of a very large telecommunications system. A two-way analysis of variance with repetitions found that TREEDISC's significance level was highly related to overfitting, and can be used to control it. Moreover, the minimum number of modules in a leaf also influenced the degree of overfitting.

引用

页码：59 / 79

页数：20