Controlling overfitting in classification-tree models of software quality

被引:26
|
作者
Khoshgoftaar T.M. [1 ]
Allen E.B. [2 ]
机构
[1] Florida Atlantic University, Boca Raton, FL
[2] Mississippi State University, MS
关键词
Algorithms - Computer simulation - Data structures - Fault tolerant computer systems - Large scale systems - Telecommunication systems;
D O I
10.1023/A:1009803004576
中图分类号
学科分类号
摘要
Predicting which modules are likely to have faults during operations is important to software developers, so that software enhancement efforts can be focused on those modules that need improvement the most. Modeling software quality with classification trees is attractive because they readily model nonmonotonic relationships. In this paper, we apply the TREEDISC algorithm which is a refinement of the CHAID algorithm to build classification-tree models. CHAID-based algorithms differ from other classification-tree algorithms in their reliance on chi-squared tests when building the tree. Classification-tree models are vulnerable to overfitting, where the model reflects the structure of the training data set too closely. Even though a model appears to be accurate on training data, if overfitted, it may be much less accurate when applied to a current data set. To account for the severe consequences of misclassifying fault-prone modules, our measure of overfitting is based on expected costs of misclassification, rather than the total number of misclassifications. We conducted a case study of a very large telecommunications system. A two-way analysis of variance with repetitions found that TREEDISC's significance level was highly related to overfitting, and can be used to control it. Moreover, the minimum number of modules in a leaf also influenced the degree of overfitting.
引用
收藏
页码:59 / 79
页数:20
相关论文
共 50 条
  • [1] Balancing misclassification rates in classification-tree models of software quality
    Khoshgoftaar T.M.
    Yuan X.
    Allen E.B.
    Empirical Software Engineering, 2000, 5 (4) : 313 - 330
  • [2] Classification-tree models of software-quality over multiple releases
    Khoshgoftaar, TM
    Allen, EB
    Jones, WD
    Hudepohl, JP
    IEEE TRANSACTIONS ON RELIABILITY, 2000, 49 (01) : 4 - 11
  • [3] Controlling overfitting in software quality models: Experiments with regression trees and classification
    Khoshgoftaar, TM
    Allen, EB
    Deng, JY
    SEVENTH INTERNATIONAL SOFTWARE METRICS SYMPOSIUM - METRICS 2001, PROCEEDINGS, 2000, : 190 - 198
  • [4] Reducing overfitting in genetic programming models for software quality classification
    Liu, Y
    Khoshgoftaar, T
    EIGHTH IEEE INTERNATIONAL SYMPOSIUM ON HIGH ASSURANCE SYSTEMS ENGINEERING, PROCEEDINGS, 2004, : 56 - 65
  • [5] A classification-tree hybrid method for studying prognostic models in intensive care
    Abu-Hanna, A
    de Keizer, N
    ARTIFICIAL INTELLIGENCE IN MEDICINE, PROCEEDINGS, 2001, 2101 : 99 - 108
  • [6] Classification-tree restructuring methodologies: A new perspective
    Chen, T.Y.
    Poon, P.-L.
    Tse, T.H.
    2002, Institution of Engineering and Technology (149):
  • [7] An integrated classification-tree methodology for test case generation
    Chen, TY
    Poon, PL
    Tse, TH
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2000, 10 (06) : 647 - 679
  • [8] Confessions during police interrogations : A Classification-Tree Approach
    Deslauriers-Varin, Nadine
    CRIMINOLOGIE, 2020, 53 (02) : 219 - 254
  • [9] Building decision tree software quality classification models using genetic programming
    Liu, Y
    Khoshgoftaar, TM
    GENETIC AND EVOLUTIONARY COMPUTATION - GECCO 2003, PT II, PROCEEDINGS, 2003, 2724 : 1808 - 1809
  • [10] Test case design based on Z and the classification-tree method
    Singh, H
    Conrad, M
    Sadeghipour, S
    FIRST IEEE INTERNATIONAL CONFERENCE ON FORMAL ENGINEERING METHODS, PROCEEDINGS, 1997, : 81 - 90