Balancing Privacy and Utility in Cross-Company Defect Prediction

被引:100
作者
Peters, Fayola [1 ]
Menzies, Tim [1 ]
Gong, Liang [2 ]
Zhang, Hongyu [2 ]
机构
[1] W Virginia Univ, Lane Dept Comp Sci & Elect Engn, Morgantown, WV 26506 USA
[2] Tsinghua Univ, Sch Software, Beijing 100084, Peoples R China
基金
美国国家科学基金会;
关键词
Privacy; classification; defect prediction; STATIC CODE ATTRIBUTES; K-ANONYMITY; MODEL;
D O I
10.1109/TSE.2013.6
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Background: Cross-company defect prediction (CCDP) is a field of study where an organization lacking enough local data can use data from other organizations for building defect predictors. To support CCDP, data must be shared. Such shared data must be privatized, but that privatization could severely damage the utility of the data. Aim: To enable effective defect prediction from shared data while preserving privacy. Method: We explore privatization algorithms that maintain class boundaries in a dataset. CLIFF is an instance pruner that deletes irrelevant examples. MORPH is a data mutator that moves the data a random distance, taking care not to cross class boundaries. CLIFF+MORPH are tested in a CCDP study among 10 defect datasets from the PROMISE data repository. Results: We find: 1) The CLIFFed+MORPHed algorithms provide more privacy than the state-of-the-art privacy algorithms; 2) in terms of utility measured by defect prediction, we find that CLIFF+MORPH performs significantly better. Conclusions: For the OO defect data studied here, data can be privatized and shared without a significant degradation in utility. To the best of our knowledge, this is the first published result where privatization does not compromise defect prediction.
引用
收藏
页码:1054 / 1068
页数:15
相关论文
共 56 条
  • [1] Agrawal R, 2000, SIGMOD REC, V29, P439, DOI 10.1145/335191.335438
  • [2] [Anonymous], 2008, P 14 ACM SIGKDD INT, DOI DOI 10.1145/1401890.1401904
  • [3] [Anonymous], 2003, P 22 ACM SIGMOD SIGA
  • [4] A new fast prototype selection method based on clustering
    Arturo Olvera-Lopez, J.
    Ariel Carrasco-Ochoa, J.
    Francisco Martinez-Trinidad, J.
    [J]. PATTERN ANALYSIS AND APPLICATIONS, 2010, 13 (02) : 131 - 141
  • [5] Bezdek JC, 2000, LECT NOTES COMPUT SC, V1876, P1
  • [6] Nearest prototype classifier designs: An experimental study
    Bezdek, JC
    Kuncheva, LI
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2001, 16 (12) : 1445 - 1473
  • [7] Bishop CM., 1995, NEURAL NETWORKS PATT
  • [8] UNDERSTANDING AND CONTROLLING SOFTWARE COSTS
    BOEHM, BW
    PAPACCIO, PN
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 1988, 14 (10) : 1462 - 1477
  • [9] Advances in instance selection for instance-based learning algorithms
    Brighton, H
    Mellish, C
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY, 2002, 6 (02) : 153 - 172
  • [10] kb-Anonymity: A Model for Anonymized Behavior-Preserving Test and Debugging Data
    Budi, Aditya
    Lo, David
    Jiang, Lingxiao
    Lucia
    [J]. ACM SIGPLAN NOTICES, 2011, 46 (06) : 447 - 457