Multivariate binary classification of imbalanced datasetsA case study based on high-dimensional multiplex autoimmune assay data

被引:2
作者
Schlieker, Laura [1 ,2 ]
Telaar, Anna [2 ,3 ]
Lueking, Angelika [2 ]
Schulz-Knappe, Peter [2 ]
Theek, Carmen [2 ,4 ]
Ickstadt, Katja [5 ]
机构
[1] ClinStat GmbH, Max Planck Str 22a, D-50858 Cologne, Germany
[2] Protagen AG, Otto Hahn Str 15, D-44227 Dortmund, Germany
[3] Berufskolleg Wassertum, D-46399 Bocholt, Germany
[4] Chiltern Int GmbH, Kronberger Hang 3, D-44227 Schwalbach, Germany
[5] Tech Univ Dortmund, Fac Stat, Dept Math Stat Applicat Biometr, Vogelpothsweg 87, D-44227 Dortmund, Germany
关键词
Cost-sensitive learning; Imbalanced data; PPLS-DA; Random Forests; Sampling; SMOTE;
D O I
10.1002/bimj.201600207
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
The classification of a population by a specific trait is a major task in medicine, for example when in a diagnostic setting groups of patients with specific diseases are identified, but also when in predictive medicine a group of patients is classified into specific disease severity classes that might profit from different treatments. When the sizes of those subgroups become small, for example in rare diseases, imbalances between the classes are more the rule than the exception and make statistical classification problematic when the error rate of the minority class is high. Many observations are classified as belonging to the majority class, while the error rate of the majority class is low. This case study aims to investigate class imbalance for Random Forests and Powered Partial Least Squares Discriminant Analysis (PPLS-DA) and to evaluate the performance of these classifiers when they are combined with methods to compensate imbalance (sampling methods, cost-sensitive learning approaches). We evaluate all approaches with a scoring system taking the classification results into consideration. This case study is based on one high-dimensional multiplex autoimmune assay dataset describing immune response to antigens and consisting of two classes of patients: Rheumatoid Arthritis (RA) and Systemic Lupus Erythemathodes (SLE). Datasets with varying degrees of imbalance are created by successively reducing the class of RA patients. Our results indicate possible benefit of cost-sensitive learning approaches for Random Forests. Although further research is needed to verify our findings by investigating other datasets or large-scale simulation studies, we claim that this work has the potential to increase awareness of practitioners to this problem of class imbalance and stresses the importance of considering methods to compensate class imbalance.
引用
收藏
页码:948 / 966
页数:19
相关论文
共 30 条
  • [1] [Anonymous], 2019, R: A language for environment for statistical computing
  • [2] [Anonymous], 2006, AAAI
  • [3] [Anonymous], 2013, BMC BIOINFORMATICS
  • [4] [Anonymous], PATTERN RECOGN LETT
  • [5] [Anonymous], INT JOINT C ART INT
  • [6] [Anonymous], 1997, P 14 INT C ONMACHINE
  • [7] Partial least squares for discrimination
    Barker, M
    Rayens, W
    [J]. JOURNAL OF CHEMOMETRICS, 2003, 17 (03) : 166 - 173
  • [8] Batista GE., 2004, ACM SIGKDD EXPL NEWS, V6, P20, DOI DOI 10.1145/1007730.1007735
  • [9] SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
    Blewitt, Marnie E.
    Gendrel, Anne-Valerie
    Pang, Zhenyi
    Sparrow, Duncan B.
    Whitelaw, Nadia
    Craig, Jeffrey M.
    Apedaile, Anwyn
    Hilton, Douglas J.
    Dunwoodie, Sally L.
    Brockdorff, Neil
    Kay, Graham F.
    Whitelaw, Emma
    [J]. NATURE GENETICS, 2008, 40 (05) : 663 - 669
  • [10] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32