Validation of cluster analysis results on validation data: A systematic framework

被引:68
作者
Ullmann, Theresa [1 ]
Hennig, Christian [2 ]
Boulesteix, Anne-Laure [1 ]
机构
[1] Ludwig Maximilians Univ Munchen, Inst Med Informat Proc Biometry & Epidemiol, Marchioninistr 15, D-81377 Munich, Germany
[2] Paolo Fortunati Univ Bologna, Dipartimento Sci Statistiche, Bologna, Italy
关键词
cluster stability; cluster validation; clustering; independent data; replication; GENE-EXPRESSION PROFILES; BREAST-CANCER SUBTYPES; STABILITY; BOOTSTRAP; SELECTION; NUMBER; CONFIGURATIONS; PREDICTION; SUBGROUPS; PAIN;
D O I
10.1002/widm.1444
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Cluster analysis refers to a wide range of data analytic techniques for class discovery and is popular in many application fields. To assess the quality of a clustering result, different cluster validation procedures have been proposed in the literature. While there is extensive work on classical validation techniques, such as internal and external validation, less attention has been given to validating and replicating a clustering result using a validation dataset. Such a dataset may be part of the original dataset, which is separated before analysis begins, or it could be an independently collected dataset. We present a systematic, structured review of the existing literature about this topic. For this purpose, we outline a formal framework that covers most existing approaches for validating clustering results on validation data. In particular, we review classical validation techniques such as internal and external validation, stability analysis, and visual validation, and show how they can be interpreted in terms of our framework. We define and formalize different types of validation of clustering results on a validation dataset, and give examples of how clustering studies from the applied literature that used a validation dataset can be seen as instances of our framework. This article is categorized under: Technologies > Structure Discovery and Clustering Algorithmic Development > Statistics Technologies > Machine Learning
引用
收藏
页数:19
相关论文
共 97 条
[1]   Estimating the reproducibility of psychological science [J].
Aarts, Alexander A. ;
Anderson, Joanna E. ;
Anderson, Christopher J. ;
Attridge, Peter R. ;
Attwood, Angela ;
Axt, Jordan ;
Babel, Molly ;
Bahnik, Stepan ;
Baranski, Erica ;
Barnett-Cowan, Michael ;
Bartmess, Elizabeth ;
Beer, Jennifer ;
Bell, Raoul ;
Bentley, Heather ;
Beyan, Leah ;
Binion, Grace ;
Borsboom, Denny ;
Bosch, Annick ;
Bosco, Frank A. ;
Bowman, Sara D. ;
Brandt, Mark J. ;
Braswell, Erin ;
Brohmer, Hilmar ;
Brown, Benjamin T. ;
Brown, Kristina ;
Bruening, Jovita ;
Calhoun-Sauls, Ann ;
Callahan, Shannon P. ;
Chagnon, Elizabeth ;
Chandler, Jesse ;
Chartier, Christopher R. ;
Cheung, Felix ;
Christopherson, Cody D. ;
Cillessen, Linda ;
Clay, Russ ;
Cleary, Hayley ;
Cloud, Mark D. ;
Cohn, Michael ;
Cohoon, Johanna ;
Columbus, Simon ;
Cordes, Andreas ;
Costantini, Giulio ;
Alvarez, Leslie D. Cramblet ;
Cremata, Ed ;
Crusius, Jan ;
DeCoster, Jamie ;
DeGaetano, Michelle A. ;
Della Penna, Nicolas ;
den Bezemer, Bobby ;
Deserno, Marie K. .
SCIENCE, 2015, 349 (6251)
[2]   Pursuing the value-conscious consumer: Store brands versus national brand promotions [J].
Ailawadi, KL ;
Neslin, SA ;
Gedenk, K .
JOURNAL OF MARKETING, 2001, 65 (01) :71-89
[3]   Comparing clusterings and numbers of clusters by aggregation of calibrated clustering validity indexes [J].
Akhanli, Serhat Emre ;
Hennig, Christian .
STATISTICS AND COMPUTING, 2020, 30 (05) :1523-1544
[4]   On similarity indices and correction for chance agreement [J].
Albatineh, Ahmed N. ;
Niewiadomska-Bugaj, Magdalena ;
Mihalko, Daniel .
JOURNAL OF CLASSIFICATION, 2006, 23 (02) :301-313
[5]  
Alexe G, 2006, CANCER INFORM, V2, P243
[6]  
[Anonymous], 1993, Handbook of Pattern Recognition and Computer Vision, DOI DOI 10.1142/9789814343138_0001
[7]  
Baba Y., 1998, DATA SCI CLASSIFICAT, P22, DOI [DOI 10.1007/978-4-431-65950-1_2, DOI 10.1007/978-4-431-65950-12]
[8]   Raise standards for preclinical cancer research [J].
Begley, C. Glenn ;
Ellis, Lee M. .
NATURE, 2012, 483 (7391) :531-533
[9]   A sober look at clustering stability [J].
Ben-David, Shai ;
von Luxburg, Ulrike ;
Pal, David .
LEARNING THEORY, PROCEEDINGS, 2006, 4005 :5-19
[10]  
Ben-Hur Asa, 2002, Pac Symp Biocomput, P6