Hierarchical clustering: Visualization, feature importance and model selection

被引:11
作者
Cabezas, Luben M. C. [1 ]
Izbicki, Rafael [1 ]
Stern, Rafael B. [2 ]
机构
[1] Univ Fed Sao Carlos, Dept Stat, BR-13565905 Sao Carlos, SP, Brazil
[2] Univ Sao Paulo, Inst Math & Stat, BR-05508090 Sao Paulo, SP, Brazil
基金
巴西圣保罗研究基金会;
关键词
Hierarchical clustering; Unsupervised learning; Phylogenetic models; R PACKAGE; TRAIT EVOLUTION; REGRESSION;
D O I
10.1016/j.asoc.2023.110303
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose methods for the analysis of hierarchical clustering that fully use the multi-resolution structure provided by a dendrogram. Specifically, we propose a loss for choosing between clustering methods, a feature importance score and a graphical tool for visualizing the segmentation of features in a dendrogram. Current approaches to these tasks lead to loss of information since they require the user to generate a single partition of the instances by cutting the dendrogram at a specified level. Our proposed methods, instead, use the full structure of the dendrogram. The key insight behind the proposed methods is to view a dendrogram as a phylogeny. This analogy permits the assignment of a feature value to each internal node of a tree through an evolutionary model. Real and simulated datasets provide evidence that our proposed framework has desirable outcomes and gives more insights than state-of-art approaches. We provide an R package that implements our methods. & COPY; 2023 Elsevier B.V. All rights reserved.
引用
收藏
页数:12
相关论文
共 52 条
  • [1] Assessing variable importance in clustering: a new method based on unsupervised binary decision trees
    Badih, Ghattas
    Pierre, Michel
    Laurent, Boyer
    [J]. COMPUTATIONAL STATISTICS, 2019, 34 (01) : 301 - 321
  • [2] SIMMAP: Stochastic character mapping of discrete traits on phylogenies
    Bollback, JP
    [J]. BMC BIOINFORMATICS, 2006, 7 (1)
  • [3] Borges L.M., 2021, OVERLOOKED ROL UNPUB
  • [4] Random forests
    Breiman, L
    [J]. MACHINE LEARNING, 2001, 45 (01) : 5 - 32
  • [5] Charrad M, 2014, J STAT SOFTW, V61, P1
  • [6] XGBoost: A Scalable Tree Boosting System
    Chen, Tianqi
    Guestrin, Carlos
    [J]. KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 785 - 794
  • [7] Chen Y, 2014, 2014 IEEE 17TH INTERNATIONAL CONFERENCE ON INTELLIGENT TRANSPORTATION SYSTEMS (ITSC), P798, DOI 10.1109/ITSC.2014.6957787
  • [8] mvMORPH: an R package for fitting multivariate evolutionary models to morphometric data
    Clavel, Julien
    Escarguel, Gilles
    Merceron, Gildas
    [J]. METHODS IN ECOLOGY AND EVOLUTION, 2015, 6 (11): : 1311 - 1319
  • [9] Coscrato V, 2019, Arxiv, DOI arXiv:1910.05206
  • [10] Comparisons and validation of statistical clustering techniques for microarray gene expression data
    Datta, S
    Datta, S
    [J]. BIOINFORMATICS, 2003, 19 (04) : 459 - 466