On the behaviour of permutation-based variable importance measures in random forest clustering

被引:7
作者
Nembrini, Stefano [1 ]
机构
[1] Univ Florida, Coll Med, Emerging Pathogens Inst, Dept Pathol, Gainesville, FL 32610 USA
关键词
random forest clustering; variable importance measures; variable selection;
D O I
10.1002/cem.3135
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Unsupervised random forest (RF) is a popular clustering method that can be implemented by artificially creating a two-class problem. Variable importance measures (VIMs) can be used to determine which variables are relevant for defining the RF dissimilarity, but they have not received as much attention as the supervised case. Here, I show that sampling schemes used in generating the artificial data-including the original one-can influence the behaviour of the permutation importance in a way that can affect conclusions on variable relevance and also propose a solution. Generating the artificial data using a Bayesian bootstrap keeps the desirable properties of the permutation VIM.
引用
收藏
页数:5
相关论文
共 15 条
  • [1] Unsupervised random forest: a tutorial with case studies
    Afanador, Nelson Lee
    Smolinska, Agnieszka
    Tran, Thanh N.
    Blanchet, Lionel
    [J]. JOURNAL OF CHEMOMETRICS, 2016, 30 (05) : 232 - 241
  • [2] [Anonymous], ADV DATA ANAL CLASSI
  • [3] Breiman L., 2004, Random forest-manual
  • [4] Altered cytoplasmic-to-nuclear ratio of survivin is a prognostic indicator in breast cancer
    Brennan, Donal J.
    Rexhepaj, Elton
    O'Brien, Sallyann L.
    McSherry, Elaine
    O'Connor, Darran P.
    Fagan, Ailis
    Culhane, Aedin C.
    Higgins, Desmond G.
    Jirstrom, Karin
    Millikan, Robert C.
    Landberg, Goran
    Duffy, Michael J.
    Hewitt, Stephen M.
    Gallaghe, William M.
    [J]. CLINICAL CANCER RESEARCH, 2008, 14 (09) : 2681 - 2689
  • [5] Dalleau Kevin, 2018, Advances in Knowledge Discovery and Data Mining. 22nd Pacific-Asia Conference, PAKDD 2018. Proceedings: LNAI 10939, P478, DOI 10.1007/978-3-319-93040-4_38
  • [6] COMPARING PARTITIONS
    HUBERT, L
    ARABIE, P
    [J]. JOURNAL OF CLASSIFICATION, 1985, 2 (2-3) : 193 - 218
  • [7] rCOSA: A Software Package for Clustering Objects on Subsets of Attributes
    Kampert, Maarten M.
    Meulman, Jacqueline J.
    Friedman, Jerome H.
    [J]. JOURNAL OF CLASSIFICATION, 2017, 34 (03) : 514 - 547
  • [8] The revival of the Gini importance?
    Nembrini, Stefano
    Koenig, Inke R.
    Wright, Marvin N.
    [J]. BIOINFORMATICS, 2018, 34 (21) : 3711 - 3718
  • [9] Plonski P, 2014, LECT NOTES ARTIF INT, V8468, P63, DOI 10.1007/978-3-319-07176-3_6
  • [10] THE BAYESIAN BOOTSTRAP
    RUBIN, DB
    [J]. ANNALS OF STATISTICS, 1981, 9 (01) : 130 - 134