Constructing bi-plots for random forest: Tutorial

被引:50
作者
Blanchet, Lionel [1 ,7 ]
Vitale, Raffaele [2 ,3 ]
van Vorstenbosch, Robert [1 ]
Stavropoulos, George [1 ]
Pender, John [4 ,5 ]
Jonkers, Daisy [6 ]
van Schooten, Frederik-Jan [1 ]
Smolinska, Agnieszka [1 ]
机构
[1] Maastricht Univ, Sch Nutr Toxicol & Translat Res Metab NUTRIM, Dept Pharmacol & Toxicol, Med Ctr, Maastricht, Netherlands
[2] Univ Lille, Lab Spectrochim Infrarouge & Raman, LASIR, CNRS,UMR 8516, Batiment C5, F-59000 Lille, France
[3] Katholieke Univ Leuven, Dept Chem, Mol Imaging & Photon Unit, Celestijnenlaan 200F, B-3001 Leuven, Belgium
[4] Maastricht Univ, Sch Nutr & Translat Res Metab NUTRIM, Dept Med Microbiol, Med Ctr, Maastricht, Netherlands
[5] Maastricht Univ, Sch Publ Hlth & Primary Care CAPHRI, Dept Med Microbiol, Med Ctr, Maastricht, Netherlands
[6] Maastricht Univ, NUTRIM Sch Nutr & Translat Res Metab, Dept Internal Med, Div Gastroenterol Hepatol,Med Ctr, NL-6202 AZ Maastricht, Netherlands
[7] Philips, Veenpluis 4-6, Best, Netherlands
关键词
Random forest interpretation; Pseudo samples; Bi-plots; Proximity matrix; Principal coordinates analysis; PSEUDO-SAMPLE TRAJECTORIES; PARTIAL LEAST-SQUARES; FAULT-DIAGNOSIS; KERNEL;
D O I
10.1016/j.aca.2020.06.043
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group. The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them. (c) 2020 Elsevier B.V. All rights reserved.
引用
收藏
页码:146 / 155
页数:10
相关论文
共 37 条
  • [21] Disease-Specific Enteric Microbiome Dysbiosis in Inflammatory Bowel Disease
    Mirsepasi-Lauridsen, Hengaineh Chloe
    Vrankx, Katleen
    Engberg, Jorgen
    Friis-Moller, Alice
    Brynskov, Jorn
    Nordgaard-Lassen, Inge
    Petersen, Andreas Munk
    Krogfelt, Karen Angeliki
    [J]. FRONTIERS IN MEDICINE, 2018, 5
  • [22] Host-microbial Cross-talk in Inflammatory Bowel Disease
    Nagao-Kitamoto, Hiroko
    Kamada, Nobuhiko
    [J]. IMMUNE NETWORK, 2017, 17 (01) : 1 - 12
  • [24] The behaviour of random forest permutation-based variable importance measures under predictor correlation
    Nicodemus, Kristin K.
    Malley, James D.
    Strobl, Carolin
    Ziegler, Andreas
    [J]. BMC BIOINFORMATICS, 2010, 11
  • [25] Specific Bacteria and Metabolites Associated With Response to Fecal Microbiota Transplantation in Patients With Ulcerative Colitis
    Paramsothy, Sudarshan
    Nielsen, Shaun
    Kamm, Michael A.
    Deshpande, Nandan P.
    Faith, Jeremiah J.
    Clemente, Jose C.
    Paramsothy, Ramesh
    Walsh, Alissa J.
    van den Bogaerde, Johan
    Samuel, Douglas
    Leong, Rupert W. L.
    Connor, Susan
    Ng, Watson
    Lin, Enmoore
    Borody, Thomas J.
    Wilkins, Marc R.
    Colombel, Jean-Frederic
    Mitchell, Hazel M.
    Kaakoush, Nadeem O.
    [J]. GASTROENTEROLOGY, 2019, 156 (05) : 1440 - +
  • [26] Opening the kernel of kernel partial least squares and support vector machines
    Postma, G. J.
    Krooshof, P. W. T.
    Buydens, L. M. C.
    [J]. ANALYTICA CHIMICA ACTA, 2011, 705 (1-2) : 123 - 134
  • [27] Unsupervised learning with random forest predictors
    Shi, T
    Horvath, S
    [J]. JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS, 2006, 15 (01) : 118 - 138
  • [28] Fusion of mass spectrometry-based metabolomics data
    Smilde, AK
    van der Werf, MJ
    Bijlsma, S
    van der Werff-van-der Vat, BJC
    Jellema, RH
    [J]. ANALYTICAL CHEMISTRY, 2005, 77 (20) : 6729 - 6736
  • [29] Interpretation and Visualization of Non-Linear Data Fusion in Kernel Space: Study on Metabolomic Characterization of Progression of Multiple Sclerosis
    Smolinska, Agnieszka
    Blanchet, Lionel
    Coulier, Leon
    Ampt, Kirsten A. M.
    Luider, Theo
    Hintzen, Rogier Q.
    Wijmenga, Sybren S.
    Buydens, Lutgarde M. C.
    [J]. PLOS ONE, 2012, 7 (06):
  • [30] Simultaneous analysis of plasma and CSF by NMR and hierarchical models fusion
    Smolinska, Agnieszka
    Posma, Joram M.
    Blanchet, Lionel
    Ampt, Kirsten A. M.
    Attali, Amos
    Tuinstra, Tinka
    Luider, Theo
    Doskocz, Marek
    Michiels, Paul J.
    Girard, Frederic C.
    Buydens, Lutgarde M. C.
    Wijmenga, Sybren S.
    [J]. ANALYTICAL AND BIOANALYTICAL CHEMISTRY, 2012, 403 (04) : 947 - 959