Statistical analysis of high-dimensional biomedical data: a gentle introduction to analytical goals, common approaches and challenges

被引:32
作者
Rahnenfuehrer, Joerg [1 ]
De Bin, Riccardo [2 ]
Benner, Axel [3 ]
Ambrogi, Federico [4 ,5 ]
Lusa, Lara [6 ,7 ]
Boulesteix, Anne-Laure [8 ]
Migliavacca, Eugenia [9 ]
Binder, Harald [10 ]
Michiels, Stefan [11 ,12 ]
Sauerbrei, Willi [10 ]
McShane, Lisa [13 ]
机构
[1] TU Dortmund Univ, Dept Stat, Dortmund, Germany
[2] Univ Oslo, Dept Math, Oslo, Norway
[3] German Canc Res Ctr, Div Biostat, Heidelberg, Germany
[4] Univ Milan, Dept Clin Sci & Community Hlth, Milan, Italy
[5] IRCCS Policlin San Donato, Sci Directorate, San Donato Milanese, Italy
[6] Univ Primorksa, Fac Math Nat Sci & Informat Technol, Dept Math, Koper, Slovenia
[7] Univ Ljubljana, Inst Biostat & Med Informat, Ljubljana, Slovenia
[8] Ludwig Maximilian Univ Munich, Inst Med Informat Proc Biometry & Epidemiol, Munich, Germany
[9] Nestle Res, EPFL Innovat Pk, Lausanne, Switzerland
[10] Univ Freiburg, Inst Med Biometry & Stat, Fac Med, Freiburg, Germany
[11] Univ Paris Saclay, Serv Biostat & Epidemiol, Gustave Roussy, Villejuif, France
[12] Univ Paris Saclay, Oncostat U1018, Inserm, Labeled Ligue Canc, Villejuif, France
[13] NCI, Biometr Res Program, Div Canc Treatment & Diag, Bethesda, MD 20892 USA
基金
美国国家卫生研究院;
关键词
High-dimensional data; Omics data; STRATOS initiative; Analytical goals; Initial data analysis; Exploratory data analysis; Clustering; Multiple testing; Prediction; FALSE DISCOVERY RATE; GENE-EXPRESSION DATA; MULTIVARIABLE PREDICTION MODEL; SAMPLE-SIZE; SURVIVAL PREDICTION; CROSS-VALIDATION; INFLUENTIAL OBSERVATIONS; MICROARRAY EXPERIMENTS; NORMALIZATION METHODS; INDIVIDUAL PROGNOSIS;
D O I
10.1186/s12916-023-02858-y
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
BackgroundIn high-dimensional data (HDD) settings, the number of variables associated with each observation is very large. Prominent examples of HDD in biomedical research include omics data with a large number of variables such as many measurements across the genome, proteome, or metabolome, as well as electronic health records data that have large numbers of variables recorded for each patient. The statistical analysis of such data requires knowledge and experience, sometimes of complex methods adapted to the respective research questions.MethodsAdvances in statistical methodology and machine learning methods offer new opportunities for innovative analyses of HDD, but at the same time require a deeper understanding of some fundamental statistical concepts. Topic group TG9 "High-dimensional data" of the STRATOS (STRengthening Analytical Thinking for Observational Studies) initiative provides guidance for the analysis of observational studies, addressing particular statistical challenges and opportunities for the analysis of studies involving HDD. In this overview, we discuss key aspects of HDD analysis to provide a gentle introduction for non-statisticians and for classically trained statisticians with little experience specific to HDD.ResultsThe paper is organized with respect to subtopics that are most relevant for the analysis of HDD, in particular initial data analysis, exploratory data analysis, multiple testing, and prediction. For each subtopic, main analytical goals in HDD settings are outlined. For each of these goals, basic explanations for some commonly used analysis methods are provided. Situations are identified where traditional statistical methods cannot, or should not, be used in the HDD setting, or where adequate analytic tools are still lacking. Many key references are provided.ConclusionsThis review aims to provide a solid statistical foundation for researchers, including statisticians and non-statisticians, who are new to research with HDD or simply want to better evaluate and understand the results of HDD analyses.
引用
收藏
页数:54
相关论文
共 235 条
[21]   Dimensionality reduction for visualizing single-cell data using UMAP [J].
Becht, Etienne ;
McInnes, Leland ;
Healy, John ;
Dutertre, Charles-Antoine ;
Kwok, Immanuel W. H. ;
Ng, Lai Guan ;
Ginhoux, Florent ;
Newell, Evan W. .
NATURE BIOTECHNOLOGY, 2019, 37 (01) :38-+
[22]  
Belsley D.A., 2005, Detecting and Assessing Collinearity, DOI [10.1002/0471725153, DOI 10.1002/0471725153, DOI 10.1002/0471725153.CH3]
[23]   Clustering gene expression patterns [J].
Ben-Dor, A ;
Shamir, R ;
Yakhini, Z .
JOURNAL OF COMPUTATIONAL BIOLOGY, 1999, 6 (3-4) :281-297
[24]  
Benjamini Y, 2001, ANN STAT, V29, P1165
[25]   CONTROLLING THE FALSE DISCOVERY RATE - A PRACTICAL AND POWERFUL APPROACH TO MULTIPLE TESTING [J].
BENJAMINI, Y ;
HOCHBERG, Y .
JOURNAL OF THE ROYAL STATISTICAL SOCIETY SERIES B-STATISTICAL METHODOLOGY, 1995, 57 (01) :289-300
[26]   CustOmics: A versatile deep-learning based strategy for multi-omics integration [J].
Benkirane, Hakim ;
Pradat, Yoann ;
Michiels, Stefan ;
Cournede, Paul-Henry .
PLOS COMPUTATIONAL BIOLOGY, 2023, 19 (03)
[27]   PROTOTYPE SELECTION FOR INTERPRETABLE CLASSIFICATION [J].
Bien, Jacob ;
Tibshirani, Robert .
ANNALS OF APPLIED STATISTICS, 2011, 5 (04) :2403-2424
[28]   Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models [J].
Binder, Harald ;
Schumacher, Martin .
BMC BIOINFORMATICS, 2008, 9 (1)
[29]  
Bland JM, 1996, BRIT MED J, V312, P1079
[30]  
Bland JM, 1996, BRIT MED J, V312, P770