Assessing the performance of statistical validation tools for megavariate metabolomics data

被引:126
作者
Rubingh, Carina M.
Bijlsma, Sabina
Derks, Eduard P. P. A.
Bobeldijk, Ivana
Verheij, Elwin R.
Kochhar, Sunil
Smilde, Age K.
机构
[1] TNO, Qual Life, Business Unit Analyt Sci, NL-3700 AJ Zeist, Netherlands
[2] Nestle Res Ctr, BioAnalyt Sci Dept, CH-1000 Lausanne 26, Switzerland
关键词
metabolomics; megavariate data; PLS-DA; cross-validation; permutation test; predictability; jack-knife;
D O I
10.1007/s11306-006-0022-6
中图分类号
R5 [内科学];
学科分类号
1002 ; 100201 ;
摘要
Statistical model validation tools Such as cross-validation, jack-knifing model parameters and permutation tests are meant to obtain an objective assessment of the performance and stability of a statistical model. However, little is known about the performance of these tools for megavariate data sets, having. for instance, a number of variables larger than 10 times the number of subjects. The performance is assessed for megavariate metabolomics data, but the conclusions also carry over to proteomics, transcriptomics and many other research areas. Partial least squares discriminant analyses models were built for several LC-MS lipidomic training data sets of various numbers of lean and obese subjects. The training data sets were compared on their modelling performance and their predictability using a 10-fold cross-validation, a permutation test, and test data sets. A wide range of cross-validation error rates was found (from 7.5% to 16.3% for the largest trainings set and from 0% to 60% for the smallest training set) and the error rate increased when the number Of Subjects decreased. The test error rates varied from 5% to 50%. The smaller the number Of Subjects compared to the number of variables, the less the outcome of validation tools Such as cross-validation, jackknifing model parameters and permutation tests can be trusted. The result depends Crucially Oil the specific sample Of Subjects that is used for modelling. The validation tools cannot be used as warning mechanism for problems due to sample size or to representativity of the sampling.
引用
收藏
页码:53 / 61
页数:9
相关论文
共 34 条
  • [1] Selection bias in gene extraction on the basis of microarray gene-expression data
    Ambroise, C
    McLachlan, GJ
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2002, 99 (10) : 6562 - 6566
  • [2] [Anonymous], 1989, MULTIVARIATE CALIBRA
  • [3] Partial least squares for discrimination
    Barker, M
    Rayens, W
    [J]. JOURNAL OF CHEMOMETRICS, 2003, 17 (03) : 166 - 173
  • [4] Bijlsma S, 2000, J CHEMOMETR, V14, P541
  • [5] Large-scale human metabolomics studies: A strategy for data (pre-) processing and validation
    Bijlsma, S
    Bobeldijk, L
    Verheij, ER
    Ramaker, R
    Kochhar, S
    Macdonald, IA
    van Ommen, B
    Smilde, AK
    [J]. ANALYTICAL CHEMISTRY, 2006, 78 (02) : 567 - 574
  • [6] Fat oxidation before and after a high fat load in the obese insulin-resistant state
    Blaak, EE
    Hul, G
    Verdich, C
    Stich, V
    Martinez, A
    Petersen, M
    Feskens, EFM
    Patel, K
    Oppert, JM
    Barbe, P
    Toubro, S
    Anderson, I
    Saris, WHM
    [J]. JOURNAL OF CLINICAL ENDOCRINOLOGY & METABOLISM, 2006, 91 (04) : 1462 - 1469
  • [7] Is cross-validation valid for small-sample microarray classification?
    Braga-Neto, UM
    Dougherty, ER
    [J]. BIOINFORMATICS, 2004, 20 (03) : 374 - 380
  • [8] Derome A.E., 1987, MODERN NMR TECHNIQUE
  • [9] Efron B., 1993, INTRO BOOTSTRAP
  • [10] Efron B., 1982, The Jack-knife, the Bootstrap and other Re-sampling plans