Improving data splitting for classification applications in spectrochemical analyses employing a random-mutation Kennard-Stone algorithm approach

被引:101
作者
Morais, Camilo L. M. [1 ]
Santos, Marfran C. D. [2 ]
Lima, Kassio M. G. [2 ]
Martin, Francis L. [1 ]
机构
[1] Univ Cent Lancashire, Sch Pharm & Biomed Sci, Preston PR1 2HE, Lancs, England
[2] Univ Fed Rio Grande do Norte, Inst Chem Biol Chem & Chemometr, BR-59072970 Natal, RN, Brazil
基金
英国工程与自然科学研究理事会; 英国生物技术与生命科学研究理事会;
关键词
ATR-FTIR SPECTROSCOPY; DISCRIMINANT-ANALYSIS;
D O I
10.1093/bioinformatics/btz421
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Data splitting is a fundamental step for building classification models with spectral data, especially in biomedical applications. This approach is performed following pre-processing and prior to model construction, and consists of dividing the samples into at least training and test sets; herein, the training set is used for model construction and the test set for model validation. Some of the most-used methodologies for data splitting are the random selection (RS) and the Kennard-Stone (KS) algorithms; here, the former works based on a random splitting process and the latter is based on the calculation of the Euclidian distance between the samples. We propose an algorithm called the Morais-Lima-Martin (MLM) algorithm, as an alternative method to improve data splitting in classification models. MLM is a modification of KS algorithm by adding a random-mutation factor. Results: RS, KS and MLM performance are compared in simulated and six real-world biospectro-scopic applications using principal component analysis linear discriminant analysis (PCA-LDA). MLM generated a better predictive performance in comparison with RS and KS algorithms, in particular regarding sensitivity and specificity values. Classification is found to be more well-equilibrated using MLM. RS showed the poorest predictive response, followed by KS which showed good accuracy towards prediction, but relatively unbalanced sensitivities and specificities. These findings demonstrate the potential of this new MLM algorithm as a sample selection method for classification applications in comparison with other regular methods often applied in this type of data.
引用
收藏
页码:5257 / 5263
页数:7
相关论文
共 20 条
[1]   Classification tools in chemistry. Part 1: linear models. PLS-DA [J].
Ballabio, Davide ;
Consonni, Viviana .
ANALYTICAL METHODS, 2013, 5 (16) :3790-3798
[2]   Partial least squares discriminant analysis: taking the magic away [J].
Brereton, Richard G. ;
Lloyd, Gavin R. .
JOURNAL OF CHEMOMETRICS, 2014, 28 (04) :213-225
[3]   Principal component analysis [J].
Bro, Rasmus ;
Smilde, Age K. .
ANALYTICAL METHODS, 2014, 6 (09) :2812-2831
[4]  
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
[5]  
Costa FSL, 2016, ANAL METHODS-UK, V8, P7107, DOI [10.1039/c6ay01893a, 10.1039/C6AY01893A]
[6]   Comparison of performance of five common classifiers represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector Quantization and Support Vector Machines, as dependent on data structure [J].
Dixon, Sarah J. ;
Brereton, Richard G. .
CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS, 2009, 95 (01) :1-17
[7]   Diagnostic segregation of human brain tumours using Fourier-transform infrared and/or Raman spectroscopy coupled with discriminant analysis [J].
Gajjar, Ketan ;
Heppenstall, Lara D. ;
Pang, Weiyi ;
Ashton, Katherine M. ;
Trevisan, Julio ;
Patel, Imran I. ;
Llabjani, Valon ;
Stringfellow, Helen F. ;
Martin-Hirsch, Pierre L. ;
Dawson, Timothy ;
Martin, Francis L. .
ANALYTICAL METHODS, 2013, 5 (01) :89-102
[8]   COMPUTER AIDED DESIGN OF EXPERIMENTS [J].
KENNARD, RW ;
STONE, LA .
TECHNOMETRICS, 1969, 11 (01) :137-&
[9]  
Lindon J., 2017, em Encyclopedia of Spectroscopy and Spectrometry, V1
[10]  
Morais C.L.M., 2018, PROTOC EXCHANGE, DOI [10.1038/protex.2018.141, DOI 10.1038/PROTEX.2018.141]