Robustness of chemometrics-based feature selection methods in early cancer detection and biomarker discovery

被引：9

作者：

Lee, Hae Woo ^{[1
]}

Lawton, Carl ^{[1
]}

Na, Young Jeong ^{[2
]}

Yoon, Seongkyu ^{[1
]}

机构：

[1] Univ Massachusetts, Dept Chem Engn, Lowell, MA 01854 USA

[2] Harvard Univ, Massachusetts Gen Hosp, Sch Med, Boston, MA USA

来源：

STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY | 2013年 / 12卷 / 02期

关键词：

biomarker discovery; chemometrics; early detection; feature selection; omics; ovarian cancer; reproducibility; stability; CARLO CROSS-VALIDATION; SELDI-TOF MS; VARIABLE SELECTION; OVARIAN-CANCER; BREAST-CANCER; WAVELENGTH SELECTION; MULTIVARIATE CALIBRATION; MASS-SPECTROMETRY; SERUM BIOMARKERS; STABILITY;

D O I：

10.1515/sagmb-2012-0067

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

In omics studies aimed at the early detection and diagnosis of cancer, bioinformatics tools play a significant role when analyzing high dimensional, complex datasets, as well as when identifying a small set of biomarkers. However, in many cases, there are ambiguities in the robustness and the consistency of the discovered biomarker sets, since the feature selection methods often lead to irreproducible results. To address this, both the stability and the classification power of several chemometrics-based feature selection algorithms were evaluated using the Monte Carlo sampling technique, aiming at finding the most suitable feature selection methods for early cancer detection and biomarker discovery. To this end, two data sets were analyzed, which comprised of MALDI-TOF-MS and LC/TOF-MS spectra measured on serum samples in order to diagnose ovarian cancer. Using these datasets, the stability and the classification power of multiple feature subsets found by different feature selection methods were quantified by varying either the number of selected features, or the number of samples in the training set, with special emphasis placed on the property of stability. The results show that high consistency does not necessarily guarantee high predictive power. In addition, differences in the stability, as well as agreement in feature lists between several feature selection methods, depend on several factors, such as the number of available samples, feature sizes, quality of the information in the dataset, etc. Among the tested methods, only the variable importance in projection (VIP)-based method shows complementary properties, providing both highly consistent and accurate subsets of features. In addition, successive projection analysis (SPA) was excellent with regards to maintaining high stability over a wide range of experimental conditions. The stability of several feature selection methods is highly variable, stressing the importance of making the proper choice among feature selection methods. Therefore, rather than evaluating the selected features using only classification accuracy, stability measurements should be examined as well to improve the reliability of biomarker discovery.

引用

页码：207 / 223

页数：17

共 50 条

[21] Multiple Sclerosis Biomarker Discovery via Bayesian Feature Selection
Pour, Ali Foroughi
Dalton, Lori A.
PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, 2016, : 540 - 541
[22] An Ensemble Feature Selection Method for Biomarker Discovery
Shahrjooihaghighi, Aliasghar
Frigui, Hichem
Zhang, Xiang
Wei, Xiaoli
Shi, Biyun
Trabelsi, Ameni
2017 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2017, : 416 - 421
[23] HE4 as an Early Detection Biomarker of Epithelial Ovarian Cancer Investigations in Prediagnostic Specimens From the Janus Serumbank
Gislefoss, Randi Elin
Langseth, Hilde
Bolstad, Nils
Nustad, Kjell
Morkrid, Lars
INTERNATIONAL JOURNAL OF GYNECOLOGICAL CANCER, 2015, 25 (09) : 1608 - 1615
[24] Biomarker-Based Early Cancer Detection: Is It Achievable?
Hazelton, William D.
Luebeck, E. Georg
SCIENCE TRANSLATIONAL MEDICINE, 2011, 3 (109)
[25] Robustness May be at Odds with Stability in Adversarial Training based Feature Selection?
Liu, Yue
Li, Yun
2022 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2022, : 1071 - 1076
[26] A critical assessment of the feature selection methods used for biomarker discovery in current metaproteomics studies
Tang, Jing
Wang, Yunxia
Fu, Jianbo
Zhou, Ying
Luo, Yongchao
Zhang, Ying
Li, Bo
Yang, Qingxia
Xue, Weiwei
Lou, Yan
Qiu, Yunqing
Zhu, Feng
BRIEFINGS IN BIOINFORMATICS, 2020, 21 (04) : 1378 - 1390
[27] Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods
Zhang, Zishuang
Liu, Zhi-Ping
BMC MEDICAL GENOMICS, 2021, 14 (SUPPL 1)
[28] Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods
Zishuang Zhang
Zhi-Ping Liu
BMC Medical Genomics, 14
[29] Ensemble feature selection with data-driven thresholding for Alzheimer's disease biomarker discovery
Spooner, Annette
Mohammadi, Gelareh
Sachdev, Perminder S.
Brodaty, Henry
Sowmya, Arcot
BMC BIOINFORMATICS, 2023, 24 (01)
[30] A Comparative Study of Redundant Feature Detection based Feature Selection Methods
Zeng, Xue-Qiang
Chen, Qian-Sheng
2014 INTERNATIONAL CONFERENCE ON COMPUTER, INFORMATION AND TELECOMMUNICATION SYSTEMS (CITS), 2014,

← 1 2 3 4 5 →