Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes

被引:11
作者
Kallberg, David [1 ,2 ]
Vidman, Linda [2 ,3 ]
Ryden, Patrik [2 ]
机构
[1] Umea Univ, Dept Stat, USBE, Umea, Sweden
[2] Umea Univ, Dept Math & Math Stat, Umea, Sweden
[3] Umea Univ, Dept Radiat Sci, Oncol, Umea, Sweden
基金
瑞典研究理事会;
关键词
feature selection; gene selection; RNA-seq; cancer subtypes; high-dimensional;
D O I
10.3389/fgene.2021.632620
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (-0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.
引用
收藏
页数:17
相关论文
共 50 条
[41]   A novel minorization-maximization framework for simultaneous feature selection and clustering of high-dimensional count data [J].
Zamzami, Nuha ;
Bouguila, Nizar .
PATTERN ANALYSIS AND APPLICATIONS, 2023, 26 (01) :91-106
[42]   Using Feature Clustering for GP-Based Feature Construction on High-Dimensional Data [J].
Binh Tran ;
Xue, Bing ;
Zhang, Mengjie .
GENETIC PROGRAMMING, EUROGP 2017, 2017, 10196 :210-226
[43]   Knowledge-slanted random forest method for high-dimensional data and small sample size with a feature selection application for gene expression data [J].
Cantor, Erika ;
Guauque-Olarte, Sandra ;
Leon, Roberto ;
Chabert, Steren ;
Salas, Rodrigo .
BIODATA MINING, 2024, 17 (01)
[44]   An Efficient SVM-Based Feature Selection Model for Cancer Classification Using High-Dimensional Microarray Data [J].
El Kafrawy, Passent ;
Fathi, Hanaa ;
Qaraad, Mohammed ;
Kelany, Ayda K. ;
Chen, Xumin .
IEEE ACCESS, 2021, 9 :155353-155369
[45]   Comparison of Clustering Methods for High-Dimensional Single-Cell Flow and Mass Cytometry Data [J].
Weber, Lukas M. ;
Robinson, Mark D. .
CYTOMETRY PART A, 2016, 89A (12) :1084-1096
[46]   An Asymmetric Chaotic Competitive Swarm Optimization Algorithm for Feature Selection in High-Dimensional Data [J].
Pichai, Supailin ;
Sunat, Khamron ;
Chiewchanwattana, Sirapat .
SYMMETRY-BASEL, 2020, 12 (11) :1-13
[47]   The feature selection bias problem in relation to high-dimensional gene data [J].
Krawczuk, Jerzy ;
Lukaszuk, Tomasz .
ARTIFICIAL INTELLIGENCE IN MEDICINE, 2016, 66 :63-71
[48]   Improving Evolutionary Algorithm Performance for Feature Selection in High-Dimensional Data [J].
Cilia, N. ;
De Stefano, C. ;
Fontanella, F. ;
di Freca, A. Scotto .
APPLICATIONS OF EVOLUTIONARY COMPUTATION, EVOAPPLICATIONS 2018, 2018, 10784 :439-454
[49]   Feature Selection for High-Dimensional Data Through Instance Vote Combining [J].
Chamakura, Lily ;
Saha, Goutam .
PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, :161-169
[50]   A Cost-Sensitive Feature Selection Method for High-Dimensional Data [J].
An, Chaojie ;
Zhou, Qifeng .
14TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND EDUCATION (ICCSE 2019), 2019, :1089-1094