Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes

被引：11

作者：

Kallberg, David ^{[1
,2
]}

Vidman, Linda ^{[2
,3
]}

Ryden, Patrik ^{[2
]}

机构：

[1] Umea Univ, Dept Stat, USBE, Umea, Sweden

[2] Umea Univ, Dept Math & Math Stat, Umea, Sweden

[3] Umea Univ, Dept Radiat Sci, Oncol, Umea, Sweden

来源：

FRONTIERS IN GENETICS | 2021年 / 12卷

基金：

瑞典研究理事会;

关键词：

feature selection; gene selection; RNA-seq; cancer subtypes; high-dimensional;

D O I：

10.3389/fgene.2021.632620

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (-0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

引用

页数：17

共 50 条

[21] Feature Selection with High-Dimensional Imbalanced Data [J].

Van Hulse, Jason ;

Khoshgoftaar, Taghi M. ;

Napolitano, Amri ;

Wald, Randall .

2009 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW 2009), 2009, :507-514

[22] FEATURE SELECTION FOR HIGH-DIMENSIONAL DATA ANALYSIS [J].

Verleysen, Michel .

ECTA 2011/FCTA 2011: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON EVOLUTIONARY COMPUTATION THEORY AND APPLICATIONS AND INTERNATIONAL CONFERENCE ON FUZZY COMPUTATION THEORY AND APPLICATIONS, 2011,

[23] Feature Selection Algorithms for High-dimensional Unbalanced Medical Data [J].

Liu, Jiaxuan ;

Li, Daiwei ;

Ren, Lijuan ;

Zhang, Haiqing ;

Tang, Xin ;

Xiang, Xiaoming .

2024 4TH INTERNATIONAL CONFERENCE ON INDUSTRIAL AUTOMATION, ROBOTICS AND CONTROL ENGINEERING, IARCE, 2024, :511-514

[24] Genetic Programming for Feature Selection and Construction to High-Dimensional Data [J].

Ma, Jianbin ;

Zhu, Man .

2024 4TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND INTELLIGENT SYSTEMS ENGINEERING, MLISE 2024, 2024, :196-200

[25] Clustering of high-dimensional gene expression data with feature filtering methods and diffusion maps [J].

Xu, Rui ;

Damelin, Steven ;

Nadler, Boaz ;

Wunsch, Donald C., II .

ARTIFICIAL INTELLIGENCE IN MEDICINE, 2010, 48 (2-3) :91-98

[26] Neighborhood Component Feature Selection for High-Dimensional Data [J].

Yang, Wei ;

Wang, Kuanquan ;

Zuo, Wangmeng .

JOURNAL OF COMPUTERS, 2012, 7 (01) :161-168

[27] Simultaneous Feature and Model Selection for High-Dimensional Data [J].

Perolini, Alessandro ;

Guerif, Sebastien .

2011 23RD IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2011), 2011, :47-50

[28] Simultaneous Feature Selection and Classification for High-Dimensional Data [J].

Pai, Vriddhi ;

Gupta, Subhash Chand .

PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON GREEN COMPUTING AND INTERNET OF THINGS (ICGCIOT 2018), 2018, :153-158

[29] Feature Selection for High-Dimensional Data: The Issue of Stability [J].

Pes, Barbara .

2017 IEEE 26TH INTERNATIONAL CONFERENCE ON ENABLING TECHNOLOGIES - INFRASTRUCTURE FOR COLLABORATIVE ENTERPRISES (WETICE), 2017, :170-175

[30] Hybrid Feature Selection for High-Dimensional Manufacturing Data [J].

Sun, Yajuan ;

Yu, Jianlin ;

Li, Xiang ;

Wu, Ji Yan ;

Lu, Wen Feng .

2021 26TH IEEE INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2021,

← 1 2 3 4 5 →