High-dimensional feature selection for genomic datasets

被引：14

作者：

Afshar, Majid ^{[1
]}

Usefi, Hamid ^{[2
]}

机构：

[1] Mem Univ Newfoundland, Dept Comp Sci, St John, NF, Canada

[2] Mem Univ Newfoundland, Dept Math & Stat, St John, NF, Canada

来源：

KNOWLEDGE-BASED SYSTEMS | 2020年 / 206卷

基金：

加拿大自然科学与工程研究理事会;

关键词：

Feature selection; Dimensionality reduction; Perturbation theory; Singular value decomposition; Disease diagnoses; Classification; NONLINEAR FEATURE-SELECTION; PARKINSONS-DISEASE; DIAGNOSIS;

D O I：

10.1016/j.knosys.2020.106370

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

A central problem in machine learning and pattern recognition is the process of recognizing the most important features. In this paper, we provide a new feature selection method (DRPT) that consists of first removing the irrelevant features and then detecting correlations between the remaining features. Let D = [A vertical bar b] be a dataset, where b is the class label and A is a matrix whose columns are the features. We solve Ax = b using the least squares method and the pseudo-inverse of A. Each component of x can be viewed as an assigned weight to the corresponding column (feature). We define a threshold based on the local maxima of x and remove those features whose weights are smaller than the threshold. To detect the correlations in the reduced matrix, which we still call A, we consider a perturbation (A) over tilde of A. We prove that correlations are encoded in Delta x =vertical bar x - (x) over tilde vertical bar, where (x) over tilde is the least squares solution of (A) over tilde(x) over tilde = b. We cluster features first based on Delta x and then using the entropy of features. Finally, a feature is selected from each sub-cluster based on its weight and entropy. The effectiveness of DRPT has been verified by performing a series of comparisons with seven state-of-the-art feature selection methods over ten genetic datasets ranging up from 9,117 to 267,604 features. The results show that, over all, the performance of DRPT is favorable in several aspects compared to each feature selection algorithm. (C) 2020 Elsevier B.V. All rights reserved.

引用

页数：11

共 49 条

[1]

Abadi M, 2016, PROCEEDINGS OF OSDI'16: 12TH USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMENTATION, P265

[2] Early diagnosis of Parkinson's disease from multiple voice recordings by simultaneous sample and feature selection [J].

Ali, Liaqat ;

Zhu, Ce ;

Zhou, Mingyi ;

Liu, Yipeng .

EXPERT SYSTEMS WITH APPLICATIONS, 2019, 137 :22-28

[3]

Amrhein V., 2019, SCI RISE STAT SIGNIF

[4]

[Anonymous], 2003, GEN INVERSES THEORY, DOI DOI 10.1126/science.1130258

[5]

[Anonymous], 2018, ASIA PAC EDUC REV

[6]

[Anonymous], 2010, Proc. 27th Int'l Conf. Machine Learning

[7]

[Anonymous], 2006, PATTERN RECOGN

[8]

[Anonymous], 1997, Icml

[9]

[Anonymous], 1994, 11 INT C MACHINE LEA, DOI 10.1016/B978-1-55860-335-6.50023-4

[10] Local Feature Selection for Data Classification [J].

Armanfard, Narges ;

Reilly, James P. ;

Komeili, Majid .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (06) :1217-1227

← 1 2 3 4 5 →