Feature Selection and Clustering of Gene Expression Profiles Using Biological Knowledge

被引:20
作者
Mitra, Sushmita [1 ]
Ghosh, Sampreeti [1 ]
机构
[1] Indian Stat Inst, Machine Intelligence Unit, Kolkata 700108, India
来源
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS | 2012年 / 42卷 / 06期
关键词
Attribute clustering; clustering large applications based on RAN-domized search (CLARANS); feature selection; gene ontology (GO) medoid; CLASSIFICATION; ALGORITHMS; DATABASE; QUALITY; TOOL;
D O I
10.1109/TSMCC.2012.2209416
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, a novel feature selection algorithm, which is governed by biological knowledge, is developed. Gene expression data being high dimensional and redundant, dimensionality reduction is of prime concern. We employ the algorithm clustering large applications based on RAN-domized search (CLARANS) for attribute clustering and dimensionality reduction based on gene ontology (GO) study. Feature selection with unsupervised learning is a difficult problem, with neither class labels present nor any guidance available to the search. Determination of the optimal number of clusters is another major issue, and has an impact on the resulting output. The use of GO analysis helps in the automated selection of biologically meaningful partitions. Tools such as Eisen plot and cluster profiles of these clusters help establish their coherence. Important representative features (or genes) are extracted from each correlated set of genes in such partitions. The algorithm is implemented on high-dimensional Yeast cell-cycle, Human Multiple Tissues, and Leukemia microarray data. In the second pass, clustering on the reduced gene space validates preservation of the inherent behavior of the original high-dimensional expression profiles. While the reduced gene set forms a biologically meaningful gene space, it simultaneously leads to a decrease in computational burden. External validation of the reduced subspace, using various well-known classifiers, establishes the effectiveness of the proposed methodology.
引用
收藏
页码:1590 / 1599
页数:10
相关论文
共 42 条
[1]   FatiGO:: a web tool for finding significant associations of Gene Ontology terms with groups of genes [J].
Al-Shahrour, F ;
Díaz-Uriarte, R ;
Dopazo, J .
BIOINFORMATICS, 2004, 20 (04) :578-580
[2]  
[Anonymous], 2001, Pattern Classification
[3]   Attribute clustering for grouping, selection, and classification of gene expression data [J].
Au, WH ;
Chan, KCC ;
Wong, AKC ;
Wang, Y .
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2005, 2 (02) :83-101
[4]   Evolutionary rough feature selection in gene expression data [J].
Banerjee, Mohua ;
Mitra, Sushmita ;
Banka, Haider .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART C-APPLICATIONS AND REVIEWS, 2007, 37 (04) :622-632
[5]   Some new indexes of cluster validity [J].
Bezdek, JC ;
Pal, NR .
IEEE TRANSACTIONS ON SYSTEMS MAN AND CYBERNETICS PART B-CYBERNETICS, 1998, 28 (03) :301-315
[6]   Robust cluster analysis of microarray gene expression data with the number of clusters determined biologically [J].
Bickel, DR .
BIOINFORMATICS, 2003, 19 (07) :818-824
[7]   Selection of relevant features and examples in machine learning [J].
Blum, AL ;
Langley, P .
ARTIFICIAL INTELLIGENCE, 1997, 97 (1-2) :245-271
[8]   Partial least squares: a versatile tool for the analysis of high-dimensional genomic data [J].
Boulesteix, Anne-Laure ;
Strimmer, Korbinian .
BRIEFINGS IN BIOINFORMATICS, 2007, 8 (01) :32-44
[9]   Random forests [J].
Breiman, L .
MACHINE LEARNING, 2001, 45 (01) :5-32
[10]   Fuzzy Bayesian validation for cluster analysis of yeast cell-cycle data [J].
Cho, Sung-Bae ;
Yoo, Si-Ho .
PATTERN RECOGNITION, 2006, 39 (12) :2405-2414