Parallelized Classification of Cancer Sub-types from Gene Expression Profiles Using Recursive Gene Selection

被引:2
作者
Venkataramana, Lokeswari [1 ]
Jacob, Shomona Gracia [1 ]
Ramadoss, Rajavel [2 ]
机构
[1] Sri Sivasubramaniya Nadar Coll Engn, Dept CSE, Madras 603110, Tamil Nadu, India
[2] Sri Sivasubramaniya Nadar Coll Engn, Dept ECE, Madras 603110, Tamil Nadu, India
来源
STUDIES IN INFORMATICS AND CONTROL | 2018年 / 27卷 / 02期
关键词
Recursive Feature Selection; Gene Selection; Microarray Gene Expression; Parallelized classification; Random Forest; PREDICTION; ENSEMBLE;
D O I
10.24846/v27i2y201809
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Cancer is a chronic disease that is caused mainly by irregularities in genes. It is important to identify such oncogenes that cause cancer. Biological data like gene expressions, protein sequences, RNA-sequences, pathway analysis, Pan-cancer analysis and structural biomarkers could aid in cancer diagnosis, classification and prognosis. This research focuses on classifying subtypes of cancer using Microarray Gene Expression (MGE) levels. Nature of MGE data is multidimensional with very few samples. It is necessary to perform dimensionality reduction to select the relevant genes and remove the redundant ones. The Recursive Feature Selection (RFS) method is proposed as it repeatedly performs the gene selection process until the best gene subset is found. The obtained best subset of genes is further employed for classification using different models and evaluated using 10-fold cross-validation. In order to scale for huge amount of gene expression data, the parallelized classification model was explored on the Spark framework. A comparison was drawn between the non-parallelized classification model on Weka and the parallelized classification model on Spark. The results revealed that the parallelized classification model performs better than non-parallelized classification model in terms of accuracy and execution time. Further, the performance of RFS and parallelized classifier was also compared with previous approaches. The proposed RFS and parallelized classifier outperformed previous methods.
引用
收藏
页码:213 / 222
页数:10
相关论文
共 16 条
  • [1] ALSHAMLAN H.M., 2013, Proceedings of the World Congress on Engineering, P1
  • [2] [Anonymous], 2009, SIGKDD Explorations, DOI DOI 10.1145/1656274.1656278
  • [3] Banerjee M., 2011, P 20 ACM INT C INF K, P2281, DOI DOI 10.1145/2063576.2063946
  • [4] Distributed feature selection: An application to microarray data classification
    Bolon-Canedo, V.
    Sanchez-Marono, N.
    Alonso-Betanzos, A.
    [J]. APPLIED SOFT COMPUTING, 2015, 30 : 136 - 150
  • [5] Data classification using an ensemble of filters
    Bolon-Canedo, V.
    Sanchez-Marono, N.
    Alonso-Betanzos, A.
    [J]. NEUROCOMPUTING, 2014, 135 : 13 - 20
  • [6] An ensemble of filters and classifiers for microarray data classification
    Bolon-Canedo, V.
    Sanchez-Marono, N.
    Alonso-Betanzos, A.
    [J]. PATTERN RECOGNITION, 2012, 45 (01) : 531 - 539
  • [7] A local asynchronous distributed privacy preserving feature selection algorithm for large peer-to-peer networks
    Das, Kamalika
    Bhaduri, Kanishka
    Kargupta, Hillol
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2010, 24 (03) : 341 - 367
  • [8] Consistency-based search in feature selection
    Dash, M
    Liu, HA
    [J]. ARTIFICIAL INTELLIGENCE, 2003, 151 (1-2) : 155 - 176
  • [9] Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
    Golub, TR
    Slonim, DK
    Tamayo, P
    Huard, C
    Gaasenbeek, M
    Mesirov, JP
    Coller, H
    Loh, ML
    Downing, JR
    Caligiuri, MA
    Bloomfield, CD
    Lander, ES
    [J]. SCIENCE, 1999, 286 (5439) : 531 - 537
  • [10] Hall M.A., 1999, P 17 INT C MACHINE L, P359