Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup

被引:1
作者
Krol, Lukasz [1 ]
机构
[1] Silesian Tech Univ, Data Min Grp, Fac Automat Control Elect & Comp Sci, Gliwice, Poland
来源
BEYOND DATABASES, ARCHITECTURES AND STRUCTURES, BDAS 2016 | 2016年 / 613卷
关键词
Feature selection; Dimensionality reduction; Parallel computing; Actor systems; Akka; Spark; Scala; !text type='Java']Java[!/text; DISCOVERY;
D O I
10.1007/978-3-319-34099-9_35
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Selection of informative features out of ever growing results of high throughput biological experiments requires specialized feature selection algorithms. One of such methods is the Monte Carlo Feature Selection - a straightforward, yet computationally expensive one. In this technical paper we present architecture and performance of a development version of our distributed implementation of this algorithm, designed to run in multiprocessor as well as multihost computing environments, and potentially controllable through a web browser by non-IT staff. As a simple enhancement, our method is able to produce statistically interpretable output by means of permutation testing. Tested on reference Golub et al. leukemia data, as well as on our own dataset of almost 2 million features, it has shown nearly linear speedup when executed with an increased amount of processors. Being platform independent, as well as open for extensions, this application could become a valuable tool for researchers facing the challenge of ill-defined high dimensional feature selection problems.
引用
收藏
页码:463 / 474
页数:12
相关论文
共 12 条
  • [1] An integrated map of genetic variation from 1,092 human genomes
    Altshuler, David M.
    Durbin, Richard M.
    Abecasis, Goncalo R.
    Bentley, David R.
    Chakravarti, Aravinda
    Clark, Andrew G.
    Donnelly, Peter
    Eichler, Evan E.
    Flicek, Paul
    Gabriel, Stacey B.
    Gibbs, Richard A.
    Green, Eric D.
    Hurles, Matthew E.
    Knoppers, Bartha M.
    Korbel, Jan O.
    Lander, Eric S.
    Lee, Charles
    Lehrach, Hans
    Mardis, Elaine R.
    Marth, Gabor T.
    McVean, Gil A.
    Nickerson, Deborah A.
    Schmidt, Jeanette P.
    Sherry, Stephen T.
    Wang, Jun
    Wilson, Richard K.
    Gibbs, Richard A.
    Dinh, Huyen
    Kovar, Christie
    Lee, Sandra
    Lewis, Lora
    Muzny, Donna
    Reid, Jeff
    Wang, Min
    Wang, Jun
    Fang, Xiaodong
    Guo, Xiaosen
    Jian, Min
    Jiang, Hui
    Jin, Xin
    Li, Guoqing
    Li, Jingxiang
    Li, Yingrui
    Li, Zhuo
    Liu, Xiao
    Lu, Yao
    Ma, Xuedi
    Su, Zhe
    Tai, Shuaishuai
    Tang, Meifang
    [J]. NATURE, 2012, 491 (7422) : 56 - 65
  • [2] Monte Carlo feature selection for supervised classification
    Draminski, Michal
    Rada-Iglesias, Alvaro
    Enroth, Stefan
    Wadelius, Claes
    Koronacki, Jacek
    Komorowski, Jan
    [J]. BIOINFORMATICS, 2008, 24 (01) : 110 - 117
  • [3] Draminski M, 2010, STUD COMPUT INTELL, V263, P371
  • [4] A second generation human haplotype map of over 3.1 million SNPs
    Frazer, Kelly A.
    Ballinger, Dennis G.
    Cox, David R.
    Hinds, David A.
    Stuve, Laura L.
    Gibbs, Richard A.
    Belmont, John W.
    Boudreau, Andrew
    Hardenbol, Paul
    Leal, Suzanne M.
    Pasternak, Shiran
    Wheeler, David A.
    Willis, Thomas D.
    Yu, Fuli
    Yang, Huanming
    Zeng, Changqing
    Gao, Yang
    Hu, Haoran
    Hu, Weitao
    Li, Chaohua
    Lin, Wei
    Liu, Siqi
    Pan, Hao
    Tang, Xiaoli
    Wang, Jian
    Wang, Wei
    Yu, Jun
    Zhang, Bo
    Zhang, Qingrun
    Zhao, Hongbin
    Zhao, Hui
    Zhou, Jun
    Gabriel, Stacey B.
    Barry, Rachel
    Blumenstiel, Brendan
    Camargo, Amy
    Defelice, Matthew
    Faggart, Maura
    Goyette, Mary
    Gupta, Supriya
    Moore, Jamie
    Nguyen, Huy
    Onofrio, Robert C.
    Parkin, Melissa
    Roy, Jessica
    Stahl, Erich
    Winchester, Ellen
    Ziaugra, Liuda
    Altshuler, David
    Shen, Yan
    [J]. NATURE, 2007, 449 (7164) : 851 - U3
  • [5] Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring
    Golub, TR
    Slonim, DK
    Tamayo, P
    Huard, C
    Gaasenbeek, M
    Mesirov, JP
    Coller, H
    Loh, ML
    Downing, JR
    Caligiuri, MA
    Bloomfield, CD
    Lander, ES
    [J]. SCIENCE, 1999, 286 (5439) : 531 - 537
  • [6] Hall M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI [DOI 10.1145/1656274.1656278, 10.1145/1656274.1656278]
  • [7] Application of genetic algorithms and constructive neural networks for the analysis of microarray cancer data
    Marcos Luque-Baena, Rafael
    Urda, Daniel
    Luis Subirats, Jose
    Franco, Leonardo
    Jerez, Jose M.
    [J]. THEORETICAL BIOLOGY AND MEDICAL MODELLING, 2014, 11
  • [8] On lines and planes of closest fit to systems of points in space.
    Pearson, Karl
    [J]. PHILOSOPHICAL MAGAZINE, 1901, 2 (7-12) : 559 - 572
  • [9] What's wrong with Bonferroni adjustments
    Perneger, TV
    [J]. BRITISH MEDICAL JOURNAL, 1998, 316 (7139) : 1236 - 1238
  • [10] Quinlan J.R., 2013, EFFECTIVE AKKA