Distributed Monte Carlo Feature Selection: Extracting Informative Features Out of Multidimensional Problems with Linear Speedup

被引:1
作者
Krol, Lukasz [1 ]
机构
[1] Silesian Tech Univ, Data Min Grp, Fac Automat Control Elect & Comp Sci, Gliwice, Poland
来源
BEYOND DATABASES, ARCHITECTURES AND STRUCTURES, BDAS 2016 | 2016年 / 613卷
关键词
Feature selection; Dimensionality reduction; Parallel computing; Actor systems; Akka; Spark; Scala; !text type='Java']Java[!/text; DISCOVERY;
D O I
10.1007/978-3-319-34099-9_35
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Selection of informative features out of ever growing results of high throughput biological experiments requires specialized feature selection algorithms. One of such methods is the Monte Carlo Feature Selection - a straightforward, yet computationally expensive one. In this technical paper we present architecture and performance of a development version of our distributed implementation of this algorithm, designed to run in multiprocessor as well as multihost computing environments, and potentially controllable through a web browser by non-IT staff. As a simple enhancement, our method is able to produce statistically interpretable output by means of permutation testing. Tested on reference Golub et al. leukemia data, as well as on our own dataset of almost 2 million features, it has shown nearly linear speedup when executed with an increased amount of processors. Being platform independent, as well as open for extensions, this application could become a valuable tool for researchers facing the challenge of ill-defined high dimensional feature selection problems.
引用
收藏
页码:463 / 474
页数:12
相关论文
共 12 条