GenomicsBench: A Benchmark Suite for Genomics

被引:6
作者
Subramaniyan, Arun [1 ]
Gu, Yufeng [1 ]
Dunn, Timothy [1 ]
Paul, Somnath [2 ]
Vasimuddin, Md [3 ]
Misra, Sanchit [3 ]
Blaauw, David [1 ]
Narayanasamy, Satish [1 ]
Das, Reetuparna [1 ]
机构
[1] Univ Michigan, Ann Arbor, MI 48109 USA
[2] Intel Corp, Hillsboro, OR USA
[3] Intel Corp, Bangalore, Karnataka, India
来源
2021 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS 2021) | 2021年
关键词
Genomics; Bioinformatics; Benchmarking; Computer Architecture; ALIGNMENT; VARIANTS;
D O I
10.1109/ISPASS51385.2021.00012
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Over the last decade, advances in high-throughput sequencing and the availability of portable sequencers have enabled fast and cheap access to genetic data. For a given sample, sequencers typically output fragments of the DNA in the sample. Depending on the sequencing technology, the fragments range from a length of 150-250 at high accuracy to lengths in few tens of thousands but at much lower accuracy. Sequencing data is now being produced at a rate that far outpaces Moore's law and poses significant computational challenges on commodity hardware. To meet this demand, software tools have been extensively redesigned and new algorithms and custom hardware have been developed to deal with the diversity in sequencing data. However, a standard set of benchmarks that captures the diverse behaviors of these recent algorithms and can facilitate future architectural exploration is lacking. To that end, we present the GenomicsBench benchmark suite which contains 12 computationally intensive data-parallel kernels drawn from popular bioinformatics software tools. It covers the major steps in short and long-read genome sequence analysis pipelines such as basecalling, sequence mapping, de-novo assembly, variant calling and polishing. We observe that while these genomics kernels have abundant data level parallelism, it is often hard to exploit on commodity processors because of input-dependent irregularities. We also perform a detailed microarchitectural characterization of these kernels and identify their bottlenecks. GenomicsBench includes parallel versions of the source code with CPU and GPU implementations as applicable along with representative input datasets of two sizes - small and large.
引用
收藏
页码:1 / 12
页数:12
相关论文
共 54 条
[1]   BioBench: A benchmark suite of bioinformatics applications [J].
Albayraktaroglu, K ;
Jaleel, A ;
Wu, X ;
Franklin, M ;
Jacob, B ;
Tseng, CW ;
Yeung, D .
ISPASS 2005: IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, 2005, :2-9
[2]   BASIC LOCAL ALIGNMENT SEARCH TOOL [J].
ALTSCHUL, SF ;
GISH, W ;
MILLER, W ;
MYERS, EW ;
LIPMAN, DJ .
JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) :403-410
[3]   BioPerf: A benchmark suite to evaluate high-performance computer architecture on bioinformatics applications [J].
Bader, DA ;
Li, Y ;
Li, T ;
Sachdeva, V .
IISWC - 2005: PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL SYMPOSIUM ON WORKLOAD CHARACTERIZATION, 2005, :163-173
[4]   A Comparison of Base-calling Algorithms for Illumina Sequencing Technology [J].
Cacho, Ashley ;
Smirnova, Ekaterina ;
Huzurbazar, Snehalata ;
Cui, Xinping .
BRIEFINGS IN BIOINFORMATICS, 2016, 17 (05) :786-795
[5]   Second-generation PLINK: rising to the challenge of larger and richer datasets [J].
Chang, Christopher C. ;
Chow, Carson C. ;
Tellier, Laurent C. A. M. ;
Vattikuti, Shashaank ;
Purcell, Shaun M. ;
Lee, James J. .
GIGASCIENCE, 2015, 4
[6]  
Chen W., 2007, P 2007 AS TECH INF P, P141
[7]   A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree [J].
Eberle, Michael A. ;
Fritzilas, Epameinondas ;
Krusche, Peter ;
Kallberg, Morten ;
Moore, Benjamin L. ;
Bekritsky, Mitchell A. ;
Iqbal, Zamin ;
Chuang, Han-Yu ;
Humphray, Sean J. ;
Halpern, Aaron L. ;
Kruglyak, Semyon ;
Margulies, Elliott H. ;
McVean, Gil ;
Bentley, David R. .
GENOME RESEARCH, 2017, 27 (01) :157-164
[8]   Profile hidden Markov models [J].
Eddy, SR .
BIOINFORMATICS, 1998, 14 (09) :755-763
[9]   Striped Smith-Waterman speeds database searches six times over other SIMD implementations [J].
Farrar, Michael .
BIOINFORMATICS, 2007, 23 (02) :156-161
[10]  
Foley P., 2017, Accelerate genomics research with the Broad-Intel genomics stack