Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools

被引:1
|
作者
Ogasawara, Takeshi [1 ]
Cheng, Yinhe [2 ]
Tzeng, Tzy-Hwa Kathy [3 ]
机构
[1] IBM Res Tokyo, Tokyo, Japan
[2] IBM Syst, Austin, TX USA
[3] IBM Syst, Poughkeepsie, NY USA
来源
PLOS ONE | 2016年 / 11卷 / 11期
关键词
FORMAT;
D O I
10.1371/journal.pone.0167100
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156-186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators, if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting input data are provided by using plug-in tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of next generation sequencing (NGS) data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime of NGS data preprocessing from about 20 hours to about nine minutes for a whole-genome sequencing data set on the same system using up to 711 GB of memory.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] FastqPuri: high-performance preprocessing of RNA-seq data
    Perez-Rubio, Paula
    Lottaz, Claudio
    Engelmann, Julia C.
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [2] FastqPuri: high-performance preprocessing of RNA-seq data
    Paula Pérez-Rubio
    Claudio Lottaz
    Julia C. Engelmann
    BMC Bioinformatics, 20
  • [3] Programming Tools for High-Performance Data Analysis
    Talia, Domenico
    Trunfio, Paolo
    PROCEEDINGS OF THE 33RD INTERNATIONAL SYMPOSIUM ON HIGH-PERFORMANCE PARALLEL AND DISTRIBUTED COMPUTING, HPDC 2024, 2024,
  • [4] Comparative Performance Evaluation of High-performance Data Transfer Tools
    Nadig, Deepak
    Jung, Eun-Sung
    Kettimuthu, Rajkumar
    Fostert, Ian
    Rao, Nageswara S., V
    Ramamurthy, Byrav
    2018 IEEE INTERNATIONAL CONFERENCE ON ADVANCED NETWORKS AND TELECOMMUNICATIONS SYSTEMS (ANTS), 2018,
  • [5] Expeditious Dynamic Clustering in Preprocessing for High-Performance Classification
    Pimpa, Anamika
    Eiamkanitchat, Narissara
    JOURNAL OF ELECTRICAL AND COMPUTER ENGINEERING, 2024, 2024
  • [6] High-Performance Optimization Framework for Reversible Data Hiding Predictor
    Ma, Bin
    Duan, Hongtao
    Ma, Ruihe
    Xian, Yongjin
    Li, Xiaolong
    IEEE SIGNAL PROCESSING LETTERS, 2025, 32 : 231 - 235
  • [7] Applying High-Performance Bioinformatics Tools for Outlier Detection in Log Data
    Wurzenberger, Markus
    Skopik, Florian
    Fiedler, Roman
    Kastner, Wolfgang
    2017 3RD IEEE INTERNATIONAL CONFERENCE ON CYBERNETICS (CYBCONF), 2017, : 399 - +
  • [8] EDA tools for high-performance MCM
    Maher, MA
    Khainson, A
    IEEE SYMPOSIUM ON IC/PACKAGE DESIGN INTEGRATION - PROCEEDINGS, 1998, : 70 - 73
  • [9] High-performance Monte Carlo tools
    Mascagni, M
    IEEE COMPUTATIONAL SCIENCE & ENGINEERING, 1998, 5 (02): : 97 - 98
  • [10] High-performance tools position accurately
    Zankowsky, D
    LASER FOCUS WORLD, 1997, 33 (01): : 135 - 138