TrieDedup: a fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing

被引：1

作者：

Hu, Jianqiao ^{[1
,2
]}

Luo, Sai ^{[1
,3
,4
,5
]}

Tian, Ming ^{[1
,3
]}

Ye, Adam Yongxin ^{[1
,3
,4
]}

机构：

[1] Boston Childrens Hosp, Program Cellular & Mol Med, Boston, MA 02115 USA

[2] Univ Washington, Dept Biol, Seattle, WA USA

[3] Harvard Med Sch, Boston, MA 02115 USA

[4] Boston Childrens Hosp, Howard Hughes Med Inst, Boston, MA 02115 USA

[5] Tsinghua Univ, Sch Basic Med Sci, Beijing, Peoples R China

来源：

BMC BIOINFORMATICS | 2024年 / 25卷 / 01期

基金：

美国国家卫生研究院;

关键词：

Deduplication; Ambiguous bases; Trie; Prefix tree; Next-generation sequencing; FORMAT;

D O I：

10.1186/s12859-024-05775-w

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background High-throughput sequencing is a powerful tool that is extensively applied in biological studies. However, sequencers may produce low-quality bases, leading to ambiguous bases, 'N's. PCR duplicates introduced in library preparation are conventionally removed in genomics studies, and several deduplication tools have been developed for this purpose. Two identical reads may appear different due to ambiguous bases and the existing tools cannot address 'N's correctly or efficiently.Results Here we proposed and implemented TrieDedup, which uses the trie (prefix tree) data structure to compare and store sequences. TrieDedup can handle ambiguous base 'N's, and efficiently deduplicate at the level of raw sequences. We also reduced its memory usage by approximately 20% by implementing restrictedDict in Python. We benchmarked the performance of the algorithm and showed that TrieDedup can deduplicate reads up to 270-fold faster than pairwise comparison at a cost of 32-fold higher memory usage.Conclusions The TrieDedup algorithm may facilitate PCR deduplication, barcode or UMI assignment, and repertoire diversity analysis of large-scale high-throughput sequencing datasets with its ultra-fast algorithm that can account for ambiguous bases due to sequencing errors.

引用

页数：13

共 18 条

[1]

Broad Institute, 2019, Picard toolkit. GitHub repository

[2]

Bushnell B., 2021, GitHub repository

[3] BCR selection and affinity maturation in Peyer's patch germinal centres [J].

Chen, Huan ;

Zhang, Yuxiang ;

Ye, Adam Yongxin ;

Du, Zhou ;

Xu, Mo ;

Lee, Cheng-Sheng ;

Hwang, Joyce K. ;

Kyritsis, Nia ;

Ba, Zhaoqing ;

Neuberg, Donna ;

Littman, Dan R. ;

Alt, Frederick W. .

NATURE, 2020, 582 (7812) :421-+

[4] Gencore: an efficient tool to generate consensus reads for error suppressing and duplicate removing of NGS data [J].

Chen, Shifu ;

Zhou, Yanqing ;

Chen, Yaru ;

Huang, Tanxiao ;

Liao, Wenting ;

Xu, Yun ;

Li, Zhicheng ;

Gu, Jia .

BMC BIOINFORMATICS, 2019, 20 (01)

[5] The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants [J].

Cock, Peter J. A. ;

Fields, Christopher J. ;

Goto, Naohisa ;

Heuer, Michael L. ;

Rice, Peter M. .

NUCLEIC ACIDS RESEARCH, 2010, 38 (06) :1767-1771

[6] Base-calling of automated sequencer traces using phred.: II.: Error probabilities [J].

Ewing, B ;

Green, P .

GENOME RESEARCH, 1998, 8 (03) :186-194

[7]

Gordon A, 2010, FASTX-Toolkit

[8]

Gregg F, 2022, GitHub repository

[9] Detecting DNA double-stranded breaks in mammalian genomes by linear amplification-mediated high-throughput genome-wide translocation sequencing [J].

Hu, Jiazhi ;

Meyers, Robin M. ;

Dong, Junchao ;

Panchakshari, Rohit A. ;

Alt, Frederick W. ;

Frock, Richard L. .

NATURE PROTOCOLS, 2016, 11 (05) :853-871

[10]

Li H, 2018, GitHub repository

← 1 2 →