HycDemux: a hybrid unsupervised approach for accurate barcoded sample demultiplexing in nanopore sequencing

被引:1
作者
Han, Renmin [1 ]
Qi, Junhai [1 ,2 ]
Xue, Yang [1 ]
Sun, Xiujuan [3 ]
Zhang, Fa [4 ]
Gao, Xin [5 ]
Li, Guojun [1 ]
机构
[1] Shandong Univ, Res Ctr Math & Interdisciplinary Sci, Qingdao 266237, Peoples R China
[2] BioMap Res, Menlo Pk, CA USA
[3] Chinese Acad Sci, Inst Comp Technol, High Performance Comp Res Ctr, Beijing 100190, Peoples R China
[4] Beijing Inst Technol, Sch Med Technol, Beijing 100085, Peoples R China
[5] King Abdullah Univ Sci & Technol KAUST, Computat Biosci Res Ctr CBRC, Comp Elect & Math Sci & Engn CEMSE Div, Thuwal 23955, Saudi Arabia
基金
中国国家自然科学基金;
关键词
Nanopore sequencing; Demultiplexing; Clustering; CD-HIT; PROTEIN;
D O I
10.1186/s13059-023-03053-1
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
DNA barcodes enable Oxford Nanopore sequencing to sequence multiple barcoded DNA samples on a single flow cell. DNA sequences with the same barcode need to be grouped together through demultiplexing. As the number of samples increases, accurate demultiplexing becomes difficult. We introduce HycDemux, which incorporates a GPU-parallelized hybrid clustering algorithm that uses nanopore signals and DNA sequences for accurate data clustering, alongside a voting-based module to finalize the demultiplexing results. Comprehensive experiments demonstrate that our approach outperforms unsupervised tools in short sequence fragment clustering and performs more robustly than current state-of-the-art demultiplexing tools for complex multi-sample sequencing data.
引用
收藏
页数:29
相关论文
共 74 条
[1]   A mathematical consideration of the word-composition vector method in comparison of biological sequences [J].
Aita, Takuyo ;
Husimi, Yuzuru ;
Nishigaki, Koichi .
BIOSYSTEMS, 2011, 106 (2-3) :67-75
[2]  
Boza V, 2017, arXiv
[3]   Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells [J].
Byrne, Ashley ;
Beaudin, Anna E. ;
Olsen, Hugh E. ;
Jain, Miten ;
Cole, Charles ;
Palmer, Theron ;
DuBois, Rebecca M. ;
Forsberg, E. Camilla ;
Akeson, Mark ;
Vollmers, Christopher .
NATURE COMMUNICATIONS, 2017, 8
[4]   MULTIPLEX DNA SEQUENCING [J].
CHURCH, GM ;
KIEFFERHIGGINS, S .
SCIENCE, 1988, 240 (4849) :185-188
[5]   Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison [J].
Dai, Qi ;
Liu, Xiaoqing ;
Yao, Yuhua ;
Zhao, Fukun .
JOURNAL OF THEORETICAL BIOLOGY, 2011, 276 (01) :174-180
[6]  
Dan Wei, 2010, 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), P204, DOI 10.1109/BICTA.2010.5645329
[7]   Three decades of nanopore sequencing [J].
Deamer, David ;
Akeson, Mark ;
Branton, Daniel .
NATURE BIOTECHNOLOGY, 2016, 34 (05) :518-524
[8]   Robust and scalable barcoding for massively parallel long-read sequencing [J].
Ezpeleta, Joaquin ;
Garcia Labari, Ignacio ;
Villanova, Gabriela Vanina ;
Bulacio, Pilar ;
Lavista-Llanos, Sofia ;
Posner, Victoria ;
Krsticevic, Flavia ;
Arranz, Silvia ;
Tapia, Elizabeth .
SCIENTIFIC REPORTS, 2022, 12 (01)
[9]   Species-specific basecallers improve actual accuracy of nanopore sequencing in plants [J].
Ferguson, Scott ;
McLay, Todd ;
Andrew, Rose L. ;
Bruhl, Jeremy J. ;
Schwessinger, Benjamin ;
Borevitz, Justin ;
Jones, Ashley .
PLANT METHODS, 2022, 18 (01)
[10]   CD-HIT: accelerated for clustering the next-generation sequencing data [J].
Fu, Limin ;
Niu, Beifang ;
Zhu, Zhengwei ;
Wu, Sitao ;
Li, Weizhong .
BIOINFORMATICS, 2012, 28 (23) :3150-3152