Clover: tree structure-based efficient DNA clustering for DNA-based data storage

被引:18
作者
Qu, Guanjin [1 ]
Yan, Zihui [1 ]
Wu, Huaming [1 ]
机构
[1] Tianjin Univ, Ctr Appl Math, Tianjin 300072, Peoples R China
基金
中国国家自然科学基金;
关键词
DNA storage; DNA clustering; Tree structure; Levenshtein distance; DIGITAL INFORMATION; ROBUST;
D O I
10.1093/bib/bbac336
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Deoxyribonucleic acid (DNA)-based data storage is a promising new storage technology which has the advantage of high storage capacity and long storage time compared with traditional storage media. However, the synthesis and sequencing process of DNA can randomly generate many types of errors, which makes it more difficult to cluster DNA sequences to recover DNA information. Currently, the available DNA clustering algorithms are targeted at DNA sequences in the biological domain, which not only cannot adapt to the characteristics of sequences in DNA storage, but also tend to be unacceptably time-consuming for billions of DNA sequences in DNA storage. In this paper, we propose an efficient DNA clustering method termed Clover for DNA storage with linear computational complexity and low memory. Clover avoids the computation of the Levenshtein distance by using a tree structure for interval-specific retrieval. We argue through theoretical proofs that Clover has standard linear computational complexity, low space complexity, etc. Experiments show that our method can cluster 10 million DNA sequences into 50 000 classes in 10 s and meet an accuracy rate of over 99%. Furthermore, we have successfully completed an unprecedented clustering of 10 billion DNA data on a single home computer and the time consumption still satisfies the linear relationship. Clover is freely available at https://github.com/Guanjinqu/Clover.
引用
收藏
页数:16
相关论文
共 35 条
[1]  
Abu Sini M, 2019, IEEE INT SYMP INFO, P290, DOI [10.1109/ISIT.2019.8849740, 10.1109/isit.2019.8849740]
[2]   Digital DNA lifecycle security and privacy: an overview [J].
Alsaffar, Muhalb M. ;
Hasan, Mohammad ;
McStay, Gavin P. ;
Sedky, Mohamed .
BRIEFINGS IN BIOINFORMATICS, 2022, 23 (02)
[3]   Low cost DNA data storage using photolithographic synthesis and advanced information reconstruction and error correction [J].
Antkowiak, Philipp L. ;
Lietard, Jory ;
Darestani, Mohammad Zalbagi ;
Somoza, Mark M. ;
Stark, Wendelin J. ;
Heckel, Reinhard ;
Grass, Robert N. .
NATURE COMMUNICATIONS, 2020, 11 (01)
[4]   SEED: efficient clustering of next-generation sequences [J].
Bao, Ergude ;
Jiang, Tao ;
Kaloshian, Isgouhi ;
Girke, Thomas .
BIOINFORMATICS, 2011, 27 (18) :2502-2509
[5]   A brief review on DNA storage, compression, and digitalization [J].
Cevallos, Yesenia ;
Nakano, Tadashi ;
Tello-Oquendo, Luis ;
Rushdi, Ahmad ;
Inca, Deysi ;
Santillan, Ivone ;
Shirazi, Amin Zadeh ;
Samaniego, Nicolay .
NANO COMMUNICATION NETWORKS, 2022, 31
[6]   Next-Generation Digital Information Storage in DNA [J].
Church, George M. ;
Gao, Yuan ;
Kosuri, Sriram .
SCIENCE, 2012, 337 (6102) :1628-1628
[7]   DNA storage: research landscape and future prospects [J].
Dong, Yiming ;
Sun, Fajia ;
Ping, Zhi ;
Ouyang, Qi ;
Qian, Long .
NATIONAL SCIENCE REVIEW, 2020, 7 (06) :1092-1107
[8]  
Ebrahimi S., 2021, IEEE T EMERG TOP COM
[9]   Local homology recognition and distance measures in linear time using compressed amino acid alphabets [J].
Edgar, RC .
NUCLEIC ACIDS RESEARCH, 2004, 32 (01) :380-385
[10]   Search and clustering orders of magnitude faster than BLAST [J].
Edgar, Robert C. .
BIOINFORMATICS, 2010, 26 (19) :2460-2461