Robust data storage in DNA by de Bruijn graph-based de novo strand assembly

被引:51
作者
Song, Lifu [1 ,2 ,3 ]
Geng, Feng [4 ]
Gong, Zi-Yi [1 ,2 ,3 ]
Chen, Xin [5 ]
Tang, Jijun [6 ,7 ]
Gong, Chunye [8 ]
Zhou, Libang [9 ]
Xia, Rui [8 ]
Han, Ming-Zhe [1 ,2 ,3 ]
Xu, Jing-Yi [1 ,2 ,3 ]
Li, Bing-Zhi [1 ,2 ,3 ]
Yuan, Ying-Jin [1 ,2 ,3 ]
机构
[1] Tianjin Univ, Frontiers Sci Ctr Synthet Biol, Tianjin 300072, Peoples R China
[2] Tianjin Univ, Key Lab Syst Bioengn, Minist Educ, Tianjin 300072, Peoples R China
[3] Tianjin Univ, Sch Chem Engn & Technol, Tianjin 300072, Peoples R China
[4] Binzhou Med Univ, Coll Pharm, Yantai 264003, Shandong, Peoples R China
[5] Tianjin Univ, Centor Appl Math, Tianjin 300072, Peoples R China
[6] Tianjin Univ, Coll Intelligence & Comp, Sch Comp Sci & Technol, Tianjin 300350, Peoples R China
[7] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen, Peoples R China
[8] Natl SuperComp Ctr Tianjin, Tianjin 300457, Peoples R China
[9] Nanjing Agr Univ, Coliege Food Sci & Technol, Nanjing 210095, Jiangsu, Peoples R China
关键词
MULTIPLE SEQUENCE ALIGNMENT; DIGITAL INFORMATION; SYNTHETIC DNA; ERROR RATES; RECONSTRUCTION;
D O I
10.1038/s41467-022-33046-w
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
DNA data storage is a rapidly developing technology with great potential due to its high density, long-term durability, and low maintenance cost. Here the authors present a strand assembly algorithm (DBGPS) using de Bruijn graph and greedy path search. DNA data storage is a rapidly developing technology with great potential due to its high density, long-term durability, and low maintenance cost. The major technical challenges include various errors, such as strand breaks, rearrangements, and indels that frequently arise during DNA synthesis, amplification, sequencing, and preservation. In this study, a de novo strand assembly algorithm (DBGPS) is developed using de Bruijn graph and greedy path search to meet these challenges. DBGPS shows substantial advantages in handling DNA breaks, rearrangements, and indels. The robustness of DBGPS is demonstrated by accelerated aging, multiple independent data retrievals, deep error-prone PCR, and large-scale simulations. Remarkably, 6.8 MB of data is accurately recovered from a severely corrupted sample that has been treated at 70 degrees C for 70 days. With DBGPS, we are able to achieve a logical density of 1.30 bits/cycle and a physical density of 295 PB/g.
引用
收藏
页数:9
相关论文
共 60 条
[31]   Efficient reconstruction of sequences [J].
Levenshtein, VI .
IEEE TRANSACTIONS ON INFORMATION THEORY, 2001, 47 (01) :2-22
[32]  
Li Heng., FAST SIMPLE K MER CO
[33]   Chemical and photochemical error rates in light-directed synthesis of complex DNA libraries [J].
Lietard, Jory ;
Leger, Adrien ;
Erlich, Yaniv ;
Sadowski, Norah ;
Timp, Winston ;
Somoza, Mark M. .
NUCLEIC ACIDS RESEARCH, 2021, 49 (12) :6687-6701
[34]   Dynamic and scalable DNA-based information storage [J].
Lin, Kevin N. ;
Volkel, Kevin ;
Tuck, James M. ;
Keung, Albert J. .
NATURE COMMUNICATIONS, 2020, 11 (01)
[35]   Self-replicating digital data storage with synthetic chromosomes [J].
Lu, Xinyu ;
Ellis, Tom .
NATIONAL SCIENCE REVIEW, 2021, 8 (07)
[36]   A fast, lock-free approach for efficient parallel counting of occurrences of k-mers [J].
Marcais, Guillaume ;
Kingsford, Carl .
BIOINFORMATICS, 2011, 27 (06) :764-770
[37]   DNA stability: a central design consideration for DNA data storage systems [J].
Matange, Karishma ;
Tuck, James M. ;
Keung, Albert J. .
NATURE COMMUNICATIONS, 2021, 12 (01)
[38]   Reading and writing digital data in DNA [J].
Meiser, Linda C. ;
Antkowiak, Philipp L. ;
Koch, Julian ;
Chen, Weida D. ;
Kohll, A. Xavier ;
Stark, Wendelin J. ;
Heckel, Reinhard ;
Grass, Robert N. .
NATURE PROTOCOLS, 2020, 15 (01) :86-101
[39]   Efficient counting of k-mers in DNA sequences using a bloom filter [J].
Melsted, Pall ;
Pritchard, Jonathan K. .
BMC BIOINFORMATICS, 2011, 12
[40]  
Organick L., bioRxiv, DOI DOI 10.1101/565150