The Utility of Data Transformation for Alignment, De Novo Assembly and Classification of Short Read Virus Sequences

被引：3

作者：

Tapinos, Avraam ^{[1
]}

Constantinides, Bede ^{[1
,2
]}

Phan, My V. T. ^{[3
]}

Kouchaki, Samaneh ^{[1
,4
]}

Cotten, Matthew ^{[3
,5
,6
]}

Robertson, David L. ^{[1
,5
]}

机构：

[1] Univ Manchester, Sch Biol Sci, Manchester M13 9PT, Lancs, England

[2] Univ Oxford, John Radcliffe Hosp, Modernising Med Microbiol Consortium, Nuffield Dept Clin Med, Oxford OX3 9DU, England

[3] Erasmus MC, Dept Virosci, Doctor Molewaterpl 40, NL-3015 GD Rotterdam, Netherlands

[4] Univ Oxford, Inst Biomed Engn, Dept Engn Sci, Oxford OX3 7DQ, England

[5] MRC Univ Glasgow, Ctr Virus Res, Glasgow G61 1QH, Lanark, Scotland

[6] MRC UVRI & LSHTM Uganda Res Unit Entebbe, POB 49, Entebbe, Uganda

来源：

VIRUSES-BASEL | 2019年 / 11卷 / 05期

基金：

英国生物技术与生命科学研究理事会; 欧盟地平线“2020”; 英国惠康基金;

关键词：

alignment; assembly; taxonomic classification; time series; data transformation; DWT; DFT; PAA; data compression; compressive genomics; TIME; ALGORITHM; DIMENSIONALITY; METAGENOMICS;

D O I：

10.3390/v11050394

中图分类号：

Q93 [微生物学];

学科分类号：

071005 ; 100705 ;

摘要：

Advances in DNA sequencing technology are facilitating genomic analyses of unprecedented scope and scale, widening the gap between our abilities to generate and fully exploit biological sequence data. Comparable analytical challenges are encountered in other data-intensive fields involving sequential data, such as signal processing, in which dimensionality reduction (i.e., compression) methods are routinely used to lessen the computational burden of analyses. In this work, we explored the application of dimensionality reduction methods to numerically represent high-throughput sequence data for three important biological applications of virus sequence data: reference-based mapping, short sequence classification and de novo assembly. Leveraging highly compressed sequence transformations to accelerate sequence comparison, our approach yielded comparable accuracy to existing approaches, further demonstrating its suitability for sequences originating from diverse virus populations. We assessed the application of our methodology using both synthetic and real viral pathogen sequences. Our results show that the use of highly compressed sequence approximations can provide accurate results, with analytical performance retained and even enhanced through appropriate dimensionality reduction of sequence data.

引用

页数：22

共 50 条

[31] RepLong: de novo repeat identification using long read sequencing data
Guo, Rui
Li, Yan-Ran
He, Shan
Le Ou-Yang
Sun, Yiwen
Zhu, Zexuan
BIOINFORMATICS, 2018, 34 (07) : 1099 - 1107
[32] Long-read sequencing and de novo genome assembly of Ammopiptanthus nanus, a desert shrub
Gao, Fei
Wang, Xue
Li, Xuming
Xu, Mingyue
Li, Huayun
Abla, Merhaba
Sun, Huigai
Wei, Shanjun
Feng, Jinchao
Zhou, Yijun
GIGASCIENCE, 2018, 7 (07):
[33] Long-read sequencing and de novo genome assembly of marine medaka (Oryzias melastigma)
Liang, Pingping
Saqib, Hafiz Sohaib Ahmed
Ni, Xiaomin
Shen, Yingjia
BMC GENOMICS, 2020, 21 (01)
[34] Parallelized short read assembly of large genomes using de Bruijn graphs
Liu, Yongchao
Schmidt, Bertil
Maskell, Douglas L.
BMC BIOINFORMATICS, 2011, 12
[35] Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Chapman, Jarrod A.
Ho, Isaac
Sunkara, Sirisha
Luo, Shujun
Schroth, Gary P.
Rokhsar, Daniel S.
PLOS ONE, 2011, 6 (08):
[36] MetaGT: A pipeline for de novo assembly of metatranscriptomes with the aid of metagenomic data
Shafranskaya, Daria
Kale, Varsha
Finn, Rob
Lapidus, Alla L.
Korobeynikov, Anton
Prjibelski, Andrey D.
FRONTIERS IN MICROBIOLOGY, 2022, 13
[37] AftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data
Sovic, Michael G.
Fries, Anthony C.
Gibbs, H. Lisle
MOLECULAR ECOLOGY RESOURCES, 2015, 15 (05) : 1163 - 1171
[38] Robust high-throughput prokaryote de novo assembly and improvement pipeline for Illumina data
Page, Andrew J.
De Silva, Nishadi
Hunt, Martin
Quail, Michael A.
Parkhill, Julian
Harris, Simon R.
Otto, Thomas D.
Keane, Jacqueline A.
MICROBIAL GENOMICS, 2016, 2 (08): : e000083
[39] taxMaps: comprehensive and highly accurate taxonomic classification of short-read data in reasonable time
Corvelo, Andre
Clarke, Wayne E.
Robine, Nicolas
Zody, Michael C.
GENOME RESEARCH, 2018, 28 (05) : 751 - 758
[40] Genotyping and De Novo Discovery of Allelic Variants at the Brassicaceae Self-Incompatibility Locus from Short-Read Sequencing Data
Genete, Mathieu
Castric, Vincent
Vekemans, Xavier
MOLECULAR BIOLOGY AND EVOLUTION, 2020, 37 (04) : 1193 - 1201

← 1 2 3 4 5 →