GraSSRep: Graph-Based Self-supervised Learning for Repeat Detection in Metagenomic Assembly

被引:1
作者
Azizpour, Ali [1 ]
Balaji, Advait [2 ]
Treangen, Todd J. [2 ]
Segarra, Santiago [1 ]
机构
[1] Rice Univ, Dept Elect & Comp Engn, POB 1892, Houston, TX 77251 USA
[2] Rice Univ, Dept Comp Sci, Houston, TX USA
来源
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024 | 2024年 / 14758卷
关键词
Metagenomics; Repeat detection; Graph neural network; Self-supervised learning;
D O I
10.1007/978-1-0716-3989-4_34
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, where genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph's structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudo-labels for a small proportion of the nodes. We then use those pseudo-labels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with pre-defined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic datasets. The results on the simulated data highlight our GraSSRep's robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, our experiments with synthetic metagenomic datasets reveal that incorporating the graph structure and the GNN enhances our detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.
引用
收藏
页码:372 / 376
页数:5
相关论文
共 8 条
[1]   MetaCarvel: linking assembly graph motifs to biological variants [J].
Ghurye, Jay ;
Treangen, Todd ;
Fedarko, Marcus ;
Hervey, W. Judson ;
Pop, Mihai .
GENOME BIOLOGY, 2019, 20 (01)
[2]   Better Identification of Repeats in Metagenomic Scaffolding [J].
Ghurye, Jay ;
Pop, Mihai .
ALGORITHMS IN BIOINFORMATICS, 2016, 9838 :174-184
[3]  
Ghurye Jay S., 2016, Yale Journal of Biology and Medicine, V89, P353
[4]   A Survey on Contrastive Self-Supervised Learning [J].
Jaiswal, Ashish ;
Babu, Ashwin Ramesh ;
Zadeh, Mohammad Zaki ;
Banerjee, Debapriya ;
Makedon, Fillia .
TECHNOLOGIES, 2021, 9 (01)
[5]   Metagenomic Data Assembly - The Way of Decoding Unknown Microorganisms [J].
Lapidus, Alla L. ;
Korobeynikov, Anton I. .
FRONTIERS IN MICROBIOLOGY, 2021, 12
[6]   Repetitive DNA and next-generation sequencing: computational challenges and solutions [J].
Treangen, Todd J. ;
Salzberg, Steven L. .
NATURE REVIEWS GENETICS, 2012, 13 (01) :36-46
[7]   Genesis, effects and fates of repeats in prokaryotic genomes [J].
Treangen, Todd J. ;
Abraham, Anne-Laure ;
Touchon, Marie ;
Rocha, Eduardo P. C. .
FEMS MICROBIOLOGY REVIEWS, 2009, 33 (03) :539-571
[8]   A Comprehensive Survey on Graph Neural Networks [J].
Wu, Zonghan ;
Pan, Shirui ;
Chen, Fengwen ;
Long, Guodong ;
Zhang, Chengqi ;
Yu, Philip S. .
IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (01) :4-24