PandoGen: Generating complete instances of future SARS-CoV-2 sequences using Deep Learning

被引：1

作者：

Ramachandran, Anand ^{[1
]}

Lumetta, Steven S. ^{[1
]}

Chen, Deming ^{[1
]}

机构：

[1] Univ Illinois, Urbana, IL 61820 USA

来源：

PLOS COMPUTATIONAL BIOLOGY | 2024年 / 20卷 / 01期

基金：

美国国家科学基金会;

关键词：

LANGUAGE;

D O I：

10.1371/journal.pcbi.1011790

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

One of the challenges in a viral pandemic is the emergence of novel variants with different phenotypical characteristics. An ability to forecast future viral individuals at the sequence level enables advance preparation by characterizing the sequences and closing vulnerabilities in current preventative and therapeutic methods. In this article, we explore, in the context of a viral pandemic, the problem of generating complete instances of undiscovered viral protein sequences, which have a high likelihood of being discovered in the future using protein language models. Current approaches to training these models fit model parameters to a known sequence set, which does not suit pandemic forecasting as future sequences differ from known sequences in some respects. To address this, we develop a novel method, called PandoGen, to train protein language models towards the pandemic protein forecasting task. PandoGen combines techniques such as synthetic data generation, conditional sequence generation, and reward-based learning, enabling the model to forecast future sequences, with a high propensity to spread. Applying our method to modeling the SARS-CoV-2 Spike protein sequence, we find empirically that our model forecasts twice as many novel sequences with five times the case counts compared to a model that is 30x larger. Our method forecasts unseen lineages months in advance, whereas models 4x and 30x larger forecast almost no new lineages. When trained on data available up to a month before the onset of important Variants of Concern, our method consistently forecasts sequences belonging to those variants within tight sequence budgets. Viral protein sequences play a pivotal role in the spread of a pandemic. As the virus evolves, so do the viral proteins, increasing the potency of the virus. Knowledge of future viral protein sequences can be invaluable because it allows us to test the efficacy of preventative and treatment methods against future changes to the virus, and tailor them to such changes early. We attempt to forecast viral proteins ahead of time. Making such predictions is very challenging and complex because the prediction target is a sequence with thousands of positions, and a single mis-predicted sequence position may invalidate the entire prediction. Also, as the virus continues to evolve, the data available to train models becomes obsolete. Addressing these challenges, we create a novel approach to train models of the SARS-CoV-2 Spike protein, that are especially tailored to forecasting future sequences. Models trained using this approach outperform existing approaches in their effectiveness. In addition, our method can train models to forecast important pandemic variants ahead of time.

引用

页数：31

共 6 条

[1] Predicting the antigenic evolution of SARS-COV-2 with deep learning
Han, Wenkai
Chen, Ningning
Xu, Xinzhou
Sahil, Adil
Zhou, Juexiao
Li, Zhongxiao
Zhong, Huawen
Gao, Elva
Zhang, Ruochi
Wang, Yu
Sun, Shiwei
Cheung, Peter Pak-Hang
Gao, Xin
NATURE COMMUNICATIONS, 2023, 14 (01)
[2] CoVEffect: interactive system for mining the effects of SARS-CoV-2 mutations and variants based on deep learning
Garcia, Giuseppe Serna
Al Khalaf, Ruba
Invernici, Francesco
Ceri, Stefano
Bernasconi, Anna
GIGASCIENCE, 2023, 12
[3] De novo generation of dual-target ligands for the treatment of SARS-CoV-2 using deep learning, virtual screening, and molecular dynamic simulations
Humayun, Fahad
Khan, Fatima
Khan, Abbas
Alshammari, Abdulrahman
Ji, Jun
Farhan, Ali
Fawad, Nasim
Alam, Waheed
Ali, Arif
Wei, Dong-Qing
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 2024, 42 (06) : 3019 - 3029
[4] Paying attention to the SARS-CoV-2 dialect : a deep neural network approach to predicting novel protein mutations
Elkin, Magdalyn E.
Zhu, Xingquan
COMMUNICATIONS BIOLOGY, 2025, 8 (01)
[5] Variation and evolution analysis of SARS-CoV-2 using self-game sequence optimization
Liu, Ziyu
Shen, Yi
Jiang, Yunliang
Zhu, Hancan
Hu, Hailong
Kang, Yanlei
Chen, Ming
Li, Zhong
FRONTIERS IN MICROBIOLOGY, 2024, 15
[6] Running ahead of evolution-AI-based simulation for predicting future high-risk SARS-CoV-2 variants
Chen, Jie
Nie, Zhiwei
Wang, Yu
Wang, Kai
Xu, Fan
Hu, Zhiheng
Zheng, Bing
Wang, Zhennan
Song, Guoli
Zhang, Jingyi
Fu, Jie
Huang, Xiansong
Wang, Zhongqi
Ren, Zhixiang
Wang, Qiankun
Li, Daixi
Wei, Dongqing
Zhou, Bin
Yang, Chao
Tian, Yonghong
INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, 2023, 37 (06) : 650 - 665

← 1 →