SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning

被引:15
作者
Balaji, Advait [1 ]
Kille, Bryce [1 ]
Kappell, Anthony D. [2 ]
Godbold, Gene D. [3 ]
Diep, Madeline [4 ]
Elworth, R. A. Leo [1 ]
Qian, Zhiqin [1 ]
Albin, Dreycey [1 ]
Nasko, Daniel J. [5 ]
Shah, Nidhi [5 ]
Pop, Mihai [5 ]
Segarra, Santiago [6 ]
Ternus, Krista L. [2 ]
Treangen, Todd J. [1 ]
机构
[1] Rice Univ, Dept Comp Sci, Houston, TX 77251 USA
[2] Signat Sci LLC, 8329 North Mopac Expressway, Austin, TX 78759 USA
[3] Signat Sci LLC, 1670 Discovery Dr, Charlottesville, VA USA
[4] Fraunhofer USA Ctr Midatlant CMA, Riverdale, MD USA
[5] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[6] Rice Univ, Dept Elect & Comp Engn, POB 1892, Houston, TX 77251 USA
基金
美国国家科学基金会;
关键词
VFDB; ALIGNMENT; CLASSIFICATION; ANNOTATION; RESOURCE; DATABASE;
D O I
10.1186/s13059-022-02695-x
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
The COVID-19 pandemic has emphasized the importance of accurate detection of known and emerging pathogens. However, robust characterization of pathogenic sequences remains an open challenge. To address this need we developed SeqScreen, which accurately characterizes short nucleotide sequences using taxonomic and functional labels and a customized set of curated Functions of Sequences of Concern (FunSoCs) specific to microbial pathogenesis. We show our ensemble machine learning model can label protein-coding sequences with FunSoCs with high recall and precision. SeqScreen is a step towards a novel paradigm of functionally informed synthetic DNA screening and pathogen characterization, available for download at www.gitlab.com/treangenlab/seqscreen.
引用
收藏
页数:29
相关论文
共 85 条
  • [1] Afshinnekoo E, 2015, CELL SYST, V1, P97, DOI 10.1016/j.cels.2015.07.006
  • [2] Agents NRC (US) C on SM for the D of a GS-BCS for the O of S, 2010, SEQUENCE BASED CLASS, DOI [10.17226/12970, DOI 10.17226/12970]
  • [3] Albin D, 2019, IEEE INT C BIOINFORM, P1729, DOI 10.1109/BIBM47256.2019.8982987
  • [4] CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database
    Alcock, Brian P.
    Raphenya, Amogelang R.
    Lau, Tammy T. Y.
    Tsang, Kara K.
    Bouchard, Megane
    Edalatmand, Arman
    Huynh, William
    Nguyen, Anna-Lisa, V
    Cheng, Annie A.
    Liu, Sihan
    Min, Sally Y.
    Miroshnichenko, Anatoly
    Tran, Hiu-Ki
    Werfalli, Rafik E.
    Nasir, Jalees A.
    Oloni, Martins
    Speicher, David J.
    Florescu, Alexandra
    Singh, Bhavya
    Faltyn, Mateusz
    Hernandez-Koutoucheva, Anastasia
    Sharma, Arjun N.
    Bordeleau, Emily
    Pawlowski, Andrew C.
    Zubyk, Haley L.
    Dooley, Damion
    Griffiths, Emma
    Maguire, Finlay
    Winsor, Geoff L.
    Beiko, Robert G.
    Brinkman, Fiona S. L.
    Hsiao, William W. L.
    Domselaar, Gary, V
    McArthur, Andrew G.
    [J]. NUCLEIC ACIDS RESEARCH, 2020, 48 (D1) : D517 - D525
  • [5] BASIC LOCAL ALIGNMENT SEARCH TOOL
    ALTSCHUL, SF
    GISH, W
    MILLER, W
    MYERS, EW
    LIPMAN, DJ
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1990, 215 (03) : 403 - 410
  • [6] DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data
    Arango-Argoty, Gustavo
    Garner, Emily
    Prudent, Amy
    Heath, Lenwood S.
    Vikesland, Peter
    Zhang, Liqing
    [J]. MICROBIOME, 2018, 6
  • [7] Balaji A, 2021, SEQSCREEN ACCURATE S
  • [8] Balaji A, 2022, SEQSCREEN DATABASES
  • [9] Balaji A., 2021, HUMANN2 ANAL COVID 1
  • [10] Balaji A, 2021, FUNSOC DB FILE