Partial order relation-based gene ontology embedding improves protein function prediction

被引:0
作者
Li, Wenjing [1 ]
Wang, Bin [2 ]
Dai, Jin [3 ]
Kou, Yan [4 ,5 ]
Chen, Xiaojun [1 ]
Pan, Yi [6 ,7 ]
Hu, Shuangwei [8 ]
Xu, Zhenjiang Zech [9 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software, 3688 Nanshan Ave, Shenzhen 518060, Guangdong, Peoples R China
[2] Nanchang Univ, Sch Math & Comp Sci, Nanchang, Peoples R China
[3] Beijing Inst Technol, Sch Phys, Beijing, Peoples R China
[4] GSK, Onyx Prod Management, New York, NY USA
[5] Xbiome & Insight Data Sci, Shenzhen, Peoples R China
[6] Chinese Acad Sci, Shenzhen Inst Adv Technol, Coll Comp Sci & Control Engn, Shenzhen, Peoples R China
[7] Georgia State Univ, Atlanta, GA USA
[8] Xbiome, Room 907,9th Floor,Sci Res Bldg,Tsinghua High Tech, Shenzhen 518000, Peoples R China
[9] Nanchang Univ, State Key Lab Food Sci & Technol, Nanchang, Peoples R China
关键词
Gene Ontology; protein annotation; representation learning; protein function prediction; partial order constraint; LANGUAGE;
D O I
暂无
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Protein annotation has long been a challenging task in computational biology. Gene Ontology (GO) has become one of the most popular frameworks to describe protein functions and their relationships. Prediction of a protein annotation with proper GO terms demands high-quality GO term representation learning, which aims to learn a low-dimensional dense vector representation with accompanying semantic meaning for each functional label, also known as embedding. However, existing GO term embedding methods, which mainly take into account ancestral co-occurrence information, have yet to capture the full topological information in the GO-directed acyclic graph (DAG). In this study, we propose a novel GO term representation learning method, PO2Vec, to utilize the partial order relationships to improve the GO term representations. Extensive evaluations show that PO2Vec achieves better outcomes than existing embedding methods in a variety of downstream biological tasks. Based on PO2Vec, we further developed a new protein function prediction method PO2GO, which demonstrates superior performance measured in multiple metrics and annotation specificity as well as few-shot prediction capability in the benchmarks. These results suggest that the high-quality representation of GO structure is critical for diverse biological tasks including computational protein annotation.
引用
收藏
页数:10
相关论文
共 32 条
  • [1] Gene Ontology: tool for the unification of biology
    Ashburner, M
    Ball, CA
    Blake, JA
    Botstein, D
    Butler, H
    Cherry, JM
    Davis, AP
    Dolinski, K
    Dwight, SS
    Eppig, JT
    Harris, MA
    Hill, DP
    Issel-Tarver, L
    Kasarskis, A
    Lewis, S
    Matese, JC
    Richardson, JE
    Ringwald, M
    Rubin, GM
    Sherlock, G
    [J]. NATURE GENETICS, 2000, 25 (01) : 25 - 29
  • [2] TALE: Transformer-based protein function Annotation with joint sequence-Label Embedding
    Cao, Yue
    Shen, Yang
    [J]. BIOINFORMATICS, 2021, 37 (18) : 2825 - 2833
  • [3] The Gene Ontology Resource: 20 years and still GOing strong
    Carbon, S.
    Douglass, E.
    Dunn, N.
    Good, B.
    Harris, N. L.
    Lewis, S. E.
    Mungall, C. J.
    Basu, S.
    Chisholm, R. L.
    Dodson, R. J.
    Hartline, E.
    Fey, P.
    Thomas, P. D.
    Albou, L. P.
    Ebert, D.
    Kesling, M. J.
    Mi, H.
    Muruganujian, A.
    Huang, X.
    Poudel, S.
    Mushayahama, T.
    Hu, J. C.
    LaBonte, S. A.
    Siegele, D. A.
    Antonazzo, G.
    Attrill, H.
    Brown, N. H.
    Fexova, S.
    Garapati, P.
    Jones, T. E. M.
    Marygold, S. J.
    Millburn, G. H.
    Rey, A. J.
    Trovisco, V.
    dos Santos, G.
    Emmert, D. B.
    Falls, K.
    Zhou, P.
    Goodman, J. L.
    Strelets, V. B.
    Thurmond, J.
    Courtot, M.
    Osumi-Sutherland, D.
    Parkinson, H.
    Roncaglia, P.
    Acencio, M. L.
    Kuiper, M.
    Laegreid, A.
    Logie, C.
    Lovering, R. C.
    [J]. NUCLEIC ACIDS RESEARCH, 2019, 47 (D1) : D330 - D338
  • [4] Chalkidis I, 2019, 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), P6314
  • [5] Chowdhary K., 2020, Fundamentals of artificial intelligence, P603, DOI DOI 10.1007/978-81-322-3972-719
  • [6] Central limit theorems for the Wasserstein distance between the empirical and the true distributions
    Del Barrio, E
    Giné, E
    Matrán, C
    [J]. ANNALS OF PROBABILITY, 1999, 27 (02) : 1009 - 1071
  • [7] Donnelly K, 2006, STUD HEALTH TECHNOL, V121, P279
  • [8] Anc2vec: embedding gene ontology terms by preserving ancestors relationships
    Edera, Alejandro A.
    Milone, Diego H.
    Stegmayer, Georgina
    [J]. BRIEFINGS IN BIOINFORMATICS, 2022, 23 (02)
  • [9] Semantic similarity analysis of protein data: assessment with biological features and issues
    Guzzi, Pietro H.
    Mina, Marco
    Guerra, Concettina
    Cannataro, Mario
    [J]. BRIEFINGS IN BIOINFORMATICS, 2012, 13 (05) : 569 - 585
  • [10] Highly accurate protein structure prediction with AlphaFold
    Jumper, John
    Evans, Richard
    Pritzel, Alexander
    Green, Tim
    Figurnov, Michael
    Ronneberger, Olaf
    Tunyasuvunakool, Kathryn
    Bates, Russ
    Zidek, Augustin
    Potapenko, Anna
    Bridgland, Alex
    Meyer, Clemens
    Kohl, Simon A. A.
    Ballard, Andrew J.
    Cowie, Andrew
    Romera-Paredes, Bernardino
    Nikolov, Stanislav
    Jain, Rishub
    Adler, Jonas
    Back, Trevor
    Petersen, Stig
    Reiman, David
    Clancy, Ellen
    Zielinski, Michal
    Steinegger, Martin
    Pacholska, Michalina
    Berghammer, Tamas
    Bodenstein, Sebastian
    Silver, David
    Vinyals, Oriol
    Senior, Andrew W.
    Kavukcuoglu, Koray
    Kohli, Pushmeet
    Hassabis, Demis
    [J]. NATURE, 2021, 596 (7873) : 583 - +