DNABERT-based explainable lncRNA identification in plant genome assemblies

被引:5
作者
Danilevicz, Monica F. [1 ]
Gill, Mitchell [1 ]
Fernandez, Cassandria G. Tay [1 ]
Petereit, Jakob [1 ]
Upadhyaya, Shriprabha R. [1 ]
Batley, Jacqueline [1 ]
Bennamoun, Mohammed [2 ]
Edwards, David [1 ]
Bayer, Philipp E. [1 ]
机构
[1] Univ Western Australia, Sch Biol Sci, Perth, Australia
[2] Univ Western Australia, Sch Phys Math & Comp, Perth, Australia
基金
澳大利亚研究理事会;
关键词
LncRNAs; Natural language processing; Deep learning; Genomic motif; Cross-species prediction; LONG NONCODING RNAS; IMMUNE-RESPONSE; CHROMATIN; SEQUENCE; CONSERVATION; TRANSCRIPTS; DATABASE; FLC;
D O I
10.1016/j.csbj.2023.11.025
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Long non-coding ribonucleic acids (lncRNAs) have been shown to play an important role in plant gene regula-tion, involving both epigenetic and transcript regulation. LncRNAs are transcripts longer than 200 nucleotides that are not translated into functional proteins but can be translated into small peptides. Machine learning models have predominantly used transcriptome data with manually defined features to detect lncRNAs, however, they often underrepresent the abundance of lncRNAs and can be biased in their detection. Here we present a study using Natural Language Processing (NLP) models to identify plant lncRNAs from genomic sequences rather than transcriptomic data. The NLP models were trained to predict lncRNAs for seven model and crop species (Zea mays, Arabidopsis thaliana, Brassica napus, Brassica oleracea, Brassica rapa, Glycine max and Oryza sativa) using publicly available genomic references. We demonstrated that lncRNAs can be accurately predicted from genomic sequences with the highest accuracy of 83.4% for Z. mays and the lowest accuracy of 57.9% for B. rapa, revealing that genome assembly quality might affect the accuracy of lncRNA identification. Furthermore, we demonstrated the potential of using NLP models for cross-species prediction with an average of 63.1% accuracy using target species not previously seen by the model. As more species are incorporated into the training datasets, we expect the accuracy to increase, becoming a more reliable tool for uncovering novel lncRNAs. Finally, we show that the models can be interpreted using explainable artificial intelligence to identify motifs important to lncRNA pre-diction and that these motifs frequently flanked the lncRNA sequence.
引用
收藏
页码:5676 / 5685
页数:10
相关论文
共 79 条
[1]   R-Loop Mediated trans Action of the APOLO Long Noncoding RNA [J].
Ariel, Federico ;
Lucero, Leandro ;
Christ, Aurelie ;
Mammarella, Maria Florencia ;
Jegu, Teddy ;
Veluchamy, Alaguraj ;
Mariappan, Kiruthiga ;
Latrasse, David ;
Blein, Thomas ;
Liu, Chang ;
Benhamed, Moussa ;
Crespi, Martin .
MOLECULAR CELL, 2020, 77 (05) :1055-+
[2]   Noncoding Transcription by Alternative RNA Polymerases Dynamically Regulates an Auxin-Driven Chromatin Loop [J].
Ariel, Federico ;
Jegu, Teddy ;
Latrasse, David ;
Romero-Barrios, Natali ;
Christ, Aurelie ;
Benhamed, Moussa ;
Crespi, Martin .
MOLECULAR CELL, 2014, 55 (03) :383-396
[3]   Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification [J].
Arnal Barbedo, Jayme Garcia .
COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2018, 153 :46-53
[4]   Novel long non-protein coding RNAs involved in Arabidopsis differentiation and stress responses [J].
Ben Amor, Besma ;
Wirth, Sonia ;
Merchan, Francisco ;
Laporte, Philippe ;
d'Aubenton-Carafa, Yves ;
Hirsch, Judith ;
Maizel, Alexis ;
Mallory, Allison ;
Lucas, Antoine ;
Deragon, Jean Marc ;
Vaucheret, Herve ;
Thermes, Claude ;
Crespi, Martin .
GENOME RESEARCH, 2009, 19 (01) :57-69
[5]   LncMachine: a machine learning algorithm for long noncoding RNA annotation in plants [J].
Cagirici, H. Busra ;
Galvez, S. ;
Sen, Taner Z. ;
Budak, Hikmet .
FUNCTIONAL & INTEGRATIVE GENOMICS, 2021, 21 (02) :195-204
[6]   BLAST plus : architecture and applications [J].
Camacho, Christiam ;
Coulouris, George ;
Avagyan, Vahram ;
Ma, Ning ;
Papadopoulos, Jason ;
Bealer, Kevin ;
Madden, Thomas L. .
BMC BIOINFORMATICS, 2009, 10
[7]   LncRNA TCONS_00021861 is functionally associated with drought tolerance in rice (Oryza sativa L.) via competing endogenous RNA regulation [J].
Chen, Jiajia ;
Zhong, Yuqing ;
Qi, Xin .
BMC PLANT BIOLOGY, 2021, 21 (01)
[8]   Antisense COOLAIR mediates the coordinated switching of chromatin states at FLC during vernalization [J].
Csorba, Tibor ;
Questa, Julia I. ;
Sun, Qianwen ;
Dean, Caroline .
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2014, 111 (45) :16160-16165
[9]   Conservation analysis of long non-coding RNAs in plants [J].
Deng, Pingchuan ;
Liu, Shu ;
Nie, Xiaojun ;
Weining, Song ;
Wu, Liang .
SCIENCE CHINA-LIFE SCIENCES, 2018, 61 (02) :190-198
[10]  
Devlin J, 2019, Arxiv, DOI [arXiv:1810.04805, 10.48550/arxiv.1810.04805]