Benchmarking DNA large language models on quadruplexes

被引:0
|
作者
Cherednichenko, Oleksandr [1 ]
Herbert, Alan [1 ,2 ]
Poptsova, Maria [1 ]
机构
[1] HSE Univ, Int Lab Bioinformat, Moscow, Russia
[2] InsideOutBio, Charlestown, MA USA
来源
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL | 2025年 / 27卷
关键词
Foundation model; Large language model; DNABERT; HyenaDNA; MAMBA-DNA; Caduseus; Flipons; Non-B DNA; G-quadruplexes;
D O I
10.1016/j.csbj.2025.03.007
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using genomic benchmark datasets, it remains unclear which LLM is best suited for specific downstream tasks, particularly for generating whole-genome annotations. Current LLMs in genomics fall into three main categories: transformer-based models, long convolution-based models, and statespace models (SSMs). In this study, we benchmarked three different types of LLM architectures for generating whole-genome maps of G-quadruplexes (GQ), a type of flipons, or non-B DNA structures, characterized by distinctive patterns and functional roles in diverse regulatory contexts. Although GQ forms from folding guanosine residues into tetrads, the computational task is challenging as the bases involved may be on different strands, separated by a large number of nucleotides, or made from RNA rather than DNA. All LLMs performed comparably well, with DNABERT-2 and HyenaDNA achieving superior results based on F1 and MCC. Analysis of whole-genome annotations revealed that HyenaDNA recovered more quadruplexes in distal enhancers and intronic regions. The models were better suited to detecting large GQ arrays that likely contribute to the nuclear condensates involved in gene transcription and chromosomal scaffolds. HyenaDNA and Caduceus formed a separate grouping in the generated de novo quadruplexes, while transformer-based models clustered together. Overall, our findings suggest that different types of LLMs complement each other. Genomic architectures with varying context lengths can detect distinct functional regulatory elements, underscoring the importance of selecting the appropriate model based on the specific genomic task. The code and data underlying this article are available at https://github.com/powidla/G4s-FMs
引用
收藏
页码:992 / 1000
页数:9
相关论文
共 50 条
  • [21] Applying Large Language Models to Issue Classification
    Aracena, Gabriel
    Luster, Kyle
    Santos, Fabio
    Steinmacher, Igor
    Gerosa, Marco Aurelio
    PROCEEDINGS 2024 ACM/IEEE INTERNATIONAL WORKSHOP ON NL-BASED SOFTWARE ENGINEERING, NLBSE 2024, 2024, : 57 - 60
  • [22] Environmental impact of large language models in medicine
    Kleinig, Oliver
    Sinhal, Shreyans
    Khurram, Rushan
    Gao, Christina
    Spajic, Luke
    Zannettino, Andrew
    Schnitzler, Margaret
    Guo, Christina
    Zaman, Sarah
    Smallbone, Harry
    Ittimani, Mana
    Chan, Weng Onn
    Stretton, Brandon
    Godber, Harry
    Chan, Justin
    Turner, Richard C.
    Warren, Leigh R.
    Clarke, Jonathan
    Sivagangabalan, Gopal
    Marshall-Webb, Matthew
    Moseley, Genevieve
    Driscoll, Simon
    Kovoor, Pramesh
    Chow, Clara K.
    Luo, Yuchen
    Thiagalingam, Aravinda
    Zaka, Ammar
    Gould, Paul
    Ramponi, Fabio
    Gupta, Aashray
    Kovoor, Joshua G.
    Bacchi, Stephen
    INTERNAL MEDICINE JOURNAL, 2024, 54 (12) : 2083 - 2086
  • [23] Safety of Large Language Models in Addressing Depression
    Heston, Thomas F.
    CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
  • [24] On the Capacity of Citation Generation by Large Language Models
    Qian, Haosheng
    Fan, Yixing
    Zhang, Ruqing
    Guo, Jiafeng
    INFORMATION RETRIEVAL, CCIR 2024, 2025, 15418 : 109 - 123
  • [25] Harnessing Large Language Models for Chart Review
    Xu, Dongchu
    Cunningham, Jonathan W.
    JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2025, 14 (07):
  • [26] On the Effectiveness of Large Language Models for GitHub Workflows
    Zhang, Xinyu
    Muralee, Siddharth
    Cherupattamoolayil, Sourag
    Machiry, Aravind
    19TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY, AND SECURITY, ARES 2024, 2024,
  • [27] Enhancing Persona Consistency with Large Language Models
    Shi, Haozhe
    Niu, Kun
    2024 5TH INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKS AND INTERNET OF THINGS, CNIOT 2024, 2024, : 210 - 215
  • [28] Large Language Models for Emotion Evolution Prediction
    Leung, Clement
    Xu, Zhifei
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS-ICCSA 2024 WORKSHOPS, PT I, 2024, 14815 : 3 - 19
  • [29] Ethical considerations for large language models in ophthalmology
    Kalaw, Fritz Gerald P.
    Baxter, Sally L.
    CURRENT OPINION IN OPHTHALMOLOGY, 2024, 35 (06) : 438 - 446
  • [30] LARGE LANGUAGE MODELS (LLMS) AND CHATGPT FOR BIOMEDICINE
    Arighi, Cecilia
    Brenner, Steven
    Lu, Zhiyong
    BIOCOMPUTING 2024, PSB 2024, 2024, : 641 - 644