Benchmarking DNA large language models on quadruplexes

被引:0
|
作者
Cherednichenko, Oleksandr [1 ]
Herbert, Alan [1 ,2 ]
Poptsova, Maria [1 ]
机构
[1] HSE Univ, Int Lab Bioinformat, Moscow, Russia
[2] InsideOutBio, Charlestown, MA USA
来源
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL | 2025年 / 27卷
关键词
Foundation model; Large language model; DNABERT; HyenaDNA; MAMBA-DNA; Caduseus; Flipons; Non-B DNA; G-quadruplexes;
D O I
10.1016/j.csbj.2025.03.007
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using genomic benchmark datasets, it remains unclear which LLM is best suited for specific downstream tasks, particularly for generating whole-genome annotations. Current LLMs in genomics fall into three main categories: transformer-based models, long convolution-based models, and statespace models (SSMs). In this study, we benchmarked three different types of LLM architectures for generating whole-genome maps of G-quadruplexes (GQ), a type of flipons, or non-B DNA structures, characterized by distinctive patterns and functional roles in diverse regulatory contexts. Although GQ forms from folding guanosine residues into tetrads, the computational task is challenging as the bases involved may be on different strands, separated by a large number of nucleotides, or made from RNA rather than DNA. All LLMs performed comparably well, with DNABERT-2 and HyenaDNA achieving superior results based on F1 and MCC. Analysis of whole-genome annotations revealed that HyenaDNA recovered more quadruplexes in distal enhancers and intronic regions. The models were better suited to detecting large GQ arrays that likely contribute to the nuclear condensates involved in gene transcription and chromosomal scaffolds. HyenaDNA and Caduceus formed a separate grouping in the generated de novo quadruplexes, while transformer-based models clustered together. Overall, our findings suggest that different types of LLMs complement each other. Genomic architectures with varying context lengths can detect distinct functional regulatory elements, underscoring the importance of selecting the appropriate model based on the specific genomic task. The code and data underlying this article are available at https://github.com/powidla/G4s-FMs
引用
收藏
页码:992 / 1000
页数:9
相关论文
共 50 条
  • [41] Workshop on Large Language Models' Interpretability and Trustworthiness (LLMIT)
    Saha, Tulika
    Ganguly, Debasis
    Saha, Sriparna
    Mitra, Prasenjit
    PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, CIKM 2023, 2023, : 5290 - 5293
  • [42] A Concise Review of Long Context in Large Language Models
    Huang, Haitao
    Liang, Zijing
    Fang, Zirui
    Wang, Zhiyuan
    Chen, Mingxiu
    Hong, Yifan
    Liu, Ke
    Shang, Penghui
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON ALGORITHMS, SOFTWARE ENGINEERING, AND NETWORK SECURITY, ASENS 2024, 2024, : 563 - 566
  • [43] Large language models in laparoscopic surgery: A transformative opportunity
    Ray, Partha Pratim
    LAPAROSCOPIC ENDOSCOPIC AND ROBOTIC SURGERY, 2024, 7 (04): : 174 - 180
  • [44] The use of large language models in medicine: proceeding with caution
    Deng, Jiawen
    Zubair, Areeba
    Park, Ye-Jean
    Affan, Eesha
    Zuo, Qi Kang
    CURRENT MEDICAL RESEARCH AND OPINION, 2024, 40 (02) : 151 - 153
  • [45] The First Instruct -Following Large Language Models for Hungarian
    Yang, Zijian Gyozo
    Dode, Reka
    Ferenczi, Gergo
    Hatvani, Peter
    Heja, Eniko
    Madarasz, Gabor
    Ligeti-Nagy, Noemi
    Sarossy, Bence
    Szaniszlo, Zsofia
    Varadi, Tamas
    Verebelyi, Tamas
    Proszeky, Gabor
    2024 IEEE 3RD CONFERENCE ON INFORMATION TECHNOLOGY AND DATA SCIENCE, CITDS 2024, 2024, : 247 - 252
  • [46] Are large language models qualified reviewers in originality evaluation?
    Huang, Shengzhi
    Huang, Yong
    Liu, Yinpeng
    Luo, Zhuoran
    Lu, Wei
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
  • [47] Prompting Large Language Models for Automatic Question Tagging
    Xu, Nuojia
    Xue, Dizhan
    Qian, Shengsheng
    Fang, Quan
    Hu, Jun
    MACHINE INTELLIGENCE RESEARCH, 2025,
  • [48] Improving Machine Translation Formality with Large Language Models
    Yang, Murun
    Li, Fuxue
    CMC-COMPUTERS MATERIALS & CONTINUA, 2025, 82 (02): : 2061 - 2075
  • [49] Large Language Models in Healthcare and Medical Domain: A Review
    Nazi, Zabir Al
    Peng, Wei
    INFORMATICS-BASEL, 2024, 11 (03):
  • [50] An analysis of large language models: their impact and potential applications
    Bharathi Mohan, G.
    Prasanna Kumar, R.
    Vishal Krishh, P.
    Keerthinathan, A.
    Lavanya, G.
    Meghana, Meka Kavya Uma
    Sulthana, Sheba
    Doss, Srinath
    KNOWLEDGE AND INFORMATION SYSTEMS, 2024, 66 (09) : 5047 - 5070