Benchmarking DNA large language models on quadruplexes

被引:0
|
作者
Cherednichenko, Oleksandr [1 ]
Herbert, Alan [1 ,2 ]
Poptsova, Maria [1 ]
机构
[1] HSE Univ, Int Lab Bioinformat, Moscow, Russia
[2] InsideOutBio, Charlestown, MA USA
来源
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL | 2025年 / 27卷
关键词
Foundation model; Large language model; DNABERT; HyenaDNA; MAMBA-DNA; Caduseus; Flipons; Non-B DNA; G-quadruplexes;
D O I
10.1016/j.csbj.2025.03.007
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using genomic benchmark datasets, it remains unclear which LLM is best suited for specific downstream tasks, particularly for generating whole-genome annotations. Current LLMs in genomics fall into three main categories: transformer-based models, long convolution-based models, and statespace models (SSMs). In this study, we benchmarked three different types of LLM architectures for generating whole-genome maps of G-quadruplexes (GQ), a type of flipons, or non-B DNA structures, characterized by distinctive patterns and functional roles in diverse regulatory contexts. Although GQ forms from folding guanosine residues into tetrads, the computational task is challenging as the bases involved may be on different strands, separated by a large number of nucleotides, or made from RNA rather than DNA. All LLMs performed comparably well, with DNABERT-2 and HyenaDNA achieving superior results based on F1 and MCC. Analysis of whole-genome annotations revealed that HyenaDNA recovered more quadruplexes in distal enhancers and intronic regions. The models were better suited to detecting large GQ arrays that likely contribute to the nuclear condensates involved in gene transcription and chromosomal scaffolds. HyenaDNA and Caduceus formed a separate grouping in the generated de novo quadruplexes, while transformer-based models clustered together. Overall, our findings suggest that different types of LLMs complement each other. Genomic architectures with varying context lengths can detect distinct functional regulatory elements, underscoring the importance of selecting the appropriate model based on the specific genomic task. The code and data underlying this article are available at https://github.com/powidla/G4s-FMs
引用
收藏
页码:992 / 1000
页数:9
相关论文
共 50 条
  • [11] Applications of Large Language Models in Pathology
    Cheng, Jerome
    BIOENGINEERING-BASEL, 2024, 11 (04):
  • [12] Quo Vadis ChatGPT? From large language models to Large Knowledge Models
    Venkatasubramanian, Venkat
    Chakraborty, Arijit
    COMPUTERS & CHEMICAL ENGINEERING, 2025, 192
  • [13] Weak Supramolecular Interactions Governing Parallel and Antiparallel DNA Quadruplexes: Insights from Large-Scale Quantum Mechanics Analysis of Experimentally Derived Models
    Yurenko, Yevgen P.
    Novotny, Jan
    Marek, Radek
    CHEMISTRY-A EUROPEAN JOURNAL, 2017, 23 (23) : 5573 - 5584
  • [14] Prompting Large Language Models With the Socratic Method
    Chang, Edward Y.
    2023 IEEE 13TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE, CCWC, 2023, : 351 - 360
  • [15] Game Generation via Large Language Models
    Hu, Chengpeng
    Zhao, Yunlong
    Liu, Jialin
    2024 IEEE CONFERENCE ON GAMES, COG 2024, 2024,
  • [16] Large Language Models as Evaluators for Recommendation Explanations
    Zhang, Xiaoyu
    Li, Yishan
    Wang, Jiayin
    Sun, Bowen
    Ma, Weizhi
    Sun, Peijie
    Zhang, Min
    PROCEEDINGS OF THE EIGHTEENTH ACM CONFERENCE ON RECOMMENDER SYSTEMS, RECSYS 2024, 2024, : 33 - 42
  • [17] The use of large language models for program repair
    Zubair, Fida
    Al-Hitmi, Maryam
    Catal, Cagatay
    COMPUTER STANDARDS & INTERFACES, 2025, 93
  • [18] Improving Recommender Systems with Large Language Models
    Lubos, Sebastian
    ADJUNCT PROCEEDINGS OF THE 32ND ACM CONFERENCE ON USER MODELING, ADAPTATION AND PERSONALIZATION, UMAP 2024, 2024, : 40 - 44
  • [19] Large Language Models: An Emerging Technology in Accounting
    Vasarhelyi, Miklos A.
    Moffitt, Kevin C.
    Stewart, Trevor
    Sunderland, Dan
    JOURNAL OF EMERGING TECHNOLOGIES IN ACCOUNTING, 2023, 20 (02) : 1 - 10
  • [20] Large Language Models: A Comprehensive Guide for Radiologists
    Kim, Sunkyu
    Lee, Choong-kun
    Kim, Seung-seob
    JOURNAL OF THE KOREAN SOCIETY OF RADIOLOGY, 2024, 85 (05): : 861 - 882