Benchmarking DNA large language models on quadruplexes

被引：0

作者：

Cherednichenko, Oleksandr ^{[1
]}

Herbert, Alan ^{[1
,2
]}

Poptsova, Maria ^{[1
]}

机构：

[1] HSE Univ, Int Lab Bioinformat, Moscow, Russia

[2] InsideOutBio, Charlestown, MA USA

来源：

COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL | 2025年 / 27卷

关键词：

Foundation model; Large language model; DNABERT; HyenaDNA; MAMBA-DNA; Caduseus; Flipons; Non-B DNA; G-quadruplexes;

D O I：

10.1016/j.csbj.2025.03.007

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using genomic benchmark datasets, it remains unclear which LLM is best suited for specific downstream tasks, particularly for generating whole-genome annotations. Current LLMs in genomics fall into three main categories: transformer-based models, long convolution-based models, and statespace models (SSMs). In this study, we benchmarked three different types of LLM architectures for generating whole-genome maps of G-quadruplexes (GQ), a type of flipons, or non-B DNA structures, characterized by distinctive patterns and functional roles in diverse regulatory contexts. Although GQ forms from folding guanosine residues into tetrads, the computational task is challenging as the bases involved may be on different strands, separated by a large number of nucleotides, or made from RNA rather than DNA. All LLMs performed comparably well, with DNABERT-2 and HyenaDNA achieving superior results based on F1 and MCC. Analysis of whole-genome annotations revealed that HyenaDNA recovered more quadruplexes in distal enhancers and intronic regions. The models were better suited to detecting large GQ arrays that likely contribute to the nuclear condensates involved in gene transcription and chromosomal scaffolds. HyenaDNA and Caduceus formed a separate grouping in the generated de novo quadruplexes, while transformer-based models clustered together. Overall, our findings suggest that different types of LLMs complement each other. Genomic architectures with varying context lengths can detect distinct functional regulatory elements, underscoring the importance of selecting the appropriate model based on the specific genomic task. The code and data underlying this article are available at https://github.com/powidla/G4s-FMs

引用

页码：992 / 1000

页数：9

共 50 条

[21] Applying Large Language Models to Issue Classification
Aracena, Gabriel
Luster, Kyle
Santos, Fabio
Steinmacher, Igor
Gerosa, Marco Aurelio
PROCEEDINGS 2024 ACM/IEEE INTERNATIONAL WORKSHOP ON NL-BASED SOFTWARE ENGINEERING, NLBSE 2024, 2024, : 57 - 60
[22] Environmental impact of large language models in medicine
Kleinig, Oliver
Sinhal, Shreyans
Khurram, Rushan
Gao, Christina
Spajic, Luke
Zannettino, Andrew
Schnitzler, Margaret
Guo, Christina
Zaman, Sarah
Smallbone, Harry
Ittimani, Mana
Chan, Weng Onn
Stretton, Brandon
Godber, Harry
Chan, Justin
Turner, Richard C.
Warren, Leigh R.
Clarke, Jonathan
Sivagangabalan, Gopal
Marshall-Webb, Matthew
Moseley, Genevieve
Driscoll, Simon
Kovoor, Pramesh
Chow, Clara K.
Luo, Yuchen
Thiagalingam, Aravinda
Zaka, Ammar
Gould, Paul
Ramponi, Fabio
Gupta, Aashray
Kovoor, Joshua G.
Bacchi, Stephen
INTERNAL MEDICINE JOURNAL, 2024, 54 (12) : 2083 - 2086
[23] Safety of Large Language Models in Addressing Depression
Heston, Thomas F.
CUREUS JOURNAL OF MEDICAL SCIENCE, 2023, 15 (12)
[24] On the Capacity of Citation Generation by Large Language Models
Qian, Haosheng
Fan, Yixing
Zhang, Ruqing
Guo, Jiafeng
INFORMATION RETRIEVAL, CCIR 2024, 2025, 15418 : 109 - 123
[25] Harnessing Large Language Models for Chart Review
Xu, Dongchu
Cunningham, Jonathan W.
JOURNAL OF THE AMERICAN HEART ASSOCIATION, 2025, 14 (07):
[26] On the Effectiveness of Large Language Models for GitHub Workflows
Zhang, Xinyu
Muralee, Siddharth
Cherupattamoolayil, Sourag
Machiry, Aravind
19TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY, AND SECURITY, ARES 2024, 2024,
[27] Enhancing Persona Consistency with Large Language Models
Shi, Haozhe
Niu, Kun
2024 5TH INTERNATIONAL CONFERENCE ON COMPUTING, NETWORKS AND INTERNET OF THINGS, CNIOT 2024, 2024, : 210 - 215
[28] Large Language Models for Emotion Evolution Prediction
Leung, Clement
Xu, Zhifei
COMPUTATIONAL SCIENCE AND ITS APPLICATIONS-ICCSA 2024 WORKSHOPS, PT I, 2024, 14815 : 3 - 19
[29] Ethical considerations for large language models in ophthalmology
Kalaw, Fritz Gerald P.
Baxter, Sally L.
CURRENT OPINION IN OPHTHALMOLOGY, 2024, 35 (06) : 438 - 446
[30] LARGE LANGUAGE MODELS (LLMS) AND CHATGPT FOR BIOMEDICINE
Arighi, Cecilia
Brenner, Steven
Lu, Zhiyong
BIOCOMPUTING 2024, PSB 2024, 2024, : 641 - 644

← 1 2 3 4 5 →