245: Benchmarking DNA foundation models

Update: 2025-12-31

Description

️ Episode 245: Benchmarking DNA foundation models

In this episode of PaperCast Base by Base, we explore A comprehensive, unbiased benchmark compares five DNA foundation models across 57 datasets and multiple tasks, finding mean token embeddings improve classification and that model strengths vary by task and pre-training.

Study Highlights:
The study evaluated DNABERT-2, NT-v2, HyenaDNA, Caduceus-Ph, and GROVER on 57 datasets spanning sequence classification, gene expression prediction, variant effect quantification, and TAD recognition. Mean token embedding consistently and significantly outperformed summary-token and max pooling for sequence classification. Model performance was task-dependent: Caduceus-Ph excelled at human TFBS and promoter tasks, NT-v2 led pathogenic variant identification, HyenaDNA scaled efficiently and benefited from multi-species pre-training, while specialized models outperformed general foundations on QTL prediction. Zero-shot embeddings provided modest gene expression prediction and NT-v2 attention patterns did not reveal inherent TAD recognition.

Conclusion:
Mean token pooling yields more robust sequence-level representations and model choice should align with task, input length, and pre-training data for best genomic performance

Music:
Enjoy the music based on this article at the end of the episode.

Reference:
Feng H, Wu L, Zhao B, Huff C, Zhang J, Wu J, Lin L, Wei P & Wu C. Benchmarking DNA foundation models for genomic and genetic tasks. Nat Commun. 2025;16:10 780. https://doi.org/10.1038/s41467-025-65823-8

License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/

Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00

Official website https://basebybase.com

Castos player https://basebybase.castos.com

On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.

Episode link: https://basebybase.castos.com/episodes/dna-foundation-models-benchmark

Episode Slug: dna-foundation-models-benchmark

Keywords: DNA foundation models, mean token embedding, sequence classification, variant effect, gene expression

Comments

In Channel

198: Mechanical Confinement and the Shape-Shifting Life of Melanoma Cells

2025-11-1415:47

163: Animal origins: looping back in time

2025-10-1014:18

248: Disruption of PIKfyve triggers lysosomal repair and mitochondrial adaptation

2026-01-0321:12

247: Genome graphs reveal structural variation in M. tuberculosis

2026-01-0221:27

246: SV2A structural pharmacology and allosteric occlusion

2026-01-0116:44

245: Benchmarking DNA foundation models

2025-12-3119:10

244: NEK7 couples SDHB to preserve mitochondrial electron transport and limit liver fibrosis

2025-12-3017:38

243: Genome-wide UVB GxE study finds 162 vitamin D variants

2025-12-2917:41

242: AAV9-fcMISv2 gene therapy prevents pregnancy in female cats

2025-12-2817:40

241: Wagyu T2T reveals a cattle X neocentromere

2025-12-2718:36

240: CYFIP1 controls cortical axon development by modulating calcium

2025-12-2618:46

239: Genomic Adaptations of the Svalbard Reindeer

2025-12-2519:21

238: Germline polymorphisms shape antibody light chain repertoires

2025-12-2418:14

237: Tracing enteric pathogens in Africa with metagenomics and WGS

2025-12-2320:00

236: XPD translocation and genetic disease etiology

2025-12-2219:54

235: Maternal H3K9 methyltransferases control aRMAE in C. elegans

2025-12-2117:56

234: MTHFR genotype and methionine metabolism predict COVID-19 severity

2025-12-2020:27

233: Mechanistic basis of NuA3 recognition and H3K14 acetylation

2025-12-1918:48

232: Lamin A/C steers fork restart via H3K9me3 and PARylation

2025-12-1817:58

231: Transcription start sites as a germline mutational hotspot

2025-12-1717:01

00:00

245: Benchmarking DNA foundation models

#box-pro-ellipsis-176745746765763{-webkit-line-clamp:2;}245: Benchmarking DNA foundation models

245: Benchmarking DNA foundation models

Gustavo Barra

245: Benchmarking DNA foundation models