DiscoverBase by Base245: Benchmarking DNA foundation models
245: Benchmarking DNA foundation models

245: Benchmarking DNA foundation models

Update: 2025-12-31
Share

Description

️ Episode 245: Benchmarking DNA foundation models


In this episode of PaperCast Base by Base, we explore A comprehensive, unbiased benchmark compares five DNA foundation models across 57 datasets and multiple tasks, finding mean token embeddings improve classification and that model strengths vary by task and pre-training.


Study Highlights:
The study evaluated DNABERT-2, NT-v2, HyenaDNA, Caduceus-Ph, and GROVER on 57 datasets spanning sequence classification, gene expression prediction, variant effect quantification, and TAD recognition. Mean token embedding consistently and significantly outperformed summary-token and max pooling for sequence classification. Model performance was task-dependent: Caduceus-Ph excelled at human TFBS and promoter tasks, NT-v2 led pathogenic variant identification, HyenaDNA scaled efficiently and benefited from multi-species pre-training, while specialized models outperformed general foundations on QTL prediction. Zero-shot embeddings provided modest gene expression prediction and NT-v2 attention patterns did not reveal inherent TAD recognition.


Conclusion:
Mean token pooling yields more robust sequence-level representations and model choice should align with task, input length, and pre-training data for best genomic performance


Music:
Enjoy the music based on this article at the end of the episode.


Reference:
Feng H, Wu L, Zhao B, Huff C, Zhang J, Wu J, Lin L, Wei P & Wu C. Benchmarking DNA foundation models for genomic and genetic tasks. Nat Commun. 2025;16:10 780. https://doi.org/10.1038/s41467-025-65823-8


License:
This episode is based on an open-access article published under the Creative Commons Attribution 4.0 International License (CC BY 4.0) – https://creativecommons.org/licenses/by/4.0/


Support:
Base by Base – Stripe donations: https://donate.stripe.com/7sY4gz71B2sN3RWac5gEg00


Official website https://basebybase.com


Castos player https://basebybase.castos.com


On PaperCast Base by Base you’ll discover the latest in genomics, functional genomics, structural genomics, and proteomics.


Episode link: https://basebybase.castos.com/episodes/dna-foundation-models-benchmark


Episode Slug: dna-foundation-models-benchmark


Keywords: DNA foundation models, mean token embedding, sequence classification, variant effect, gene expression

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

245: Benchmarking DNA foundation models

245: Benchmarking DNA foundation models

Gustavo Barra