METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Update: 2025-01-08

Description

🤗 Upvotes: 13 | q-bio.GN, cs.AI, cs.CL, cs.LG

Authors:

Ollie Liu, Sami Jaghouar, Johannes Hagemann, Shangshang Wang, Jason Wiemels, Jeff Kaufman, Willie Neiswanger

Title:

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Arxiv:

http://arxiv.org/abs/2501.02045v1

Abstract:

We pretrain METAGENE-1, a 7-billion-parameter autoregressive transformer model, which we refer to as a metagenomic foundation model, on a novel corpus of diverse metagenomic DNA and RNA sequences comprising over 1.5 trillion base pairs. This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic (next-generation) sequencing methods. Unlike genomic models that focus on individual genomes or curated sets of specific species, the aim of METAGENE-1 is to capture the full distribution of genomic information present within this wastewater, to aid in tasks relevant to pandemic monitoring and pathogen detection. We carry out byte-pair encoding (BPE) tokenization on our dataset, tailored for metagenomic sequences, and then pretrain our model. In this paper, we first detail the pretraining dataset, tokenization strategy, and model architecture, highlighting the considerations and design choices that enable the effective modeling of metagenomic data. We then show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining. Finally, we demonstrate the performance of METAGENE-1, which achieves state-of-the-art results on a set of genomic benchmarks and new evaluations focused on human-pathogen detection and genomic sequence embedding, showcasing its potential for public health applications in pandemic monitoring, biosurveillance, and early detection of emerging health threats.

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution

2025-01-0822:18

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

2025-01-0826:54

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

2025-01-0822:26

Personalized Graph-Based Retrieval for Large Language Models

2025-01-0821:16

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

2025-01-0821:38

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

2025-01-0822:25

Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation

2025-01-0822:15

TransPixar: Advancing Text-to-Video Generation with Transparency

2025-01-0822:45

AutoPresent: Designing Structured Visuals from Scratch

2025-01-0819:20

EnerVerse: Envisioning Embodied Future Space for Robotics Manipulation

2025-01-0724:44

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

2025-01-0720:37

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

2025-01-0723:02

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

2025-01-0722:38

SDPO: Segment-Level Direct Preference Optimization for Social Agents

2025-01-0719:44

Graph Generative Pre-trained Transformer

2025-01-0720:24

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

2025-01-0723:14

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

2025-01-0725:56

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

2025-01-0423:53

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

2025-01-0423:32

VideoAnydoor: High-fidelity Video Object Insertion with Precise Motion Control

2025-01-0419:15

00:00

1.0x

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173639338140168{-webkit-line-clamp:2;}METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring

Jingwen Liang, Gengyu Wang

METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring