UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Update: 2025-08-28

Description

🤗 Upvotes: 23 | cs.LG

Authors:

Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, Siyuan Qiao

Title:

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Arxiv:

http://arxiv.org/abs/2508.18756v1

Abstract:

While Mixture of Experts (MoE) models achieve remarkable efficiency by activating only subsets of parameters, they suffer from high memory access costs during inference. Memory-layer architectures offer an appealing alternative with very few memory access, but previous attempts like UltraMem have only matched the performance of 2-expert MoE models, falling significantly short of state-of-the-art 8-expert configurations. We present UltraMemV2, a redesigned memory-layer architecture that closes this performance gap. Our approach introduces five key improvements: integrating memory layers into every transformer block, simplifying value expansion with single linear projections, adopting FFN-based value processing from PEER, implementing principled parameter initialization, and rebalancing memory-to-FFN computation ratios. Through extensive evaluation, we demonstrate that UltraMemV2 achieves performance parity with 8-expert MoE models under same computation and parameters but significantly low memory access. Notably, UltraMemV2 shows superior performance on memory-intensive tasks, with improvements of +1.6 points on long-context memorization, +6.2 points on multi-round memorization, and +7.9 points on in-context learning. We validate our approach at scale with models up to 2.5B activated parameters from 120B total parameters, and establish that activation density has greater impact on performance than total sparse parameter count. Our work brings memory-layer architectures to performance parity with state-of-the-art MoE models, presenting a compelling alternative for efficient sparse computation.

Comments

In Channel

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

2025-08-2821:42

VibeVoice Technical Report

2025-08-2821:19

CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics

2025-08-2820:03

VoxHammer: Training-Free Precise and Coherent 3D Editing in Native 3D Space

2025-08-2820:50

OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation

2025-08-2822:38

Spacer: Towards Engineered Scientific Inspiration

2025-08-2822:27

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

2025-08-2819:39

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

2025-08-2723:14

Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation

2025-08-2718:59

MV-RAG: Retrieval Augmented Multiview Diffusion

2025-08-2720:32

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs

2025-08-2622:33

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

2025-08-2621:37

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks

2025-08-2621:27

Intern-S1: A Scientific Multimodal Foundation Model

2025-08-2319:26

Mobile-Agent-v3: Foundamental Agents for GUI Automation

2025-08-2325:02

Deep Think with Confidence

2025-08-2320:40

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

2025-08-2323:48

DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization

2025-08-2222:59

From Scores to Skills: A Cognitive Diagnosis Framework for Evaluating Financial Large Language Models

2025-08-2223:15

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

2025-08-2222:01

00:00

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-17567297456468{-webkit-line-clamp:2;}UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning

Jingwen Liang, Gengyu Wang

UltraMemV2: Memory Networks Scaling to 120B Parameters with Superior Long-Context Learning