ModernVBERT: Towards Smaller Visual Document Retrievers

Update: 2025-10-04

Description

🤗 Upvotes: 24 | cs.IR

Authors:

Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse

Title:

ModernVBERT: Towards Smaller Visual Document Retrievers

Arxiv:

http://arxiv.org/abs/2510.01149v1

Abstract:

Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.

Comments

In Channel

LongCodeZip: Compress Long Context for Code Language Models

2025-10-0429:31

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

2025-10-0423:03

ExGRPO: Learning to Reason from Experience

2025-10-0421:57

StealthAttack: Robust 3D Gaussian Splatting Poisoning via Density-Guided Illusions

2025-10-0426:07

Interactive Training: Feedback-Driven Neural Network Optimization

2025-10-0420:55

ModernVBERT: Towards Smaller Visual Document Retrievers

2025-10-0423:21

StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?

2025-10-0430:49

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

2025-10-0324:12

GEM: A Gym for Agentic LLMs

2025-10-0325:53

VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators

2025-10-0326:50

Knapsack RL: Unlocking Exploration of LLMs via Optimizing Budget Allocation

2025-10-0324:50

PIPer: On-Device Environment Setup via Online Reinforcement Learning

2025-10-0320:19

SINQ: Sinkhorn-Normalized Quantization for Calibration-Free Low-Precision LLM Weights

2025-10-0324:05

ACON: Optimizing Context Compression for Long-horizon LLM Agents

2025-10-0325:16

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

2025-10-0225:16

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

2025-10-0223:40

Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play

2025-10-0229:09

Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning

2025-10-0220:04

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

2025-10-0224:43

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

2025-10-0227:25

00:00

1.0x

ModernVBERT: Towards Smaller Visual Document Retrievers

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-175959392722483{-webkit-line-clamp:2;}ModernVBERT: Towards Smaller Visual Document Retrievers

ModernVBERT: Towards Smaller Visual Document Retrievers

Jingwen Liang, Gengyu Wang

ModernVBERT: Towards Smaller Visual Document Retrievers