T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Update: 2025-12-26

Description

🤗 Upvotes: 23 | cs.CV

Authors:

Zhe Cao, Tao Wang, Jiaming Wang, Yanghai Wang, Yuanxing Zhang, Jialu Chen, Miao Deng, Jiahao Wang, Yubin Guo, Chenxi Liao, Yize Zhang, Zhaoxiang Zhang, Jiaheng Liu

Title:

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Arxiv:

http://arxiv.org/abs/2512.21094v1

Abstract:

Text-to-Audio-Video (T2AV) generation aims to synthesize temporally coherent video and semantically synchronized audio from natural language, yet its evaluation remains fragmented, often relying on unimodal metrics or narrowly scoped benchmarks that fail to capture cross-modal alignment, instruction following, and perceptual realism under complex prompts. To address this limitation, we present T2AV-Compass, a unified benchmark for comprehensive evaluation of T2AV systems, consisting of 500 diverse and complex prompts constructed via a taxonomy-driven pipeline to ensure semantic richness and physical plausibility. Besides, T2AV-Compass introduces a dual-level evaluation framework that integrates objective signal-level metrics for video quality, audio quality, and cross-modal alignment with a subjective MLLM-as-a-Judge protocol for instruction following and realism assessment. Extensive evaluation of 11 representative T2AVsystems reveals that even the strongest models fall substantially short of human-level realism and cross-modal consistency, with persistent failures in audio realism, fine-grained synchronization, instruction following, etc. These results indicate significant improvement room for future models and highlight the value of T2AV-Compass as a challenging and diagnostic testbed for advancing text-to-audio-video generation.

Comments

In Channel

TurboDiffusion: Accelerating Video Diffusion Models by 100-200 Times

2025-12-2621:22

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models

2025-12-2622:56

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

2025-12-2621:35

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

2025-12-2620:49

SemanticGen: Video Generation in Semantic Space

2025-12-2522:13

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

2025-12-2527:28

LongVideoAgent: Multi-Agent Reasoning with Long Videos

2025-12-2522:12

SpatialTree: How Spatial Abilities Branch Out in MLLMs

2025-12-2522:10

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

2025-12-2424:05

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

2025-12-2426:13

Region-Constraint In-Context Generation for Instructional Video Editing

2025-12-2421:07

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

2025-12-2424:14

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

2025-12-2424:41

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

2025-12-2425:57

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

2025-12-2326:30

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

2025-12-2323:40

00:00

1.0x

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176674892959144{-webkit-line-clamp:2;}T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation

Jingwen Liang, Gengyu Wang

T2AV-Compass: Towards Unified Evaluation for Text-to-Audio-Video Generation