Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Update: 2024-12-31

Description

🤗 Upvotes: 11 | cs.CV

Authors:

Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang

Title:

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Arxiv:

http://arxiv.org/abs/2412.19326v1

Abstract:

Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO

Comments

Top Podcasts

The Best New Comedy Podcast Right Now – June 2024 The Best News Podcast Right Now – June 2024 The Best New Business Podcast Right Now – June 2024 The Best New Sports Podcast Right Now – June 2024 The Best New True Crime Podcast Right Now – June 2024 The Best New Joe Rogan Experience Podcast Right Now – June 20 The Best New Dan Bongino Show Podcast Right Now – June 20 The Best New Mark Levin Podcast – June 2024

In Channel

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

2025-01-0322:38

Xmodel-2 Technical Report

2025-01-0317:16

Are Vision-Language Models Truly Understanding Multi-vision Sensor?

2025-01-0324:50

HUNYUANPROVER: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

2025-01-0320:48

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

2025-01-0322:06

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

2025-01-0220:07

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System

2025-01-0218:53

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

2025-01-0125:04

On the Compositional Generalization of Multimodal LLMs for Medical Imaging

2025-01-0122:45

Bringing Objects to Life: 4D generation from 3D objects

2025-01-0121:48

Efficiently Serving LLM Reasoning Programs with Certaindex

2025-01-0120:19

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

2025-01-0121:15

Edicho: Consistent Image Editing in the Wild

2025-01-0122:47

Facilitating large language model Russian adaptation with Learned Embedding Propagation

2025-01-0122:12

Training Software Engineering Agents and Verifiers with SWE-Gym

2025-01-0126:54

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

2025-01-0120:54

Slow Perception: Let's Perceive Geometric Figures Step-by-step

2025-01-0123:19

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

2024-12-3123:19

1.58-bit FLUX

2024-12-3122:59

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

2024-12-3117:30

00:00

1.0x

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-173593636802578{-webkit-line-clamp:2;}Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Jingwen Liang, Gengyu Wang

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment