DiscoverDaily Paper CastTask Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment
Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Update: 2024-12-31
Share

Description

🤗 Upvotes: 11 | cs.CV



Authors:

Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang



Title:

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment



Arxiv:

http://arxiv.org/abs/2412.19326v1



Abstract:

Current multimodal large language models (MLLMs) struggle with fine-grained or precise understanding of visuals though they give comprehensive perception and reasoning in a spectrum of vision applications. Recent studies either develop tool-using or unify specific visual tasks into the autoregressive framework, often at the expense of overall multimodal performance. To address this issue and enhance MLLMs with visual tasks in a scalable fashion, we propose Task Preference Optimization (TPO), a novel method that utilizes differentiable task preferences derived from typical fine-grained visual tasks. TPO introduces learnable task tokens that establish connections between multiple task-specific heads and the MLLM. By leveraging rich visual labels during training, TPO significantly enhances the MLLM's multimodal capabilities and task-specific performance. Through multi-task co-training within TPO, we observe synergistic benefits that elevate individual task performance beyond what is achievable through single-task training methodologies. Our instantiation of this approach with VideoChat and LLaVA demonstrates an overall 14.6% improvement in multimodal performance compared to baseline models. Additionally, MLLM-TPO demonstrates robust zero-shot capabilities across various tasks, performing comparably to state-of-the-art supervised models. The code will be released at https://github.com/OpenGVLab/TPO

Comments 
loading
In Channel
1.58-bit FLUX

1.58-bit FLUX

2024-12-3122:59

loading
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Jingwen Liang, Gengyu Wang