DiscoverDaily Paper CastVMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control
VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Update: 2025-01-03
Share

Description

🤗 Upvotes: 2 | cs.CV



Authors:

Shaojin Wu, Fei Ding, Mengqi Huang, Wei Liu, Qian He



Title:

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control



Arxiv:

http://arxiv.org/abs/2412.20800v1



Abstract:

While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.

Comments 
In Channel
loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

Jingwen Liang, Gengyu Wang