Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Update: 2025-12-18

Description

🤗 Upvotes: 39 | cs.CV, cs.AI

Authors:

Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang

Title:

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Arxiv:

http://arxiv.org/abs/2512.12675v1

Abstract:

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

Comments

In Channel

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

Next-Embedding Prediction Makes Strong Vision Learners

2025-12-2022:00

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

2025-12-2024:04

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

2025-12-2022:14

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

2025-12-2021:29

Generative Refocusing: Flexible Defocus Control from a Single Image

2025-12-2025:27

DeContext as Defense: Safe Image Editing in Diffusion Transformers

2025-12-2023:34

Step-GUI Technical Report

2025-12-1926:21

DEER: Draft with Diffusion, Verify with Autoregressive Models

2025-12-1925:44

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

2025-12-1921:47

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

2025-12-1922:07

Puzzle Curriculum GRPO for Vision-Centric Reasoning

2025-12-1925:36

MMGR: Multi-Modal Generative Reasoning

2025-12-1824:35

Video Reality Test: Can AI-Generated ASMR Videos fool VLMs and Humans?

2025-12-1824:20

WorldPlay: Towards Long-Term Geometric Consistency for Real-Time Interactive World Modeling

2025-12-1821:32

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

2025-12-1822:31

RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics

2025-12-1819:54

OpenDataArena: A Fair and Open Arena for Benchmarking Post-Training Dataset Value

2025-12-1829:25

00:00

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176631259157137{-webkit-line-clamp:2;}Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Jingwen Liang, Gengyu Wang

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling