How to Synthesize Text Data without Model Collapse?

Update: 2024-12-21

Description

🤗 Upvotes: 19 | cs.CL, cs.AI, cs.LG

Authors:

Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou

Title:

How to Synthesize Text Data without Model Collapse?

Arxiv:

http://arxiv.org/abs/2412.14689v1

Abstract:

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

Comments

In Channel

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

2025-12-2326:30

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

2025-12-2323:40

Are We on the Right Way to Assessing LLM-as-a-Judge?

2025-12-2323:16

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

Next-Embedding Prediction Makes Strong Vision Learners

2025-12-2022:00

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

2025-12-2024:04

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

2025-12-2022:14

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

2025-12-2021:29

Generative Refocusing: Flexible Defocus Control from a Single Image

2025-12-2025:27

DeContext as Defense: Safe Image Editing in Diffusion Transformers

2025-12-2023:34

Step-GUI Technical Report

2025-12-1926:21

DEER: Draft with Diffusion, Verify with Autoregressive Models

2025-12-1925:44

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

2025-12-1921:47

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

2025-12-1922:07

00:00

How to Synthesize Text Data without Model Collapse?

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176649357418778{-webkit-line-clamp:2;}How to Synthesize Text Data without Model Collapse?

How to Synthesize Text Data without Model Collapse?

Jingwen Liang, Gengyu Wang

How to Synthesize Text Data without Model Collapse?