Dynamic Scaling of Unit Tests for Code Reward Modeling

Update: 2025-01-04

Description

🤗 Upvotes: 13 | cs.CL, cs.SE

Authors:

Zeyao Ma, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang

Title:

Dynamic Scaling of Unit Tests for Code Reward Modeling

Arxiv:

http://arxiv.org/abs/2501.01054v1

Abstract:

Current large language models (LLMs) often struggle to produce accurate responses on the first attempt for complex reasoning tasks like code generation. Prior research tackles this challenge by generating multiple candidate solutions and validating them with LLM-generated unit tests. The execution results of unit tests serve as reward signals to identify correct solutions. As LLMs always confidently make mistakes, these unit tests are not reliable, thereby diminishing the quality of reward signals. Motivated by the observation that scaling the number of solutions improves LLM performance, we explore the impact of scaling unit tests to enhance reward signal quality. Our pioneer experiment reveals a positive correlation between the number of unit tests and reward signal quality, with greater benefits observed in more challenging problems. Based on these insights, we propose CodeRM-8B, a lightweight yet effective unit test generator that enables efficient and high-quality unit test scaling. Additionally, we implement a dynamic scaling mechanism that adapts the number of unit tests based on problem difficulty, further improving efficiency. Experimental results show that our approach significantly improves performance across various models on three benchmarks (e.g., with gains of 18.43% for Llama3-8B and 3.42% for GPT-4o-mini on HumanEval Plus).

Comments

In Channel

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

2025-12-2326:30

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

2025-12-2323:40

Are We on the Right Way to Assessing LLM-as-a-Judge?

2025-12-2323:16

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

Next-Embedding Prediction Makes Strong Vision Learners

2025-12-2022:00

StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors

2025-12-2024:04

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

2025-12-2022:14

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

2025-12-2021:29

Generative Refocusing: Flexible Defocus Control from a Single Image

2025-12-2025:27

DeContext as Defense: Safe Image Editing in Diffusion Transformers

2025-12-2023:34

Step-GUI Technical Report

2025-12-1926:21

DEER: Draft with Diffusion, Verify with Autoregressive Models

2025-12-1925:44

Fast and Accurate Causal Parallel Decoding using Jacobi Forcing

2025-12-1921:47

HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices

2025-12-1922:07

00:00

Dynamic Scaling of Unit Tests for Code Reward Modeling

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176649559318180{-webkit-line-clamp:2;}Dynamic Scaling of Unit Tests for Code Reward Modeling

Dynamic Scaling of Unit Tests for Code Reward Modeling

Jingwen Liang, Gengyu Wang

Dynamic Scaling of Unit Tests for Code Reward Modeling