DiscoverNeural intel PodMetaStone-S1: Reflective Generative AI for Test-Time Scaling
MetaStone-S1: Reflective Generative AI for Test-Time Scaling

MetaStone-S1: Reflective Generative AI for Test-Time Scaling

Update: 2025-09-02
Share

Description

This document introduces MetaStone-S1, a novel reflective generative model designed for Test-Time Scaling (TTS) in large language models (LLMs). The core innovation is a Reflective Generative Form that unifies the policy model and a Self-supervised Process Reward Model (SPRM) within a single network. This integration allows MetaStone-S1 to efficiently generate and select high-quality reasoning trajectories without relying on expensive, human-annotated process-level data, instead learning from outcome rewards. The research demonstrates that MetaStone-S1, with only 32 billion parameters, achieves performance comparable to OpenAI's o3-mini series across various benchmarks, including mathematics, coding, and Chinese reasoning. The paper also explores the scaling law of these models and identifies an "aha moment" during training where the SPRM begins to effectively distinguish between correct and incorrect reasoning.

Comments 
00:00
00:00
1.0x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

MetaStone-S1: Reflective Generative AI for Test-Time Scaling

MetaStone-S1: Reflective Generative AI for Test-Time Scaling

Neural Intelligence Network