Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Update: 2025-12-25

Description

🤗 Upvotes: 49 | cs.LG, cs.AI, cs.CL

Authors:

Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu

Title:

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Arxiv:

http://arxiv.org/abs/2512.19673v1

Abstract:

Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.

Comments

In Channel

SemanticGen: Video Generation in Semantic Space

2025-12-2522:13

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

2025-12-2527:28

LongVideoAgent: Multi-Agent Reasoning with Long Videos

2025-12-2522:12

SpatialTree: How Spatial Abilities Branch Out in MLLMs

2025-12-2522:10

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

2025-12-2424:05

The Prism Hypothesis: Harmonizing Semantic and Pixel Representations via Unified Autoencoding

2025-12-2426:13

Region-Constraint In-Context Generation for Instructional Video Editing

2025-12-2421:07

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

2025-12-2424:14

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

2025-12-2424:41

Can LLMs Estimate Student Struggles? Human-AI Difficulty Alignment with Proficiency Simulation for Item Difficulty Prediction

2025-12-2425:57

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

2025-12-2324:01

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

2025-12-2325:33

When Reasoning Meets Its Laws

2025-12-2321:45

Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience

2025-12-2325:34

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

2025-12-2326:30

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

2025-12-2323:40

Are We on the Right Way to Assessing LLM-as-a-Judge?

2025-12-2323:16

Kling-Omni Technical Report

2025-12-2024:17

Adaptation of Agentic AI

2025-12-2026:20

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

2025-12-2026:32

00:00

1.0x

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Jingwen Liang, Gengyu Wang

#box-pro-ellipsis-176665109709547{-webkit-line-clamp:2;}Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

Jingwen Liang, Gengyu Wang

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies