Arxiv paper - Self-Improving Robust Preference Optimization

Update: 2025-04-23

Description

In this episode, we discuss Self-Improving Robust Preference Optimization by Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi Azar. The paper introduces Self-Improving Robust Preference Optimization (SRPO), an offline RLHF framework that enables models to self-improve and generalize across tasks by jointly optimizing a self-improvement and generative policy through a min-max objective. SRPO reformulates this objective into a non-adversarial offline loss that can be efficiently optimized using supervised learning. Experiments show SRPO significantly outperforms existing methods like DPO and IPO on benchmarks such as XSum and Arena-Hard, achieving higher win rates against human and AI baselines.

Comments

In Channel

Towards Robust Mathematical Reasoning

2025-11-0607:47

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

2025-11-0406:49

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

2025-10-2807:09

ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases

2025-10-2707:39

Scaling Instruction-Based Video Editing with a High-Quality Synthetic Dataset

2025-10-2706:59

Reasoning with Sampling: Your Base Model is Smarter Than You Think

2025-10-2307:58

DeepSeek-OCR: Contexts Optical Compression

2025-10-2108:05

The Markovian Thinker

2025-10-1607:48

DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

2025-10-0808:03

Towards a Physics Foundation Model

2025-10-0307:04

Scalable Option Learning in High-Throughput Environments

2025-09-3008:18

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

2025-09-2408:10

Reverse-Engineered Reasoning for Open-Ended Generation

2025-09-1908:39

Scaling Performance of Large Language Model Pretraining

2025-09-1606:58

General Social Agents

2025-09-1508:30

We need a new ethics for a world of AI agents

2025-09-1207:26

Hierarchical Reasoning Model

2025-09-1109:03

ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts

2025-09-1008:23

Small Language Models are the Future of Agentic AI

2025-09-0907:54

Learning When to Plan: Efficiently Allocating Test-Time Compute for LLM Agents

2025-09-0807:01

00:00

1.0x

Arxiv paper - Self-Improving Robust Preference Optimization

#box-pro-ellipsis-176245770051064{-webkit-line-clamp:2;}Arxiv paper - Self-Improving Robust Preference Optimization

Arxiv paper - Self-Improving Robust Preference Optimization

agibreakdown

Arxiv paper - Self-Improving Robust Preference Optimization