Ep.143. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Description
"Direct Preference Optimization: Your Language Model is Secretly a Reward Model" by Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn
Summary
This paper introduces Direct Preference Optimization (DPO), a novel method for fine-tuning large language models based on human feedback. Unlike traditional Reinforcement Learning from Human Feedback (RLHF), which is complex and unstable, DPO simplifies the process by directly optimizing the language model policy. It achieves this by leveraging a theoretical mapping between reward functions and optimal policies, transforming the preference learning problem into a straightforward classification task. This eliminates the need for training a separate reward model or using reinforcement learning, resulting in a more stable, performant, and computationally lightweight approach that matches or surpasses RLHF in aligning language models with human preferences.




