No video

An update on DPO vs PPO for LLM alignment

  Рет қаралды 1,069

Nathan Lambert

Nathan Lambert

Күн бұрын

A casual chat on our experiments trying to figure out which one is best.
Paper referenced: arxiv.org/abs/...
Abstract: Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains.
Slides: docs.google.co...
Synthetic data piece: www.interconne...
Slides taken from recent Stanford Lecture: docs.google.co...

Пікірлер: 7
Self-directed Synthetic Dialogues (and other recent synth data)
15:51
DPO Debate: Is RL needed for RLHF?
26:55
Nathan Lambert
Рет қаралды 8 М.
Running With Bigger And Bigger Feastables
00:17
MrBeast
Рет қаралды 195 МЛН
Do you think that ChatGPT can reason?
1:42:28
Machine Learning Street Talk
Рет қаралды 62 М.
This is why Deep Learning is really weird.
2:06:38
Machine Learning Street Talk
Рет қаралды 385 М.
Evaluating LLM-based Applications
33:50
Databricks
Рет қаралды 24 М.
[1hr Talk] Intro to Large Language Models
59:48
Andrej Karpathy
Рет қаралды 2,1 МЛН
Data Modeling for Power BI [Full Course] 📊
2:34:41
Pragmatic Works
Рет қаралды 3,3 МЛН
No. 1 CEO: The Strategies I Used to Build 5 Billion-Dollar Companies (And How You Can Use Them)
1:35:48
Machine Learning for Everybody - Full Course
3:53:53
freeCodeCamp.org
Рет қаралды 6 МЛН
Aligning LLMs with Direct Preference Optimization
58:07
DeepLearningAI
Рет қаралды 25 М.
Open-source AI (and LLMs): Definitions, Finding Nuance, and Policy
20:44