Direct Preference Optimization (DPO): Your Language Model is Secretly a Reward Model Explained

  Рет қаралды 16,479

Gabriel Mongaras

Gabriel Mongaras

Күн бұрын

Пікірлер: 11
@rw-kb9qv
@rw-kb9qv Жыл бұрын
Dude, you're a hero for making these videos! Definitely earned a subscription from me.
@FaultyTwo
@FaultyTwo Жыл бұрын
I'm really glad a channel likes yours exist.
@DED_Search
@DED_Search 10 ай бұрын
I was reading Zephyr. It led to this DPO paper, which landed me on your channel. I am soooo happy. Keep it up!
@Anonymous-bu9ch
@Anonymous-bu9ch 7 ай бұрын
Absolute Goldmine!!
@Poqets
@Poqets Жыл бұрын
Finally, a video on this!
@prof_shixo
@prof_shixo Жыл бұрын
Nice one, thanks for sharing. The replacement of the reward model by the MLE term looks appealing when we have groundtruth (generated reply and reference reply). Still, the advantage of reward models is mainly in their potential to be used on new samples without groundtruth presented (i.e., no reference replies in a self-play training on new datasets), so how would the MLE loss work in such scenarios?
@unclecode
@unclecode Жыл бұрын
Thank you so much! I truly enjoyed your video and the way you explain things. There are moments, however, where I find myself a bit lost when you don't delve deeper, and I wish you could expand on those points. I would appreciate it if you could recommend a video, article to help me become more familiar with the basic concepts related to such papers in this field. Understanding these basics would make it much easier for me to grasp the material. I feel some gaps in my knowledge that If I work on them then understanding these papers, their mathematical notation will be easier. I focusing on papers in this area of transformers and training models.. Any suggestions would be greatly appreciated. Btw you definitely earned my subscription as well.
@gabrielmongaras
@gabrielmongaras Жыл бұрын
Thanks for the feedback! I usually try to hit the sweet spot between assumed knowledge and what I put in the videos so they don't get too long. I'll keep that in mind for future videos! I usually assume knowledge about MLPs (feed-forward networks), convolutional neural networks, sometimes attention (attention is all you need), the normal training process as the loss functions that go with it such as NLL and MSE. Most of these concepts are usually covered in intro to ML classes such as Andrew Ng's or a textbook. As for the textbook, I don't know of one that stands out among the rest. Most are probably similar to each other. As for mathematical notion, I'm not a mathematician, but reading papers in general has helped me get a better understanding of the notation, though I still lack a lot of knowledge of basic mathematical concepts. Hope this helps!
@YashVerma-ii8lx
@YashVerma-ii8lx 9 ай бұрын
Hey @Gabriel. Can you please clear my doubt, why @12:58 we cant just directly backpropagate the loss like we do in simple fine tuning? I am not understanding it? Please share any relevant resource?
@MacProUser99876
@MacProUser99876 9 ай бұрын
How DPO works under the hood: kzbin.info/www/bejne/gKaQoXmAg8uCnLs
@artiommyaskouvskey2126
@artiommyaskouvskey2126 Жыл бұрын
Thanks!
RLHF & DPO Explained (In Simple Terms!)
19:39
Entry Point AI
Рет қаралды 2,8 М.
The IMPOSSIBLE Puzzle..
00:55
Stokes Twins
Рет қаралды 187 МЛН
From Small To Giant 0%🍫 VS 100%🍫 #katebrush #shorts #gummy
00:19
Turn Off the Vacum And Sit Back and Laugh 🤣
00:34
SKITSFUL
Рет қаралды 7 МЛН
Aligning LLMs with Direct Preference Optimization
58:07
DeepLearningAI
Рет қаралды 27 М.
LoRA: Low-Rank Adaptation of LLMs Explained
27:19
Gabriel Mongaras
Рет қаралды 10 М.
DPO Debate: Is RL needed for RLHF?
26:55
Nathan Lambert
Рет қаралды 8 М.
Reinforcement Learning from Human Feedback: From Zero to chatGPT
1:00:38