DPO V.S. RLHF 模型微调

  Рет қаралды 2,835

Alice in AI-land

Alice in AI-land

Күн бұрын

Пікірлер: 7
@neurite001
@neurite001 10 ай бұрын
激动万分啊, 也像 Andrew Ng 一样, 在咖啡馆里差一点就跳起来, 终于有中文博主讲解DPO了
@AliceInAILand
@AliceInAILand 10 ай бұрын
😄 我也是 看到这么漂亮的证明满心欢喜;今天还看到meta家已经用这个self-rewarding的方法fine tune llama2-70b 说在一些benchmark上效果比gpt4好arxiv.org/abs/2401.10020
@theodoruszhou2692
@theodoruszhou2692 8 ай бұрын
Thank you very much for the video, the explanations were very clear, and I learned a lot. Looking forward to your next work~
@AliceInAILand
@AliceInAILand 8 ай бұрын
Glad it was helpful :)
@iwisher666
@iwisher666 8 ай бұрын
加油 加油
@ZhousiChen-h8p
@ZhousiChen-h8p 6 ай бұрын
能给数学并不好的人(我)解答一下吗? RLHF和DPO的共同点都是preference对子,DPO不依赖于reward model和RL,那是不是说DPO的训练数据会少很多?因为preference也是人来评判的,没用到其他模型作为近似。我感觉reward model也有是一个数据增广的作用,或者bootstrapping的感觉。 也很想知道,怎么把调整模型输出某个句子出现的概率转换成梯度的,最近看到一篇叫做KTO,说是不依赖preference对子,只要一个例子和一个二元判断受人类欢迎和不受欢迎就够了。不清楚为什么对子为啥这么重要。 如果可以,能请你更多用自然语言解释解释和对比一下这些方法论之间的异同吗?也希望节目时间能短一点。。。谢谢你!🤗
@fungpangfan8825
@fungpangfan8825 5 ай бұрын
❤🎉
Lazy days…
00:24
Anwar Jibawi
Рет қаралды 6 МЛН
I thought one thing and the truth is something else 😂
00:34
عائلة ابو رعد Abo Raad family
Рет қаралды 10 МЛН
Reinforcement Learning from Human Feedback: From Zero to chatGPT
1:00:38
从零开始学习大语言模型(一)
20:13
林亦LYi
Рет қаралды 230 М.
强化学习与ChatGPT:PPO 算法介绍和实际应用(中文介绍)
42:32
Pourquoi (布瓜的世界)
Рет қаралды 10 М.
Lazy days…
00:24
Anwar Jibawi
Рет қаралды 6 МЛН