ORPO: Monolithic Preference Optimization without Reference Model (Paper Explained)

  Рет қаралды 23,032

Yannic Kilcher

Yannic Kilcher

Күн бұрын

Пікірлер: 69
@r9999t
@r9999t 7 ай бұрын
Glad you're back to technical content this time. Any AI KZbinr can give us latest AI news, but you're just about the only one that can give technical insight into the stories.
@lone0017
@lone0017 7 ай бұрын
6 videos in 7 days, I'm having a holiday and this is such a perfect-timing treat.
@EternalKernel
@EternalKernel 7 ай бұрын
Thank you for being awesome Yannic, I send people from the classes that I "TA" for to you because you're reliably strong with your analysis.
@tensorturtle1566
@tensorturtle1566 7 ай бұрын
Great to see research from my homeland of South Korea represented!
@Dogo.R
@Dogo.R 7 ай бұрын
Woo allegence to tribes!!... .. ..
@jawadmansoor6064
@jawadmansoor6064 7 ай бұрын
do you know Seoul?
@cvabds
@cvabds 7 ай бұрын
There is only one korea
@borisbondarenko314
@borisbondarenko314 7 ай бұрын
I very like more technical content from you. I usually read tech news in telegram and your NL New are greats, but very ordinal and simple. So such paper explanations are kind of impact to the DS community, such videos grands new ideas and increase understanding of the field for those, who tried to dive in the deep. Of course it less popular due to complexity of material for audience, but much more interesting. So thank you for such format.
@Mordenor
@Mordenor 7 ай бұрын
Thank you Mr Klicher for delving into the paper, ORPO; Monolithic Preference Optimization without Reference Model
@peach412
@peach412 7 ай бұрын
26:30 that 'really?' and the following struggle with basic math is WAAAAY to relatable
@clray123
@clray123 5 ай бұрын
Always get amazed why mathematicians must name every variable and function with one letter (ok, nowadays it's up to three) and funny unpronounceable symbols positioned in sub- and superscript (you can also add extra fonts if you run out of letters). It's like some sort of mystical masochism which has survived unquestioned across entire history, serving only to exclude anyone unable to process their inhuman notation. But they also like other tricks such as treating infinity as noun and mumbling about infinite sets (while in reality they mean repeatable iterative set element generation algorithms).
@peach412
@peach412 5 ай бұрын
Dunno, the notation seems pretty clear here. My comment was more about context switching from high level concepts back to high school math.
@ZhousiChen-h8p
@ZhousiChen-h8p 6 ай бұрын
Really appreciate your explaining, very helpful. Now I see the alignment process as widening the upper part of the Y shape: x with y_w to y_l. Thanks!
@justheuristic
@justheuristic 7 ай бұрын
The main loss function (7) looks like it can be meaningfully simplified with school-level math. Lor = -log(sigm( log ( odds(y_w|x) / odds(y_l|x)))), where sigm(a) = 1/(1 + exp(-a)) = exp(a) / (1 + exp(a)) Let's assume that both odds(y_w|x) and odds(y_l|x) are positive (because softmax) By plugging in the sigmoid, we get Lor = - log (exp(log(odds(y_w|x) / odds(y_l|x) )) / (1 + exp(log(odds(y_w|x) / odds(y_l|x)))) ) Note that exp(log(odds(y_w|x) / odds(y_l|x)) = odds(y_w|x) / odds(y_l|x). We use this to simplify: Lor = - log( [odds(y_w|x) / odds(y_l|x)] / (1 + odds(y_w|x) / odds(y_l|x)) ) Finally, multiply both numerator and denominator by odds(y_l|x) to get Lor = - log(odds(y_w|x) / (odds(y_w|x) + odds(y_l|x)) ) Intuitively, this is the negative log-probability of (the odds of good response) / (odds of good response + odds of bad response ). If you minimize the average loss over multiple texts, it's the same as maximizing the odds that the model chooses winning response in every pair (of winning+losing responses).
@peterszilvasi752
@peterszilvasi752 7 ай бұрын
Good job! I suppose you mean `odds(y_l|x)` instead of `odds(y_l)` in the final equation.
@justheuristic
@justheuristic 7 ай бұрын
@@peterszilvasi752 thanks! good catch :) /* fixed the previous comment */
@lucidraisin
@lucidraisin 7 ай бұрын
very cool! thank you for this
@simaogoncalves1957
@simaogoncalves1957 6 ай бұрын
16:12 Not sure I follow the intuition behind supervised fine tuning not being able to penalize the “wrong” token that is opposite to what we want the model to mimic. I’m confused because in my view the wrong, but highly probably token, contributes more to the loss so it will be penalized heavier than the more meaningless, random output, tokens. Can someone clarify this for me?
@blender6426
@blender6426 7 ай бұрын
Nice I was waiting for this after you mentioned ORPO in ML News :))
@MyCiaoatutti
@MyCiaoatutti 7 ай бұрын
"Specifically, 1 - p(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood p(y|x) is low". I think that (1 - p(y|x)) have two different meanings here: it can be the result of differentiation by coincidence and also the "corresponding side" of the likelihood, i.e., 1 - p(y|x). So, when it says the "corresponding side" of p(y|x) is low, it means that 1 - p(y|x) is low.
@I-0-0-I
@I-0-0-I 7 ай бұрын
Thanks for explaining basic terms along with the more complex stuff, for dilettantes like myself. Cheers.
@Zed_Oud
@Zed_Oud 6 ай бұрын
27:57 “the corresponding side” Maybe they mistakenly switched the w l givens in the denominators?
@herp_derpingson
@herp_derpingson 5 ай бұрын
18:47 I wish they showed some loss curves of the training in the paper unless I missed it. Whenever you divide things like that in the loss function, the loss curve goes crazy. It still trains but it can go crazy as for some samples the denominator might be close to zero. . 19:33 There is no ablation in the paper with no SFT since the loss is L_sft + lambda L_orpo. I think we are soon to see a followup paper "ORPO is all you need" which just drops the SFT. I think it will work great. . 31:30 One of my colleagues tried the probability ratio thing before. I dont remember what came out of it. Havent checked on him for a while.
@Htyagi1998
@Htyagi1998 2 ай бұрын
27.57 explaination, here they have written if the likelihood of p(y/x) is low then 1 - p(y/x) will accelerate the gredients.. If i am not wrong then it can be consider as, if lets say likelihood of loosing side is more then the gredient will accelerate towards then other side.
@max0x7ba
@max0x7ba 6 ай бұрын
That log of probability is also a power transform often used to narrow or widen a distribution.
@syeshwanth6790
@syeshwanth6790 7 ай бұрын
Where does Yw and Yl come from. Is it from the training dataset or the LLM that is being trained generates these and are labelled by humans or reward models as W and L?
@clray123
@clray123 5 ай бұрын
I suppose the chosen and rejected tokens are similarly reinforced by SFT based on the similarity of their embeddings. So it's kind of a hen-and-egg problem: you want the training data to produce different embeddings; but at the same time you would need different embeddings to begin with for the SFT to not shape them into becoming similar based on context. It reminds me strongly of the "confused pronoun" (he~she, I~you) or "confused actor name" ("Mark, you are my best friend" said Mark) problem which everyone has experienced in their interactions with the smaller/dumber models. Another way of putting it is that they "learn" the category, but not the distinction within the category - perhaps because the examples in SFT are not contrastive enough to promote learning of the differences. Of course, supplying chosen/rejected inputs is one way to enforce that contrast. But perhaps just supplying enough specially prepared SFT examples would do the same trick.
@mantasorantas5289
@mantasorantas5289 7 ай бұрын
Would be interesting to see how it compares to KTO. Would guess that KTO outperforms and is easier to implament as you dont need pairs of inputs.
@govinda4577
@govinda4577 5 ай бұрын
I´m bullish on VRA, Joystream and Cyberopolis. What do you think guys about my picks?
@SLAM2977
@SLAM2977 7 ай бұрын
There seems to be a conceptual problem, where are the preferences coming from given that they are expressed on multiple responses to the same prompt? Let's suppose we wish to fine-tune a foundation model for chat, we would not have the preferences before having done SFT and gathered some responses on the chat template format based prompt, that would force us to do SFT first and then SFT+ODDS_RATIO loss. Doable but surely not a single pass approach.
@clray123
@clray123 5 ай бұрын
Yes, it seems that to get an acceptable result with ORPO you have to incorporate the SFT data into your chosen/rejected dataset in order to make it work at all. It is not a technique like RLHF/DPO which can be applied on top just to tweak a pretrained model without destroying it. I don't see anything like the "KL divergence" term in the loss which would make it stick to the reference model (as there is no reference model), so naturally it will cause some brutal overfitting to the "chosen" examples. Unsurprisingly, it does destroy a pretrained model in my experiments if the alignment dataset just focuses on fixing up short rejected completions like a DPO dataset would. (DPO also has that effect when overdone, but not as much.) It appears that the "chosen" examples have to be completions from the original SFT dataset while the "rejected" ones should be the undesired completions generated by the pretrained model. What is confusing in that context is how to choose the length of the "prompt" vs completion. I suspect it's less efficient than SFT because the prompts are masked out and are not trained along (or you would have to blow up the dataset to have varying length versions of the same prompt).
@lenant
@lenant 5 ай бұрын
Thanks for explanation! But what do they consider as y_l? What are these tokens, which probability shall be lower, how do they select it?
@lenant
@lenant 5 ай бұрын
I see in paper they use datasets argilla/ultrafeedback-binarized-preferences-cleaned and Anthropic/hh-rlhf, but i don't quite understand how teacher forcing works here with 2 different sequences.
@lenant
@lenant 5 ай бұрын
Reading more into parer and I think I got it: they don't add L_or to each token, but rather to whole loss from sft (gathered from generated tokens) and L_or is caluclated as probability over whole chosen and rejected sequences.
@xxlvulkann6743
@xxlvulkann6743 7 ай бұрын
great! now apply ORPO to a reward model and round we go!
@wwkk4964
@wwkk4964 7 ай бұрын
What's going on, is it a yannic bonanza time of the year! Loving these addicting videos
@ArijitBiswasGooglePlus
@ArijitBiswasGooglePlus 5 ай бұрын
At the beginning, you referred to a paper from Meta. Which paper is it?
@amanprajapati8707
@amanprajapati8707 5 ай бұрын
My top picks for this bull run are Illuvium, Verasity and Cyberopolis.
@jondo7680
@jondo7680 6 ай бұрын
You should make a video just focusing on log and explaining it's role in neuronal networks.
@fearnworks
@fearnworks 7 ай бұрын
You are on fire!
@thunder89
@thunder89 7 ай бұрын
The comparison in the end between OR and PR should also discuss the influence of the log sigmoid, or? And, more importantly, how the gradients for the winning and loosing output actually would look like with these simulated pars... It feels a bit handweavy why the logsigmoid of the OR should be the target ...
@kaikapioka9711
@kaikapioka9711 7 ай бұрын
Thx again yan! 🎉
@Htyagi1998
@Htyagi1998 2 ай бұрын
The way you explained ❤
@yannickpezeu3419
@yannickpezeu3419 7 ай бұрын
I liked the self deprecation at 32:00 haha
@drdca8263
@drdca8263 7 ай бұрын
0:52 : I wish we had a different term for this other than “alignment”
@TheRyulord
@TheRyulord 7 ай бұрын
"Preference tuning" is used to describe it pretty often
@drdca8263
@drdca8263 7 ай бұрын
@@TheRyulord thanks!
@Jason-lm2yq
@Jason-lm2yq 6 ай бұрын
Can you do one on Kolmogorov-Arnold Network from MIT
@amber9040
@amber9040 7 ай бұрын
I feel like AI models have gotten more stale and same-y ever since RLHF became the norm. Playing around with GPT-3 was wild times. Hopefully alignment moves in a direction with more diverse ranges of responses in the future, and less censorship in domains where it's not needed.
@dinoscheidt
@dinoscheidt 7 ай бұрын
LLMs are what Machine Learning has always been: input output. Quality data makes the cake…. no matter how many fancy mixers you bring to the table.
@gauranshsoni4011
@gauranshsoni4011 7 ай бұрын
Keep them comin
@pritioli8429
@pritioli8429 5 ай бұрын
great explanation!
@davidhauser7537
@davidhauser7537 6 ай бұрын
yannick can you do xlstm paper
@john_blues
@john_blues 7 ай бұрын
I don't even know what the title of this video means 😵‍💫. But I'm going to watch anyway.
@rectomgris
@rectomgris 7 ай бұрын
makes me think of PPO
@chrise8153
@chrise8153 7 ай бұрын
Wow good timing to go on youtube
@BOYBOss-h3v
@BOYBOss-h3v 5 ай бұрын
You forgot to mention Cyberopolis. It will destroy other alts. Still early to ape in.
@jellyfishnexus3132
@jellyfishnexus3132 7 ай бұрын
Nice!
@iworeushankaonce
@iworeushankaonce 6 ай бұрын
*posts videos almost every day* *KAN paper dropped, disappears for 2 weeks* I hope you alright man 🫂🤗
Mixtral of Experts (Paper Explained)
34:32
Yannic Kilcher
Рет қаралды 58 М.
Don't underestimate anyone
00:47
奇軒Tricking
Рет қаралды 23 МЛН
I thought one thing and the truth is something else 😂
00:34
عائلة ابو رعد Abo Raad family
Рет қаралды 10 МЛН
А я думаю что за звук такой знакомый? 😂😂😂
00:15
Денис Кукояка
Рет қаралды 4,9 МЛН
The Ultimate Sausage Prank! Watch Their Reactions 😂🌭 #Unexpected
00:17
La La Life Shorts
Рет қаралды 9 МЛН
Aligning LLMs with Direct Preference Optimization
58:07
DeepLearningAI
Рет қаралды 27 М.
Were RNNs All We Needed? (Paper Explained)
27:48
Yannic Kilcher
Рет қаралды 52 М.
RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)
1:02:17
RAG vs. Fine Tuning
8:57
IBM Technology
Рет қаралды 66 М.
Flow Matching for Generative Modeling (Paper Explained)
56:16
Yannic Kilcher
Рет қаралды 55 М.
ORPO: NEW DPO Alignment and SFT Method for LLM
24:05
Discover AI
Рет қаралды 4,2 М.
Don't underestimate anyone
00:47
奇軒Tricking
Рет қаралды 23 МЛН