Glad you're back to technical content this time. Any AI KZbinr can give us latest AI news, but you're just about the only one that can give technical insight into the stories.
@lone00177 ай бұрын
6 videos in 7 days, I'm having a holiday and this is such a perfect-timing treat.
@EternalKernel7 ай бұрын
Thank you for being awesome Yannic, I send people from the classes that I "TA" for to you because you're reliably strong with your analysis.
@tensorturtle15667 ай бұрын
Great to see research from my homeland of South Korea represented!
@Dogo.R7 ай бұрын
Woo allegence to tribes!!... .. ..
@jawadmansoor60647 ай бұрын
do you know Seoul?
@cvabds7 ай бұрын
There is only one korea
@borisbondarenko3147 ай бұрын
I very like more technical content from you. I usually read tech news in telegram and your NL New are greats, but very ordinal and simple. So such paper explanations are kind of impact to the DS community, such videos grands new ideas and increase understanding of the field for those, who tried to dive in the deep. Of course it less popular due to complexity of material for audience, but much more interesting. So thank you for such format.
@Mordenor7 ай бұрын
Thank you Mr Klicher for delving into the paper, ORPO; Monolithic Preference Optimization without Reference Model
@peach4127 ай бұрын
26:30 that 'really?' and the following struggle with basic math is WAAAAY to relatable
@clray1235 ай бұрын
Always get amazed why mathematicians must name every variable and function with one letter (ok, nowadays it's up to three) and funny unpronounceable symbols positioned in sub- and superscript (you can also add extra fonts if you run out of letters). It's like some sort of mystical masochism which has survived unquestioned across entire history, serving only to exclude anyone unable to process their inhuman notation. But they also like other tricks such as treating infinity as noun and mumbling about infinite sets (while in reality they mean repeatable iterative set element generation algorithms).
@peach4125 ай бұрын
Dunno, the notation seems pretty clear here. My comment was more about context switching from high level concepts back to high school math.
@ZhousiChen-h8p6 ай бұрын
Really appreciate your explaining, very helpful. Now I see the alignment process as widening the upper part of the Y shape: x with y_w to y_l. Thanks!
@justheuristic7 ай бұрын
The main loss function (7) looks like it can be meaningfully simplified with school-level math. Lor = -log(sigm( log ( odds(y_w|x) / odds(y_l|x)))), where sigm(a) = 1/(1 + exp(-a)) = exp(a) / (1 + exp(a)) Let's assume that both odds(y_w|x) and odds(y_l|x) are positive (because softmax) By plugging in the sigmoid, we get Lor = - log (exp(log(odds(y_w|x) / odds(y_l|x) )) / (1 + exp(log(odds(y_w|x) / odds(y_l|x)))) ) Note that exp(log(odds(y_w|x) / odds(y_l|x)) = odds(y_w|x) / odds(y_l|x). We use this to simplify: Lor = - log( [odds(y_w|x) / odds(y_l|x)] / (1 + odds(y_w|x) / odds(y_l|x)) ) Finally, multiply both numerator and denominator by odds(y_l|x) to get Lor = - log(odds(y_w|x) / (odds(y_w|x) + odds(y_l|x)) ) Intuitively, this is the negative log-probability of (the odds of good response) / (odds of good response + odds of bad response ). If you minimize the average loss over multiple texts, it's the same as maximizing the odds that the model chooses winning response in every pair (of winning+losing responses).
@peterszilvasi7527 ай бұрын
Good job! I suppose you mean `odds(y_l|x)` instead of `odds(y_l)` in the final equation.
@justheuristic7 ай бұрын
@@peterszilvasi752 thanks! good catch :) /* fixed the previous comment */
@lucidraisin7 ай бұрын
very cool! thank you for this
@simaogoncalves19576 ай бұрын
16:12 Not sure I follow the intuition behind supervised fine tuning not being able to penalize the “wrong” token that is opposite to what we want the model to mimic. I’m confused because in my view the wrong, but highly probably token, contributes more to the loss so it will be penalized heavier than the more meaningless, random output, tokens. Can someone clarify this for me?
@blender64267 ай бұрын
Nice I was waiting for this after you mentioned ORPO in ML News :))
@MyCiaoatutti7 ай бұрын
"Specifically, 1 - p(y|x) in the denominators amplifies the gradients when the corresponding side of the likelihood p(y|x) is low". I think that (1 - p(y|x)) have two different meanings here: it can be the result of differentiation by coincidence and also the "corresponding side" of the likelihood, i.e., 1 - p(y|x). So, when it says the "corresponding side" of p(y|x) is low, it means that 1 - p(y|x) is low.
@I-0-0-I7 ай бұрын
Thanks for explaining basic terms along with the more complex stuff, for dilettantes like myself. Cheers.
@Zed_Oud6 ай бұрын
27:57 “the corresponding side” Maybe they mistakenly switched the w l givens in the denominators?
@herp_derpingson5 ай бұрын
18:47 I wish they showed some loss curves of the training in the paper unless I missed it. Whenever you divide things like that in the loss function, the loss curve goes crazy. It still trains but it can go crazy as for some samples the denominator might be close to zero. . 19:33 There is no ablation in the paper with no SFT since the loss is L_sft + lambda L_orpo. I think we are soon to see a followup paper "ORPO is all you need" which just drops the SFT. I think it will work great. . 31:30 One of my colleagues tried the probability ratio thing before. I dont remember what came out of it. Havent checked on him for a while.
@Htyagi19982 ай бұрын
27.57 explaination, here they have written if the likelihood of p(y/x) is low then 1 - p(y/x) will accelerate the gredients.. If i am not wrong then it can be consider as, if lets say likelihood of loosing side is more then the gredient will accelerate towards then other side.
@max0x7ba6 ай бұрын
That log of probability is also a power transform often used to narrow or widen a distribution.
@syeshwanth67907 ай бұрын
Where does Yw and Yl come from. Is it from the training dataset or the LLM that is being trained generates these and are labelled by humans or reward models as W and L?
@clray1235 ай бұрын
I suppose the chosen and rejected tokens are similarly reinforced by SFT based on the similarity of their embeddings. So it's kind of a hen-and-egg problem: you want the training data to produce different embeddings; but at the same time you would need different embeddings to begin with for the SFT to not shape them into becoming similar based on context. It reminds me strongly of the "confused pronoun" (he~she, I~you) or "confused actor name" ("Mark, you are my best friend" said Mark) problem which everyone has experienced in their interactions with the smaller/dumber models. Another way of putting it is that they "learn" the category, but not the distinction within the category - perhaps because the examples in SFT are not contrastive enough to promote learning of the differences. Of course, supplying chosen/rejected inputs is one way to enforce that contrast. But perhaps just supplying enough specially prepared SFT examples would do the same trick.
@mantasorantas52897 ай бұрын
Would be interesting to see how it compares to KTO. Would guess that KTO outperforms and is easier to implament as you dont need pairs of inputs.
@govinda45775 ай бұрын
I´m bullish on VRA, Joystream and Cyberopolis. What do you think guys about my picks?
@SLAM29777 ай бұрын
There seems to be a conceptual problem, where are the preferences coming from given that they are expressed on multiple responses to the same prompt? Let's suppose we wish to fine-tune a foundation model for chat, we would not have the preferences before having done SFT and gathered some responses on the chat template format based prompt, that would force us to do SFT first and then SFT+ODDS_RATIO loss. Doable but surely not a single pass approach.
@clray1235 ай бұрын
Yes, it seems that to get an acceptable result with ORPO you have to incorporate the SFT data into your chosen/rejected dataset in order to make it work at all. It is not a technique like RLHF/DPO which can be applied on top just to tweak a pretrained model without destroying it. I don't see anything like the "KL divergence" term in the loss which would make it stick to the reference model (as there is no reference model), so naturally it will cause some brutal overfitting to the "chosen" examples. Unsurprisingly, it does destroy a pretrained model in my experiments if the alignment dataset just focuses on fixing up short rejected completions like a DPO dataset would. (DPO also has that effect when overdone, but not as much.) It appears that the "chosen" examples have to be completions from the original SFT dataset while the "rejected" ones should be the undesired completions generated by the pretrained model. What is confusing in that context is how to choose the length of the "prompt" vs completion. I suspect it's less efficient than SFT because the prompts are masked out and are not trained along (or you would have to blow up the dataset to have varying length versions of the same prompt).
@lenant5 ай бұрын
Thanks for explanation! But what do they consider as y_l? What are these tokens, which probability shall be lower, how do they select it?
@lenant5 ай бұрын
I see in paper they use datasets argilla/ultrafeedback-binarized-preferences-cleaned and Anthropic/hh-rlhf, but i don't quite understand how teacher forcing works here with 2 different sequences.
@lenant5 ай бұрын
Reading more into parer and I think I got it: they don't add L_or to each token, but rather to whole loss from sft (gathered from generated tokens) and L_or is caluclated as probability over whole chosen and rejected sequences.
@xxlvulkann67437 ай бұрын
great! now apply ORPO to a reward model and round we go!
@wwkk49647 ай бұрын
What's going on, is it a yannic bonanza time of the year! Loving these addicting videos
@ArijitBiswasGooglePlus5 ай бұрын
At the beginning, you referred to a paper from Meta. Which paper is it?
@amanprajapati87075 ай бұрын
My top picks for this bull run are Illuvium, Verasity and Cyberopolis.
@jondo76806 ай бұрын
You should make a video just focusing on log and explaining it's role in neuronal networks.
@fearnworks7 ай бұрын
You are on fire!
@thunder897 ай бұрын
The comparison in the end between OR and PR should also discuss the influence of the log sigmoid, or? And, more importantly, how the gradients for the winning and loosing output actually would look like with these simulated pars... It feels a bit handweavy why the logsigmoid of the OR should be the target ...
@kaikapioka97117 ай бұрын
Thx again yan! 🎉
@Htyagi19982 ай бұрын
The way you explained ❤
@yannickpezeu34197 ай бұрын
I liked the self deprecation at 32:00 haha
@drdca82637 ай бұрын
0:52 : I wish we had a different term for this other than “alignment”
@TheRyulord7 ай бұрын
"Preference tuning" is used to describe it pretty often
@drdca82637 ай бұрын
@@TheRyulord thanks!
@Jason-lm2yq6 ай бұрын
Can you do one on Kolmogorov-Arnold Network from MIT
@amber90407 ай бұрын
I feel like AI models have gotten more stale and same-y ever since RLHF became the norm. Playing around with GPT-3 was wild times. Hopefully alignment moves in a direction with more diverse ranges of responses in the future, and less censorship in domains where it's not needed.
@dinoscheidt7 ай бұрын
LLMs are what Machine Learning has always been: input output. Quality data makes the cake…. no matter how many fancy mixers you bring to the table.
@gauranshsoni40117 ай бұрын
Keep them comin
@pritioli84295 ай бұрын
great explanation!
@davidhauser75376 ай бұрын
yannick can you do xlstm paper
@john_blues7 ай бұрын
I don't even know what the title of this video means 😵💫. But I'm going to watch anyway.
@rectomgris7 ай бұрын
makes me think of PPO
@chrise81537 ай бұрын
Wow good timing to go on youtube
@BOYBOss-h3v5 ай бұрын
You forgot to mention Cyberopolis. It will destroy other alts. Still early to ape in.
@jellyfishnexus31327 ай бұрын
Nice!
@iworeushankaonce6 ай бұрын
*posts videos almost every day* *KAN paper dropped, disappears for 2 weeks* I hope you alright man 🫂🤗