Proximal Policy Optimization (PPO) - How to train Large Language Models

Рет қаралды 38,701

Serrano.Academy

Күн бұрын

Пікірлер: 89

@texwiller7577 10 ай бұрын

Probably the best explanation of the PPO ever

@wirotep.1210 9 ай бұрын

Agreed!

@PeterbroMC 5 ай бұрын

Probably? THIS IS THE BEST(personally though).

@RunchuTian 10 ай бұрын

Thank you! Your explanation of PPO is SO explicit.

@dasistdiewahrheit9585 Жыл бұрын

I love your clear examples and how you reduce them to the essentials.

@patriceboulanger928 7 күн бұрын

Clearest video on the subject so far, well done

@Punch_Card Ай бұрын

The title of this video is misleading. At first I thought this is only applicable to LLMs so I didn't click, after watching every single video on this topic and understanding nothing I decided to check out this one. Turns out it's the best explanation.

@Closer_36 5 ай бұрын

Your effort to explain the complicated concept in easy and very clear way from scratch, giving visual examples, is just beautiful! Thanks for your sharing your knowledge.

@endoumamoru3835 8 ай бұрын

This is the best explanation for PPO I have seen; it's very intuitive.

@itsSandraKublik Жыл бұрын

Loved it❤ Need to rewatch it few more times now but its getting much much clearer thanks to you!

@DarkKnight7_1 6 ай бұрын

The most “Spectacular Explanation” I have ever seen on PPO explanation. Really really liked it!

@sarthak.AiMLDL Жыл бұрын

Sorry bae cant talk right now , Luis dropped another masterpiece had to watch it first..... :)

@SerranoAcademy Жыл бұрын

🤣 LOL! that’s a good one

@ShivangiTomar-p7j Жыл бұрын

You're the best!!! Absolutely love all your ML vids!

@Wenbobobo Жыл бұрын

I love your clear teaching with both easy to understand and in-depth nature. I'll recommand to friend and hoping for the next RLHF video!

@deeplearningexplained 5 күн бұрын

This is an awesome tutorial, will recommend it for sure. Nice work!

@vlknmrt 3 ай бұрын

Thank you very much for this simple, understandable and at the same time elegantly narrated video... Great work!

@Fabricio-rm4hj 13 күн бұрын

Your videos are great, they make it easier to understand these complex subjects. Thank you for sharing this knowledge.

@sachinsarathe1143 Жыл бұрын

You are a Genius man .... The way you explain things in a easy to understand way is mind blowing. Love you a lot :)

@alifwicaksanaramadhan6358 6 ай бұрын

Love your explanation. It's the best PPO explanation I found so far.

@TripDerve 8 ай бұрын

Thank you so much! I'm currently writing my master dissertation proposal and have just been exposed to PPO and after reading the OpenAI white paper the images and explanation of this video just tied the bridge with Actor Critic and now I completely understand how it's all related! It's so satisfying being able to understand this thank you so much!

@samuelnzekwe7696 8 ай бұрын

Would you be kind enough to offer some insights into what direction you are taking in your dissertation? I'm also in same boat and just at a loss as to where, how or what to specifically focus on.

@wb7779 9 ай бұрын

Reading the book is really hard. It's hard to control the studying environment, and it's hard to sustain enough focus with the necessary momentum to learn a concept. Theres so many interruptions that can happen that can interfere with the process, but with this concise and powerful video it helped save a lot of time in learning. Thanks.❤

@KumR Жыл бұрын

Looking forward to this.

@jimijames3251 4 ай бұрын

best and most simple explanation of RL and PPO👏👏👏

@hopelesssuprem1867 Күн бұрын

u didn't explain how to apply this algorithm in a context of llm training, but this is the most important part

@jff711 Жыл бұрын

Thank you for your time and effort to prepare this useful video and explain it.

@PeterbroMC 5 ай бұрын

Sincerely impressed, you're explanation was amazing

@whilewecan 3 ай бұрын

Thank you very very much. I could understand. I owe you a lot not limited to this one, but all, for me to understand. Wonderful.

@ahmedshmels8866 Жыл бұрын

Crystal clear!

@ChrisZuo 4 ай бұрын

Really good explanation! Immediately understand PPO after watching.

@JackWang-dx3rm 5 ай бұрын

The explanation about surrogate function was so vivid

@moammerelzwail5188 6 ай бұрын

Hi Luis, thanks for this great video. However, @33:10 I believe the objective here is to maximize the surrogate objective function NOT to "make it as small as possible" as you said. When the advantage (At) is +ve we need to increase the prob of the current action (by maximizing) the surrogate objective function and when the advantage (At) is -ve we need to decrease the prob of the current action but also through maximizing the surrogate objective function.

@emanuelgoldman4118 13 күн бұрын

You are correct.

@ብርቱሰው 7 ай бұрын

I would like to say thank you for the wonderful video. I want to learn reinforcement learning for my future study in the field of robotics. I have seen that you only have 4 videos about RL. I am hungry for more of your videos. I found that your videos are easier to understand because you explain well. Please add more RL videos. Thank you 🙏

@SerranoAcademy 7 ай бұрын

Thank you for the suggestion! Definitely! Any ideas on what topics in RL to cover?

@ብርቱሰው 7 ай бұрын

@@SerranoAcademy more videos in the field of Robotics please. Thank you. You may also guide me how I can approach the study of reinforcement learning.

@limal8012 10 ай бұрын

Thank you for your video, which provided a great explanation of PPO.❤

@sethjchandler 10 ай бұрын

Extraordinarily lucid. Thanks!

@zugzwangelist 15 күн бұрын

Everything is great in this video, except there should be a minus sign at the beginning of the policy loss function. It could also be worth mentioning that usually the value loss and the policy loss is added, and the gradients from that total loss function are used to update both the value network and the policy network.

@YTD_AI 4 күн бұрын

Wow! That was an amazing explanation. Please do the GRPO explaliner too🙏🙏

@matanavitan4838 5 күн бұрын

Great work buddy!

@deter3 2 ай бұрын

great courses !!! working on o1 kind of training , this helps a lot .

@faisaldj 3 ай бұрын

Serrano brother........God Bless you..............

@learnenglishwithmovie8485 10 ай бұрын

Since I am familiar with RL concepts, it was boring at the beginning. But, it finished awesome. Thanks

@sergioa.serrano7993 Жыл бұрын

Excelente explicación profe!!

@gemini_537 9 ай бұрын

You are a genius!!

@mavichovizana5460 10 ай бұрын

great example and clear explanation!

@gemini_537 9 ай бұрын

Gemini: This video is about Proximal Policy Optimization (PPO) and its applications in training large language models. The speaker, Louis Sano, starts the video by explaining what Proximal Policy Optimization (PPO) is and why it is important in reinforcement learning. Then, he dives into the details of PPO with a grid world example. Here are the key points of the video: * Proximal Policy Optimization (PPO) is a method commonly used in reinforcement learning. [1] * It is especially important for training large language models. [1] * In reinforcement learning, an agent learns through trial and error in an environment. The agent receives rewards for good actions and penalties for bad actions. [1] * The goal is to train the agent to take actions that maximize the total reward it receives. [1] * PPO uses two neural networks: a value network and a policy network. [2] * The value network estimates the long-term value of being in a particular state. [2] * The policy network determines the action the agent should take in a given state. [2] * PPO trains the value network and policy network simultaneously. [2] * The speaker uses a grid world example to illustrate the concepts of states, actions, values, and policy. [2,3,4,5] * In the grid world example, the agent is a small orange ball that moves around a grid. [2] * The goal of the agent is to get as many points as possible. [2] * The agent receives points by landing on squares with money and avoids squares with dragons. [2] * The speaker explains how to calculate the value of each state in the grid world. [4] * The value of a state is the maximum expected reward the agent can get from that state. [4] * The speaker also explains how to determine the best policy (i.e., the best action to take) for each state in the grid world. [5] * Once the value and policy are determined for all states, the agent can start acting in the environment. [5] * PPO uses a clipped surrogate objective function to train the policy network. [8,9] * This function helps to ensure that the policy updates are stable and do not diverge too much. [8,9] Overall, this video provides a clear and concise explanation of Proximal Policy Optimization (PPO) with a focus on its application in training large language models.

@fangyi3358 Ай бұрын

thank you for the great video

@RussAbbott1 3 күн бұрын

I understand what you said, but I don't understand how that trains a neural net. What am I missing? Thanks.

@itsSandraKublik Жыл бұрын

Let's goooo!

@SerranoAcademy Жыл бұрын

Yayyy!!! :)

@HenriqueSousa-ub5en 10 ай бұрын

I think that in the case of the policy loss you want to maximize it instead of minimize it. Since a positive gain means you need to increase the weights that contribute to an increase the probability of the considered action and therefore should do a gradient ascent on the weights.

@unclecode 7 ай бұрын

Appreciate the great explanation. I have a question regarding the clipping formula at 36:42. You have used the "min" function. For example, if the rate is 0.4 and the epsilon is 0.3, indicating that we should get 0.7 in this scenario. However, in the formula you introduced here is returns then 0.4. Shouldn't the formula be clipped_f(x) = max(1 - epsilon, min(f(x), 1 + epsilon))? Am I missing anything?

@aedi_nav 4 ай бұрын

I was thinking the same before. But when I look up for more detailed explanation in the paper, it says that the process to obtain the value in between the range only happens on clip function. At the end, we just again consider choosing the minimum value between unclipped (on the left side) and clipped result. It says like this: "Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective. With this scheme, we only ignore the change in probability ratio when it would make the objective improve, and we include it when it makes the objective worse"

@senx8758 2 ай бұрын

one question：at time 24:40, there is actual value in training data and value nn output. But in chatgpt, the annotator's label data is for value nn. and the policy nn is training by the value provided by value nn. so actually the value nn output and actual value are same. So how policy is trained in that scenario.

@KumR Жыл бұрын

Excellent Session Luis. Can we have a similar one for DPO as well ?

@SerranoAcademy Жыл бұрын

Thank you! Yes, after this is RLHF and then DPO

@KumR Жыл бұрын

Cant wait.....

@e555t66 3 күн бұрын

I did MITs Micromasters course and it introduced RL with a MDP. Can you tell me what were its shortfalls ? I am going to ask GPT but I will leave the question here.

@kplim9873 8 күн бұрын

Wow, ppo explained without much math!

@kshitijdesai2402 5 ай бұрын

loved it! :)

@wirotep.1210 9 ай бұрын

THE BEST on ppo.

@deema_c 8 ай бұрын

THE BEST !!!!!!!!!

@ciciy-wm5ik 4 ай бұрын

Can I ask when training the value NN, why do not optimize the prediction for each state separately, instead of optimizing the prediction on each path?

@guzh Жыл бұрын

L_policy^CLIP seems to be incorrect, what is ho? The min of clip() is always the lower bound. Can you give a reference?

@cromi4194 11 ай бұрын

Am I correct in pointing out, that the Loss is the negative of this expectation? Loss is always something we want to decrease, so this is the gain without the minus?

@andreanegreanu8750 8 ай бұрын

Alex, thank you for your incredible work to vulgarize complex things. But Why the hell, the value function share the same parameters theta as the policy function??!! Can you confirm that? And if this is the case, why?

@Q793148210 7 ай бұрын

It‘s was just so clear. 😃

@johnzhu5735 7 ай бұрын

This was very helpful

@ognjenbaucal2654 6 ай бұрын

When Im training the policy network, how do I know what the value for the scenario is from a previous iteration? My network has weight and biases that calculate the probability of a given action in a state. How do i get the probability for the same action in the same state from a pervious iteration?? I would have to use weights and biases before i updated them.

@ognjenbaucal2654 6 ай бұрын

Also how would this work for iteration 1. I initialize random weights and biases, since its the first iteration whats the previous iteration result?

@chenqu773 Жыл бұрын

Thank you Luis! One thing though I don't catch: why decrease the policy when the value needs to go down, and increase the policy when value goes up? I can't see a coupling between the trend of the value and of the policy

@SerranoAcademy Жыл бұрын

Great question! Yeah I also found that part a bit mysterious. My guess is that as we’re training both the value and policy NNs at the same time, that they kind of capture similar information. So if the value NN underestimated the value of a state, then it’s likely that the policy NN also underestimates the probabilities to get to that state. So as we increase the value estimate, then we should also increase the probability estimate. But if you have any other thoughts lemme know, I’m still trying to wrap my head around it…

@JaniMikaelOllenberg 11 ай бұрын

wow this video is so awesome! your book link doesnt seem to work in the description :)

@SerranoAcademy 11 ай бұрын

Thank you, and thanks so much for pointing it out! Just fixed it.

@synchro-dentally1965 Жыл бұрын

Might be a novel approach for robotics: represent paths as gaussian splats and use spherical harmonics as the "recommended" directions within those splats to reach a goal/endpoint

@ciciy-wm5ik 4 ай бұрын

I do not get it, you increase the probability just because you underestimate the gain by 10? Then what if in other directions you underestimate by 100, 1000, 10000, should you decrease the probability?

@fgh680 Жыл бұрын

Is clipping done to avoid vanishing/exploding gradients?

@SerranoAcademy Жыл бұрын

Great question, yes absolutely! If the gradient is too big or too small, then that messes up the training, and that's why we clip it to something in the middle.

@TerryE-mo2ky 11 ай бұрын

@@SerranoAcademy It seems to me that the lower bound of the probability ratio is not determined by the clipping function since the min function will take the minimum of the probability ratio and the result of the clipping function. So if epsilon is 0.3 and the probability ratio is 0.2, the lower bound of the clipping function will be 1-0.3=0.7 and min(0.2, 0.7)=0.2

@jimshtepa5423 Жыл бұрын

I am wondering what level of expertise and knowledge one must have to be able to notice that impulse must be taken when probability is adjusted, omg:0 even if I live 2 lives spanning for 200 years I would never realize that impulse must be taken into account

@jimshtepa5423 Жыл бұрын

when explaining the formula with mathematical notation, where exactly is the notation for summing the values for each step just as you summed them over little earlier before explaining math formula?

@SerranoAcademy Жыл бұрын

Great question! You mean in the surrogate objective function? Yes, I skimmed over that part, but at the end when you see the expected value sign, that means we're looking at the average of functions for different actions (those taken along a path).

@neelkamal3357 2 ай бұрын

wow

@jimshtepa5423 Жыл бұрын

musical inserts between concepts are too loud and too long in this ever decreasing attention span world. the presentation and material are amazing. thank you

@illbeback12 10 ай бұрын

Weird comment