Proximal Policy Optimization | ChatGPT uses this

  Рет қаралды 22,047

CodeEmporium

CodeEmporium

Күн бұрын

Пікірлер: 51
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks for watching! If you think I deserve it, please consider hitting that like button as it will help spread this channel. More break downs to come!
@Punch_Card
@Punch_Card 4 күн бұрын
What's the quiz answers
@арсланвалеев-д9у
@арсланвалеев-д9у 10 ай бұрын
Hi! Great video! Could you answer my question about training policy? This happening on 10:00. Why obtained probability of actions are different from probs, taken on gathering data? I think that we havent changed policy network before this action. So, if we havent changed network yet, on 10:08 we would have received ratio == 1 on every step(
@jyotsnachoudhary8999
@jyotsnachoudhary8999 Ай бұрын
Finally, a great video that explains the entire training process so clearly and effectively! Thank you so much for this. Would be great if you can create a video on Direct Preference Optimization(DPO) as well. :)
@PatrickConnor-i2q
@PatrickConnor-i2q Жыл бұрын
I like the clarity that your video provides. Thanks for this primer. A couple things, though, that were a bit unclear and perhaps you could elaborate on here in the comments. - It wasn't obvious to me how/why you would submit all of the states at once (to either network) and update with an average loss as opposed to training on each state independently. I get that we have an episode of related/dependent states here -- maybe that's why we use the average instead of the directly associated discounted future reward? - Secondly, in your initial data sampling stage you collected outputs from the policy network. During the training phase of the network it looks like you're sampling again but your values are different. How is this possible unless you're network has changed somehow? Maybe you're using drop-out or something like that? Forgive the questions -- I'm just learning about this methodology for the first time.
@ТимИсаков-ц7щ
@ТимИсаков-ц7щ 6 ай бұрын
I'm also interested in the answer to the second question.
@cauamp
@cauamp 2 ай бұрын
@@ТимИсаков-ц7щ Any answers?
@martinleykauf6857
@martinleykauf6857 4 ай бұрын
Hi! I'm writing my thesis currently and using PPO in my project. Your video was of great help to get a more intuitive understanding about the algorithm! Keep it up man, very very helpful.
@vastabyss6496
@vastabyss6496 Жыл бұрын
What's the purpose of having a separate policy network and value network? Wouldn't the value network already give you the best move in a given state, since we can simply select the action the value network predicts will have the highest future reward?
@yeeehees2973
@yeeehees2973 9 ай бұрын
More to do with balancing exploration/exploitation, as simply picking the maximum Q-value from the value network yields suboptimal results due to limited exploration. Alternatively, using on a policy network would yield too noisy updates, resulting in unstable training.
@sudiptasarkar4438
@sudiptasarkar4438 9 ай бұрын
​@@yeeehees2973I feel that this video is misleading at 02:06. Previously I thought value function objective is to estimate the max reward value of current state, but this guy is saying otherwise
@yeeehees2973
@yeeehees2973 9 ай бұрын
@@sudiptasarkar4438 the Q-values inherently try to maximize the future rewards, so a Q value of being in a certain state can be interpreted as maximums future reward given this state.
@patrickmann4122
@patrickmann4122 8 ай бұрын
It helps with something called “baselining” which is a variance reduction technique to improve policy gradients
@好了-t4d
@好了-t4d 6 ай бұрын
That’s because this kind of algorithm deals with continuous action not like DQN. That’s the key point of involving policy gradient to Q-learning which is the value network.
@srivatsa1193
@srivatsa1193 Жыл бұрын
I ve really enjoyed this series so far. Great work ! The world needs more pasionate teachers like youeself. Cheers!
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much for the kind words I really appreciate it :)
@vlknmrt
@vlknmrt 2 ай бұрын
Thanks, it is really a very explanatory video!
@swagatochakraborty2583
@swagatochakraborty2583 9 ай бұрын
Great presentation. One question : why the policy network is a separate network than the value network? Seems like the probability of the actions should be based on estimating the expected reward values I think in my Coursera course on Reinforcement learning - I saw they were using the same network and simply copying over the weights from one to another. So they were essentially the time shifted version of the same network and trained just once.
@ashishbhong5901
@ashishbhong5901 Жыл бұрын
Good presentation and break down of concepts. Liked your video.
@burnytech
@burnytech 5 ай бұрын
Great stuff mate
@ZhechengLi-wk8gy
@ZhechengLi-wk8gy Жыл бұрын
Like your channel very much, looking forward to the coding part of RL.😀
@vivian.deeplearning
@vivian.deeplearning Ай бұрын
didnt explain why clipping, min or anything for the loss
@iliyafarahani-y3c
@iliyafarahani-y3c Ай бұрын
hi , thanks for your great video ,but one question , you mean for calculating the loss for policy network we need to get the prob for actions two times and then calculate it ?
@borneoland-hk2il
@borneoland-hk2il 3 ай бұрын
this PPO you explained is PPO-Penalty or PPO-Clip, and what is the different?
@2_Tou
@2_Tou 7 ай бұрын
I think the calculation shown on 5:45 is not the advantage. The advantage of an action is calculated by taking the average value of all actions in that state and find the difference between the average value and the value of the action you are interested in. That calculation looks more like a MC target to me. Please point out if I made a mistake because I always do...
@ericgonzales5057
@ericgonzales5057 10 ай бұрын
WHERE DID YOU LEARN THIS?!??! PLEASE ANSWER
@victoruzondu6625
@victoruzondu6625 9 ай бұрын
What are vf updates and how do we get the value for our clipped ratio. You didn't seem to explain them I could only tell the last quiz is a B because the other options complement the policy nextwork not the value network
@ns-eb7dw
@ns-eb7dw 2 ай бұрын
You define the Value function as essentially being the Q function (ie a binary function that takes a state and an action arguments) and you say A(s,a) = R_t - Q(s,a) where R_t is the total discounted reward from step t onwards. Many other sources define the Value function as unary, ie it only takes a state argument and say that A(s,a) = R_t - V(s). Can you comment on this difference?
@inderjeetsingh2367
@inderjeetsingh2367 Жыл бұрын
Thanks for sharing 🙏
@CodeEmporium
@CodeEmporium Жыл бұрын
My pleasure! Thank you for watching
@OPASNIY_KIRPI4
@OPASNIY_KIRPI4 Жыл бұрын
Please explain how you can apply back propagation over the network simply by using a single loss number? As far as I understand, an input vector and a target vector are needed to train a neural network. I will be very grateful for an explanation.
@CodeEmporium
@CodeEmporium Жыл бұрын
The single loss is “back propagated” through the network to compute the gradient of the loss with respect to each parameter of the network. This gradient is later used by an optimizer algorithm (like gradient descent) to update the neural network parameter, effectively “learning”. I have a video coming out on this tomorrow explaining back propagation in my new playlist “Deep Learning 101”. So do keep an eye out for this
@OPASNIY_KIRPI4
@OPASNIY_KIRPI4 Жыл бұрын
Thanks for the answer! I'm waiting for a video on this topic.
@obieda_ananbeh
@obieda_ananbeh Жыл бұрын
Thank you!
@footube3
@footube3 11 ай бұрын
Could you please explain what up, down, left and right signify. In which data structure are we going up, down, left or right?
@CodeEmporium
@CodeEmporium 11 ай бұрын
Up down left and right are individual actions that an agent can possibly take. You could store these data types in an “enum” and sample a random action from this
@borneoland-hk2il
@borneoland-hk2il 4 ай бұрын
make Soft Actor-Critic videos please, is it part of Policy-based or VF or neither both.
@kuteron307
@kuteron307 4 күн бұрын
I don't think you've explained the value network correctly. The output of this network should be a one-dimensional scalar value (e.g. A score of 0.92 for a certain state), rather than being as big as the action space. By having the output be a scalar, you get different results for this network than the policy network, which can then be trained on the reward function.
@pushkinarora5800
@pushkinarora5800 5 ай бұрын
Q1: B Q2: B Q3:B
@diegosabajo2182
@diegosabajo2182 18 күн бұрын
Quiz 1: B
@paull923
@paull923 Жыл бұрын
Great video! Especially, the quizzes are a good idea. B B B I‘d say
@CodeEmporium
@CodeEmporium Жыл бұрын
Thanks so much! It’s fun making them too. I thought it would be a good way to engage. And yep the 3 Bs sound right to me too 😊
@zakariaabderrahmanesadelao3048
@zakariaabderrahmanesadelao3048 Жыл бұрын
The answer is B.
@CodeEmporium
@CodeEmporium Жыл бұрын
Ding ding ding for the Quiz 1!
@sujalsuri1109
@sujalsuri1109 Ай бұрын
Q1. B
@BboyDschafar
@BboyDschafar Жыл бұрын
FEEDBACK. Either from experts/ teachers, or from the enviroment.
@sujalsuri1109
@sujalsuri1109 Ай бұрын
2. B
@0xabaki
@0xabaki 11 ай бұрын
haha finally no one has done quiz time yet! I propose the following answers: 0) seeing the opportunity cost of an action is low 1) A 2) B 3) D
@sashayakubov6924
@sashayakubov6924 7 ай бұрын
I did not understand nothing... apparently I'll need to ask chatgpt for clarificaions
@id104442304
@id104442304 11 ай бұрын
bbb
Reinforcement Learning through Human Feedback - EXPLAINED! | RLHF
10:17
Proximal Policy Optimization (PPO) - How to train Large Language Models
38:24
99.9% IMPOSSIBLE
00:24
STORROR
Рет қаралды 31 МЛН
Reinforcement Learning: on-policy vs off-policy algorithms
14:47
CodeEmporium
Рет қаралды 12 М.
Proximal Policy Optimization Explained
17:50
Edan Meyer
Рет қаралды 53 М.
genAI vs ChatGPT vs LLMs - Buzzwords Explained!
17:48
CodeEmporium
Рет қаралды 1,8 М.
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
Policy Gradient Methods | Reinforcement Learning Part 6
29:05
Mutual Information
Рет қаралды 39 М.
An introduction to Policy Gradient methods - Deep Reinforcement Learning
19:50
RAG - Explained!
30:00
CodeEmporium
Рет қаралды 3,4 М.
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,2 МЛН
RAG vs. Fine Tuning
8:57
IBM Technology
Рет қаралды 105 М.
Chat GPT Rewards Model Explained!
17:56
CodeEmporium
Рет қаралды 18 М.