Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning in Continuous Action Spaces | DDPG Tutorial (Pytorch)

Рет қаралды 32,039

Күн бұрын

Пікірлер: 86

@MachineLearningwithPhil 5 жыл бұрын

This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.

@hussainalaaedi 4 жыл бұрын

Hi, mr.phill, please how to get the source code of this tutorial?

@bobingstern4448 3 жыл бұрын

My tip is to try to follow along with a different environment or goal so you can learn to implement it in your own way. Great tutorial Phil as always!

@MachineLearningwithPhil 3 жыл бұрын

Great tips!

@RedShipsofSpainAgain 5 жыл бұрын

1:15 "Deep Deterministic Policy Gradients work by combing the magic of actor-critic methods with the magic of Deep Q learning, which of course has a replay buffer." I clearly need to read up some background on this.

@yehiaheshamsaid5213 4 жыл бұрын

This is so helpful , I like that you explained every detail in theory along the implentation, as to why we do such that ands that. What I was hoping too was to see the theory of mu of the continuous action space explained here . I will check next your paper video to see if you cover that.

@MachineLearningwithPhil 5 жыл бұрын

You can see the code for this tutorial here, including the trained model parameters. github.com/philtabor/KZbin-Code-Repository/tree/master/ReinforcementLearning/PolicyGradient/DDPG/lunar-lander/pytorch

@MP-nn6wo 2 жыл бұрын

Unfortunately, the page is no longer available. Did you perhaps just move the files?

@marcuslee7575 4 жыл бұрын

Hi Phil, why did you add the state_value and action_value together instead of concatenating them?

@gfbio 2 жыл бұрын

Hi Phil, does gym take care of the action and state normalization/denormalization? Or should I denormalize the action before calling env.step?

@geo2073 5 жыл бұрын

Great! Looking forward for the paper review!

@albertyang3128 3 күн бұрын

Hi Phil, how long did it take your model to complete training? I'm currently training mine and it's taking awfully long almost 20 seconds an episode

@ahmadalghooneh2105 5 жыл бұрын

Love your videos and you Phillll, !!!

@talipturkmen9041 4 жыл бұрын

Thank you very much for such an amazing video. Could you please implement PPO on reacher environment?

@rajghugare6161 4 жыл бұрын

The DDPG paper uses batch normalisation but you have used the different layer normalisation. Doesn't it make a difference?

@FHidber 5 жыл бұрын

Wow, I'm early. Great video, I've been looking to get into ddpg. This saves me the trouble of going through dozens of git reps and scraping together how to do it. Thanks!

@TheKineshma 5 жыл бұрын

Amazing Video!! Keep up with the good work!!

@ahmadalghooneh2105 5 жыл бұрын

phil, there is a question, in the CriticNetwork, why did you add the two values of state and action together? should you not define something like Q (s,a), so the network will have input size = state_shape[0] + n_actions ?

@eshaanbarkataki9198 5 жыл бұрын

I hope that you are referring to CriticNetwork.forward(). The reason why Phil did T.add(state_value, action_value) is because state_value dimensions has a large shape. State value could be a dimension of a whole image. While action value is a small dimension value, only representing the action. Since a state value has large dimensions, it has to be put in a complex Neurel Network (For example a 4 layer Neurel Network with some layer have a node count of 100-300). Since the action is a small value, it could be represented as a much simpler Neurel Network (2 -4 layers with each layer have a node count of 2-10). So the T.add(state_value, action_value) basically combines the Small Neurel Network (for the action) and the Complex Neurel Network (for the state) into one whole Neurel Network for the action_value AND the state_value. If you make the Critic Network take in the values of the action_value and the state_value, you are basically making the action_value input have a very complex Neurel Network when it is not needed. I tried to explain the best I could. If you have any questions about my response, please let me know. I am also new to reinforcement learning. :)

@ahmadalghooneh2105 5 жыл бұрын

guys, I found the reason, first of all phil pass the states from two layers, then passes the action from another layer separately, then by adding the action_value and state_value he is merging the two networks, so hence we have a Q(s,a), then passes both from another layer, done!.

@MachineLearningwithPhil 5 жыл бұрын

You are correct good sir!

@abdullahmosibah560 5 жыл бұрын

Looking forward to paper implementation video

@zapkif 2 жыл бұрын

Thanks for this and all the other videos Phil! Quick question: The way you defined the update in update_network_parameters(self): critic_state_dict[name] = tau * critic_state_dict[name].clone() + (1 - tau) * target_critic_dict[name].clone() I think that for the critic to move slowly, we want tau to be close to 1 (like .999), but you passed .001 as tau. I'm definitely getting better convergence when I set tau=.999 with the code as it is.

@vahidgholamzade1964 2 жыл бұрын

Dear Phil, could you please explain how to deal with invalid actions? For example, if chosen action from "def choosen_action" is 2.5 but it is impossible to do the action, what should we do except large negative reward?

@harshraj22_ 2 жыл бұрын

The critic in Actor-critic had final output of dim=1 (denoting the value function corresponding to the givne state). Why does the critic in DDPG has final output of dim=env.action_space.n ? and what does that represent ?

@MachineLearningwithPhil 2 жыл бұрын

I think you mean actor. Here the n refers to the components of the continuous action space. We are outputting an action, rather than a probability distribution.

@janeskitchen8294 4 жыл бұрын

Hello. This is a great lesson. I just want to create an obs value arbitrarily, but what type of value should I put it in? (Ndarray ?, tensor?): 'While not done;' When specifying obs value under code

@bryanbocao4906 4 жыл бұрын

Thanks for the video!

@robosergTV 5 жыл бұрын

thanks! You are doing good work

@MachineLearningwithPhil 5 жыл бұрын

Thanks Roboserg!

@Perryman1138 5 жыл бұрын

In the Agent learn function, should line 218 read 1-done? It looks like it would zero out future rewards until it’s done unless I misunderstand this code.

@MachineLearningwithPhil 5 жыл бұрын

Great question. We take care of the 1 - done up in the ReplayBuffer class when we store the transitions.

@marcosflaks5214 4 жыл бұрын

Hi Phil, your video is great!!! I just have 1 question. When you do the forward propagation using eval() mode, the BatchNorm Layer doesn't update the running_var and running_mean used to normalize the input of this layer. So all your inputs will always be normalized by mean=0 and var=1 (default) . Shouldn't we update the running_var and running_mean ? As far as i know, It only happens if we run a forward propagation using the train() mode.

@MachineLearningwithPhil 4 жыл бұрын

You are correct. It's been 8 months since I've done the video so I'm trying to remember why I did it that particular way. I should have commented the code better, in hindsight :p I'm going back over the code now for my new course on Actor Critic methods. Once I have it clear in my mind I'll update the github repo for the YT channel. Either with an explanatory comment or with a correction. The git for the code is here: github.com/philtabor/KZbin-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/lunar-lander/pytorch/ddpg_torch.py

@marcosflaks5214 4 жыл бұрын

Your videos are amazing!!! I learned a lot from them. I spend more than 1 week just trying to understand eval and train mode in Pytorch. Here is what i learned during the training the model will calculate the std.P, and var.P (whole population) of the input x and use it to normalize output (train) = (x_input - mean) / (var.P^(1/2) + epsilon) * weight + bias only during the training mode will update the running_mean and running_var (used on eval mode) default initial values: running_mean=0, running_var=1, weight=1, bias=0, momentum=0.1, epsilon = 0.00001 running_mean = momentum * mean_input_x + (1-momentum) * prev_running_mean running_var = momentum * var.S_input_x + (1-momentum) * prev_running_var output (eval) = (x_input - running_mean) / (running_var^(1/2) + epsilon) * weight + bias varS = Sample --> divide by n-1) Maybe in the end of the learn function we should set up Actor and Critic to train mode and do a forward propagation. It will update running_var and running_mean that we will use on the next iteration on the eval mode Also , maybe we should update the critic_target and actor_target with the running_var and running_mean using the Soft Update too Please Let me know if it makes sense Looking forward to see more videos from you! You are an excellent teacher! Best

@mrmatthewleigh 4 жыл бұрын

@@marcosflaks5214 Its kind of important to note that Phil isnt actually using a BatchNorm layer in this network. Rather he is using LayerNorm, not sure if that is a mistake or not, they are really similar. pytorch.org/docs/master/generated/torch.nn.LayerNorm.html The big difference is that LayerNorm does not apply a different transformation when in training or evaluation mode. So nothing is actually changing when he calls eval() or tain()...

@clapdrix72 7 ай бұрын

@MachineLearningwithPhil, Does your Udemy course curriculum cover custom environmemts for actor critic network in PyTorch or TF? I'm applying SAC to a custom problem so I don't have any need for pre-baked environments.

@MachineLearningwithPhil 7 ай бұрын

It does not, unfortunately. Shoot me an email please.

@clapdrix72 7 ай бұрын

Will do

@KodandocomFaria 4 жыл бұрын

How can I create my own gym based on a real world ? for instance: I would like to train my model on a webpage were offers some informations about environment and give me some action I could do. based on this informations I would like to create a gym capable to be used with this model to train and later to be used in production. Is it possible?

@thomashirtz 3 жыл бұрын

We do not need to initialize the weights of "action_value" like the other layers (16:06) ?

@bruceli7999 2 жыл бұрын

The noise added to action is for training only, right? For evaluation, do we still need the action noise? Thanks :))

@MachineLearningwithPhil 2 жыл бұрын

Training only, correct

@eshaanbarkataki9198 5 жыл бұрын

Great Video! I just have one question. When it says that -self.critic.forward(state, mu), can you please explain me why there is a negative sign? I read the documentation of DDPG and don't see the minus sign near the policy gradient. Also, do you have to learn calculus, linear algebra, or any math courses in order to understand all these machine learning papers? I only know algebra one and two...

@berkealgul2503 4 жыл бұрын

the reason probably is that we are trying to maximize q function. However the thing gradient descent does is that minimize our function. weights and biases updated (simply) as follows: w += - lr * gradient -> gradient = d(loss)/dw b += - lr * gradient -> gradient = d(loss)/db gradients are calculated during loss.backwards() Hence we need to change direction of derivatives in order to maximize our q function w += -lr * g -> g= d(-loss) / dw b += -lr * g -> g= d(-loss) / db we need to change sign of gradient so we can get rid of minus sign in the beginning since we use mean loss function T.mean( -x) = - T.mean(x) it will also change our loss functions sign. and actually this is actually named gradient ascent rather than descent if anything wrongs correct me phil :) In addition derivatives and matrices are quite essential to understand math logic behind deep learning concepts. i recommend you to take basic calculus and lineer algebra courses.

@eshaanbarkataki9198 4 жыл бұрын

I will take your recommendation of learning calc and linear algebra. Even without the experience of calc and linear algebra, I understood your answer. Everything is making sense now, thanks!

@andrestorres7343 Жыл бұрын

what learning rate did u use?

@ICREsniper 4 жыл бұрын

Hi Phil. Great video! I wanted to ask, why do you use nn.LayerNorm instead of nn.BatchNorm1d for batch normalization?

@MachineLearningwithPhil 4 жыл бұрын

Good question. It boils down to issues with model check pointing with the BatchNorm (it doesn't save all the statistics, if I recall correctly). The layer norm does basically the same thing, and evidently works so I ran with it.

@DanNovischi 3 жыл бұрын

@@MachineLearningwithPhil even if check pointing may be the issue, I fail to see how BN is the same thing as LN. i.e BN normalizes across samples for each channel, whereas LN normalizes across channels for each sample. In the 1D case LN is more like instance normalization (IN). Any thoughts as to why LN has at least the same effects in this instance?

@pmz558 5 жыл бұрын

Why was I unsubscribed from your channel ? Love your videos

@MachineLearningwithPhil 5 жыл бұрын

because YT sucks. Welcome back

@lakshayarora9697 Жыл бұрын

How to plot training loss values ?

@yueleng6898 4 жыл бұрын

Thank you so much for this fantastic video. I am wondering how to implement a MonteCarlo method rather than TD method. Could you give some references? Thanks again.

@yueleng6898 4 жыл бұрын

another question I have is that in my understanding, you implement a layernormalization instead of batch normalization. Is that correct?

@MachineLearningwithPhil 4 жыл бұрын

I am indeed doing layernorm instead of batch norm. I ran into issues with model check pointing with batch norm, and the layer norm does the same thing and appears to work well enough so I ran with it. Check out the REINFORCE algorithm and REINFORCE with baseline. Here's a good reference for all the actor critic algorithms: lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html

@DiirtyDan1 4 жыл бұрын

Great video! What is the editor you are using to write this code?

@MachineLearningwithPhil 4 жыл бұрын

Atom, I believe.

@X4r1l 4 жыл бұрын

Hi Phil! I'm trying to use this in an environment where I choose a direction in degrees, i.e. a value between -180 and 180. Should I still initialize the weights of the network as such small numbers? In the first few iterations, it outputs numbers close to 0.

@MachineLearningwithPhil 4 жыл бұрын

You can try scaling the actions to between -1 and +1 and then multiply by 180. Perhaps that action range is too much for the network.

@X4r1l 4 жыл бұрын

@@MachineLearningwithPhil Thanks for the reply. Do you mean after the forward pass? So then I would take the output from the network, scale it and then multiply it with 180?

@manuelnovella39 5 жыл бұрын

Hey, Phil! Thanks for the video. O e question, why choose PyTorch over Keras? Isn't the latter easier?

@MachineLearningwithPhil 5 жыл бұрын

They're about equal in terms of difficulty. I'm working on a course for deep learning beginners using PyTorch, so I figured I could use the practice.

@manuelnovella39 5 жыл бұрын

@@MachineLearningwithPhil thanks for your answer!

@sunduskhurram4672 4 жыл бұрын

PermissionError: [Errno 13] Permission denied: 'tmp/ddpg\\actor_ddpg' is the error i am facing. Need help

@MachineLearningwithPhil 4 жыл бұрын

What command are you using? Is that for mkdir? Try sudo

@EliorBY 5 жыл бұрын

great video as always! a question - how do you add the L2 regularization on the q layer of the Critic? they mentioned it in the article but I didn't see it in the video...

@MachineLearningwithPhil 5 жыл бұрын

I didn't add it for this implementation, your eyes do not deceive you.

@natnaelhabtamu9006 5 жыл бұрын

Thank you Phil!! i have one question.. why am i getting this error ? how can i solve it ? No registered env with id: LunarLanderContinous-v2

@MachineLearningwithPhil 5 жыл бұрын

Make sure you have box-2d installed

@kae4881 4 жыл бұрын

Its LunarLanderContinuous-v2, wrong spelling

@DaHrakl 4 жыл бұрын

What about using a GRU or LSTM at the input layer?

@MachineLearningwithPhil 4 жыл бұрын

It's certainly possible. I would encourage you to do it and see if you can improve on my model's performance. You could fork my repo and make it your own.

@myriads0796 3 жыл бұрын

Hello, Thanks to your video. By the way, you know why I get a message like "ImportError: cannot import name 'plotLearning' from 'utils' (C:\miniconda\envs\Term-Project\lib\site-packages\utils\__init__.py)"?

@MachineLearningwithPhil 3 жыл бұрын

You will need to clone it from my GitHub repo

@myriads0796 3 жыл бұрын

@@MachineLearningwithPhil Thanks for your reply

@myriads0796 3 жыл бұрын

@@MachineLearningwithPhil I have another question. If I want to watch video while training, not record, what function should I use?

@myriads0796 3 жыл бұрын

I solve it after download pip install gym[all] and use env.render(). Thanks.