This content is sponsored by my Udemy courses. Level up your skills by learning to turn papers into code. See the links in the description.
@hussainalaaedi4 жыл бұрын
Hi, mr.phill, please how to get the source code of this tutorial?
@bobingstern44483 жыл бұрын
My tip is to try to follow along with a different environment or goal so you can learn to implement it in your own way. Great tutorial Phil as always!
@MachineLearningwithPhil3 жыл бұрын
Great tips!
@RedShipsofSpainAgain5 жыл бұрын
1:15 "Deep Deterministic Policy Gradients work by combing the magic of actor-critic methods with the magic of Deep Q learning, which of course has a replay buffer." I clearly need to read up some background on this.
@yehiaheshamsaid52134 жыл бұрын
This is so helpful , I like that you explained every detail in theory along the implentation, as to why we do such that ands that. What I was hoping too was to see the theory of mu of the continuous action space explained here . I will check next your paper video to see if you cover that.
@MachineLearningwithPhil5 жыл бұрын
You can see the code for this tutorial here, including the trained model parameters. github.com/philtabor/KZbin-Code-Repository/tree/master/ReinforcementLearning/PolicyGradient/DDPG/lunar-lander/pytorch
@MP-nn6wo2 жыл бұрын
Unfortunately, the page is no longer available. Did you perhaps just move the files?
@marcuslee75754 жыл бұрын
Hi Phil, why did you add the state_value and action_value together instead of concatenating them?
@gfbio2 жыл бұрын
Hi Phil, does gym take care of the action and state normalization/denormalization? Or should I denormalize the action before calling env.step?
@geo20735 жыл бұрын
Great! Looking forward for the paper review!
@albertyang31283 күн бұрын
Hi Phil, how long did it take your model to complete training? I'm currently training mine and it's taking awfully long almost 20 seconds an episode
@ahmadalghooneh21055 жыл бұрын
Love your videos and you Phillll, !!!
@talipturkmen90414 жыл бұрын
Thank you very much for such an amazing video. Could you please implement PPO on reacher environment?
@rajghugare61614 жыл бұрын
The DDPG paper uses batch normalisation but you have used the different layer normalisation. Doesn't it make a difference?
@FHidber5 жыл бұрын
Wow, I'm early. Great video, I've been looking to get into ddpg. This saves me the trouble of going through dozens of git reps and scraping together how to do it. Thanks!
@TheKineshma5 жыл бұрын
Amazing Video!! Keep up with the good work!!
@ahmadalghooneh21055 жыл бұрын
phil, there is a question, in the CriticNetwork, why did you add the two values of state and action together? should you not define something like Q (s,a), so the network will have input size = state_shape[0] + n_actions ?
@eshaanbarkataki91985 жыл бұрын
I hope that you are referring to CriticNetwork.forward(). The reason why Phil did T.add(state_value, action_value) is because state_value dimensions has a large shape. State value could be a dimension of a whole image. While action value is a small dimension value, only representing the action. Since a state value has large dimensions, it has to be put in a complex Neurel Network (For example a 4 layer Neurel Network with some layer have a node count of 100-300). Since the action is a small value, it could be represented as a much simpler Neurel Network (2 -4 layers with each layer have a node count of 2-10). So the T.add(state_value, action_value) basically combines the Small Neurel Network (for the action) and the Complex Neurel Network (for the state) into one whole Neurel Network for the action_value AND the state_value. If you make the Critic Network take in the values of the action_value and the state_value, you are basically making the action_value input have a very complex Neurel Network when it is not needed. I tried to explain the best I could. If you have any questions about my response, please let me know. I am also new to reinforcement learning. :)
@ahmadalghooneh21055 жыл бұрын
guys, I found the reason, first of all phil pass the states from two layers, then passes the action from another layer separately, then by adding the action_value and state_value he is merging the two networks, so hence we have a Q(s,a), then passes both from another layer, done!.
@MachineLearningwithPhil5 жыл бұрын
You are correct good sir!
@abdullahmosibah5605 жыл бұрын
Looking forward to paper implementation video
@zapkif2 жыл бұрын
Thanks for this and all the other videos Phil! Quick question: The way you defined the update in update_network_parameters(self): critic_state_dict[name] = tau * critic_state_dict[name].clone() + (1 - tau) * target_critic_dict[name].clone() I think that for the critic to move slowly, we want tau to be close to 1 (like .999), but you passed .001 as tau. I'm definitely getting better convergence when I set tau=.999 with the code as it is.
@vahidgholamzade19642 жыл бұрын
Dear Phil, could you please explain how to deal with invalid actions? For example, if chosen action from "def choosen_action" is 2.5 but it is impossible to do the action, what should we do except large negative reward?
@harshraj22_2 жыл бұрын
The critic in Actor-critic had final output of dim=1 (denoting the value function corresponding to the givne state). Why does the critic in DDPG has final output of dim=env.action_space.n ? and what does that represent ?
@MachineLearningwithPhil2 жыл бұрын
I think you mean actor. Here the n refers to the components of the continuous action space. We are outputting an action, rather than a probability distribution.
@janeskitchen82944 жыл бұрын
Hello. This is a great lesson. I just want to create an obs value arbitrarily, but what type of value should I put it in? (Ndarray ?, tensor?): 'While not done;' When specifying obs value under code
@bryanbocao49064 жыл бұрын
Thanks for the video!
@robosergTV5 жыл бұрын
thanks! You are doing good work
@MachineLearningwithPhil5 жыл бұрын
Thanks Roboserg!
@Perryman11385 жыл бұрын
In the Agent learn function, should line 218 read 1-done? It looks like it would zero out future rewards until it’s done unless I misunderstand this code.
@MachineLearningwithPhil5 жыл бұрын
Great question. We take care of the 1 - done up in the ReplayBuffer class when we store the transitions.
@marcosflaks52144 жыл бұрын
Hi Phil, your video is great!!! I just have 1 question. When you do the forward propagation using eval() mode, the BatchNorm Layer doesn't update the running_var and running_mean used to normalize the input of this layer. So all your inputs will always be normalized by mean=0 and var=1 (default) . Shouldn't we update the running_var and running_mean ? As far as i know, It only happens if we run a forward propagation using the train() mode.
@MachineLearningwithPhil4 жыл бұрын
You are correct. It's been 8 months since I've done the video so I'm trying to remember why I did it that particular way. I should have commented the code better, in hindsight :p I'm going back over the code now for my new course on Actor Critic methods. Once I have it clear in my mind I'll update the github repo for the YT channel. Either with an explanatory comment or with a correction. The git for the code is here: github.com/philtabor/KZbin-Code-Repository/blob/master/ReinforcementLearning/PolicyGradient/DDPG/lunar-lander/pytorch/ddpg_torch.py
@marcosflaks52144 жыл бұрын
Your videos are amazing!!! I learned a lot from them. I spend more than 1 week just trying to understand eval and train mode in Pytorch. Here is what i learned during the training the model will calculate the std.P, and var.P (whole population) of the input x and use it to normalize output (train) = (x_input - mean) / (var.P^(1/2) + epsilon) * weight + bias only during the training mode will update the running_mean and running_var (used on eval mode) default initial values: running_mean=0, running_var=1, weight=1, bias=0, momentum=0.1, epsilon = 0.00001 running_mean = momentum * mean_input_x + (1-momentum) * prev_running_mean running_var = momentum * var.S_input_x + (1-momentum) * prev_running_var output (eval) = (x_input - running_mean) / (running_var^(1/2) + epsilon) * weight + bias varS = Sample --> divide by n-1) Maybe in the end of the learn function we should set up Actor and Critic to train mode and do a forward propagation. It will update running_var and running_mean that we will use on the next iteration on the eval mode Also , maybe we should update the critic_target and actor_target with the running_var and running_mean using the Soft Update too Please Let me know if it makes sense Looking forward to see more videos from you! You are an excellent teacher! Best
@mrmatthewleigh4 жыл бұрын
@@marcosflaks5214 Its kind of important to note that Phil isnt actually using a BatchNorm layer in this network. Rather he is using LayerNorm, not sure if that is a mistake or not, they are really similar. pytorch.org/docs/master/generated/torch.nn.LayerNorm.html The big difference is that LayerNorm does not apply a different transformation when in training or evaluation mode. So nothing is actually changing when he calls eval() or tain()...
@clapdrix727 ай бұрын
@MachineLearningwithPhil, Does your Udemy course curriculum cover custom environmemts for actor critic network in PyTorch or TF? I'm applying SAC to a custom problem so I don't have any need for pre-baked environments.
@MachineLearningwithPhil7 ай бұрын
It does not, unfortunately. Shoot me an email please.
@clapdrix727 ай бұрын
Will do
@KodandocomFaria4 жыл бұрын
How can I create my own gym based on a real world ? for instance: I would like to train my model on a webpage were offers some informations about environment and give me some action I could do. based on this informations I would like to create a gym capable to be used with this model to train and later to be used in production. Is it possible?
@thomashirtz3 жыл бұрын
We do not need to initialize the weights of "action_value" like the other layers (16:06) ?
@bruceli79992 жыл бұрын
The noise added to action is for training only, right? For evaluation, do we still need the action noise? Thanks :))
@MachineLearningwithPhil2 жыл бұрын
Training only, correct
@eshaanbarkataki91985 жыл бұрын
Great Video! I just have one question. When it says that -self.critic.forward(state, mu), can you please explain me why there is a negative sign? I read the documentation of DDPG and don't see the minus sign near the policy gradient. Also, do you have to learn calculus, linear algebra, or any math courses in order to understand all these machine learning papers? I only know algebra one and two...
@berkealgul25034 жыл бұрын
the reason probably is that we are trying to maximize q function. However the thing gradient descent does is that minimize our function. weights and biases updated (simply) as follows: w += - lr * gradient -> gradient = d(loss)/dw b += - lr * gradient -> gradient = d(loss)/db gradients are calculated during loss.backwards() Hence we need to change direction of derivatives in order to maximize our q function w += -lr * g -> g= d(-loss) / dw b += -lr * g -> g= d(-loss) / db we need to change sign of gradient so we can get rid of minus sign in the beginning since we use mean loss function T.mean( -x) = - T.mean(x) it will also change our loss functions sign. and actually this is actually named gradient ascent rather than descent if anything wrongs correct me phil :) In addition derivatives and matrices are quite essential to understand math logic behind deep learning concepts. i recommend you to take basic calculus and lineer algebra courses.
@eshaanbarkataki91984 жыл бұрын
I will take your recommendation of learning calc and linear algebra. Even without the experience of calc and linear algebra, I understood your answer. Everything is making sense now, thanks!
@andrestorres7343 Жыл бұрын
what learning rate did u use?
@ICREsniper4 жыл бұрын
Hi Phil. Great video! I wanted to ask, why do you use nn.LayerNorm instead of nn.BatchNorm1d for batch normalization?
@MachineLearningwithPhil4 жыл бұрын
Good question. It boils down to issues with model check pointing with the BatchNorm (it doesn't save all the statistics, if I recall correctly). The layer norm does basically the same thing, and evidently works so I ran with it.
@DanNovischi3 жыл бұрын
@@MachineLearningwithPhil even if check pointing may be the issue, I fail to see how BN is the same thing as LN. i.e BN normalizes across samples for each channel, whereas LN normalizes across channels for each sample. In the 1D case LN is more like instance normalization (IN). Any thoughts as to why LN has at least the same effects in this instance?
@pmz5585 жыл бұрын
Why was I unsubscribed from your channel ? Love your videos
@MachineLearningwithPhil5 жыл бұрын
because YT sucks. Welcome back
@lakshayarora9697 Жыл бұрын
How to plot training loss values ?
@yueleng68984 жыл бұрын
Thank you so much for this fantastic video. I am wondering how to implement a MonteCarlo method rather than TD method. Could you give some references? Thanks again.
@yueleng68984 жыл бұрын
another question I have is that in my understanding, you implement a layernormalization instead of batch normalization. Is that correct?
@MachineLearningwithPhil4 жыл бұрын
I am indeed doing layernorm instead of batch norm. I ran into issues with model check pointing with batch norm, and the layer norm does the same thing and appears to work well enough so I ran with it. Check out the REINFORCE algorithm and REINFORCE with baseline. Here's a good reference for all the actor critic algorithms: lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html
@DiirtyDan14 жыл бұрын
Great video! What is the editor you are using to write this code?
@MachineLearningwithPhil4 жыл бұрын
Atom, I believe.
@X4r1l4 жыл бұрын
Hi Phil! I'm trying to use this in an environment where I choose a direction in degrees, i.e. a value between -180 and 180. Should I still initialize the weights of the network as such small numbers? In the first few iterations, it outputs numbers close to 0.
@MachineLearningwithPhil4 жыл бұрын
You can try scaling the actions to between -1 and +1 and then multiply by 180. Perhaps that action range is too much for the network.
@X4r1l4 жыл бұрын
@@MachineLearningwithPhil Thanks for the reply. Do you mean after the forward pass? So then I would take the output from the network, scale it and then multiply it with 180?
@manuelnovella395 жыл бұрын
Hey, Phil! Thanks for the video. O e question, why choose PyTorch over Keras? Isn't the latter easier?
@MachineLearningwithPhil5 жыл бұрын
They're about equal in terms of difficulty. I'm working on a course for deep learning beginners using PyTorch, so I figured I could use the practice.
@manuelnovella395 жыл бұрын
@@MachineLearningwithPhil thanks for your answer!
@sunduskhurram46724 жыл бұрын
PermissionError: [Errno 13] Permission denied: 'tmp/ddpg\\actor_ddpg' is the error i am facing. Need help
@MachineLearningwithPhil4 жыл бұрын
What command are you using? Is that for mkdir? Try sudo
@EliorBY5 жыл бұрын
great video as always! a question - how do you add the L2 regularization on the q layer of the Critic? they mentioned it in the article but I didn't see it in the video...
@MachineLearningwithPhil5 жыл бұрын
I didn't add it for this implementation, your eyes do not deceive you.
@natnaelhabtamu90065 жыл бұрын
Thank you Phil!! i have one question.. why am i getting this error ? how can i solve it ? No registered env with id: LunarLanderContinous-v2
@MachineLearningwithPhil5 жыл бұрын
Make sure you have box-2d installed
@kae48814 жыл бұрын
Its LunarLanderContinuous-v2, wrong spelling
@DaHrakl4 жыл бұрын
What about using a GRU or LSTM at the input layer?
@MachineLearningwithPhil4 жыл бұрын
It's certainly possible. I would encourage you to do it and see if you can improve on my model's performance. You could fork my repo and make it your own.
@myriads07963 жыл бұрын
Hello, Thanks to your video. By the way, you know why I get a message like "ImportError: cannot import name 'plotLearning' from 'utils' (C:\miniconda\envs\Term-Project\lib\site-packages\utils\__init__.py)"?
@MachineLearningwithPhil3 жыл бұрын
You will need to clone it from my GitHub repo
@myriads07963 жыл бұрын
@@MachineLearningwithPhil Thanks for your reply
@myriads07963 жыл бұрын
@@MachineLearningwithPhil I have another question. If I want to watch video while training, not record, what function should I use?
@myriads07963 жыл бұрын
I solve it after download pip install gym[all] and use env.render(). Thanks.