Training a Deep Q-Network - Reinforcement Learning

Рет қаралды 71,753

Күн бұрын

Пікірлер: 129

@jacekkozieja6598 5 жыл бұрын

This material is a gem and you are a great teacher. I was digging through all the materials that I could find about deep q-learning for 2 days and only you explained that Q_target value is calculated by feeding s' to the network (same way as Q_predicition). You saved my master's thesis in some way.

@ani96bob 5 жыл бұрын

seconded!

@AndrewGarrisonTheHuman 4 жыл бұрын

I agree, this is an awesome series! The visualizations help so much.

@grantmayberry7358 4 жыл бұрын

i’ve been wanting someone to explain the nitty-gritty details of this for so long. thank you so much

@ChibatZ 4 жыл бұрын

Awesome, best high and low level explanation combination I found so far. Really helped me! Keep up the good work :)

@Johncowk 2 жыл бұрын

This is the best explanation I've seen on the subject so far. Thanks!

@emenikeanigbogu9368 4 жыл бұрын

Finally a solid response. Thank you so much. I've spent so much time trying to understand target data.

@TheCodeVertigo 4 жыл бұрын

This comment made me watch the video. :)

@emenikeanigbogu9368 4 жыл бұрын

@@TheCodeVertigo Glad to help sir!!!!

@MotoGReviews 5 жыл бұрын

You are amazinggggg!!! I love your videos. One suggestion: you should make a series on PPO and policy gradients since you can explain things way better than other KZbinrs!!!!

@deeplizard 5 жыл бұрын

Thank youuu! I intend to cover the more sophisticated reinforcement learning topics, like the ones you mentioned, in this series as well :D

@MotoGReviews 5 жыл бұрын

@@deeplizard great! I am so happy you'll keep this awesome work :) it will really help the entire community. Good job ;)

@pouyashaeri 2 жыл бұрын

These videos are awesome, thanks for the great intuits you're giving.

@yuktikaura Жыл бұрын

Great Work!

@mosfetkhan7334 3 жыл бұрын

Thank you so much for such a great explanation. I am not a very smart guy so it is a bit difficult for me to understand algorithms, but your explanations help to easily understand the most complicated machine learning algorithms. Amazing work.....

@youssefmaghrebi6963 6 ай бұрын

A huge thank you for this tutorial and the pseudo code.

@kushis7242 5 жыл бұрын

Appreciate the time and effort you have put in making this excellent video series. Thank you very much. Learning curve is smooth when you have a good instructor. :) Could you please clarify the following points: 1) What is an emulator here? i.e are we using two different neural networks (a) one of replay memory and (b) other to train the batch sample from the replay memory 2) Why are we preprocessing on batched sample and not on the raw state before we pass the input to an emulator.

@deeplizard 5 жыл бұрын

Hey kushi - The emulator here would just be the game. So, we'd execute an action in the game environment and then observe the result of that action. In regards to using two different networks, we're not doing that in this video, but move forward in the series and you will see that this is the next concept introduced. You should gain clarity there for why.

@kushis7242 5 жыл бұрын

@@deeplizard Thanks for the clarification on question 1. Could you please shed some light on question 2. i.e why are we not preprocessing the input before running it in an emulator but only when running it against a Neural Network.

@deeplizard 5 жыл бұрын

If you're playing a game, for example, then you will not need to do any processing on the state of the environment before the agent can take an action in the game environment. Everything is self-contained in the game. Just imagine playing a video game yourself without any Q-learning in the mix. You're not having to do anything to the game to have it process the state of the current environment before you can continue. The processing is only to get the state ready for the network.

@waltercanedoriedel1413 3 жыл бұрын

Thanks for the volume up!

@Nissearne12 10 ай бұрын

❤oh this was so helpfull for me. You just make me understand here the connection betwheen Q Learning and Deep Q-Learning DQN algoritm why its called DQN and the replay mechanism the differens betwheen DQN and Vanilla Policy Reinforcement Learning. Thank you so much 🎉❤ no other tutorial show me this with replay and the how it relatera to Q Learning and how to make a Target value an so on in a gentle and easy way thanks again

@deeplizard 10 ай бұрын

Wonderful to hear! Thanks for taking the time to share 😊

@Nissearne12 10 ай бұрын

@@deeplizardi am so excited now to program my first DQN program (in C/C++ i am so old fashion programmer from the -80 you know I was programming when I was a boy at C-64 I am 49-years so PyTorch and python 🐍 is a bit to modern for me but of course nice and with high lever GPU libraries of course). Anyway I will look through your hole DQN serie with joy, its grate that you gently go through all details. I have done lot of other program CNN, Autoencoders Sparce Autoencoders etc from scratch in C/C++ without any ML library just so I understand all details in the algoritm so I can adapt it to FPGA Circuit (VHDL Code) I working with also for decades. But this with DQN have been for me a bit blure how it works. But you clear up the fog. Thanks 🙏

@LuisRamirez-gc5ds 4 жыл бұрын

guys very excellent content. I've been studying deep learning. But never visited Deep Reinforment Learning and i love all this stuff it's so cool

@tingnews7273 5 жыл бұрын

What I learned: 1.How to combine everything:DQN,replay memory.(actually not so clear. but when to the code,it will all be clear) 2.Policy network:network to approximate the optimal policy (just think the network is the optimal Q-function) 3.Calculating the loss part make me a little confuse.The loss calculatio.How did it happen.I thought I will get more info after I saw the code.

@nitinchoudhary505 2 жыл бұрын

@ashwynhorton2841 3 жыл бұрын

Thank you so much...

@zipcomps1741 3 жыл бұрын

Luv it.. easily one of your best (among many good ones).. Lucid, broken down well.. Became a Patreon. Thanks!

@deeplizard 3 жыл бұрын

Thank you!

@sam-gu6rv 4 жыл бұрын

I love her voice so cool to understanding

@zachhua7704 3 жыл бұрын

Thank you Mandy, you are really good at explaining all the deep learning stuff, how do you guys learn all of these by yourself? You are really smart I would say:)

@deeplizard 3 жыл бұрын

Thank you Zach! We spend lots of time learning and improving!

@eddieseabrook8614 4 жыл бұрын

Incredible, this was super helpful. Thanks!

@sanjanavijayshankar5508 3 жыл бұрын

This is amazing! Thanks a lot!

@HiddenAway-oq6wo 5 жыл бұрын

Waiting

@FaizanKhan-gc2qp 5 жыл бұрын

Very helpful video in understanding the training of DQN. Appreciate it. (y)

@alexanderwu4930 5 жыл бұрын

{ "question": "During training, we make two passes through the network. What is the purpose of the second pass?", "choices": [ "To get the Q-value for the action", "To calculate the target Q-value for the action", "To calculate the target reward for the action", "To calculate the reward for the action" ], "answer": "To calculate the target Q-value for the action", "creator": "AW", "creationDate": "2019-09-25T07:08:11.309Z" }

@deeplizard 5 жыл бұрын

Thank you for the quiz question, Andrew! First one for this video! I modified the answer a bit. You are correct that the second pass is indeed done to help us calculate the target Q-value for the current state-action pair. More particularly though, the second pass is specifically done to calculate this term: maxQ*(s’,a’). This will tell us the maximum Q-value across all possible actions for the next state, which we indeed need to be able to calculate the target Q-value for the current state-action. Let me know if this makes sense. I've just posted it, so it's now live below :D Thank you again! deeplizard.com/learn/video/0bt0SjbS3xc

@origamigirl11RK 5 жыл бұрын

Very informative. I've been learning so much from your videos.

@mariushoppe8880 5 жыл бұрын

Since your next video in the topic of reinforcement learning is (probably) gonna be on fixed Q-Targets, could you maybe go into full detail on how it works. I've asked a lot of PhD students on my university and everyone always tells me, that they don't know why but it somehow stabilises the DQN and if we don't have it, it can easily oscillate. Could you maybe give an example on in which case that would happen? That would be incredibly helpful. Thanks anyways, love your videos!

@deeplizard 5 жыл бұрын

Thanks, Marius! I'll keep this in mind when I explain it. You're right-- it is the topic of the next video :D

@neogarciagarcia443 4 жыл бұрын

Very good. Awsome!!!. Congrats!!!

@benjamindeporte3806 3 жыл бұрын

Superb :-)

@tinyentropy 4 жыл бұрын

This Bellman equation consideration only provides q*(s4, a4c), what will be the target values for a4a, a4b and a4d, respectively? [assuming a4 == a4c].

@rocklee5231 4 жыл бұрын

I'm on episode 13 of 18 in your Reinforcement Learning - Goal Oriented Intelligence, and while I think you have done a great job in the videos individually, there seems to be multiple episodes missing. In that I mean there have been several times where a video jumps into a subject while mentioning it relies on a on some other topic that was "already covered," but the already covered subject was not in the preceding videos in the playlist. A specific example in this video is forward propagation.

@datascience_with_yetty 5 жыл бұрын

Patiently waiting :)

@sounakbanerjeenfet 5 жыл бұрын

very well explained.

@RobertOSullivan 4 жыл бұрын

Dislikes were from John Connor and the resistance.

@hazzaldo 5 жыл бұрын

TY so much for this brilliant video. I have a question on this. If you ever get the time, would really appreciate some clarification. 1- You mentioned in the list of high level steps for the process:- after taking the first action from the initial state, I didn’t understand why we sample random batch from Replay Memory! Shouldn’t we continue from the state where the agent ended up after taking the first action, as it is the natural progression in the game? Say if it’s Super Mario game where the first action is move forward. This will bring Mario closer to the first enemy Mashroom (Goomba). Shouldn’t we continue from the new state, by jumping over the enemy, and then continue from there on. It didn’t make sense for me, to suddenly sample some completely random batch, which would make the agent jump to some completely random state from the Reply Memory, when the agent should continue progressing through the game as a human player would. I hope my question makes sense.

@andonglin8900 5 жыл бұрын

s and s' share the same net and have only one step between them, which make the optimization like chasing its own tail and unstable

@deeplizard 5 жыл бұрын

That's right, Andong! Thanks for answering!

@deeplizard 5 жыл бұрын

By the way, I gave you a shout out in the latest video for your answer :) kzbin.info/www/bejne/rofOgZtvep56nKc

@tomw4688 4 жыл бұрын

Thank you for the video.

@tinyentropy 5 жыл бұрын

Brilliant!

@sarvagyagupta1744 5 жыл бұрын

I have some questions: 1) In 4:23, how do we calculate the term Rt+k+1? 2) In the final steps, in step 5, are we shuffling the random memory or are we just randomly selecting a memory? Because we need the next state for step 8

@travisbarton932 2 жыл бұрын

I came for the same #1

@emrek1 2 жыл бұрын

1) I guess the emulator gives us that. I am totally lost on the other side.

@user-or7ji5hv8y 5 жыл бұрын

awesome video!

@alivecoding4995 13 сағат бұрын

Sounds like we update all q values, but we only know the successive state s‘ for one of the actions. How can we calculate the optimal q values for the other actions at that moment?

@kameelamareen Жыл бұрын

How do you even find the Expectation when its a Model Free Method ... we dont have the probabilities ??

@timof6364 2 ай бұрын

What exactly is the difference between the q-network and the policy-network? And doesnt solving the bellman equations require us to solve a huge system of equations, which would take way too long?

@salonirathi1158 5 жыл бұрын

Thank you for the videos I really love them. Please can you make videos on actor critic algorithm .

@deeplizard 5 жыл бұрын

Actor critic is on our list of potential topics for the future! Thanks, Saloni!

@deniztekalp8059 5 жыл бұрын

@@deeplizard We're looking forward to you covering actor critic algorithm.

@chuanjiang6931 3 ай бұрын

At 7:34, loss calculation also happens in each time step, is it what happening in practice? Can loss calculation happens once for every few time steps?

@ani96bob 5 жыл бұрын

Out of curiosity, when we calculate target Q using the second feed and get the max of the action values predicted by that network: - Initially, since the weights are random for all states, i was wondering how we would trust that target Q value to be our basis for improvisation of loss?

@deeplizard 5 жыл бұрын

Hey Ani - Yes, this is actually a bit of a problem. We address it in the following episode :D deeplizard.com/learn/video/xVkPh9E9GfE

@ketalciencia3163 5 жыл бұрын

Yup

@MrJonhSmithTheFirst 3 жыл бұрын

I am wondering if there are any other ways to calculate Qmax(optimal) for our loss function? Or it is the only way to estimate it (looking 1 step ahead)? Can we look, say few steps ahead? Or set Qmax based on the system's known max rewards? ( I am a newbie, so if you have a min to explain how those thoughts may be a flaw, please, let me know). And Thank you, Hands down the best series out there! For many years to come!

@_jiwi2674 4 жыл бұрын

Hi thank you so much for sharing and the great work! Just one point though, wouldn't the reward in each experience tuple belong to the same timestep as the one in the state and action? For instance: instead of (s4, a4, r5, s5) I thought it would be (s4, a4, r4, s5). I believe this was the argument made by the stanford cs231 RL course lecturer.

@deeplizard 4 жыл бұрын

Hey Jiwi - The idea is that the agent won't receive the reward from the state-action pair that occurred at time t until the agent is transitioned to the new state at time t+1. At this time, the reward from the previous state-action pair will be granted. The literature for the reward notation is mixed on this. Some authors will show the reward for time t being granted at time t, and others will show the reward for time t being granted at time t+1. I'm following Richard Sutton's ("the father of reinforcement learning") notation for this. I elaborate a bit more on the MDP notation in this video/blog: deeplizard.com/learn/video/my207WNoeyA Let me know if this helps clarify!

@sahand5277 5 жыл бұрын

Thank you! but it was a long time gap between this and your last video!

@deeplizard 5 жыл бұрын

It was! Lots of lizard life stuff going on in the background that's temporarily slowing the release schedule ;)

@stefano8936 5 жыл бұрын

After this lesson I should be able to re-adapt the algorithm based on the q-table in a dqn based one. But I'm not. I think this lesson should be better explained. All the rest was very good material, thanks

@andreistativa8549 Жыл бұрын

I have a question: The max q*(s', a')...is a single value or is a vector...corresponding to probabilities of going right, up, left and down?

@deeplizard Жыл бұрын

single value

@andreistativa8549 Жыл бұрын

@@deeplizard in the case of using a q-table, is a single value, but when using a NN, that maxq term is also a single value? Thanx!

@deeplizard Жыл бұрын

yes, single value

@nicholasges972 5 жыл бұрын

I love the videos but I have one question about calculating loss. If I understand correctly, state1 is inputted into the network, and it results in an output layer of several q values - each of which are associated to an action. From this output layer, the network would then choose action1 based on the a with the highest q value. (here is where I am confused) After deciding the "best" action wouldn't it now enter a different state (state2 rather than state 1)? Being in a a new state, I dont see how an optimal q value from this different state action would be relevant to the optimal q value of the previous state action pair (sa1). I'm sorry for continuing to ramble, but is this the answer?: the optimal q value of sa pair 2 is optimal because it provides insight on the "success" of its judgement. For example, the optimal q value of sa1 may be .9 but in the following state (sa2) there is only an optimal q value of .2. Therefore, it will update the output in sa1 so that it is lower and less desirable.

@zainkhalid5393 Жыл бұрын

Thanks! I was losing my mind as it did not make any sense to me before.

@jakobliebig5568 4 жыл бұрын

So do we subtract the network Output from the calculated q value or do we subtract the calculated q value from the Network Output to get the loss?

@thomastomy555 3 жыл бұрын

Depending on the action from layer 1, there will be different output states s5. So, for each output from network 1, there will be another network to calculate the q*. Which means number of second stage networks equals the number of outputs from layer. 1 Am I right?

@MrJonhSmithTheFirst 3 жыл бұрын

Seems, like it, when you said it out loud =) Unless it computes only the maximum weighted action scenario 🤔 I'd like to know it too, didn't get it clearly.

@bobthebuilderhecanbuildit 3 ай бұрын

the only confusing thing about this for me is that you store one experience in replay memory and then sample a random batch from replay memory. Does that mean for the first time step it always take a batch of size 1 representing the experience just observed? And then in the future it takes a random sequence of actions? And couldn't it then take sequences that make no sense back to back? That part is confusing

@ArmanAli-ww7ml 2 жыл бұрын

for each time step you are storing values and then randomly choosing that stored values. this confuses me, please someone explain step 4 and 5 in time step 1 and episode 1

@haneulkim4902 3 жыл бұрын

Thanks for great video! One question, what exactly is the difference between agent and emulator?

@Trubripes Ай бұрын

so, there is no difference between DQN and Policy gradient ?

@Arc2urus2357 4 жыл бұрын

Nice tutorials. I have a question regarding target q-values, I don't understand how to get them, or are they taken from experience tuple, I have watched past 3 videos ten times, everything apart target q-values I understood, maybe I am missing something, could anyone explain what is target q-values, or give a simple example? Am I correct in saying that we need a second data pass through the network just to get the optimal q-values?

@annankldun4040 4 жыл бұрын

How can you sample from a batch if nothing is in the replace memory yet? The pseudocode at 7:50 is confusing. You store each single experience and then right after you take a batch to sample from.

@deeplizard 4 жыл бұрын

In the early stages of the game, we are unable to sample a batch from memory if memory is not yet large enough to return a sample. You will see exactly how this is implemented in code in the following episodes. Specifically, it is explained in the video and blog below in the Replay Memory section > function can_provide_sample(). deeplizard.com/learn/video/PyQNfsGUnQA

@houyahiya5729 2 жыл бұрын

Thank you for this wonderful explanation I just have a question. you said ''After storing an experience in replay memory, we then pick a random batch of experiences from replay memory'' but indeed we only have one experience and the bach contains a set of experiences.. this point confused me. the second point is whether we first perform actions in an emulator of many episodes and store these experiences in replay memory, then we start the training by choosing random experiences or how. Please answer me.

@deeplizard 2 жыл бұрын

We sample a batch from replay memory only if there are enough samples in RM to make a batch. Watch the code implementation of DQN training that comes later in this series for more clarity.

@houyahiya5729 2 жыл бұрын

@@deeplizard Thank you for responding to me To be clear in my question. you said '' All of the agent's experiences at each time step over all episodes played by the agent are stored in the replay memory'', so according to what I understood that each time step over one episode contains several experiences. am i right ?

@deeplizard 2 жыл бұрын

There is one experience per timestep. There are many timesteps per episode.

@pranesh3050 4 жыл бұрын

Hi can anyone help me run openai gym codes on spyder. Everything runs well but when i run to view the codes the window gets stuck. I have tried changing the ipython console in preferences but it doesnt seem to work. It would be great if someone can solve this issue for me. Thanks

@hanserj169 4 жыл бұрын

Is there any difference between sampling experiences/applying gradient descend each time step and doing it between episodes?

@srikrishnaveturi6135 3 жыл бұрын

Hi! i was wondering when we want to find the of q(s1,a1), we have to pass the state s1 to the network another time to actually get the maximum q value. But won't it just become an infinite loop since to get the value of s2, we would have to pass it to the model and the same for the next state? Do we just stop after sometime since the gamma would decrease the total effect of the future states?

@jonathanford9354 2 жыл бұрын

You use the neural network to approximate the q value at the state-action pair (s1,a1)

@ukaszbednarski6098 4 жыл бұрын

We assume that target Q-values are given before we start the algortithm?

@MdYusuf0 5 жыл бұрын

One question at first we store only one experience how we get a batch from experience reply. Or there is something i have not understand.

@Nissearne12 10 ай бұрын

I wounder. How do I modify the Target 🎯 value when I select a transition to a State where I win or lose the game or select a state where the agent get partial Reward or Penalty ? I guess that in the case I select any special state Win or Lose or a partial Reward Or Penalty state I should set a Reward/Penalty value to the R term in the Bellman Equation. If I NOT hit any special Reward/Penalty state this R term set to Zero. Is that right ? ? Or do I missundertand something? Best regards ❤️🙏🐇. When I look through this video acording to Dueling DQN kzbin.info/www/bejne/jpvWimtjhZmKfq8si=mpF3b17S_xMmPhOY I was realize that I should ONLY add the Reward R term to the specifc Action output node. The other action nodes should have a zero R term Added onto the S’ ”Target” output Network values. Is that right?

@alexmarty6138 4 жыл бұрын

Should I only backprop the loss of Q(s4,a4) and it's Q*(s4,a4) or should I backpropagate all the Q-values wrt the Q*(s4,a4)? or should I calculate for every Q(sn,an) it's Q*(sn,an) and each time backprop the loss of these 2? If you can help me out, I own you a beer ;)

@christianliz-fonts3524 4 жыл бұрын

What should I be studying to understand these algorithm's?

@samarthagarwal7219 3 жыл бұрын

Make videos for Graph neural nets..

@gurinderghotra1756 5 жыл бұрын

this is awsome...however it could benefit from a sequence. number

@deeplizard 5 жыл бұрын

Thank you, gurinder! Check out the series on deeplizard.com. There, each episode has a corresponding number: deeplizard.com/learn/video/nyjbcRQ-uQ8

@pepe6666 5 жыл бұрын

man i really need to go back to understanding the bellman equation stuff. i cam totally lost with that.

@deeplizard 5 жыл бұрын

If you spend some time on the corresponding blogs for the Bellman Equation content, it should help!

@pepe6666 5 жыл бұрын

@@deeplizard bwoah i didn't expect a response from the lizard herself. i am honored. you raise a valid point. i will sleep on it and absorb knowledge by induction. thats how my species learns.

@christianliz-fonts3524 4 жыл бұрын

Can you actually show the code of these algo's? It will help a lot of others that feel left out such as me catch on quicker, well either way I need to research some computer science algo's and will need to catch on at some point in time. But from knowledge these algo's can be should in not much line of code.

@deeplizard 4 жыл бұрын

Yes, the code comes in later episodes of the course. First we learn the theory. Then we apply the theory to develop code. Then we visually see the game in action from the code we develop.

@hhellohhello 5 жыл бұрын

I don't understand where is a4 actually used?

@itzmesujit 5 жыл бұрын

I also had the exact confusion for a moment. Q(s4,a4) - (r4+gamma*Q(s5,a5)) gives us the loss to backpropagate. So, we must keep track of what action choice we took at s4 last time, in our experience buffer.

@gzbin365 5 жыл бұрын

≧∇≦

@owlfight7789 5 жыл бұрын

so do you make the ai learning while playing? like in the same episode? also my first Conv2D layer doesn't like it when I give it a input shape of input_shape=(4,210,160,1) (4 frames , 210 pixel wide , 160 pixel high , 1 color channel). When I give it that it says it cant substact negative dimention :( so im forced to work with 1 frame for the input. If you had the same problem i would love hearing how you did to solve it

@tinyentropy 5 жыл бұрын

I kind of felt in love with your voice and charme of intelligence :) Still wonder where you might be from. From the US?

@deeplizard 5 жыл бұрын

The U.S. indeed!

@tinyentropy 5 жыл бұрын

deeplizard Cool :) Greetings from Germany. Are you making your PhD in the field of Machine Learning and Deep Learning, or already finished it?

@deeplizard 5 жыл бұрын

👋 *Waves in direction of Germany* 👋 No, I'm not in a PhD program :)

@tinyentropy 5 жыл бұрын

deeplizard You should maybe ;)

@paulgarcia2887 5 жыл бұрын

When you fall I will catch you 6:01

@deeplizard 5 жыл бұрын

😂😂😂 Hahaha this took me a minute to get after playing 6:01. But yes, I'll be waiting...

@paulgarcia2887 5 жыл бұрын

@@deeplizard 😁

@deeplizard 5 жыл бұрын

Blog post: deeplizard.com/learn/video/0bt0SjbS3xc

@hossein_haeri 4 жыл бұрын

Please number your videos!

@deeplizard 4 жыл бұрын

They are numbered on the playlist page on KZbin as well as in the series on deeplizard.com. kzbin.info/aero/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv deeplizard.com/learn/playlist/PLZbbT5o_s2xoWNVdDudn51XM8lOuZ_Njv