Q-Learning Explained - A Reinforcement Learning Technique

Рет қаралды 233,375

Күн бұрын

Пікірлер

@justinheehaw 4 жыл бұрын

I gave up when I see 1:29 for the first time (because I'm not so good at math and English) But when I came back again today and watched the entire video, I found this video the most well explained one. Especially the Q table section.

@absimaldata 3 жыл бұрын

Why you are so so clear in explaining, I mean why others fail to deliver the tutorials with such clarity like you do?? I dont know whats wrong with everyone. Omg you are impressive.

@richardkessler2171 5 жыл бұрын

One of the best series I've viewed on RL. Really great job teaching the content without boring the audience. Also...really enjoy the closing snippets that keep me excited to see the end. Excellent!

@deeplizard 5 жыл бұрын

Thank you, Richard! Really happy to hear that!

@obensustam3574 11 ай бұрын

Very good content, I watched the videos in this playlist to prepera for my exam. Thank you 😊

@pawelczar 6 жыл бұрын

This whole series is great! I love the way you explain all math concepts and questions and more that happy that you did stop on that but you introduced practical example. Cant wait for next episodes :D

@deeplizard 6 жыл бұрын

Thanks, pawelczar! Glad you're liking it!

@abcqer555 6 жыл бұрын

Hi Lizard People, I feel so fortunate to have come across your channel. Your lessons/videos are very clear, concise, well produced, entertaining, and I am excited for all the videos that will be coming out. Keep up the fantastic work!

@deeplizard 6 жыл бұрын

Hey Paul - Thank you! We're glad you're here!

@tingnews7273 6 жыл бұрын

What I learned: 1.Q-learning:Learning the optimal policy in MDP. 2.Q-learning work:learning the Q-values for each state-action pair 3.Value iteration:Q-learning iteravely updates the Q-value(Will be more clear in future I thought) 4.Q-table:store the Q-value for all state-action pair 5.Exploration:exploring the environment to find out infomation about it. 6.Exploitation:exploiting the information that is already known about(tip:epison greed)

@mohammadmohi8561 3 жыл бұрын

u r an AI, so nicely explained all these hard concepts so easily. thank u so much

@Asmutiwari 4 жыл бұрын

These series are so so informative !! I wish you could make videos on dynamic navigation techniques using DRL

@arkadipbasu828 2 жыл бұрын

Super explanation. Thanks from India

@adamhendry945 4 жыл бұрын

PHENOMENAL! Your videos are THE BEST! Can you PLEASE PLEASE PLEASE do a series on Actor-Critic methods!!

@guineteherve9751 Жыл бұрын

Your work is simply incredible. Thank you!

@asdfasdfuhf 4 жыл бұрын

This was an exciting video, finally, we are getting to the good stuff.

@Ayushsingh-zw3yk 9 ай бұрын

nice explanation deeplizard

@davidkhassias4876 4 жыл бұрын

Can't wait for coming episodes, because this series is amazing! And they/you helped me a lot. Thank you so much!

@DreadFox_official Жыл бұрын

Hey, I loved your video. Thank you so much

@yelircaasi 3 жыл бұрын

Really nice video, thanks for the clear explanations!

@cedrichung6820 3 жыл бұрын

How are you so good at explaining😍😍😍😍

@hazzaldo 5 жыл бұрын

Brilliant video. One of the best RL teaching series/materials I've come across anywhere on the internet (if not the best). Look forward to watching the rest of the series. On this video, I have 3 questions: 1- Just to clarify is there a difference between the Q-function and Optimal Q-function? If so, is the difference that when a Q-function perform Q-value iteration, and eventually converge on the optimal Q-values, then this is called the Optimal Q-function? 2- What does the capital `E` signify in the Bellman Optimality equation? 3- So far I have only learnt the definition of a "policy". Putting it in practice, giving the scenario in this video (about the lizard navigating an environment), where does the policy come into play here? Re-phrasing the question, what part of this scenario is the policy? Many thanks

@deeplizard 5 жыл бұрын

Thanks, hazzaldo! 1. Your assumption is correct. 2. E is the notation for "expected value." 3. Recall that a policy is a function that maps a given state to the corresponding probabilities of selecting each possible action from that state. The goal is for the lizard to navigate the environment in such a way that will yield the most return. Once it learns this "optimal navigation," it will have learned the optimal policy.

@hazzaldo 5 жыл бұрын

@@deeplizard TY very much. I do have another question, that I left in the Exploration vs Exploitation video part of this series. If you ever get the time, would really appreciate any clarification on it. Many thanks again for the answer to my question and this great series.

@NoNTr1v1aL 3 жыл бұрын

Amazing video!

@arnabjana2620 3 жыл бұрын

{ "question": "What is optimal Q-value for a policy?", "choices": [ "Expected return for the reward at time (t+1) and maximum discounted reward thereafter for a state-action pair.", "It gives the optimal policy for the optimal expected return for an agent for each state-action pair.", "It is the reward for the action 'a' taken in state 's' at time 't'.", "Maximum accumulated reward by following the policy from time (t+1)." ], "answer": "Expected return for the reward at time (t+1) and maximum discounted reward thereafter.", "creator": "Arnab", "creationDate": "2021-08-03T08:12:26.884Z" }

@ashabrar2435 4 жыл бұрын

{ "question": "Q table is defined as _______________ and _______________", "choices": [ "action and state", "action and agent", "state and environment", "environment and action" ], "answer": "action and state", "creator": "Hivemind", "creationDate": "2021-01-03T20:01:41.577Z" }

@deeplizard 4 жыл бұрын

Thanks, ash! Just added your question to deeplizard.com/learn/video/qhRNvCVVJaA :)

@shashankdhananjaya9923 4 жыл бұрын

Awesome explanation. I like this

@shoaibalyaan 4 жыл бұрын

AMAZING SERIES! Absolutely loved it!

@xiaojiang2610 4 жыл бұрын

Better than my engineering teacher.

@yashas9974 3 жыл бұрын

Link to the talk that appeared at the end of the video?

@rosameliacarioni1022 3 жыл бұрын

Thanks so muuuuch !

@tallwaters9708 2 жыл бұрын

I'll tell you what I really don't get, it seems the equation only updates the q table based on the current and next state. But the Bellman equation seems to imply that all future states are considered, is there some recursion thing going on?

@deeplizard 6 жыл бұрын

Check out the corresponding blog and other resources for this video at: deeplizard.com/learn/video/qhRNvCVVJaA

@MohsinKhan-ve1hn 5 жыл бұрын

Your voice is great

@SugamMaheshwari 4 жыл бұрын

Your voice is just amazing 😍😍😍😍😍

@davidli9872 Жыл бұрын

Are you here after Reuter's article on OpenAI's q*?

@ayrmendina8314 Жыл бұрын

Yeap 😂 hi 👋

@tomdexter5029 3 ай бұрын

No, what's that OpenAI shit?

@sontapaa11jokulainen94 4 жыл бұрын

Is the exploration vs exploitation part only a part of the training or does it happen also when actually using the learned q table and also can the policy be that "Take the action which has the largest q value and sometimes explore" (eq can that be an example of a policy in this case)? So the policy is just the probability of taking some action in a state so can the policy just be written as: "Take the action which has the largest q value" (this is just an example for exploitation)?

@michaelscott8572 4 жыл бұрын

Thanks for the good explanation and all your work. A little hint if I may: Don't explain the words using the same words: Exploitation and Exploration

@mateusbalotin7247 3 жыл бұрын

Thank you!

@patite3103 3 жыл бұрын

your videos are awsome! Please correct the corresponding quiz since the answer is uncorrect to me. Could you do a video explaining the first three steps and how the q-tables updates. This would really help to understand how the update works. thank you!

@neogarciagarcia443 5 жыл бұрын

exploration of reinforcement learning is going fine !

@madhesh18 4 жыл бұрын

Really good work

@arefeshghi 4 жыл бұрын

Good balance of exploration and exploitation will bring good results in life too! We are all lizards! :)

@namitaa 4 жыл бұрын

you saved my life bro

@rursus8354 3 жыл бұрын

Won't a square become empty when the cricket(s) is(are) eaten?

@TheOfficialJeppezon 4 жыл бұрын

You say that Q-learning tires to find the best policy. However, I thought q-learning is an off-policy algorithm. I also have troubles understanding the on/off policy concept.

@iAndrewMontanai 5 жыл бұрын

What should I do in case of continuous tasks? Like in Flappy Bird (if its continuos, but anyway), I guess Q Table would be infinite here or just would have big fixed value to save memory. Can you give some recommendations or explain please? Wanna start implementation, but dont know how Q Table should look like in this case and how to interact with it correctly ( and i hope there will be no other surprises lol )

@deeplizard 5 жыл бұрын

Yes, a Q-table would not be feasible for this task. Keep going in the series, and you will see how you can use Deep Q-Learning for these tasks. Essentially, you substitute a Q-table for a neural network.

@megasage 3 жыл бұрын

The dound at 00:30 I hear in every video is quite disturbing 😅

@mauriziovassallo5499 5 жыл бұрын

Very clear :)

@MrGenbu 5 жыл бұрын

hi i wanted to ask a Q , when the agent is trained on this lizard example in this board 3*3 , if we placed it in a board of 6*6 can it still perform or this is another kind of reinforcement learning

@deeplizard 5 жыл бұрын

This technique would still work with a 6x6 board.

@MrGenbu 5 жыл бұрын

@@deeplizard but we need to train it first to generate the Q table rigth ? i mean it can not be trained and run on different board even using a 3*3 board with different reward places will not work? like regression u make a line then use it as u like but here u can not becuase the states should be the same ? is not it ?

@deeplizard 5 жыл бұрын

Yes, I thought you were asking in general if the Q-learning with value iteration technique would work on a 6x6 board. If you changed the board, then you would need to change and initialize the Q-table as well before training starts.

@MrGenbu 5 жыл бұрын

@@deeplizard so this kind of agent is environment specific did u watch openAI hide and seek agent it seems it adapts to a new environemt without training i see this kind of agents is limited in its use as it can not be used is the real world as it needs to be trained on every new environment i am a newpie so i am rly just asking u to get more clear answer ? if u had seen the openAI video i would like to know which type of reinforecemnt learning can adapt to new environments

@krajkumar6 3 жыл бұрын

Hey @deeplizard, Many thanks for this video. I'm reading 'Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow' and I'd like to know whether Q- learning technique described here is the same as dynamic programming explained in the book?

@krajkumar6 3 жыл бұрын

it is Temporal difference learning technique

@louerleseigneur4532 4 жыл бұрын

merci merci hats off

@XxGabberlordxX 5 жыл бұрын

Hello, can someone explain me pls why there are 6 empty states?

@deeplizard 5 жыл бұрын

The six empty states are arbitrary. Think about a video game where some actions will cause you to gain points, some actions will cause you to lose points or lose the game, and some actions will have no immediate effect on your score. With the lizard game example, we have a similar set up where moving to a tile with crickets will gain points, moving to a tile with a bird will lose points/lose the game, and moving to an empty tile has no immediate effect on our score.

@XxGabberlordxX 5 жыл бұрын

@@deeplizard Hey ty for the answer! When I take a look at the picture I still don't get why there are 6 empty tiles. I don't count 6 empty tiles 🤔

@deeplizard 5 жыл бұрын

In the photo, the lizard is on one of the empty tiles. The lizard is the agent, and she is free to move to any tile.

@XxGabberlordxX 5 жыл бұрын

@@deeplizard Wow that was so obvious. Ty for the help :) Now i get it. Nice video and have a nice day :)

@shreyasrajanna7361 6 жыл бұрын

Where is the next video

@deeplizard 6 жыл бұрын

It is being developed! Aiming to add a new video to this series every 3-4 days.

@shreyasrajanna7361 6 жыл бұрын

@@deeplizard Your videos are really good.

@shreyasrajanna7361 6 жыл бұрын

Can you make videos on Kaggle projects or make a project which may make learning even more interesting.

@deeplizard 6 жыл бұрын

We may do some Kaggle videos in the future. We do have the following two series that show practical deep learning projects in both Keras and TensorFlow.js. deeplizard.com/learn/playlist/PLZbbT5o_s2xr83l8w44N_g3pygvajLrJ- deeplizard.com/learn/playlist/PLZbbT5o_s2xrwRnXk_yCPtnqqo4_u2YGL

@saumyachaturvedi9065 11 ай бұрын

I guess crickets make sound, so lizard can take that as input as well to take the path

@deeplizard 11 ай бұрын

🦗🎶🦎

@nodstradamus 5 жыл бұрын

Thanks for the video, it was useful. But for me it would have been even more useful if you'd explained the Gamma Value (i.e. Discount Factor) in the formula as well..

@deeplizard 5 жыл бұрын

Hey Aleistar - You're welcome! We first introduce the discount rate (gamma) a couple of episodes back where we learned about expected return. Check out the video/blog where it is introduced and defined here: deeplizard.com/learn/video/a-SnJtmBtyA Let me know if this helps!