Train Q-learning Agent with Python - Reinforcement Learning Code Project

Рет қаралды 79,498

Күн бұрын

💡Enroll to gain access to the full course:
deeplizard.com/course/rlcpailzrd
Welcome back to this series on reinforcement learning! As promised, in this video, we're going to write the code to implement our first reinforcement learning algorithm. Specifically, we'll use Python to implement the Q-learning algorithm to train an agent to play OpenAI Gym's Frozen Lake game that we introduced in the previous video. Let's get to it!
Sources:
Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow
incompleteideas.net/book/RLboo...
Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies
www.cs.toronto.edu/~vmnih/doc...
Thomas Simonini's Frozen Lake Q-learning implementation
github.com/simoninithomas/Dee...
OpenAI Gym:
gym.openai.com/docs/
TED Talk: • The Rise of Artificial...
🕒🦎 VIDEO SECTIONS 🦎🕒
00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources
00:30 Help deeplizard add video timestamps - See example in the description
08:29 Collective Intelligence and the DEEPLIZARD HIVEMIND
💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥
👋 Hey, we're Chris and Mandy, the creators of deeplizard!
👉 Check out the website for more learning material:
🔗 deeplizard.com
💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES
🔗 deeplizard.com/resources
🧠 Support collective intelligence, join the deeplizard hivemind:
🔗 deeplizard.com/hivemind
🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order
👉 Use your receipt from Neurohacker to get a discount on deeplizard courses
🔗 neurohacker.com/shop?rfsn=648...
👀 CHECK OUT OUR VLOG:
🔗 / deeplizardvlog
❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind:
Tammy
Mano Prime
Ling Li
🚀 Boost collective intelligence by sharing this video on social media!
👀 Follow deeplizard:
Our vlog: / deeplizardvlog
Facebook: / deeplizard
Instagram: / deeplizard
Twitter: / deeplizard
Patreon: / deeplizard
KZbin: / deeplizard
🎓 Deep Learning with deeplizard:
Deep Learning Dictionary - deeplizard.com/course/ddcpailzrd
Deep Learning Fundamentals - deeplizard.com/course/dlcpailzrd
Learn TensorFlow - deeplizard.com/course/tfcpailzrd
Learn PyTorch - deeplizard.com/course/ptcpailzrd
Natural Language Processing - deeplizard.com/course/txtcpai...
Reinforcement Learning - deeplizard.com/course/rlcpailzrd
Generative Adversarial Networks - deeplizard.com/course/gacpailzrd
🎓 Other Courses:
DL Fundamentals Classic - deeplizard.com/learn/video/gZ...
Deep Learning Deployment - deeplizard.com/learn/video/SI...
Data Science - deeplizard.com/learn/video/d1...
Trading - deeplizard.com/learn/video/Zp...
🛒 Check out products deeplizard recommends on Amazon:
🔗 amazon.com/shop/deeplizard
🎵 deeplizard uses music by Kevin MacLeod
🔗 / @incompetech_kmac
❤️ Please use the knowledge gained from deeplizard content for good, not evil.

Пікірлер: 224

@deeplizard 5 жыл бұрын

Check out the corresponding blog and other resources for this video at: deeplizard.com/learn/video/HGeI30uATws

@fordatageeks 5 жыл бұрын

I must really really say that I have never found any tutorial that explains stuffs both theoretical and codewise like you do. You're a GEM. You inspire me, keep up the great work.

@deeplizard 5 жыл бұрын

Thank you so much, Sanni!

@tanismar2979 3 жыл бұрын

needed a course that goes into enough detail to understand what's going on beyond an introductory overview of RL, but not so much that it would be a playlist of 15 videos, 1 hour and a half each. This course strikes the perfect balance between speed and depth, with great explanations and very useful resources. I think it deserves many more views, and expecting that they'll get them soon!

@gvcallen Жыл бұрын

exactly the same. great videos

@tryhardnoob1140 4 жыл бұрын

This is exactly what I needed to see. I feel like too many tutorials either fail to give enough explanation, or spend an hour explaining basic programming concepts.

@noahturner5212 5 жыл бұрын

Definitely the most clear and by far the best tutorial on q learning out there. Thank you!!

@christianjt7018 5 жыл бұрын

This tutorial is incredibly clear and is the best tutorial that I have found about RL on the internet, I have learned a lot, thanks a lot for the effort in creating this and sharing the knowledge.

@aditjain3897 3 жыл бұрын

Just amazing, the way you explained the concepts is just brilliant. Thank you so much.

@omkarjadhav13 3 жыл бұрын

Amazing work!! Can't believe you are explaining RL in such a easy way. Thank you so much!

@abdoulayely8284 5 жыл бұрын

Your way of explaining makes these "obscure" concepts relatively easy to grab. I went into various resources but your videos are just the best I can found about reinforcement learning for starters. I am looking forward to seeing new contents uploaded. Thanks a million.

@deeplizard 5 жыл бұрын

Thank you, Abdoulaye! Glad you're here!

@DanielWeikert 5 жыл бұрын

It's amazing how you guys are able to explain complex topics step by step so well. I am really grateful for that. Your videos are interesting and fun and I really learn a lot

@deeplizard 5 жыл бұрын

Thank you, Daniel!

@Alchemist10241 2 жыл бұрын

This video summarize all of the previous lessons. Thanks for the practicality 😎

@dmitriys4279 5 жыл бұрын

Thank you for awesome explanation!!! It's the greatest tutorial I have ever seen about Q-learning

@iAndrewMontanai 4 жыл бұрын

Giving likes for any non music video very rarely, but here i gotta make exception. These videos deserve thousands, you're awesome ^^

@psychodrums8138 Жыл бұрын

What a beautiful explanation of the code!!!! GREAT!

@chamangupta4624 3 жыл бұрын

No better 8 min video i found on RL than this , i searched for 1 month Very thanks

@TheMyrkiriad 3 жыл бұрын

After a couple of changes, I managed 75.6 %. Also converging much faster, after only 2000 episodes. Since there is randomness in this game, figures can change quite a bit from one run to the other. Si I gave the best score that came up after several runs.

@sreeharshaparuchuri756 Жыл бұрын

Hey, that's cool to hear. Would you mind me asking what you changed? Did you expect the agent to converge that much faster or faster in general wrt the tutorial?

@CesarAugusto-vu2ev 5 жыл бұрын

You are the best!!! EXCELLENT Explanation, EXCELLENT video! Thank you very much, I became a fan of your work!

@paedrufernando2351 4 жыл бұрын

Why r u such a gem of a person...keep it up...Really hats off

@Marcaunon 5 жыл бұрын

Excellent video series!

@antoinelunaire9462 4 жыл бұрын

Thanks a lot for the great effort, and an advice for whomever concerned to run the code is to visit the blog and copy/past the explained code with considering the spacing, to avoid the mistakes, it gave me some different numbers but with the same philosophy.

@timonstwedder3201 5 жыл бұрын

Great explanation! Thank you

@varunbansal590 3 жыл бұрын

you made it so simple Thankkksssss one of the best resources for rl

@tong9977 5 жыл бұрын

This is very good. Excellent explanation. Thank youy

@zsa208 3 ай бұрын

very detailed and easy to understand tutorial

@egrinant2 5 жыл бұрын

Hi, I commented on the poll before and it doesn't let me post another comment to reply you. Languages we use: mostly C# and javascript, sql, php, rarely python. However I am very interested on ML. I love your videos, I just started following you for about 2 weeks and I am amazed of the quality of the content. I've read more than a dozen books about ML, tryed to follow someone else's videos/courses, and I can say that your channel is explaining the most important concepts in a much clearer way than anyone else. Keep up the good work!

@deeplizard 5 жыл бұрын

Hey Edgar - Thanks for sharing your experiences! Really glad to hear you're finding such value in the content, and we're happy that you're here!

@absimaldata 3 жыл бұрын

Best tutorial till now

@housseynenadour2233 3 жыл бұрын

Thank you for your lesson, so helpful. For this game, even if we take exploration_rate = 0, we can reach very quickly the global optlimal policy (with average of reward per 1000 episodes = 0.7) and without stacking in a local opitmum policy.

@mateusbalotin7247 2 жыл бұрын

Thank you for the video, great!

@ericchu3226 2 жыл бұрын

Very nice series...

@absimaldata 3 жыл бұрын

I dont know whats wrong with rest of the world. Everyone making some stupid mugged up nonsensical tutorials. But this one is by far the most clear and best explanation till now.

@tariqshah2767 4 жыл бұрын

Wow really impressed

@maryannsalva3462 4 жыл бұрын

That was really amusing! Great videos and presentation. 😍😘

@sametozabaci4633 5 жыл бұрын

Thank you very much for your great explanation. You really know how to teach. I respect that deeply. That is a very good example for beginning but it's a prebuilt environment. I think one of the key point is to write proper reward function. In that example there is no reward function. Everything is done by environment object. Will you cover topics on writing proper reward functions or building environment for RL in the next videos? Again, thanks for you excellent work.

@deeplizard 5 жыл бұрын

Hey samet - You're welcome! I've thought about this as well. I haven't yet explored building and developing an environment from scratch, but I may consider doing this in a future video. Thanks for the suggestion!

@sametozabaci4633 5 жыл бұрын

@@deeplizard Thanks. Waiting for upcoming videos.

@korbinianblanz3734 4 жыл бұрын

exactly my thought. this series is absolutely amazing and helped me a lot but i miss this point. is there a @deeplizard video out there now on this topic? if not, can you suggest any valuable resources on implementing your own reward function? Thanks!

@trishantroy6256 4 жыл бұрын

@@deeplizard Thanks a lot for this video series! Regarding the reward functions query, we can write the reward functions through code ourselves since we know the state of the agent, right? Is this correct?

@rashikakhandelwal702 2 жыл бұрын

Thank you for doing this great Work . I love the videos :)

@marcin.sobocinski 2 жыл бұрын

I know the course is quite dated (2018), but just wanted to say that the sample code is great. I am just doing another heavily charged course online and the code is so cryptic and unnecessarily complicated that I have to search for other sources of knowledge. Happy to find a good one here :D

@deeplizard 2 жыл бұрын

Glad to hear it Marcin!

@niveyoga3242 4 жыл бұрын

Thank you!

@DEEPAKSV99 4 жыл бұрын

It's true that it has been almost 2 years since this video was uploaded. But since I am viewing it just now let me try to answer your question on "why a higher value of exploration_decay_rate (0.01) gives poorer results (compared to 0.001) ?": When I tried running the code with 0.01 in my PC, all I got at the end were a q_table with all the elements as zero & all the rewards were zero as well. My intuition: I think this could have happened since a high exploration_decay_rate can cause the decay to happen instantly and the model would almost every time prefer exploitation over exploration, and when exploitation happens over a q-table with only zeros, the agent would be looping around the same path and might never reach the goal causing the q-table & rewards to never update. . .

@sebastianjost 3 жыл бұрын

That sounds very much correct and this is a common problem: the agent never gets enough reward to learn where to go. That's why more complex reward functions can be useful. In this case adding a penalty for falling in a home should help a lot. But since the environment is prebuilt, ai don't know if that's possible.

@RandomShowerThoughts 5 жыл бұрын

After this video I'm going to become a patreon! This was an absolutely amazing series up to this point. Can't wait to finish

@deeplizard 5 жыл бұрын

Thank you so much, Farhanking! Really happy that you're enjoying the series!

@RandomShowerThoughts 5 жыл бұрын

deeplizard really am. I doubt you remember but I request this series back in September or something so it’s awesome that you guys actually did it. I was off from machine learning for a while, I came back and saw you guys were also gone for a couple months. Great to see you back though

@deeplizard 5 жыл бұрын

I do remember :D When I saw your comment, your user name and profile photo refreshed my memory that you were previously going through the Keras series and had asked about plans for RL. New episodes are still being added to the RL series, and the next one should be out within a couple of days!

@RandomShowerThoughts 5 жыл бұрын

deeplizard wow that’s insane! Ha it’s really been a while. Looking forward to it

@hello3141 5 жыл бұрын

Once again, very nice. As you noted at 0:45, the initial exploration decay rate was off by an order of magnitude. As you suggest, a good exercise is to assess why this is a problem. My 2-cents: if the exploration decay rate is too large, the 2nd term in the “exploration-rate update” is ≈ 0 (because the exponential term is ≈ 0). The impact is that subsequent epsilon-greedy searches get stuck in an “exploitation” mode since the exploration rate converges to "min_exploration_rate" (little or no exploration occurs). This is the same point you made later in the video, so again, nice work.

@deeplizard 5 жыл бұрын

Thanks for answering the challenge, Bruce! 🎉Yes, the lower the exploration decay rate, the longer the agent will be able to explore. With 0.01 as the decay rate, the agent was only able to explore for a relatively short amount of time until it went into full exploitation mode without having a chance to fully explore and learn about the environment. Decreasing the decay rate to 0.001 allowed the agent to explore for longer and learn more.

@richarddeananderson82 5 жыл бұрын

@@deeplizard It is true that a larger exploration decay rate (0.01 instead of 0.001) makes the learning algorithm to go faster to "exploitation". That been said, the issue here lies mainly on the "Exploration-exploitation trade-off" part together with the fact that the q_table will be updated with values different than zero starting from the "higher states" and going backward to the start state (because there is only a reward on the last state, where the frisbee is). And choosing the max q_value on the table from a state (action = np.argmax(q_table[state,:]) ) where all the q_values are 0 will ALWAYS RETURN action=0, or "Left" so at the start (in case of greedy policy) our buddy here will get stuck on the first line walking left, unless he "luckily" slips and goes down. So you may end up with a Q-table full of zeros every now and then. To avoid this, I suggest this minor change on the code for the "Exploration-exploitation trade-off" part: if exploration_rate_threshold > exploration_rate and ~np.all(q_table[state,:]==0): # Greedy Policy, only if the q values for the state are NOT all 0 action = np.argmax(q_table[state,:]) else: # Explore action = env.action_space.sample() in that case, in case the "Greedy Policy" is chosen but all of the q_values of the state are zero, it will choose a random action, instead of always going for action 0, which is not fair. With that you can leave exploration_decay_rate = 0.01 and get similar results. One last thing: these behavior would come much more clear if the game was deterministic (no slipping on ice), since the slippery situation adds a randomness which contributes to hide the phenomenon. Best Regards and keep it up!

@abdurrakib5324 4 жыл бұрын

Great tutorial, would be really helpful if you can make a tutorial on implementing the same game in the probabilistic setting based on MDP.

@yash-vh9tk 4 жыл бұрын

The more we explore the better results I am seeing.

@portiseremacunix 3 жыл бұрын

I can finanlly understand how easy to implement the Q learning!

@chrisfreiling5089 5 жыл бұрын

Thanks for these videos! Since this is such a small problem, I would bet (after you changed exploration_decay_rate =0.001) you have already found the optimum policy with about 10,000 episodes. Assuming this hunch is correct, the only way to improve the score further is to decrease the final exploration_rate. So I tried just setting exploration_rate=0 for the last 1000 steps. I think it does a little better. There is a lot of luck, but I would guess about 73% avg.

@deeplizard 5 жыл бұрын

Ah, nice! Thanks for sharing your approach and results!

@Gleidson5608 4 жыл бұрын

Thanks for this series. I had learned so much with it and i'm learning with you a lot of things.

@farhanfaiyazkhan8916 19 күн бұрын

the exploration_decay_rate is directly proportional to 1 - epsilon. hence, we will have more no. of exploitation cases before we have explored our environment, hence the inconsistency.

@iTube4U Жыл бұрын

if u tried to follow along code, but this line doesn't work: ---> q_table[state, action] first change state = env.reset() to --> state = env.reset()[0] and don't forget to add new_state, reward, done, info, _ = env.step(action) if it asks about one more tuple to be returned

@hanserj169 4 жыл бұрын

Amazing explanation. I was wondering if you have the code implementation for a continous task as well.

@interweb3401 Жыл бұрын

BEST INSTRUCTOR EVER, BEST INSTRUCTOR EVERRRRRR

@maximilianwittmann2400 4 жыл бұрын

Dear deeplizard-team, first of all I want to congratulate you for having put together an amazing Reinforcement Learning playlist. Your theoretical explanations are very precise, you walk us very smoothly through the source code, and your videos are both entertaining and rich on great content! By watching your videos, you can tell that you are very passionate about AI and on sharing your knowledge with fellow tech enthusiasts. Thank you for your hard work and dedication. I have one question though: I am unsure where exactly we specify the positive or negative rewards in the Python code? I followed your explanation and understood that the agent is basing its decisions for each state-action-pair on the q-table-values depending on the exploration-exploitation-tradeoff. But where exactly in the source code do we actually tell the agent that by stepping onto the fields with letter H, it will receive minus 1 points and for landing on F-letter fields it is "safe"? Is this information specified in the env.step-function and thus already imported from OpenAI gym's environment? I look forward to your reply. Thanks!

@deeplizard 4 жыл бұрын

Hey Maximilian - You're so welcome! Thank you for the kind words, and we're very happy you're finding so much value in the content. In regards to your question, your assumption is correct that the reward function is defined by the OpenAI gym environment. The link below is gym's wiki page for the Frozen Lake environment, which specifies the an overview of the environment, including how the rewards are defined. github.com/openai/gym/wiki/FrozenLake-v0

@pepe6666 5 жыл бұрын

hoooray exponential decay. the champion's decay

@saurabhkhodake 5 жыл бұрын

Average Reward of 0.75 with below settings num_episodes = 20000 max_steps_per_episode = 10000 learning_rate = 0.01 discount_rate = 0.99 exploration_rate = 1 max_exploration_rate = 1 min_exploration_rate = 0.001 exploration_decay_rate =0.001

@deeplizard 5 жыл бұрын

Thanks for sharing your approach and results, Saurabh! Nice!

@kushagrak4903 3 жыл бұрын

My rewards were like this : They aren't in ascending 1000 : 0.03000000000000002 2000 : 0.22400000000000017 3000 : 0.4110000000000003 4000 : 0.6240000000000004 5000 : 0.6540000000000005 6000 : 0.7050000000000005 7000 : 0.7100000000000005 8000 : 0.7070000000000005 9000 : 0.7330000000000005 10000 : 0.6920000000000005

@mohammadelghandour1614 4 жыл бұрын

Thanks for the amazing work. i have a question. for a state like "S" , which is the initial state for the agent. There will be only 2 action choices, moving right or moving down. the same also applies with states found on the edges (3 action choices) and corners (2 action choices). should the Q values of those state action pairs be zeros? just cannot locate them on the final Q-table. I know there are 23 zeros on the final Q-table. Are those for state action pairs that lead to "H" and for cases mentioned above when only 2 or 3 action choices are available?

@linvincent840 4 жыл бұрын

I think we should remove the item "discount_rate *np.max(q_table[new_state, :])" in bellmann equation when our agent reach the goal, bacause in this case we don't have new state anymore.

@sergiveramartinez2685 3 жыл бұрын

Hi everyone. I'm applying RL to get the link weights of different network topologies. Since I need a value at the output and not a classification (probability), the model will be of regression type. The problem is that how can I get N different values, where N is the number of links? Thank you so much

@ninatenpas633 4 жыл бұрын

love you

@arturasdruteika2628 4 жыл бұрын

Hi, why do we need to set done = False when later we get done value from env.step(action)?

@TheRealAfroRick 4 жыл бұрын

Love the tutorials, but the equation for the q_table update appears different from that of the Bellman Equation. q[state,action] = q[state,action] + learning_rate * ( reward + gamma * np.max(qtable[new_stare,:]) - qtable[state,action)

@henoknigatu7121 Жыл бұрын

i have tried the algorithm on frozenlake v1 in the same way you have implemented and printed out the reward for each step, and all rewards for all state (including the Holes) and action except the win state are 0.0, how would the agent learn from this rewards

@harshadevapriyankarabandar5456 5 жыл бұрын

That explanation..Wow

@deeplizard 5 жыл бұрын

I hope that's a good wow :D

@harshadevapriyankarabandar5456 5 жыл бұрын

@@deeplizard definitely that's a good wow for your good explanation..keep it going..it's very helpful.

@faizanriasat2632 Жыл бұрын

What if when selecting random action from sample gives an action which is prohibited in that state?

@masudcseku 3 жыл бұрын

Great explanation. I tried the same code; however, the rewards are not increasing properly. After 5000 episodes, it started to decrease suddenly.

@moritzpainz1839 4 жыл бұрын

why dont you publish your courses on udemy or coursera? i would be happy to support you there :D KEEP IT UP

@fabricendui7902 3 жыл бұрын

those are the best parameters i got: count = 1000 num_episodes = 10000 max_steps_per_episode = 100 learning_rate = 0.01 discount_rate = 0.999 exploration_rate = .95 max_exploration_rate = 0.95 min_exploration_rate = 0.01 exploration_decay_rate = 0.001 reached 0.76

@ka1zoku-o Жыл бұрын

DOUBT: You discussed something about the loss functions earlier. But there's no mention of loss function in the code?? I understand the working the bellman equations, in reaching optimal solutions over time. (which is also what happens in Bellman ford algorithm for single source shortest path in graphs). But still is it fine to leave out loss function??

@mohanadahmed2819 4 жыл бұрын

Great video series. What changes from one episode to the next? The only thing that I can see is exploration_rate_threshold, which is a random number. If this is correct, then should we re-seed the random number generator at every episode to guarantee non-identical episode results?

@happyduck70 Жыл бұрын

A question about the reward that the environment return after taking env.step(action). Is this the reward for ending up in the next_state or this linked to taking the action according to the current state? If it would follow the rules for Q-learning I would expect that the reward is linked to ending up in a state

@mmanuel98 3 жыл бұрын

I'm currently implementing reinforcement learning for another situation. How will the output from this (the q-table) be used for the policy? So that if the situation changes, the policy gained from the training can be used (that is the same "game" but different environment).

@TheHunnycool 5 жыл бұрын

Why are there large q-values for left and top action for starting State in the First Row of q_table, as we know it can only move to right or down? Also large value for left in 5th row for [F]HFH? By the way you are the best teacher EVER!!!! Really thankyou for these lectures.

@deeplizard 5 жыл бұрын

You're welcome, Himanshu. Glad you're enjoying the content! The answer to this question is very similar to what we previously discussed in your prior comments. Although the agent can choose to move left from the starting state, since the ice is slippery, then the ice may make the agent slip into another state rather than the state the agent chose to move. If the agent chose to move left, but instead slipped right, for example, then the Q-value associated with moving left from the starting state would actually be positive since the agent instead slipped right and gained a positive reward. This explains why the Q-value associated with a ending state could be non-zero.

@VJZ-YT 2 жыл бұрын

I noticed that my agent keeps facing left (0) because the env.action_space.sample() keeps guessing and never reaches the reward therefore it always returns zero, the q table always equals 0, even as I lower the exploration rate decay, it doesn't matter: not once over 50000 episodes did my agent reach the reward, keep in mind, I have set exploration = 100% as there is nothing to optimize in the Q Table, I am at a lost.

@Mirandorl 5 жыл бұрын

How does the agent know what actions are available to it? It seems they are abstracted away in this "action space". Where is up / down / left / right actually defined as an action and made available to the agent please?

@ariwanabdollah6758 6 ай бұрын

For those trying this in 2024: below "state = state.reset()", add "state = state[0]" and change "new_state, reward, done, info = env.step(action)" to "new_state, reward, done, truncated, info = env.step(action)"

@ka1zoku-o Жыл бұрын

DOUBT: Usually while training neural networks, we initialize the weights to random real numbers. Here we're initializing q-values to zero. Is there any specific reason for the difference in treatment of both situations while initializing.

@user-uk4zk1st6f 4 жыл бұрын

Can anyone tell me why if I reduce epsilon *linearly* the agent doesn't learn? Using the formula from the video works just fine, but i would like to be able to use linear decay, because it helps to calculate how many steps I want to explore...

@Zelmann1 Жыл бұрын

Very good tutorial. I noticed even with 50,000 episodes the rewards the agent gets per 1000 episodes plateaus at about 69-70%....implying there is a maximum amount that the agent can learn. Wondering what can be done to get it to 90-something percent.

@ilyeshouhou9998 3 жыл бұрын

i tried using this method on a maze solving agent with gym_maze but when the action to be taken is an exploitation action i get this error : ""TypeError: Invalid type for action, only 'int' and 'str' are allowed "" and when i print the action it is indeed an integer!! anyhelp would be great, thanks

@Alchemist10241 2 жыл бұрын

My little tiny Frozen Lake AI doesn't learn anything; Average reward isn't increasing - the code is the same. Anyway thanks for your practical examples

@sankalpramesh5478 3 жыл бұрын

Does the map of frozen lake remain the same or does it get randomly changed?

@garrett6064 3 жыл бұрын

So the show, Game of Thrones is one that I've heard about and will watch if I ever get the appropriate streaming app. ("White Walkers" was searchable) The FrozenLake game is interesting. I could never get it to quite make it to 78% success rate. That better be a star spangled admantium Frisbee if I'm going to take a 22% chance of dying in its recovery!! I didn't really do this with too much focus, more like just poking around and trying stuff so there is still room for tweaking. Honestly, I'm not too happy with 78%. "Don't worry, there's a 78% chance our robots won't turn on us and kill us." The exploit vs explore number gets down to practically 0.0 so it does exploit enough, so the problem is in the Q-Table data. I was looking at collecting a bunch of data for excel and annalizing that and maybe figuring out what's in that data packet returned from env.step() and it occurred to me. You may just cover this in the next or an upcoming video. Oh, and with the exploration_decay_rate = 0.01 it would occasionally return a table and results of all 0.0. I didn't figure out why that was either. So my best numbers were num_episodes = 500,000 (takes 8+ minutes) Max_steps = 200 Min_ exp_rate = 0.00001 And THANK YOU for your amazing tutorial on this!!

@deeplizard 3 жыл бұрын

For the exploration_decay_rate, the lower the rate, the longer the agent will be able to explore. With 0.01 as the decay rate, the agent was only able to explore for a relatively short amount of time until it went into full exploitation mode without having a chance to fully explore and learn about the environment. Decreasing the decay rate to 0.001 allowed the agent to explore for longer and learn more, which allows for better exploitation of the environment later based on what its learned. For the 78% success rate, keep in mind the game is non-deterministic, where the agent can essentially choose the "correct" move to take, but due to the slippery ice, can be caused to slip in the wrong direction. Thanks for posting your results!

@vvviren 5 жыл бұрын

NOBODY: LITERALLY NOBODY: deeplizard: Don't be shy. I wanna hear from you.

@deeplizard 5 жыл бұрын

Lol errbody being shy 🤭

@Then00tz 4 жыл бұрын

@@deeplizard I'm more of a listener. Good job though, you guys rock!

@junhanouyang6593 3 жыл бұрын

I copy all your code and run it, but the result barely passes 6%. If I remove the randomness my result works fine. Anyone know what may be the problem?

@jeffshen7058 5 жыл бұрын

I tried using a linear decay rate but that didn't work too well. I also tried only exploring for the initial 7000 episodes, then only exploiting for the last 3000 episodes, and that seemed to do quite well. With a learning rate of 0.3 and a discount rate of 0.98, I got to 76% using this strategy. Would it be possible to implement grid search to automate (to some extent) the hyperparameter tuning? Thanks for the effort you put into these excellent videos.

@deeplizard 5 жыл бұрын

Thanks for sharing your experiments and results! I haven't tried gird search with this, so I'm not positive of the results it would yield.

@picumtg5631 2 жыл бұрын

I tried this qlearning with another simple game, but somehow it just cant get the reward right. Are there common mistakes beginners make?

@TheMekkyfly02 3 жыл бұрын

Please, I need a little help!!! I am getting a "tuple not in range" error when I run the code to print the rewards_per_thousand episode

@fishgoesbah7881 3 жыл бұрын

Anyone else keep getting KeyError "Cannot call env.step() before calling reset()" :((, I don't know how to fix this, I did call reset() via state = env.reset()

@mariaioannatzortzi 3 жыл бұрын

{ "question": "What is the maximum and minimum reward value that I can obtain in each episode?", "choices": [ "One for the max, zero for the min.", "The sum of all the rewards for each action is the max and min as well.", "One for the max and the sum of the rewards obtained until failure for the min.", "One for the max, minus one for the min." ], "answer": "One for the max, zero for the min.", "creator": "marianna tzortzi", "creationDate": "2020-11-19T20:57:53.328Z" }

@deeplizard 3 жыл бұрын

Thanks, Marianna! Just added your question to deeplizard.com/learn/video/HGeI30uATws :) Note I changed a couple of the possible answers for this one, since technically the sum of all the rewards for each action in an episode is indeed the max reward value. Each action in an episode results in a 0 reward until the agent reaches the frisbee or falls through a hole, so the reward for each episode is the sum of all the intermediate 0 rewards plus the reward of falling through a hole (0) or reaching the frisbee (1). Thanks for your quiz question contributions in this course!

@mariaioannatzortzi 3 жыл бұрын

@@deeplizard Thank you for the best youtube channel ever! It's so to the point everything you say and mostly hands on! Regarding the question my purpose was of how much attention someone gives in the reward value of each action. Because the game with the lizard in the early videos had a reward of -1 for empty, I thought that someone could be confused with the rewards of the Frozen Lake, given this vast amount of information. Also thanks for the chance that you give to the community to participate.

@deeplizard 3 жыл бұрын

Thank you, Marianna! So happy to hear you're gaining from the content and appreciate you taking the time to let us know :)

@deeplizard 3 жыл бұрын

Regarding the purpose of your quiz question, got it. The correct answer you created of 0 for min and 1 for max is the correct choice for the question on the site, so your point is well taken there :D

@crykrafter 5 жыл бұрын

Thank you for your great explaination. I realy understood everything you explained but im still interested how youd apply this method to a dynamic game like pong where you cant give the agent all the possible states because its to complex

@deeplizard 5 жыл бұрын

You're welcome, CRYKrafter! I'm glad to hear you're understanding everything. In regards to your question/interest, this would be where _deep_ reinforcement learning will come into play. Deep RL integrates neural networks into reinforcement learning. We'll learn about deep Q-learning in an upcoming video, which can be used for larger more complex state spaces.

@crykrafter 5 жыл бұрын

Thanks Im looking forward to this series

@SamerSallam92 5 жыл бұрын

Thank u very much Great course and blog I guess there is a missing round bracket at the end of this line of code in your blog post # Update Q-table for Q(s,a) q_table[state, action] = q_table[state, action] * (1 - learning_rate) + \ learning_rate * (reward + discount_rate * np.max(q_table[new_state, :])

@deeplizard 5 жыл бұрын

Yes, good eye! Thank you!

@pedrofrasao4102 2 ай бұрын

I'm having difficulty averaging the rewards per episode, as the averages are always at 0. Can anyone help me with this?

@l.o.2963 3 жыл бұрын

Why is the alpha parameter needed? Due to Bellman equations, the q-value is guaranteed to converge because the mapping is a contraction. Wouldn't the alpha parameter make the convergence slower?

@l.o.2963 3 жыл бұрын

And you rock, amazing videos. I am very happy I found this

@thehungpham4898 5 жыл бұрын

Hey I get an error, when I update the q_table. "IndexError: index 4 is out of bounds for axis 0 with size 4" Can somebody help me?

@GtaRockt 5 жыл бұрын

size 4 means there are indicies 0,1,2,3 you always start counting at zero :)

@SuperOnlyP 4 жыл бұрын

it should be q_table[state,action] not q_table[action,state] for the formular. It will fix the error

@izyaanimthiaz2082 Жыл бұрын

i'm getting this error - only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices - in code q_table[state, action]= q_table[state, action] * (1-learning_rate)... directly copied from website, can someone help me to solve this error?

@deeplizard Жыл бұрын

This is due to changes released in later versions of Gym (now migrated to Gymnasium). Code is now updated in the corresponding lecture notes on deeplizard.com.

@dulminarenuka8819 5 жыл бұрын

thnk you very much.can't wait for the next one.can you make it quick? :-)

@deeplizard 5 жыл бұрын

You're welcome, Dulmina! Hope to have the next one done within a few days 😆

@raghavgoel581 5 жыл бұрын

I am getting a decrease in the reward after running the code for 10k episodes, any ideas why ?

@antoinelunaire9462 4 жыл бұрын

yes, me too, I believe that this shown code is not the final code, for example in Update Q-table , there is a parentheses missed at the end

@antoinelunaire9462 4 жыл бұрын

Visit the blog and try to copy the code and fix the spacing accordingly, it somewhat worked with me but gives different numbers

@kushagrak4903 3 жыл бұрын

I changed minimum exploration rate from 0.1 to 0.001. Btw if you solved it different please share bro

@bharathvarma9729 3 жыл бұрын

where did she declare q_table

@amirhosseinesteghamat7621 3 жыл бұрын

after making exploration_decay 0.01 i got completely zero! I think it is because the number become very low so the agent stop exploring the environment

@sphynxusa 5 жыл бұрын

I have more of a general question. You mentioned in one of your videos that there are 3 types of machine learning: 1) Supervised, 2) Unsupervised and 3) Semi-Supervised. Would Reinforcement Learning be considered a 4th type?

@deeplizard 5 жыл бұрын

Yes, it would!

@pepe6666 5 жыл бұрын

@@deeplizard why? it seems unsupervised because theres no humans telling it anything

@NityaStriker 4 жыл бұрын

The environment provides a rewards to the agent. That could be a reason it isn’t unsupervised. 🤔