Exploration vs. Exploitation - Learning the Optimal Reinforcement Learning Policy

  Рет қаралды 111,897

deeplizard

deeplizard

Күн бұрын

💡Enroll to gain access to the full course:
deeplizard.com/course/rlcpailzrd
Welcome back to this series on reinforcement learning! Last time, we left our discussion of Q-learning with the question of how an agent chooses to either explore the environment or to exploit it in order to select its actions. In this video, we'll answer this question by introducing a type of strategy called an epsilon greedy strategy.
We'll also explore how, using this strategy, the agent makes decisions about the actions it takes. We'll also see how exactly Q-value is calculated and updated in the Q-table mathematically using an example from the lizard game we introduced last time.
Sources:
Reinforcement Learning: An Introduction, Second Edition by Richard S. Sutton and Andrew G. Bartow
incompleteideas.net/book/RLboo...
Playing Atari with Deep Reinforcement Learning by Deep Mind Technologies
www.cs.toronto.edu/~vmnih/doc...
TED Talk:
• Artificial intelligenc...
🕒🦎 VIDEO SECTIONS 🦎🕒
00:00 Welcome to DEEPLIZARD - Go to deeplizard.com for learning resources
00:30 Help deeplizard add video timestamps - See example in the description
09:37 Collective Intelligence and the DEEPLIZARD HIVEMIND
💥🦎 DEEPLIZARD COMMUNITY RESOURCES 🦎💥
👋 Hey, we're Chris and Mandy, the creators of deeplizard!
👉 Check out the website for more learning material:
🔗 deeplizard.com
💻 ENROLL TO GET DOWNLOAD ACCESS TO CODE FILES
🔗 deeplizard.com/resources
🧠 Support collective intelligence, join the deeplizard hivemind:
🔗 deeplizard.com/hivemind
🧠 Use code DEEPLIZARD at checkout to receive 15% off your first Neurohacker order
👉 Use your receipt from Neurohacker to get a discount on deeplizard courses
🔗 neurohacker.com/shop?rfsn=648...
👀 CHECK OUT OUR VLOG:
🔗 / deeplizardvlog
❤️🦎 Special thanks to the following polymaths of the deeplizard hivemind:
Tammy
Mano Prime
Ling Li
🚀 Boost collective intelligence by sharing this video on social media!
👀 Follow deeplizard:
Our vlog: / deeplizardvlog
Facebook: / deeplizard
Instagram: / deeplizard
Twitter: / deeplizard
Patreon: / deeplizard
KZbin: / deeplizard
🎓 Deep Learning with deeplizard:
Deep Learning Dictionary - deeplizard.com/course/ddcpailzrd
Deep Learning Fundamentals - deeplizard.com/course/dlcpailzrd
Learn TensorFlow - deeplizard.com/course/tfcpailzrd
Learn PyTorch - deeplizard.com/course/ptcpailzrd
Natural Language Processing - deeplizard.com/course/txtcpai...
Reinforcement Learning - deeplizard.com/course/rlcpailzrd
Generative Adversarial Networks - deeplizard.com/course/gacpailzrd
🎓 Other Courses:
DL Fundamentals Classic - deeplizard.com/learn/video/gZ...
Deep Learning Deployment - deeplizard.com/learn/video/SI...
Data Science - deeplizard.com/learn/video/d1...
Trading - deeplizard.com/learn/video/Zp...
🛒 Check out products deeplizard recommends on Amazon:
🔗 amazon.com/shop/deeplizard
🎵 deeplizard uses music by Kevin MacLeod
🔗 / @incompetech_kmac
❤️ Please use the knowledge gained from deeplizard content for good, not evil.

Пікірлер: 105
@deeplizard
@deeplizard 5 жыл бұрын
Check out the corresponding blog and other resources for this video at: deeplizard.com/learn/video/mo96Nqlo1L8
@yohannistelila8879
@yohannistelila8879 2 жыл бұрын
CS - AI student; This playlist is gem. Sharing it with my classmates. Thank you a lot!
@tanmaykulkarni6046
@tanmaykulkarni6046 18 күн бұрын
Perfect introductory playlist for RL
@arianvc8239
@arianvc8239 5 жыл бұрын
This series has been the best introductory course I've seen so far.
@BDEvans
@BDEvans 4 жыл бұрын
I agree
@mohamed_khoudjatelli9349
@mohamed_khoudjatelli9349 2 жыл бұрын
I also agree
@luigifaticoso2212
@luigifaticoso2212 5 жыл бұрын
Explanation is really good for me (university student in artificial intelligence)! Superbe video and audio quality, really understandable graphics!! This channel deserves more! Thank you!
@22kmhigh
@22kmhigh 3 жыл бұрын
Really good introductory material, well-explained, sufficiently rigorous, and fun to view. Thanks!
@delllaptop5971
@delllaptop5971 3 жыл бұрын
Hey just wanted to say this play list is sooo helpful thank you so much!! I just did a couple of courses in Rl and and then I started getting my basics all mixed up but luckily we got you !! so THANKS !! keep up the good work. Would love to see more projects thenkss!!
@siddhant997
@siddhant997 3 жыл бұрын
Amazing channel! thank you. And the way you end with a different video, unique.
@darshshah7155
@darshshah7155 Жыл бұрын
really love the amount of effort you put into making these videos. especially loved that you put some related ted talk for students who watch your videos into becoming more interested in this AI field
@hcandra
@hcandra 3 жыл бұрын
Excellent course series on reinforcement learning. Really enjoying it. Thx so much! God bless :)
@lanxizhang2816
@lanxizhang2816 4 жыл бұрын
You saved my life. I am currently doing machine learning subject. I can easily understand you even English is my second language. Thank you!
@Mahdi-noori-ai
@Mahdi-noori-ai 3 ай бұрын
Insane Insane and absolutely insane ❤❤❤ Thanks for this wonderful Playlist🎉
@deeplizard
@deeplizard 3 ай бұрын
You're welcome!
@psychodrums8138
@psychodrums8138 Жыл бұрын
WOOOOW!! This is by far the best explanation of Q-Learning I've seen! Congratulations!! Of course I have subscribed!
@t.pranav2834
@t.pranav2834 2 жыл бұрын
A Really Great Playlist. Thanks a lot!!!
@abdullahmoiz8151
@abdullahmoiz8151 4 жыл бұрын
thanks this was quite helpful
@interweb3401
@interweb3401 Жыл бұрын
Best course EVER, best teacher EVVVEERRRR
@mrfrozen97-despicable
@mrfrozen97-despicable 3 жыл бұрын
To the point videos. Thanks
@benjamindeporte3806
@benjamindeporte3806 3 жыл бұрын
Outstanding video.
@nernaykumar8334
@nernaykumar8334 4 жыл бұрын
The explanation is very good that tho they are providing content for free 👍👍👍👍👍👍
@faisalamir1656
@faisalamir1656 2 жыл бұрын
thank you, this is very useful
@adamhendry945
@adamhendry945 3 жыл бұрын
Please give credit to "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto, copyright 2014, 2015. You allow viewers to pay you through Join and this book material is copyrighted, but you do not reference them anywhere on your website. The equations and material are pulled directly from the text and it presents an ethical issue. Though the book is open-sourced, it is copyrighted, and you are using this material for financial gain. This text book has been used in several university courses on reinforcement learning in the past. I love these videos, but proper credit and securing approval from the authors must be obtained!
@adamhendry945
@adamhendry945 3 жыл бұрын
Finally! 3 months passed without a reply and now the references have magically appeared. I would have appreciated a response from the website authors themselves, but still grateful they finally gave the textbook authors credit. We don't want another Raj incident! This is great content!
@adamhendry945
@adamhendry945 3 жыл бұрын
@Radu Cojocaru I'm not an asshole.
@dfaiezdfaiez1699
@dfaiezdfaiez1699 5 жыл бұрын
Thank you thank you thank you!! this is super helpful (:
@daliasobhy7590
@daliasobhy7590 2 жыл бұрын
Its really good. I did not the part related to make the Q-value function converge to the right hand side of the bellman equation.
@mateusbalotin7247
@mateusbalotin7247 2 жыл бұрын
Thank you!
@Alchemist10241
@Alchemist10241 2 жыл бұрын
Thanks for not leaving the lizard game! 😊
@gamgam635
@gamgam635 Жыл бұрын
thank you!
@ameynaik2743
@ameynaik2743 5 жыл бұрын
Very good explanation. It would be great if you had worked out the example completely. (preferably step by step). Thanks!
@profie24
@profie24 4 жыл бұрын
Sehr nice Erklärung
@mariaioannatzortzi
@mariaioannatzortzi 3 жыл бұрын
{ "question": "When the actions in the Q-table have a Q-value of zero, the agent will:", "choices": [ "Explore the environment by having ε value equal to 1.", "Exploit the information of the environment by having ε value equal to 1.", "Explore and exploit by having ε value equal to 0.5.", "Explore the environment by having ε value equal to 0." ], "answer": "Explore the environment by having ε value equal to 1.", "creator": "marianna tzortzi", "creationDate": "2020-11-18T22:29:01.966Z" }
@deeplizard
@deeplizard 3 жыл бұрын
Thanks, Marianna! Just added your question to deeplizard.com/learn/video/mo96Nqlo1L8 :)
@tingnews7273
@tingnews7273 5 жыл бұрын
What I learned: 1.Exploration and exploitation:key is choose the highest Q-value for given state or not. 2.Balance:epsilon greedy strategy(EGS) 3.EGS:we set a epsilon to decide. Which means probability to explore.At first set to 1 means 100% percent to explore(not to choose the best but ramdom) 4.Greedy:means agent will be greedy if it learns the envirment.(No more explore,just exploitation) 5.Before I thought update the Q-table is easy. Every time you just update the value you learned -1 for example.Now I get it. First you must to set learning rate.You can not forget the old value.Then the value is not only the reward this step . It is the return include this step and the step after(bellman equention). 6.Maxsteps:We set the conditon to stop. question: 1.When lizard know the bird will kill him.Do him need to explore it again? 2.I don't get the updating the Q-value that chapter idea.In my view,without it.This article can make sence too. 3.I think the example not good enough . Maxq(s',a') is zero.
@okihnjolossantos3047
@okihnjolossantos3047 3 жыл бұрын
How can we calculate the loss function with optimal q Value, if we dont have that yet? I mean thats why we are doing this whole thing, so we can get q*.
@hazzaldo
@hazzaldo 5 жыл бұрын
Outstanding video series. Many thanks for going through all this effort to teach us this intriguing concept. I have one question on this video, which I would deeply appreciate if someone could clarify: I didn’t quite understand the logic behind updating the Q-value formula. Specifically I didn’t understand how placing a fixed learning value, in such a way so it’s always favouring one Q-value over another (i.e. between the old Q-value and the Learned Q-value). I can’t see the logic behind how this will optimise and converge to the optimal Q-value. Because it’s always going to favour one Q-value over another (i.e. give it a higher weight), even if the other Q-value might be yielding a better value over a number of iterations. The formula just seems to be acting as a bias rather than a learning optimisation formula. I hope this question makes sense, and many thanks in advance for any answers.
@garrett6064
@garrett6064 3 жыл бұрын
Why did we pick 0.99 as gamma? Why might I choose another number? If I'm getting ahead, just let me know that. Thank you so much! I really appreciate these videos.
@deeplizard
@deeplizard 3 жыл бұрын
The closer gamma is to 1, the more we want the agent to value long term rewards, therefore influencing its current action. The closer gamma is to 0, the more we want the agent to value short term rewards, again therefore influencing its current action but with a different priority. We have free range to set gamma to any value in [0,1] and it really just depends on the environment and often times will need to be experimented with to yield the best results. There's lots of good extra insight on this rate in this thread: stats.stackexchange.com/questions/221402/understanding-the-role-of-the-discount-factor-in-reinforcement-learning
@garrett6064
@garrett6064 3 жыл бұрын
@@deeplizard thank you very much!!
@sarvagyagupta1744
@sarvagyagupta1744 4 жыл бұрын
The explanation is really AMAZING!! Top-Notch!!! Please, can you make some videos on continuous systems, where we use policy gradients and not very discrete systems?
@deeplizard
@deeplizard 4 жыл бұрын
Thanks, sarvagya! May cover these topics in this series in the future.
@sarvagyagupta1744
@sarvagyagupta1744 4 жыл бұрын
@@deeplizard Thanks. I do have a question. You are using random samples from the replay memory to train the neural network. But i read some posts from Andrej Karpathy, Siraj Raval and they use sequential batches from replay memory to train. Can you tell me why the difference?
@joaoramalho4107
@joaoramalho4107 3 жыл бұрын
First of all, i just want to congratulate you for this amazing series! Made it so eazier to understand :) I just have a doubt. After watching this video over and over again I just could not fully get the idea of minimizing the loss between the optimal q-function and the 'current' q-function. Since this is a step that we do iteratively (the calculation of this loss), doesn't that require to know already what's the optimal q-value? Because without it we can't compute the loss. The calculation of this loss is what confuses me
@felicemorgigi1764
@felicemorgigi1764 Жыл бұрын
Yeah please an explanation for that
@iAndrewMontanai
@iAndrewMontanai 4 жыл бұрын
Finally someone explains that clearly. Thank you! But there is something i cant understand: max(q(s,a)) part its just related to all Q values for our new state? We moved to next cell (empty5), so in the table we're looking for maximum value in that 5empty state row?
@deeplizard
@deeplizard 4 жыл бұрын
Yes :)
@shahulrahman2516
@shahulrahman2516 4 жыл бұрын
Hii, Wonderful video, Can you clarify my doubt regarding, where do we actually calculate the "Loss". During the process of calculating the new Q-Value, i do not see anything of loss. So i would like to know, where we actually calculate the loss.
@aseemjha5481
@aseemjha5481 5 жыл бұрын
Hi, On your blog post, when you have put the values on your equation, where did u get the value of γ as 0.99. I am a bit dumb.. would be great if could explain. Thanks for the videos.. I have finished your fundamentals of deep learning. It was absolutely great. Can you please add RNN/LSTM video. Also it would be great to see some KDD99 , word2vec and auto encoders. thank you for your effort.
@adimib
@adimib 4 жыл бұрын
Hello, I have a question: Isn't the learning rate equation basically the same as SARSA, essentially making the algorithm on-policy in that it dynamically updates the q-values? Or is the update made after the episode ends (which is an offline policy as far as I am aware?) Great series by the way, really helped me understand the basic of Reinforcement Learning!
@EricsShitposting
@EricsShitposting 4 жыл бұрын
great
@chrisfreiling5089
@chrisfreiling5089 5 жыл бұрын
Great videos! Thanks! I'm learning a lot (I think--but maybe not). Here's something I'm puzzled about. In the definition of "loss" there is an expression "q*(s,a)" and an expression "q(s,a)" with no subscript. But I'm not sure what policies are being used for each. There are three policies in the back of my mind. There is the true optimum policy which is what we are trying to find, but we don't know it yet. There is the current policy that takes into account the exploration rate, "epsilon". And there is the current approximation to the optimum policy, by which I mean the current policy with epsilon = 0. Could you please make it clear to me which policy goes with each of these q's? Thanks!
@deeplizard
@deeplizard 5 жыл бұрын
Hey Chris - q* is the optimal Q-function for the optimal policy. The * notation here represents the optimal policy. q with no subscript used in the definition of loss is the Q-function for the current policy (taking the exploration rate into account).
@santoshvasaresearch419
@santoshvasaresearch419 2 жыл бұрын
At 6:39 you've mentioned that you take the optimal q-value of s prime and a prime. I got confused on how you will know the optimal q-value.. I think you will take whatever values are there in the q-table according to the code from the next videos. Please let me know if I am right!. I love your videos.. cant express in words.
@hangchen
@hangchen 5 жыл бұрын
Edit: I got it from your blog. The Bellman equation is actually used as the second part in the Q-learning update function. Original question: The explanation is awesome! But I have a doubt here. The Bellman equation was introduced in the previous episode, but it seems when we implement Q-learning, we just need that Q value function to update the Q values in the Q-table? So where is the Bellman equation used in Q-learning or the Bellman equation is just a conceptual idea? Thanks!
@sarthaksaxena3364
@sarthaksaxena3364 4 жыл бұрын
Hey Hang! IF you know anything about DP, basically we are using the optimal substructure property to inform our decisions. The bellman equation is only really important as a proof for why our update method converges to a maximum! Lemme know if you need more details :)
@RodgertheProgrammer
@RodgertheProgrammer 4 жыл бұрын
@@sarthaksaxena3364 I don't think you can call the bellman equation a proof. It is built into our update rule. You could, however, prove the bellman equations with contraction mappings.
@rodrigourquizo6235
@rodrigourquizo6235 3 жыл бұрын
{ "question": "What is the disadvantage of selecting the highest learning rate(1)?", "choices": [ "The agent won't learn about previous states", "The updating process will be very slow", "The agent will always get low rewards", "The new Q-value will always be the same" ], "answer": "The agent won't learn about previous states", "creator": "Rodrigo", "creationDate": "2020-08-18T18:47:07.528Z" }
@deeplizard
@deeplizard 3 жыл бұрын
Thanks, Rodrigo! I changed the wording just a bit, but your question has now been added to deeplizard.com/learn/video/mo96Nqlo1L8 :)
@sidddddddddddddd
@sidddddddddddddd 2 жыл бұрын
Great video but throughout the video, you were filling the wrong state-action pair in the Q-table.
@MoreFoodNowPlease
@MoreFoodNowPlease 4 жыл бұрын
{ "question": "When iterating, the new Q-value is equal to a weighted sum of the old value and the...", "choices": [ "learned value", "starting value", "maximum value", "final value" ], "answer": "learned value", "creator": "glenn", "creationDate": "2020-04-08T09:57:57.971Z" }
@deeplizard
@deeplizard 4 жыл бұрын
Thanks, glenn! Just added your question to deeplizard.com/learn/video/mo96Nqlo1L8 :)
@louerleseigneur4532
@louerleseigneur4532 4 жыл бұрын
merci merci
@tamerzah
@tamerzah 3 жыл бұрын
{ "question": "What is the range of learning rate?", "choices": [ "0
@milindsawant8004
@milindsawant8004 Жыл бұрын
Amazing video series. I am trying solve the tic tac toe game using Q learning. Since it is a 2 player game, I treat the opponent as part of the environment. 2 Questions: 1. Once my Q-learning agent takes action a in state s, the opposing player (fixed rule and non learning) can take various actions. This creates multiple potential next states s' for my agent. For using the Bellman's equation, should I be using the max values for all possible next states s' to update my Q-table for previous state-action pair (s,a)? I tried this and was hoping that after sufficient learning, my Q-learning agent either wins or ties. It should never lose (based on my expertise in tic tac toe, which means nothing :)). Am I doing something wrong? 2. Should I still use the Bellman's equation for the terminal state (win, loss or tie) or simply assign the value of the win/loss/tie to the previous Q(s, a) which resulted in the next state as terminal state? The reason I ask is after running 100K games, the Q(s,a) remains the same even after a win/loss. I guess it is approaching Q optimal.
@cyrilgarcia2485
@cyrilgarcia2485 4 жыл бұрын
I wish you can visually show us how the game is played without piling on the theory. I want to visually see it.
@deeplizard
@deeplizard 4 жыл бұрын
Continue on with the course, an you will visually see it. First we learn the theory. Then we apply the theory to develop code. Then we visually see the game in action from the code we develop.
@ArmanAli-ww7ml
@ArmanAli-ww7ml 2 жыл бұрын
what is the relationship between the loss equation and the newQ with learning rate? confused
@Ayanwesha
@Ayanwesha 3 жыл бұрын
value iteration is good for static world,now what if the world changes with time,can i implement the same too there?
@chrisfreiling5089
@chrisfreiling5089 5 жыл бұрын
The learning process seems to rely on both epsilon and alpha. The role of epsilon is clear, but alpha is more confusing. If I were making this stuff up, I would have picked alpha to depend on how many times that specific action was taken at the current state. But I'm not the inventor, so I am going to assume that we should choose alpha and epsilon in order to get our policy to converge to the optimum policy as quickly as possible. Is that right? Of course, this begs the question: How do we know that we will converge to the optimum policy?
@deeplizard
@deeplizard 5 жыл бұрын
That's an interesting thought in regards to alpha. Since alpha is a static parameter, though, it will not change from action to action... at least in the traditional way we're using it. In practice, there may be some implementation of a dynamic learning rate that could be applied in the way that you thought of. If not, then you will indeed be the crowned the grand inventor of dynamic learning rates in Q-learning! 😋 In regards to choosing alpha and epsilon-- Remember, we set epsilon to 1 initially, and it decreases according to the exploration decay rate. So, the decay rate is actually the parameter that you'll be choosing/testing/tuning, rather than epsilon. Alpha is a hyperparameter as well that we must test and tune with the goal of converging to the optimal policy. We can measure "how close" we are to converging using the Bellman equation.
@ThePaintingpeter
@ThePaintingpeter 5 жыл бұрын
The chunking is fairly good so far. I like the progression. Great video and audio quality. The graphics are very good also. I'm hoping that the issue of calculating max value of future state/action pairs is well covered. I'm wondering what happens when you don't know what the actual future values will be. (obviously not like the 4x4 square problem)
@deeplizard
@deeplizard 5 жыл бұрын
Thanks, Peter! I'm glad you're liking the style. Have you checked out the blog yet? There, I walk through the concrete calculation of the max value of the next state-action pair you mentioned. deeplizard.com/learn/video/mo96Nqlo1L8 In regards to your thought about not knowing the future values-- The immediate rewards for each state-action pair will always be defined up front. The expected future return for any state-action pair (AKA the Q-value) is what we don't know up front. This is what the agent learns. For example, at the start, the future return for all state-action pairs is just zero, but over time, the agent gets better and better at estimating the future return as it learns and experiences more in the environment. Let me know if this helps clarify.
@NoahSpurrier
@NoahSpurrier 2 жыл бұрын
Would be nice to see a demo on a more complex example. The trivial example here is good for explaining the method, but it doesn’t give any sense that this method has power to deal with more complex problems.
@deeplizard
@deeplizard 2 жыл бұрын
Exploration/exploitation is also included in the code projects that come later in the course.
@srikrishnanarasimhan9350
@srikrishnanarasimhan9350 4 жыл бұрын
Why do we have to generate a random number and compare it with epsilon? Instead we can take 0.5 as the threshold for epsilon and proceed right?
@shahdalq7001
@shahdalq7001 3 жыл бұрын
if we say that epsilon will mean that the agent has to explore. based on what we can choose a random value for epsilon?? I mean at the initial stage, the epsilon should be 1 as the environment is unknown, so what is the benefit of randomly choose a value for epsilon??
@shahdalq7001
@shahdalq7001 3 жыл бұрын
what if the random value r equals the epsilon??
@rockapedra1130
@rockapedra1130 3 жыл бұрын
Yeeeeessssss!
@ruthvikanchuri2106
@ruthvikanchuri2106 4 жыл бұрын
when the agent knows which is an optimal path , why does it not follow that path and why should we calculate loss , when we know optimal path, tell the agent directly to follow that path right? if we dont know optimal path initially to train the agent then how are we able to calculate loss?
@LightningSpeedtop
@LightningSpeedtop 4 жыл бұрын
The agent doesn’t know the optimal path when it begins, it basically explores its environment and updates the W values accordingly, from then it’ll know it’s optimal path
@ruthvikanchuri2106
@ruthvikanchuri2106 4 жыл бұрын
@@LightningSpeedtop thanks!
@LightningSpeedtop
@LightningSpeedtop 4 жыл бұрын
ruthvik anchuri anytime
@nabilbaalbaki7816
@nabilbaalbaki7816 5 жыл бұрын
I am a bit confused on how we got the update Q-value equation. Specifically where we introduced the learning rate to the equation and the (1 - alpha) Never mind, amazing things happen when you write down an equation and simplify it :) Thanks for the video
@deeplizard
@deeplizard 5 жыл бұрын
Excellent! Nothing is more satisfying than guiding yourself to an understanding that you weren't able to see before.
@jastremblay3299
@jastremblay3299 4 жыл бұрын
Do you need table like that for every reinforcement?
@deeplizard
@deeplizard 4 жыл бұрын
Yes, you'll need a Q-table when using value iteration. This table will be generated by code. In later episodes, you'll see how this is done.
@peterhayman
@peterhayman 28 күн бұрын
🎉🎉🎉
@moritzpainz1839
@moritzpainz1839 4 жыл бұрын
But who do you know the max q(s‘,a‘) since this is the next step?
@RodgertheProgrammer
@RodgertheProgrammer 4 жыл бұрын
You use the estimate of q(s', a') for your current time step. Initially these values are zero (or set avoid some bias), but as you perform new actions and receive rewards, you will update each visited q(s,a). Max(q(s',a')) becomes more meaningful after your value estimates have improved.
@Arjun147gtk
@Arjun147gtk 4 жыл бұрын
❤️
@ArjunKalidas
@ArjunKalidas 4 жыл бұрын
Have you encountered a challenge of playing Rummy using Reinforcement learning. Say, a 3-card rummy game, because the actual rummy might require an enormous Q-table and processing could be a pain!
@deeplizard
@deeplizard 4 жыл бұрын
I've not encountered this particular challenge, but continue with the series, and you will see that the use of *deep* Q-learning is much better for tasks that would otherwise require an enormous Q-table if we stuck with value iteration.
@ultragranata15
@ultragranata15 4 жыл бұрын
I have a dumb question: How can we measure the loss between the target Q-value and the Q-value for the state-action pair if we don't know the target Q-value? and, instead, if we know the target Q-value for the state-pair, why we don't just put this value in the Q-Table? This will then lead us trivially to get a 0 loss. I think I'm missing something, does we know or not the target Q-Value in advance?
@LightningSpeedtop
@LightningSpeedtop 4 жыл бұрын
The target Q value is calculated using a formula R +ymax Q(s’a)
@LightningSpeedtop
@LightningSpeedtop 4 жыл бұрын
You don’t completely erase your Previous Q value because, suppose your environment is random , and you accidentally land in the wrong state, you’ll then erase your previous correct Q values with the wrong values since you’re not in the state you’re meant to be , defeating the entire purpose, so you have a learning rate , that kind determines the impact of the new value to the old one
@unamattina6023
@unamattina6023 Жыл бұрын
but we have 5 empty boxes, why did you write 6?
@daliasobhy7590
@daliasobhy7590 2 жыл бұрын
its really but i have a question:
@mohammedehsanurrahman1047
@mohammedehsanurrahman1047 4 жыл бұрын
but y a lizard?
@deeplizard
@deeplizard 4 жыл бұрын
_I could a tale unfold whose lightest word_ _Would harrow up thy soul._ 👻🦎
@escapefelicity2913
@escapefelicity2913 3 жыл бұрын
Get rid of the background noise
@RatedRudy
@RatedRudy 5 жыл бұрын
your other video series are lot more intuitive.... this whole RL series is lot less intuitive.... you just cover math behind it. for someone who doesn't know well, the best thing to do is to start from going through the lizard game step by step, and then cover the math..... I was very much fan of your videos, but honestly this whole series is not very well explained.
@subratswain6775
@subratswain6775 5 жыл бұрын
this video is a bit less intuitive than earlier.
@deeplizard
@deeplizard 5 жыл бұрын
Yeah, it was a little more technical than last time. Once you simmer on the concepts and study a bit, it becomes more intuitive.
Must-have gadget for every toilet! 🤩 #gadget
00:27
GiGaZoom
Рет қаралды 8 МЛН
⬅️🤔➡️
00:31
Celine Dept
Рет қаралды 42 МЛН
Reinforcement Learning, by the Book
18:19
Mutual Information
Рет қаралды 78 М.
When To Try New Things (According to Computer Science)
6:30
Up and Atom
Рет қаралды 75 М.
Q-Learning Explained - A Reinforcement Learning Technique
8:38
deeplizard
Рет қаралды 223 М.
Reinforcement Learning: on-policy vs off-policy algorithms
14:47
CodeEmporium
Рет қаралды 6 М.
What are Pooling Layers in Deep Neural Networks?
9:16
Machine Learning Explained
Рет қаралды 2,3 М.
Generative AI in a Nutshell - how to survive and thrive in the age of AI
17:57
Multi-Armed Bandits: A Cartoon Introduction - DCBA #1
13:59
Academic Gamer
Рет қаралды 40 М.