RL Course by David Silver - Lecture 4: Model-Free Prediction

  Рет қаралды 340,225

Google DeepMind

Google DeepMind

Күн бұрын

Пікірлер: 225
@yasseraziz1287
@yasseraziz1287 4 жыл бұрын
0:53 Outline 2:10 Introduction 5:06 Monte Carlo Learning 9:20 First Visit MC Policy Evaluation 14:55 Every Visit MC Policy Evaluation 16:23 Blackjack example 26:30 Incremental Mean 29:00 Incremental MC Updates 34:00 Temporal Difference (TD) Learning 35:45 MC vs TD 39:50 Driving Home Example 44:56 Advantages and Disadvantages of MC vs. TD 53:35 Random Walk Example 58:04 Batch MC and TD 58:45 AB Example 1:01:33 Certainty Equivalence 1:03:32 Markov Property - Advantages and Disadvantages of MC vs. TD 1:04:50 Monte Carlo Backup 1:07:45 Temporal Difference Backup 1:08:14 Dynamic Programming Backup 1:09:10 Bootstrapping and Sampling 1:10:50 Unified View of Reinforcement Learning 1:15:50 TD(lambda) and n-Step Prediction 1:17:29 n-Step Return 1:20:22 Large Random Walk Example 1:22:53 Averaging n-Step Return 1:23:55 lambda-return 1:28:52 Forward-view TD(lambda) 1:30:30 Backward view TD(lambda) and Eligibility Trace 1:33:40 TD(lambda) and TD(0) 1:34:40 TD(lambda) and MC
@김준호-n1w9d
@김준호-n1w9d Жыл бұрын
thanks
@appendix77
@appendix77 7 жыл бұрын
Questions from students are of very high quality and they are one of the many reasons that make this lecture series particularly great.
@muratcan__22
@muratcan__22 5 жыл бұрын
yeah specially the guy sitting close to the camera
@NganVu
@NganVu 4 жыл бұрын
2:03 Introduction 5:04 Monte-Carlo Learning 33:56 Temporal-Difference Learning 1:23:53 TD(lambda)
@scottoctave1353
@scottoctave1353 4 жыл бұрын
He is the Andrew Ng of RL.
@nirajabcd
@nirajabcd 4 жыл бұрын
Could not agree more. Andrw Ng got me into ML and David Silver into RL. And that book by Sutton is gold!
@ruslanponomariov3598
@ruslanponomariov3598 3 жыл бұрын
No, Andrew Ng is David Silver of ML.
@samjoshua192
@samjoshua192 3 жыл бұрын
@@ruslanponomariov3598 Wouldnt it be Richard Sutton, given he is considered father of RL?
@diln5
@diln5 3 жыл бұрын
Andrew Ng has done lots of RL work himself!
@CONGKAMAEL
@CONGKAMAEL 3 жыл бұрын
@@nirajabcd very true; but isn't RL a subset of ML? I thought ML = {RL, Supervised, Unsupervised}
@ikechukwuokerenwogba8646
@ikechukwuokerenwogba8646 Жыл бұрын
The lecture 💯. The questions the students were asking 💯. My enjoyment of the whole thing 💯.
@azerotrlz
@azerotrlz 6 жыл бұрын
I love how he relates the form of the incremental mean to the meaning of RL updates.
@beepmcjeep5527
@beepmcjeep5527 4 жыл бұрын
This actually helped me a ton, I've been knee deep in Deep RL implementations that have just been handed to me and I've skirted doing it by hand or understanding the formulas. As a result, reading papers have been confusing. Him mentioning that the incremental mean will be a common pattern made about 10 things click all at once for me.
@jaidevshah2286
@jaidevshah2286 4 жыл бұрын
@@beepmcjeep5527 wake up
@beepmcjeep5527
@beepmcjeep5527 4 жыл бұрын
@@jaidevshah2286 😴
@scienceofart9121
@scienceofart9121 4 жыл бұрын
When you realized every lecture corresponds to a chapter in Sutton's "Introduction to Reinforcement Learning"
@phuongdh
@phuongdh 4 жыл бұрын
Sutton & Barto are the OGs, Silver is the GOAT.
@aidosmaulsharif9570
@aidosmaulsharif9570 4 жыл бұрын
"Introduction to Reinforcement Learning" is a Holy Bible of RL, everything you need to know is in this book. And David Silver helped us to interpret this book.
@_snwflk_
@_snwflk_ 4 жыл бұрын
My Prof actually uses both as teaching material xD
@BooBar2521
@BooBar2521 7 күн бұрын
​@@_snwflk_my prof too 😂
@yuxinzhang9403
@yuxinzhang9403 3 жыл бұрын
I think the reason why looking one step into the future is better than using past predictions is that you can treat the next step as if it were the last step, then that would be the terminating goal and the game is over, we've already known the current state didn't make the episode end up, so only the future state could and that's why we always look into the future for the terminating state.
@saminchowdhury7995
@saminchowdhury7995 5 жыл бұрын
38:29 what a great example to explain how TD is different from MC
@alexanderyau6347
@alexanderyau6347 7 жыл бұрын
I have watched the lecture 4 four times, and this is the clearest one. For non-English speakers, language is really an obstacle to understanding this lecture. Oh my poor English, I only got 6.5 at IELTS Listening.
@achyuthvishwamithra
@achyuthvishwamithra 3 жыл бұрын
The backup diagrams have made everything much more clearer.
@nightfall4857
@nightfall4857 8 жыл бұрын
Thanks for good lecture. This lecture really help me a lot. I have a suggestion for improving this lecture. It is English subtitle. It will make this lecture more accessible for the handicapped and non-English speakers.
@minhuang8848
@minhuang8848 8 жыл бұрын
Problem being that subtitles cost. Professional captioning can easily run you 400 bucks for a lecture like this, so...
@isainstars
@isainstars 8 жыл бұрын
I'm sure that a lot of people wouldn't mind volunteering to provide English subtitles for these lectures (at least I wouldn't). It would be a nice alternative for that :)
@ThePaypay88
@ThePaypay88 5 жыл бұрын
@@minhuang8848 lmao wtf , algorithm already provides free (albeit errorbugged) subtitles , its more than enough . Just let video settings to apply that
@testxy5555
@testxy5555 4 жыл бұрын
love the example to demonstrate the difference between TD and MC!!
@daehankim2437
@daehankim2437 3 жыл бұрын
These lectures are gold!
@tacobellas
@tacobellas 8 жыл бұрын
These lectures are sooo helpful! Thank you very much for posting. They are really good :).
@joshash5944
@joshash5944 7 жыл бұрын
haha.. "You don't need to wait til you die to update your value function.. "
@Hestehviskeren
@Hestehviskeren 4 жыл бұрын
Thanks, TD learning!
@billykotsos4642
@billykotsos4642 3 жыл бұрын
Another meaty lecture ! This is pure treasure
@ErwinDSouza
@ErwinDSouza 6 жыл бұрын
At 1:27:47, David explains why we use λ geometric series by saying it is memoryless so "You can do TD(λ) for the same cost as TD(0)".. but I don't see how! TD(0) merely looks at one 1 step whereas TD(λ) has to look at all future steps (or in the backward view, TD(0) merely updates the current state, while TD(λ) updates all previous states)
@totolamenace
@totolamenace 2 жыл бұрын
TD(λ), in the backward view, does updates all previous steps, but that's as effective as TD(0), as each update is roughly equivalent to updates done with TD(0). The forward view is theoretical, and mentioned because it gives us some intuitions on TD(λ), but it would be very inefficient to implement TD(λ) using the forward view, way too much computation, plus it implies that it needs finite episodes. That's why, in practice, we only use the backward view for implementations.
@totolamenace
@totolamenace 2 жыл бұрын
If you look into lecture 5, on Sarsa(λ), there's a comparison of the way Sarsa(λ) and Sarsa(1) update the states at around 1:06:00. You can see that while Sarsa(λ) does more computation, it then also build much more precise evaluation of the state action values. I think looking at this example might give you some intuitions why TD(λ) is as efficient as TD(0) computation wise.
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
​​​​@@totolamenace What you say makes sense, but computational cost means cost per iteration and the backward view of TD(lambda) updates eligibility traces and many state values per iteration, so it is obviously more expensive that one-step TD. A different statement would be to say that "to obtain state value estimates of similar quality, the number of required operations in both methods is roughly the same", which is not the computational order anyway.
@danielc4267
@danielc4267 6 жыл бұрын
43:35 A student asked about the goal and actions of the driving-home example. I have read the book where this example comes from. And here is my take on the question: The actions come from a policy that is determined by the person. In this case, the policy is getting home by driving a car through particular roads. The person can use other policies to get home such as walking home and driving through other roads. The goal of Monte Carlo or Temporal Difference is to estimate how good his policy is. Remember his policy involves driving through particular roads. The example shows just one sample of updating the algorithms. To actually see how good his policy his. He needs to take the same route everyday, obtain more data, and update the algorithms.
@ze-speeches
@ze-speeches 6 жыл бұрын
Great explanation, Daniel! Thanks! Which book is the example from and would you recommend to read it parallel to working through this lectures?
@danielc4267
@danielc4267 6 жыл бұрын
Thank you. The book is reinforcement learning: an introduction by Sutton. I think you can download the draft with some googling
@danielc4267
@danielc4267 6 жыл бұрын
Yeah, I would recommend reading it along with the lectures. It's a good book. I find it helpful to learn the same concepts from different sources, especially difficult ones.
@ze-speeches
@ze-speeches 6 жыл бұрын
You are welcome! Thanks for the very fast reply! I already have the book from Sutton and Barton, but decided to start with the review article "Deep Reinforcement Learning - An Overview" from Yuxi Li. Also very recommendable!
@ze-speeches
@ze-speeches 6 жыл бұрын
Thanks for the recommendation! I also definitely love to learn from different sources and in addition to lectures and readings listen to podcasts. I love this one, but it is about machine learning fundamentals, not RL: ocdevel.com/podcasts/machine-learning Maybe you or another reader can recommend another podcast, being more specialized on RL or other more advanced topics? "TWiML & AI" seems to be quite promising... also want to try out "Machine Learning 101". Any experiences or recommendations on that?
@joaumgomes
@joaumgomes 8 жыл бұрын
1:34:04 TCHA TCHEW
@LucGendrot
@LucGendrot 6 жыл бұрын
Anticipation for this moment made this excellent lecture that much better.
@johnsmith-mp4pr
@johnsmith-mp4pr 5 жыл бұрын
What an excellent teacher.
@43SunSon
@43SunSon 4 жыл бұрын
I have to say, David Silver is slightly smarter than me.
@suryansudash0502
@suryansudash0502 4 жыл бұрын
Just slightly, yea, just like how they write "only" in the cheques 😂
@qingfangliu4687
@qingfangliu4687 4 жыл бұрын
24:52 I think the professor's explanation is a bit misleading about this question. The Sutton & Barto book, where the figure came from, clearly told that the dealer has a fixed strategy to stick on any sum of 17 or greater, and hit otherwise.
@larryjuang5643
@larryjuang5643 4 жыл бұрын
I think what David showed in this lecture is the V under the policy of sticking only above 20. This does not say anything about the quality of that policy, just the value function for any state under the said policy.
@Ashrak55
@Ashrak55 4 жыл бұрын
@@larryjuang5643 Liu was talking about the strategy of the dealer not the policy of the agent. I think the point Dr. Silver is trying to make in this case, is that it doesn't really matter what the strategy of the dealer is. In the book by Sutton and Barto the strategy is fixed, so that we can simulate it when writing the code for the environment.
@mind-set
@mind-set 3 жыл бұрын
Thank you for these lectures. They are fantastic.
@somcho
@somcho 5 жыл бұрын
On Lambda-return at 1:23:56, if you think of N as being geometrically distributed with parameter lambda (for Bernoulli probability of success) and temporarily think of G^(n)_t as a (deterministic) function of n, then G^lambda_t is simply the expected value of G^(N)_t. i.e. Let N~Geo(lamda) and G^(n)_t be fixed as a function of n then G^lambda_t = E[ G^(N)_t ] ### Correction: N~Geo(lamda) above should be N~Geo(1-lamda) instead
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
It is 1:23:56, your timestamp is incorrect! But the observation is correct up to the fact that N~Geo(1-lambda) instead of Geo(lambda). Nice observation!!
@somcho
@somcho Жыл бұрын
@@edmonddantes4705 wow 4years later i am looking at what I wrote and literally have no idea what the heck i was talking about.. haha for a second I thought I had been hacked and chatgpt has been posting youtube comments on my behalf! 😂 time flies ¯\_(ツ)_/¯
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
I guess it is good to know that somebody read it carefully even after four years 😂
@johnhefele5432
@johnhefele5432 6 жыл бұрын
You get your real award; having a day at work :D
@홍준표-p4q
@홍준표-p4q 6 жыл бұрын
I cannot clearly understand why TD(\lambda) is better than n-step TD. There seems to be no big difference between the RMS error figures of n-step TD (p. 37) and of TD(\lambda) (p. 42).
@pi3ni0
@pi3ni0 6 жыл бұрын
I have this intuition, but it may be wrong. TD(n) and TD(lambda) both have one hyperparameter to tune. I think the advantage of TD(\lambda) is that when you average over TD(0), TD(1), TD(n), you apply bagging technique, so it's easier to control bias/variance with lambda and avoid high variance.
@SaifUlIslam-di5xv
@SaifUlIslam-di5xv 4 жыл бұрын
"You don't need to wait until you die to update your value function" killed me. xD
@dawenwu9896
@dawenwu9896 6 жыл бұрын
i can't fully understand the slide in 1:21:45 .what is offline TD learning? i just think that offline TD should have the same result with online TD
@akshayshrivastava97
@akshayshrivastava97 Жыл бұрын
It's been ~8 years since these lectures were uploaded and I'm yet to find an online resource that can rival Dr. Silver's delivery of this material.
@SphereofTime
@SphereofTime 3 ай бұрын
1:36:12 td lambda and montecarlo 1:36:22
@chandrasekharChebiyyam
@chandrasekharChebiyyam 5 жыл бұрын
Does anyone see a connection between a Non-Markov Decision Process (aka partially observable MDP) and a Hidden Markov model?
@herrizaax
@herrizaax 3 жыл бұрын
The answer to the question at 46:50 confuses me. I think the questioner was wondering whether MC would learn to drive safely, not whether TD would learn it. David gave the answer that TD would learn it. My answer would be: "MC would learn it after enough iterations, because sometimes you would crash". But hey, maybe I'm the one that misinterprets the question :)
@BGasperov
@BGasperov 4 жыл бұрын
An audience member at 46:14 asked: "Does it learn the same behavior? So if you have a car, and the car almost crashes, but it doesn't, ... When you're learning step by step you gonna learning the behavior that you're scared of crashing, but if you're learning from the very end you would never learn sth to encourage you to avoid your own manic behaviour." Although prof. Silver's explanation is correct, I think that's not what the answer student was looking for. It is not true that you would never learn sth to encourage you to avoid manic behavior (as the student is implying) - given enough episodes you would, as you would crash in many/most. MC does converge to the true value function (due to the law of large numbers).
@paulmarkus4917
@paulmarkus4917 4 жыл бұрын
Thanks for the clarification. So you are saying MC would also experience episodes terminating in crashes and hence learn to avoid crashes, just like in the case of TD.
@BGasperov
@BGasperov 4 жыл бұрын
@@paulmarkus4917 Yep.
@nikebless
@nikebless 4 жыл бұрын
@@BGasperov Does it mean TD would be able to learn to not crash without crashing, while Monte Carlo has to crash at least once to understand that it's bad?
@BGasperov
@BGasperov 4 жыл бұрын
@@nikebless No, TD would still need to crash (experience an extremely negative reward) in order to learn not to do it. Without crashing (experiencing a negative reward) how could it know that the action that leads to crashing is bad? Perhaps it could in model-based RL but that's another story.
@abeaumont10
@abeaumont10 5 жыл бұрын
Great series of videos!
@vaddadisairahul2956
@vaddadisairahul2956 3 жыл бұрын
I didn't understand the part where he explains how MC learning doesn't depend on how large the state space is. I need to understand it through an example
@eyeofhorus1301
@eyeofhorus1301 5 жыл бұрын
I'm falling in love with David Silver 😍😍
@kiuhnmmnhuik2627
@kiuhnmmnhuik2627 7 жыл бұрын
@1:07:00 He should've started from the graph rather than from the math.
@DhavalSalwala
@DhavalSalwala 5 жыл бұрын
Exactly
@mathavraj9378
@mathavraj9378 4 жыл бұрын
Can someone clarify me this? Monte carlo learning is basically just taking fewer paths out of the whole possible paths from a state and averaging it to estimate value function of a state.
@kuity
@kuity 6 жыл бұрын
I don't think the example at 43:00 was very clear... Or at least, it wasn't explained what's the policy and what's the return here I mean I get the example and and I get the theory but not how the two of them relates
@muxintian8110
@muxintian8110 Жыл бұрын
really appreciate!! help me understand model free
@avinashanand2163
@avinashanand2163 7 жыл бұрын
What kind of policy do we have in a Model-free Prediction? For example, if we have a policy of walking in a straight line from a point towards a goal, then the same set of states will be repeated in every iteration and hence the rewards should be same in every episode. Then how averaging would help?
@DerRumbringer
@DerRumbringer 7 жыл бұрын
i don't think you know where the goal is (if you do, you already have the optimal policy). other than that your environment will usually act stochastic and your sampling of starting states should be stochastic..
@DerRumbringer
@DerRumbringer 7 жыл бұрын
I have to correct: as it will be explained in lecture 5, acting greedily according to some policy will not lead to optimal exploration of the state space. This is because your trajectory is sampled from a propability distribution with much lower variance (because of deterministic actions limiting to one dimension of the propability matrix, the state transition dimension as opposed to the both the state transition dimension and the action dimension).
@ahmadalghooneh2105
@ahmadalghooneh2105 5 жыл бұрын
14:33 was very good :))
@sharinfoster6582
@sharinfoster6582 4 жыл бұрын
IDK who this person in the background is buy they cannot stop coughing and burping constantly. It's so irritating.
@cheungtyrone3615
@cheungtyrone3615 4 жыл бұрын
I have one question about "model free". If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change? Just like a human looking at a counter repeadtedly counting up from 0 to 9 and then counting down back to 0. A reward of 1 is given if the person predicts the next digit correctly. The person starts out thinking the state is just the number displayed and when he/she watches the counter behave longer, he/she come to realize that the sate is more complicated than that. Consider an extreme case, when the counter repeatedly shows all permutations of some sequence longer than two in some fixed order and none of Monte Carlo, TD(\lambda) and the version that incopoartes eligibility trace would ever generate a policy better than ramdom guess even if the entire period is shown as many times as needed as with the digit as the state representation, the best markov reward process it can fit is with ramdom transition between each state.
@BGasperov
@BGasperov 4 жыл бұрын
"If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change?" Knowledge about the environment == knowledge about state-to-state probability transitions and rewards. The agent perceives states nevertheless (unless we have a POMDP in which case it's a bit more complicated).
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
The answer is pretty simple. You are choosing a terrible state space that doesn't represent your problem/what is happening well in your example. Choosing a suitable state space is a big part of solving reinforcement learning tasks. Imagine that I am trying to code a bot that sells and buys different cryptos and the state space is whether it rained yesterday. Then the state space is terrible as a predictor. As a human, if you are not making progress in your task of optimising rewards, it is clear that you are not understanding your state space well, so you should include more and more stuff into it until you see how your returns get better. You have to seek to understand what a suitable state space for your problem is.
@12kenbutsuri
@12kenbutsuri 6 жыл бұрын
Amazing lecture, but the eligibility trace part seemed to sudden and random to me..
@alvinphantomhive3794
@alvinphantomhive3794 4 жыл бұрын
hahaha yeah looks like he just rushing out of time.
@lex6709
@lex6709 2 жыл бұрын
other algorithms from last lectures(mc, td, dp, ..) = forward view eligibility trace = backward view It's a method to update present value from past values and not future (estimated or not)values like other algorithms that we've learned so far.
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
It's a super cool idea to make something online that seems hopelessly offline. I find it pretty smart, plus the motivation is very psychological.
@MattSiegel
@MattSiegel 9 жыл бұрын
16:35 lol: *actually* worries about breaking the casino! :D
@divyanshushekhar5118
@divyanshushekhar5118 5 жыл бұрын
seen movie "21"?
@vedgupta1686
@vedgupta1686 4 жыл бұрын
In the equation for TD(lambda), the weights have a (1-lambda) term But that would make TD(1) have weights of 0?
@hieuphan8335
@hieuphan8335 4 жыл бұрын
When lambda=1, the n-step returns = 0 for all n, except the ∞-step return when the agent reaches the terminal state. That final return is the actual return after accumulating the rewards until the end of the episode, which is a Monte Carlo estimate. Therefore, TD(1) uses the same target as the Monte Carlo method.
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
Be wiser than just plugging in lambda=1 in the coefficients of an infinite series. Compute the limit when lambda-->1 instead by using that the episode terminates and hence from some point onwards, the discounted return will be fixed. Use summability of geometric series.
@commelephenix4702
@commelephenix4702 3 жыл бұрын
1:07:00 mc vs td vs known dynamics
@p.z.8355
@p.z.8355 3 жыл бұрын
Can you learn the dynamics of the system during MC sampling and transform model-free in to model-based problem ?
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
Lecture 8
@Chitrang62
@Chitrang62 4 жыл бұрын
Maths is pure beauty
@brekk6118
@brekk6118 7 жыл бұрын
Thanks you for good lecture. One question on this video. On the slide Large-Random-Walk-Example, I don't get the difference between On-Line n-Step TD and Off-Line n-Step TD. In short, what is the difference between On-Line and Off-Line?
@brekk6118
@brekk6118 7 жыл бұрын
I expected TD should be always Online. How can TD be offline in the picture?
@kiuhnmmnhuik2627
@kiuhnmmnhuik2627 7 жыл бұрын
In this context, online means "in place". In each episode (assuming there *are* episodes), you update all V(s) by using the V(s) themselves (that's "bootstrapping"). If during the episode you update, say, V(s1) and then you reuse V(s1) for updating V(s2), then you updated V(s2) not by using the old V(s1) but the update you just did to V(s1). If you first save all the V(s) in, say, OldV(s) and then update V(s) by only using OldV(s), then you're doing it Offline. If you, instead, use the same V vector and don't worry about the update interferences/overlappings, then you're doing it Online. Online is easier and faster and works well, AFAIK.
@jeffery1236
@jeffery1236 4 жыл бұрын
Online updates means the value function is updated at every step in the episode; Offline updates occur all at once at the end of the episode
@Elocess
@Elocess 5 жыл бұрын
Shouldn't TD(0) correspond to n=1 instead of n=0?
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
It does. But TD(0) denotes TD(lambda) for lambda = 0. This corresponds with 1-step TD or to lambda = 0 in the eligibility trace paradigm. In summary, the 0 corresponds with lambda, not with n.
@jiapengzeng6651
@jiapengzeng6651 2 жыл бұрын
For the blackjack example, how do we know which state we get to if we choose to twist without knowing the transition probability?
@fabtjar
@fabtjar Жыл бұрын
I was thinking you'd have to write your own simulator. Did you figure out the answer a year later?
@yunlongsong7618
@yunlongsong7618 5 жыл бұрын
x.1.0 speed is good for taking the lecture for the first time x1.5 speed is perfect for quick reviewing this great lecture
@arturbegyan5207
@arturbegyan5207 5 жыл бұрын
Yunlong Song x2 mate
@a11a12
@a11a12 3 жыл бұрын
"You don't need to wait until you die to update your value function" :)
@43SunSon
@43SunSon 4 жыл бұрын
@37:30, question please. I dont understand the difference between real return (G_t) and ESTimate return (bootstrap, TD target). In the class 2, bellman equation shows exactly they are the SAME, isn't it , G_t = R_(t+1) + r V(S_(t+1)) ? If left G_t is the REAL return, why the right side of the equation is ESTimation now ??? Can someone teach me ? thank you!
@leopoldmuller4486
@leopoldmuller4486 4 жыл бұрын
Estimate return: you already got the real R_t+1 but estimate the rest (yV(S_t+1) together -> estimated return. In Monte Carlo you have are at the end of the episode thats why you can use the real return. You already received every real reward.
@43SunSon
@43SunSon 4 жыл бұрын
@@leopoldmuller4486 Leo, thanks for your help. Please allow me to ask more about this. 1. I do agree with you that: real return in MC G_t = R_(t+1) + rR_t+2 + .... Here you do have all the R since you do walk though to the end. Am I right ? 2. I do know what you saying, in TD, you estimate, you don't go to the end of eps, so you do need to estimate. But I don't understand the equation in class 2 about this bellman equation: G_t = R_(t+1) + r V(S_(t+1)) . My understanding is this V(S_(t+1)) is the real, not estimation. The equation derivation shows very clear. Did I miss anything ? the notation is exactly the same between est and real ? Could you check the class 2, it was in very early of that video. thank you!
@vegahimsa3057
@vegahimsa3057 4 жыл бұрын
Does TD work in the game of Go? I understand it's very difficult to evaluation a board state without playing it out further. Perhaps TD (breadth search) but MC to evaluate the leaves.
@driyagon
@driyagon 4 жыл бұрын
nice way to spend coronavacation
@420_gunna
@420_gunna 5 ай бұрын
11:00 good question, David being pissy as usual about it Great lecture though
@akarshrastogi3682
@akarshrastogi3682 5 жыл бұрын
45:46 somebody gagged himself lol
@b00i00d
@b00i00d 4 жыл бұрын
LOL
@divyanshushekhar5118
@divyanshushekhar5118 5 жыл бұрын
1:10:13 How does DP bootstraps? Doesn't the DP uses the already calculated real returns?
@alvinphantomhive3794
@alvinphantomhive3794 4 жыл бұрын
Dp uses Bellman Equation for synchronous backups, which is like two steps look a head given the current state or action and immediately update the value for Vk+1 using Value function equation. Behind the equation, it uses the returns of the successor state to keep updating the Vk+1 value onwards.
@MinhVu-fo6hd
@MinhVu-fo6hd 7 жыл бұрын
For TD learning, how can an agent know what the estimate (look ahead) value for state St+1 should be????. I know TD is bias because it has to guess the next value is and use it to update the current value, but how would it guess????
@sandeepinuganti8791
@sandeepinuganti8791 7 жыл бұрын
I think we randomly initialize the V(St) at first, then we use the iterative update, that's why he said TD is sensitive to initial conditions.
@Erain-aus-78
@Erain-aus-78 7 жыл бұрын
Here the value function is stationary, without the time index attached
@Isaac668
@Isaac668 7 жыл бұрын
It depends on the specific environment. In a case like a game of chess, where the environment is largely predictable, then information like number of pieces taken can be used to predict your future value function. Otherwise, you would apply a random value function and then use different update functions to adjust that value for the probability distributions p(s'|a,s) and V(s). Essentially, if you don't want a random initialization, then a prior needs to be applied
@ThePrefect693
@ThePrefect693 7 жыл бұрын
At 1:03:00, I don't understand how TD(0) gives a value of 0.75. If we used the equations, then won't the value be different?
@baicenxiao7107
@baicenxiao7107 7 жыл бұрын
As far as I understand, the TD(0) will consider multiple episodes. In other words, the TD guess v[s(t+1)] based on multiple episodes, so we can get value 0.75. But I am not quite sure about that.
@neilslater8223
@neilslater8223 7 жыл бұрын
I think you are right, it would update V(A) to 0 based on what it had seen if the sequences were taken in that order in a single pass through using TD algorithm. I think you can still understand the 0.75 value as a TD-like analysis after the fact. Or you would get 0.75 if you repeated the same 8 sequences again and again.
@daviderrington2241
@daviderrington2241 6 жыл бұрын
This also puzzled me. I think the key is to realise that the equations shown at 1:03:00 are for the transition matrix and the reward structure rather than the value function. To find the value function, we should use a Bellman equation. Given the equation at 1:03:00 for transition matrix we see that P_{AB}=1 and the equation for reward structure shows R_{AB}=0. The Bellman equation will then give V(A) = R_{AB} + gamma * V(B). For no discounting we should set gamma = 1 and we also know that V(B) = 6/8 (I believe that's why he says V(B) is the easier one earlier in the video). Substituting then gives V(A) = 6/8. Incidentally, a similar argument can be used to get V(B). For example, we see that P_{BT}=1 where T is terminal state. We also establish that R_{BT}=6/8 using the equation at 1:03:00, and clearly V(T)=0. Again, we get V(B) = R_{BT} + gamma * V(T) and with gamma = 1, we get the desired result. Hopefully that makes sense!
@tejasduseja
@tejasduseja 2 жыл бұрын
Can someone explain how TD(1) corresponding to monte carlo? According to formula (1-lambda) will be zero for lambda and hence estimated return should be zero for every state.
@vaishnav4035
@vaishnav4035 2 жыл бұрын
since we have a terminal state T(ie it will terminate after some k),Then the probability density (weight)for that G_(k) will be the sum of geometrical from that point to n= infnity,Then you will get lamda^(k-1).Hence we get MC for lamda=1.This is what prof says in the side 1:28:00 ("we will assign the rest of the prob to the tail part(shaded portion) of the decaying distribution").
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
In the forward view, it is a limiting case, i.e. lim TD(lambda) when lambda-->1, motivated by the argument Vaishnav described. Calculate the geometric sum of the coefficients from k onwards and take limits when lambda-->1. You will see that, in the limit, all the mass concentrates on the tail of the series. In the backward view, it can be shown by direct manipulation and the proof can be found in the extra material in the slides.
@lahaale5840
@lahaale5840 6 жыл бұрын
I do not understand: how MC and TD can find the optimal policy without compare all the possible policies like DP? Do they sampled the majority of the policies?
@chandrasekharChebiyyam
@chandrasekharChebiyyam 6 жыл бұрын
As far as what I understood, MC and TD converge to one and the same value estimate when in an MDP. They are just different way to get there. No one in their right sense should ideally be using MC for estimating value in an MDP though, I guess MC is only useful for non MDP cases. Now, coming to the policy (which I believe would be explained in a later lecture), if you remember the explanation in one of the previous lectures, where we alternate between value and policy updates, the greedy policy estimated based on the current best value estimate is always better then the policy we started with before the current best value estimate converges. Hence, we are bound to have a better policy than before when we have a better value function estimate. And over multiple such alternate steps of updates of value and policy, we end up with the ideal most policy and value.
@SuperOnlyP
@SuperOnlyP 5 жыл бұрын
What I understand is the more policy you sample the more accurate the policy is. You can see the example of blakjack 19:50 . 10,000 esposides vs 500,000 episodes are different. DP only work for small case which is mentioned on lecture 2 or 3 - I think. So for big case MC and TD will save you time by sampling.
@amornchotsingh3977
@amornchotsingh3977 6 жыл бұрын
Hey everybody, Is this course an advanced course? I'm looking at various courses on the internet and choosing the best one. I've come up with Udacity's course(the hefty price tag is holding me back) but I heard a lot of good things about this course. How would you recommend studying RL? Take udacity course with a lot of coding implementations or study UCL's course thoroughly and implement code from pseudocodes from the slides?
@Shottedsheriff
@Shottedsheriff 6 жыл бұрын
Sorry for 1 month delay, You've probably either picked something else or watched this one, but if not, I would recommend to follow Practical_RL assignments on github from Higher School of Economics, if I recall correctly, and there you have suggested videos for each 'practical week' and tasks for it, it really helps to grasp the ideas in this course and later start the Berkeley great CS294-112
@amornchotsingh3977
@amornchotsingh3977 6 жыл бұрын
@@Shottedsheriff Thank you very much. I've just bought a course from Udacity, I'll do CS294 and Udacity and the one you recommended me :)
@pariveshplayson
@pariveshplayson 3 жыл бұрын
What about states which are very infrequent and it takes a long time to find the first time they are visited for a given episode? Now, over many episodes ..the time will add up. Right?
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
In practice, there are many states that could not be visited when the state space is very large. For example, they argue that TD-Gammon probably has not seen most of the configurations in Backgammon and can still produce amazing results. Lectures 6, 7, 8, 10 will certainly answer your questions partially. When there are too many states, other approaches that generalise to unseen states are preferred (parametrising the value function of the policy by a neural network, for example).
@ildaalushaj8627
@ildaalushaj8627 3 жыл бұрын
Does the First-Term MC and MC(all) return estimate biased? And how can we prove that formally?
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
Both are unbiased estimators. Assume a policy pi is fixed. The experimental discounted returns you are computing every episode for every state s in MC are a samples of the random variable R = R_2 + gamma R_3 + gamma^2 R_4 +... conditioned to s, i.e. samples of the random variable R | S_1=s. Note that the value of a state s is defined by v(s) = E[R], where E denotes the expectation. By the strong law of large numbers, for every state s, the average experimental return converges to the real expectation almost surely, since the samples are independent and the rewards are uniformly bounded. That's the proof.
@윤섭-p9l
@윤섭-p9l 2 жыл бұрын
Can someone explain about when doing TD(0), how can we get partition of TD target that estimated value of next state V(St+1)??
@3DAYSGG
@3DAYSGG 3 жыл бұрын
Wouldn't TD be considered a model-based algorithm since we need an immediate reward and value of state s'?
@shauryamohan718
@shauryamohan718 3 жыл бұрын
for MDP you need to have Action-space, State-space, Reward function (or distribution), Probability of moving to state s' from s given action 'a', over all s,s' in State-space and all 'a' over Action-space. Right now we only have reward function and State-space, (maybe in the next lecture action space) but not the probabilities. I think that is the reason why we don't have a model and this is not a model-based algorithm.
@girigupta
@girigupta 4 жыл бұрын
for Analysis purpose can we treat the Every visit MC as a first visit MC by considering each occurrence of the state as a single episode?
@girigupta
@girigupta 4 жыл бұрын
The only thing that will differ is the averaging factor for the chain with multiple visits to the same state.
@sng5192
@sng5192 8 жыл бұрын
Thanks for the nice lecture.
@princehan920453024
@princehan920453024 3 жыл бұрын
Is alpha calculated by 1/N(s) of certain state when traversing all the episodes, or just chosen evenly and manually btw 0 and 1?
@duckdudette
@duckdudette 3 жыл бұрын
I don't know what example you're looking at but it generally varies. Both options you gave are viable in different cases.
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
In the first case, your estimator is unbiased, in the second one, your estimator is biased. In both cases, there is convergence in the limit.
@divyanshushekhar5118
@divyanshushekhar5118 5 жыл бұрын
1:22:12 Is the offline method referred to Monte Carlo because it needs to wait till episode termination to update the value function but Temporal Difference method is offline because it updates the value function as it sees the state every time?
@alvinphantomhive3794
@alvinphantomhive3794 4 жыл бұрын
exactly.. that's the definiton, you might want to look up the slide on page 46. the lecture slide was in the description
@lahaale5840
@lahaale5840 6 жыл бұрын
why Ace is 10 on dealer side? it say is Ace can use either 1 or 11, right?
@davypaul123
@davypaul123 4 жыл бұрын
What's the difference between Backup and Bootstrapping?
@suryapalaniswamy3726
@suryapalaniswamy3726 4 жыл бұрын
In my opinion I think backup is really for policy evaluation by taking the returns and finding out value functions and optimizing them. Bootstrapping is something to do with sampling the entire distribution and not using it as a whole
@hschoi3681
@hschoi3681 7 жыл бұрын
Why does considering (the immediate reward from next step + Value function of next step : TD(0)) means considering the MDP structure?..
@Isaac668
@Isaac668 7 жыл бұрын
Each step Rt you take increases noise and decreases the bias. If we take a probabilistic standpoint, the more data we have (i.e. the more iterations/steps of the model we take), the less the bias will have effect and the more the likelihood will take effect (on the proverbial posterior distribution). This effectively eliminates the bias, and since at TD(0), we've taken 0 steps, there's 0 noise and close to 0 bias, so that should converge to the true value
@justalaugher
@justalaugher 6 жыл бұрын
because in TD, you make instant response to what just happened after you took the action. But only in Markov environment can you assume that what happened just now was only related to your action and previous state
@santiagorestrepo5458
@santiagorestrepo5458 6 жыл бұрын
if you listen carefully you can hear a sutil fart 0:36:58
@johnhefele5432
@johnhefele5432 6 жыл бұрын
heard dat
@shashankbhat9646
@shashankbhat9646 5 жыл бұрын
maybe he got a reward from monte carlo
@woncoh1
@woncoh1 5 жыл бұрын
50:24 1:35:57
@scienceofart9121
@scienceofart9121 4 жыл бұрын
What is the difference between online and offline update in 1:35:50
@DrN007
@DrN007 2 жыл бұрын
Last 10 minutes (backward view) need more elaboration.
@anmolagarwal999
@anmolagarwal999 Жыл бұрын
1:05:00 = Monte Carlo backups
@saminder987
@saminder987 4 жыл бұрын
41:45
@aneeshbhat7328
@aneeshbhat7328 5 жыл бұрын
Difference b/w expected return and empirical mean return? 8:28 I'm still unable to grasp the concept of expectation.
@manishandu
@manishandu 5 жыл бұрын
Empirical means refers to sample mean when you have the samples available, expected value is not for a sample but a distribution. In some cases, you can infer the distribution from the samples, and empirical mean and expected value would be the same.
@sajanpoudel5707
@sajanpoudel5707 4 жыл бұрын
who is that guy on background suffering from covid-19, 5 years long back .. his coughing frequency is so irritating ...
@benvirusofthemind
@benvirusofthemind 3 жыл бұрын
you know 2020 has passed when everyone is thinking the same when they hear someone coughing.
@sandeepinuganti8791
@sandeepinuganti8791 7 жыл бұрын
He skipped the last 10 slides and i saw them at their website and i think it was really something important and i am having trouble understanding it, can anybody tell me if those slides were important or not?
@kiuhnmmnhuik2627
@kiuhnmmnhuik2627 7 жыл бұрын
The last 10 slides show that eligibility traces are truly equivalent to the slow counterpart of calculating updates as we do in MC and naive TD(lambda). If you understand eligibility traces intuitively, you should also "sort of" understand the proofs (I think the proofs are not very precise, though).
@ChumpyZhang
@ChumpyZhang 5 жыл бұрын
I have the same questions. I totally lost it when it came to backward view of TD(lambda).
@vman049
@vman049 5 жыл бұрын
Could someone please clarify why lambda = 1 corresponds to MC? If you plug in lambda = 1, the return just goes to 0, so how can it be equal to MC?
@paulmarkus4917
@paulmarkus4917 4 жыл бұрын
The formula for the return G_t(lambda) should contain a term at the end for the final update (end of episode) including the terminal state. Need to add (+) lambda^(T-t-1) G_t to the end of the return function. In the case that lambda = 0 then this part become G_t which is the same as MC.
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
Don't plug it in in the forward view, plug it in in the backward view and you can check it is MC (proof in the notes). In the forward case, it is true as a limit, i.e. MC is lim TD(lambda) when lambda-->1. Indeed, calculate the geometric sum of the coefficients from the end of the episode onwards and take limits when lambda-->1. You will see that all the mass concentrates on the tail of the series in the limit.
@II_superluminal_II
@II_superluminal_II 3 жыл бұрын
COOL
@astaragmohapatra9
@astaragmohapatra9 4 жыл бұрын
The class is very diverse, guessing from their voices
@divyanshushekhar5118
@divyanshushekhar5118 5 жыл бұрын
I don't understand the question asked at 11:34 and its answer
@manishandu
@manishandu 5 жыл бұрын
David explained 'first-visit monte carlo', in which the number of times you visited a state is updated only the first time you visit it, and the expected return is calculated from the first visit. Even, if you visit the state again a second time, it's not updated until you start a next episode. Thus, you get N_s = 10, for let's say 10 episodes, because you are considering only first visit, and would find the expected return by averaging over return from first visit to state 's'. However, this has been proved that first-visit average return is equal to n-visit average return where you visit the state multiple times during one episode, and the count is updated everytime(unlike first visit). So, if you study only first-visit, you are not missing anything.
@manishandu
@manishandu 5 жыл бұрын
In other words, value function is same for first visit or n-visit MC.
@lekhaemerald
@lekhaemerald 7 жыл бұрын
I am not really sure how he got the value of V(A) and V(B) at 59:54. Can someone please explain?
@neilslater8223
@neilslater8223 7 жыл бұрын
He explains just after. It was to compare how MC and TD use their samples. So you have 8 samples that include B, 6 of them get a final return of 1, and 2 of them get a return of 0. With MC sampling, then estimate for V(B) is 6/8 = 0.75. For TD it is the same because state B is always last one in sequence, so we are bootstrapping from final state, with V(final)=0 by definition. The interesting one is V(A). With MC sampling, you look at average return for any sequence that includes state A. There is only one such sequence and the total return is 0 + 0 = 0, making the MC estimate for V(A) = 0. But the TD estimate can include the existing estimate for B when the sequence transitions from A to B . . . I think it is a little wrong in that this only works with TD algorithm as shown if the A:0,B:0 sequence occurs as the last sample. But I think the idea in the lecture is to ignore the algorithms and compare the two intuitive answers V(A) = 0 (for MC) or V(A) = 0.75 (for TD) to get a sense on how MC and TD differ.
@hschoi3681
@hschoi3681 7 жыл бұрын
This is what I was looking for! Thanks! I also thought the sequences should be opposite otherwise in TD case it will also be V(A)=0
@vikramwaradpande5059
@vikramwaradpande5059 5 жыл бұрын
This is exactly what I was looking for! I also think the same, but seeing that no-one else is pointing this out makes me think am I missing something obvious?
@rahulramachandran3492
@rahulramachandran3492 2 жыл бұрын
Just realised from the example from 1:00:00 that I'm an MC kind of guy
@claudiocimarelli
@claudiocimarelli 7 жыл бұрын
Someone has more documentation on the question about geometrical decay for lambda at 1:28:00? Thanks
@kiuhnmmnhuik2627
@kiuhnmmnhuik2627 7 жыл бұрын
Let's say we are at time t and we have already visited state s three times. If the eligibility trace decays geometrically than the decay is constant at each time step, so we don't need to maintain a history of the single visits. Let's say E = E1 + E2 + E3, where Ei is the "partial eligibility" relative to the i-th visit of state s. If the decay is constant, we have new(E) = new(E1) + new(E2) + new(E3) = lambda E1 + lambda E2 + lambda E3 = lambda E and so we don't really need to remember E1, E2 and E3 separately. If the decay is, say, hyperbolic (i.e. 1/n after n steps), then the update is more difficult: new(E) = new(E1) + new(E2) + new(E3) = ? E1 + ? E2 + ? E3 = ??? We can't proceed without knowing when the visits took place. Let's say the right decays are l1, l2, l3, respectively. Then we have new(E) = l1 E1 + l2 E2 + l3 E3 Unfortunately, we need to remember the entire history of the visits. We can't update E directly as before. Just out of curiosity, sei italiano?
@divyanshushekhar5118
@divyanshushekhar5118 5 жыл бұрын
23:56 What question is being asked?
@alvinphantomhive3794
@alvinphantomhive3794 4 жыл бұрын
I guess he asked that on the "Dealer Showing" axis, why the value function on '10' value decrease compare to the '9' value (Dealer showing axis). But again, i think this examples not really neccesary for you to understand, since he also knew, not everyone understand the gameplay which may be hard to bring the intuition. And there are many other more examples with simple concept on the next slide.
@zeweichu550
@zeweichu550 7 жыл бұрын
Can someone give an example of the student's question at 1:15:39?
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
I think pretty much any example you try is going to be a counterexample, because basically what the student proposes does not make any sense. Try with a state space with two or three states and some uniform policy with two different actions and I think it will work :). Let me explain the intuition, which is the fundamental thing here. We are assuming that our environment is an MDP, and MDPs are FORWARD processes. Now, any MDP has an associated Bellman expectation equation, which expresses the value of a state s in terms of the values of FUTURE STATES. In other words, you look at the future states where you might end up depending on which actions you choose and you express the value of a state as a weighted sum of values at those states. The Bellman expectation equation expresses PAST state values in terms of FUTURE state values. This is key. In particular, if we are at state S_t and sample an action with policy pi, R_{t+1} + gamma V(S_{t+1}) is an UNBIASED SAMPLE of V(S_t). This is the heart of TD learning, whose update rule uses the estimate R_{t+1} + \gamma V(S_{t+1}) for V(S_t) , where V(S_{t+1}) is now an approximation of the true value at state S_{t+1}, so the sample becomes biased. However, as you know, TD converges to the true value in the limit because the CLT. The estimates become more and more unbiased and accurate. Now, imagine that instead of updating like V(S_t)
@porkypig7170
@porkypig7170 7 ай бұрын
Lost me at eligibility trace, I'm too stupid too understand
@nikhilweee
@nikhilweee 2 жыл бұрын
1:34:05 cha-chow
@jakoblindrup
@jakoblindrup 2 жыл бұрын
so chances for beating trhe house in black jack looks quite low :)
@burada8993
@burada8993 4 жыл бұрын
Isn't the formula wrong? The last lambda exponent should be T-t-1 instead of T-1; it is Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-1) * R_T but should have been Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-t-1) * R_T .
@edmonddantes4705
@edmonddantes4705 Жыл бұрын
Yeah, you are totally correct.
RL Course by David Silver - Lecture 5: Model Free Control
1:36:31
Google DeepMind
Рет қаралды 270 М.
RL Course by David Silver - Lecture 6: Value Function Approximation
1:36:45
Молодой боец приземлил легенду!
01:02
МИНУС БАЛЛ
Рет қаралды 2,1 МЛН
Yay😃 Let's make a Cute Handbag for me 👜 #diycrafts #shorts
00:33
LearnToon - Learn & Play
Рет қаралды 117 МЛН
Long Nails 💅🏻 #shorts
00:50
Mr DegrEE
Рет қаралды 16 МЛН
1. Introduction to 'The Society of Mind'
2:05:54
MIT OpenCourseWare
Рет қаралды 1,4 МЛН
RL Course by David Silver - Lecture 8: Integrating Learning and Planning
1:40:13
RL Course by David Silver - Lecture 2: Markov Decision Process
1:42:05
Google DeepMind
Рет қаралды 638 М.
RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning
1:28:13
Wolfram Physics Project Launch
3:50:19
Wolfram
Рет қаралды 1,9 МЛН
RL Course by David Silver - Lecture 7: Policy Gradient Methods
1:33:58
Google DeepMind
Рет қаралды 276 М.
TD Learning - Richard S. Sutton
1:26:25
Wei Wei
Рет қаралды 19 М.
RL Course by David Silver - Lecture 3: Planning by Dynamic Programming
1:39:09
Молодой боец приземлил легенду!
01:02
МИНУС БАЛЛ
Рет қаралды 2,1 МЛН