0:53 Outline 2:10 Introduction 5:06 Monte Carlo Learning 9:20 First Visit MC Policy Evaluation 14:55 Every Visit MC Policy Evaluation 16:23 Blackjack example 26:30 Incremental Mean 29:00 Incremental MC Updates 34:00 Temporal Difference (TD) Learning 35:45 MC vs TD 39:50 Driving Home Example 44:56 Advantages and Disadvantages of MC vs. TD 53:35 Random Walk Example 58:04 Batch MC and TD 58:45 AB Example 1:01:33 Certainty Equivalence 1:03:32 Markov Property - Advantages and Disadvantages of MC vs. TD 1:04:50 Monte Carlo Backup 1:07:45 Temporal Difference Backup 1:08:14 Dynamic Programming Backup 1:09:10 Bootstrapping and Sampling 1:10:50 Unified View of Reinforcement Learning 1:15:50 TD(lambda) and n-Step Prediction 1:17:29 n-Step Return 1:20:22 Large Random Walk Example 1:22:53 Averaging n-Step Return 1:23:55 lambda-return 1:28:52 Forward-view TD(lambda) 1:30:30 Backward view TD(lambda) and Eligibility Trace 1:33:40 TD(lambda) and TD(0) 1:34:40 TD(lambda) and MC
@김준호-n1w9d Жыл бұрын
thanks
@appendix777 жыл бұрын
Questions from students are of very high quality and they are one of the many reasons that make this lecture series particularly great.
@muratcan__225 жыл бұрын
yeah specially the guy sitting close to the camera
Could not agree more. Andrw Ng got me into ML and David Silver into RL. And that book by Sutton is gold!
@ruslanponomariov35983 жыл бұрын
No, Andrew Ng is David Silver of ML.
@samjoshua1923 жыл бұрын
@@ruslanponomariov3598 Wouldnt it be Richard Sutton, given he is considered father of RL?
@diln53 жыл бұрын
Andrew Ng has done lots of RL work himself!
@CONGKAMAEL3 жыл бұрын
@@nirajabcd very true; but isn't RL a subset of ML? I thought ML = {RL, Supervised, Unsupervised}
@ikechukwuokerenwogba8646 Жыл бұрын
The lecture 💯. The questions the students were asking 💯. My enjoyment of the whole thing 💯.
@azerotrlz6 жыл бұрын
I love how he relates the form of the incremental mean to the meaning of RL updates.
@beepmcjeep55274 жыл бұрын
This actually helped me a ton, I've been knee deep in Deep RL implementations that have just been handed to me and I've skirted doing it by hand or understanding the formulas. As a result, reading papers have been confusing. Him mentioning that the incremental mean will be a common pattern made about 10 things click all at once for me.
@jaidevshah22864 жыл бұрын
@@beepmcjeep5527 wake up
@beepmcjeep55274 жыл бұрын
@@jaidevshah2286 😴
@scienceofart91214 жыл бұрын
When you realized every lecture corresponds to a chapter in Sutton's "Introduction to Reinforcement Learning"
@phuongdh4 жыл бұрын
Sutton & Barto are the OGs, Silver is the GOAT.
@aidosmaulsharif95704 жыл бұрын
"Introduction to Reinforcement Learning" is a Holy Bible of RL, everything you need to know is in this book. And David Silver helped us to interpret this book.
@_snwflk_4 жыл бұрын
My Prof actually uses both as teaching material xD
@BooBar25217 күн бұрын
@@_snwflk_my prof too 😂
@yuxinzhang94033 жыл бұрын
I think the reason why looking one step into the future is better than using past predictions is that you can treat the next step as if it were the last step, then that would be the terminating goal and the game is over, we've already known the current state didn't make the episode end up, so only the future state could and that's why we always look into the future for the terminating state.
@saminchowdhury79955 жыл бұрын
38:29 what a great example to explain how TD is different from MC
@alexanderyau63477 жыл бұрын
I have watched the lecture 4 four times, and this is the clearest one. For non-English speakers, language is really an obstacle to understanding this lecture. Oh my poor English, I only got 6.5 at IELTS Listening.
@achyuthvishwamithra3 жыл бұрын
The backup diagrams have made everything much more clearer.
@nightfall48578 жыл бұрын
Thanks for good lecture. This lecture really help me a lot. I have a suggestion for improving this lecture. It is English subtitle. It will make this lecture more accessible for the handicapped and non-English speakers.
@minhuang88488 жыл бұрын
Problem being that subtitles cost. Professional captioning can easily run you 400 bucks for a lecture like this, so...
@isainstars8 жыл бұрын
I'm sure that a lot of people wouldn't mind volunteering to provide English subtitles for these lectures (at least I wouldn't). It would be a nice alternative for that :)
@ThePaypay885 жыл бұрын
@@minhuang8848 lmao wtf , algorithm already provides free (albeit errorbugged) subtitles , its more than enough . Just let video settings to apply that
@testxy55554 жыл бұрын
love the example to demonstrate the difference between TD and MC!!
@daehankim24373 жыл бұрын
These lectures are gold!
@tacobellas8 жыл бұрын
These lectures are sooo helpful! Thank you very much for posting. They are really good :).
@joshash59447 жыл бұрын
haha.. "You don't need to wait til you die to update your value function.. "
@Hestehviskeren4 жыл бұрын
Thanks, TD learning!
@billykotsos46423 жыл бұрын
Another meaty lecture ! This is pure treasure
@ErwinDSouza6 жыл бұрын
At 1:27:47, David explains why we use λ geometric series by saying it is memoryless so "You can do TD(λ) for the same cost as TD(0)".. but I don't see how! TD(0) merely looks at one 1 step whereas TD(λ) has to look at all future steps (or in the backward view, TD(0) merely updates the current state, while TD(λ) updates all previous states)
@totolamenace2 жыл бұрын
TD(λ), in the backward view, does updates all previous steps, but that's as effective as TD(0), as each update is roughly equivalent to updates done with TD(0). The forward view is theoretical, and mentioned because it gives us some intuitions on TD(λ), but it would be very inefficient to implement TD(λ) using the forward view, way too much computation, plus it implies that it needs finite episodes. That's why, in practice, we only use the backward view for implementations.
@totolamenace2 жыл бұрын
If you look into lecture 5, on Sarsa(λ), there's a comparison of the way Sarsa(λ) and Sarsa(1) update the states at around 1:06:00. You can see that while Sarsa(λ) does more computation, it then also build much more precise evaluation of the state action values. I think looking at this example might give you some intuitions why TD(λ) is as efficient as TD(0) computation wise.
@edmonddantes4705 Жыл бұрын
@@totolamenace What you say makes sense, but computational cost means cost per iteration and the backward view of TD(lambda) updates eligibility traces and many state values per iteration, so it is obviously more expensive that one-step TD. A different statement would be to say that "to obtain state value estimates of similar quality, the number of required operations in both methods is roughly the same", which is not the computational order anyway.
@danielc42676 жыл бұрын
43:35 A student asked about the goal and actions of the driving-home example. I have read the book where this example comes from. And here is my take on the question: The actions come from a policy that is determined by the person. In this case, the policy is getting home by driving a car through particular roads. The person can use other policies to get home such as walking home and driving through other roads. The goal of Monte Carlo or Temporal Difference is to estimate how good his policy is. Remember his policy involves driving through particular roads. The example shows just one sample of updating the algorithms. To actually see how good his policy his. He needs to take the same route everyday, obtain more data, and update the algorithms.
@ze-speeches6 жыл бұрын
Great explanation, Daniel! Thanks! Which book is the example from and would you recommend to read it parallel to working through this lectures?
@danielc42676 жыл бұрын
Thank you. The book is reinforcement learning: an introduction by Sutton. I think you can download the draft with some googling
@danielc42676 жыл бұрын
Yeah, I would recommend reading it along with the lectures. It's a good book. I find it helpful to learn the same concepts from different sources, especially difficult ones.
@ze-speeches6 жыл бұрын
You are welcome! Thanks for the very fast reply! I already have the book from Sutton and Barton, but decided to start with the review article "Deep Reinforcement Learning - An Overview" from Yuxi Li. Also very recommendable!
@ze-speeches6 жыл бұрын
Thanks for the recommendation! I also definitely love to learn from different sources and in addition to lectures and readings listen to podcasts. I love this one, but it is about machine learning fundamentals, not RL: ocdevel.com/podcasts/machine-learning Maybe you or another reader can recommend another podcast, being more specialized on RL or other more advanced topics? "TWiML & AI" seems to be quite promising... also want to try out "Machine Learning 101". Any experiences or recommendations on that?
@joaumgomes8 жыл бұрын
1:34:04 TCHA TCHEW
@LucGendrot6 жыл бұрын
Anticipation for this moment made this excellent lecture that much better.
@johnsmith-mp4pr5 жыл бұрын
What an excellent teacher.
@43SunSon4 жыл бұрын
I have to say, David Silver is slightly smarter than me.
@suryansudash05024 жыл бұрын
Just slightly, yea, just like how they write "only" in the cheques 😂
@qingfangliu46874 жыл бұрын
24:52 I think the professor's explanation is a bit misleading about this question. The Sutton & Barto book, where the figure came from, clearly told that the dealer has a fixed strategy to stick on any sum of 17 or greater, and hit otherwise.
@larryjuang56434 жыл бұрын
I think what David showed in this lecture is the V under the policy of sticking only above 20. This does not say anything about the quality of that policy, just the value function for any state under the said policy.
@Ashrak554 жыл бұрын
@@larryjuang5643 Liu was talking about the strategy of the dealer not the policy of the agent. I think the point Dr. Silver is trying to make in this case, is that it doesn't really matter what the strategy of the dealer is. In the book by Sutton and Barto the strategy is fixed, so that we can simulate it when writing the code for the environment.
@mind-set3 жыл бұрын
Thank you for these lectures. They are fantastic.
@somcho5 жыл бұрын
On Lambda-return at 1:23:56, if you think of N as being geometrically distributed with parameter lambda (for Bernoulli probability of success) and temporarily think of G^(n)_t as a (deterministic) function of n, then G^lambda_t is simply the expected value of G^(N)_t. i.e. Let N~Geo(lamda) and G^(n)_t be fixed as a function of n then G^lambda_t = E[ G^(N)_t ] ### Correction: N~Geo(lamda) above should be N~Geo(1-lamda) instead
@edmonddantes4705 Жыл бұрын
It is 1:23:56, your timestamp is incorrect! But the observation is correct up to the fact that N~Geo(1-lambda) instead of Geo(lambda). Nice observation!!
@somcho Жыл бұрын
@@edmonddantes4705 wow 4years later i am looking at what I wrote and literally have no idea what the heck i was talking about.. haha for a second I thought I had been hacked and chatgpt has been posting youtube comments on my behalf! 😂 time flies ¯\_(ツ)_/¯
@edmonddantes4705 Жыл бұрын
I guess it is good to know that somebody read it carefully even after four years 😂
@johnhefele54326 жыл бұрын
You get your real award; having a day at work :D
@홍준표-p4q6 жыл бұрын
I cannot clearly understand why TD(\lambda) is better than n-step TD. There seems to be no big difference between the RMS error figures of n-step TD (p. 37) and of TD(\lambda) (p. 42).
@pi3ni06 жыл бұрын
I have this intuition, but it may be wrong. TD(n) and TD(lambda) both have one hyperparameter to tune. I think the advantage of TD(\lambda) is that when you average over TD(0), TD(1), TD(n), you apply bagging technique, so it's easier to control bias/variance with lambda and avoid high variance.
@SaifUlIslam-di5xv4 жыл бұрын
"You don't need to wait until you die to update your value function" killed me. xD
@dawenwu98966 жыл бұрын
i can't fully understand the slide in 1:21:45 .what is offline TD learning? i just think that offline TD should have the same result with online TD
@akshayshrivastava97 Жыл бұрын
It's been ~8 years since these lectures were uploaded and I'm yet to find an online resource that can rival Dr. Silver's delivery of this material.
@SphereofTime3 ай бұрын
1:36:12 td lambda and montecarlo 1:36:22
@chandrasekharChebiyyam5 жыл бұрын
Does anyone see a connection between a Non-Markov Decision Process (aka partially observable MDP) and a Hidden Markov model?
@herrizaax3 жыл бұрын
The answer to the question at 46:50 confuses me. I think the questioner was wondering whether MC would learn to drive safely, not whether TD would learn it. David gave the answer that TD would learn it. My answer would be: "MC would learn it after enough iterations, because sometimes you would crash". But hey, maybe I'm the one that misinterprets the question :)
@BGasperov4 жыл бұрын
An audience member at 46:14 asked: "Does it learn the same behavior? So if you have a car, and the car almost crashes, but it doesn't, ... When you're learning step by step you gonna learning the behavior that you're scared of crashing, but if you're learning from the very end you would never learn sth to encourage you to avoid your own manic behaviour." Although prof. Silver's explanation is correct, I think that's not what the answer student was looking for. It is not true that you would never learn sth to encourage you to avoid manic behavior (as the student is implying) - given enough episodes you would, as you would crash in many/most. MC does converge to the true value function (due to the law of large numbers).
@paulmarkus49174 жыл бұрын
Thanks for the clarification. So you are saying MC would also experience episodes terminating in crashes and hence learn to avoid crashes, just like in the case of TD.
@BGasperov4 жыл бұрын
@@paulmarkus4917 Yep.
@nikebless4 жыл бұрын
@@BGasperov Does it mean TD would be able to learn to not crash without crashing, while Monte Carlo has to crash at least once to understand that it's bad?
@BGasperov4 жыл бұрын
@@nikebless No, TD would still need to crash (experience an extremely negative reward) in order to learn not to do it. Without crashing (experiencing a negative reward) how could it know that the action that leads to crashing is bad? Perhaps it could in model-based RL but that's another story.
@abeaumont105 жыл бұрын
Great series of videos!
@vaddadisairahul29563 жыл бұрын
I didn't understand the part where he explains how MC learning doesn't depend on how large the state space is. I need to understand it through an example
@eyeofhorus13015 жыл бұрын
I'm falling in love with David Silver 😍😍
@kiuhnmmnhuik26277 жыл бұрын
@1:07:00 He should've started from the graph rather than from the math.
@DhavalSalwala5 жыл бұрын
Exactly
@mathavraj93784 жыл бұрын
Can someone clarify me this? Monte carlo learning is basically just taking fewer paths out of the whole possible paths from a state and averaging it to estimate value function of a state.
@kuity6 жыл бұрын
I don't think the example at 43:00 was very clear... Or at least, it wasn't explained what's the policy and what's the return here I mean I get the example and and I get the theory but not how the two of them relates
@muxintian8110 Жыл бұрын
really appreciate!! help me understand model free
@avinashanand21637 жыл бұрын
What kind of policy do we have in a Model-free Prediction? For example, if we have a policy of walking in a straight line from a point towards a goal, then the same set of states will be repeated in every iteration and hence the rewards should be same in every episode. Then how averaging would help?
@DerRumbringer7 жыл бұрын
i don't think you know where the goal is (if you do, you already have the optimal policy). other than that your environment will usually act stochastic and your sampling of starting states should be stochastic..
@DerRumbringer7 жыл бұрын
I have to correct: as it will be explained in lecture 5, acting greedily according to some policy will not lead to optimal exploration of the state space. This is because your trajectory is sampled from a propability distribution with much lower variance (because of deterministic actions limiting to one dimension of the propability matrix, the state transition dimension as opposed to the both the state transition dimension and the action dimension).
@ahmadalghooneh21055 жыл бұрын
14:33 was very good :))
@sharinfoster65824 жыл бұрын
IDK who this person in the background is buy they cannot stop coughing and burping constantly. It's so irritating.
@cheungtyrone36154 жыл бұрын
I have one question about "model free". If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change? Just like a human looking at a counter repeadtedly counting up from 0 to 9 and then counting down back to 0. A reward of 1 is given if the person predicts the next digit correctly. The person starts out thinking the state is just the number displayed and when he/she watches the counter behave longer, he/she come to realize that the sate is more complicated than that. Consider an extreme case, when the counter repeatedly shows all permutations of some sequence longer than two in some fixed order and none of Monte Carlo, TD(\lambda) and the version that incopoartes eligibility trace would ever generate a policy better than ramdom guess even if the entire period is shown as many times as needed as with the digit as the state representation, the best markov reward process it can fit is with ramdom transition between each state.
@BGasperov4 жыл бұрын
"If the agent starts out without any knowledge about the environment, how is it even able to know what constitutes a state change?" Knowledge about the environment == knowledge about state-to-state probability transitions and rewards. The agent perceives states nevertheless (unless we have a POMDP in which case it's a bit more complicated).
@edmonddantes4705 Жыл бұрын
The answer is pretty simple. You are choosing a terrible state space that doesn't represent your problem/what is happening well in your example. Choosing a suitable state space is a big part of solving reinforcement learning tasks. Imagine that I am trying to code a bot that sells and buys different cryptos and the state space is whether it rained yesterday. Then the state space is terrible as a predictor. As a human, if you are not making progress in your task of optimising rewards, it is clear that you are not understanding your state space well, so you should include more and more stuff into it until you see how your returns get better. You have to seek to understand what a suitable state space for your problem is.
@12kenbutsuri6 жыл бұрын
Amazing lecture, but the eligibility trace part seemed to sudden and random to me..
@alvinphantomhive37944 жыл бұрын
hahaha yeah looks like he just rushing out of time.
@lex67092 жыл бұрын
other algorithms from last lectures(mc, td, dp, ..) = forward view eligibility trace = backward view It's a method to update present value from past values and not future (estimated or not)values like other algorithms that we've learned so far.
@edmonddantes4705 Жыл бұрын
It's a super cool idea to make something online that seems hopelessly offline. I find it pretty smart, plus the motivation is very psychological.
@MattSiegel9 жыл бұрын
16:35 lol: *actually* worries about breaking the casino! :D
@divyanshushekhar51185 жыл бұрын
seen movie "21"?
@vedgupta16864 жыл бұрын
In the equation for TD(lambda), the weights have a (1-lambda) term But that would make TD(1) have weights of 0?
@hieuphan83354 жыл бұрын
When lambda=1, the n-step returns = 0 for all n, except the ∞-step return when the agent reaches the terminal state. That final return is the actual return after accumulating the rewards until the end of the episode, which is a Monte Carlo estimate. Therefore, TD(1) uses the same target as the Monte Carlo method.
@edmonddantes4705 Жыл бұрын
Be wiser than just plugging in lambda=1 in the coefficients of an infinite series. Compute the limit when lambda-->1 instead by using that the episode terminates and hence from some point onwards, the discounted return will be fixed. Use summability of geometric series.
@commelephenix47023 жыл бұрын
1:07:00 mc vs td vs known dynamics
@p.z.83553 жыл бұрын
Can you learn the dynamics of the system during MC sampling and transform model-free in to model-based problem ?
@edmonddantes4705 Жыл бұрын
Lecture 8
@Chitrang624 жыл бұрын
Maths is pure beauty
@brekk61187 жыл бұрын
Thanks you for good lecture. One question on this video. On the slide Large-Random-Walk-Example, I don't get the difference between On-Line n-Step TD and Off-Line n-Step TD. In short, what is the difference between On-Line and Off-Line?
@brekk61187 жыл бұрын
I expected TD should be always Online. How can TD be offline in the picture?
@kiuhnmmnhuik26277 жыл бұрын
In this context, online means "in place". In each episode (assuming there *are* episodes), you update all V(s) by using the V(s) themselves (that's "bootstrapping"). If during the episode you update, say, V(s1) and then you reuse V(s1) for updating V(s2), then you updated V(s2) not by using the old V(s1) but the update you just did to V(s1). If you first save all the V(s) in, say, OldV(s) and then update V(s) by only using OldV(s), then you're doing it Offline. If you, instead, use the same V vector and don't worry about the update interferences/overlappings, then you're doing it Online. Online is easier and faster and works well, AFAIK.
@jeffery12364 жыл бұрын
Online updates means the value function is updated at every step in the episode; Offline updates occur all at once at the end of the episode
@Elocess5 жыл бұрын
Shouldn't TD(0) correspond to n=1 instead of n=0?
@edmonddantes4705 Жыл бұрын
It does. But TD(0) denotes TD(lambda) for lambda = 0. This corresponds with 1-step TD or to lambda = 0 in the eligibility trace paradigm. In summary, the 0 corresponds with lambda, not with n.
@jiapengzeng66512 жыл бұрын
For the blackjack example, how do we know which state we get to if we choose to twist without knowing the transition probability?
@fabtjar Жыл бұрын
I was thinking you'd have to write your own simulator. Did you figure out the answer a year later?
@yunlongsong76185 жыл бұрын
x.1.0 speed is good for taking the lecture for the first time x1.5 speed is perfect for quick reviewing this great lecture
@arturbegyan52075 жыл бұрын
Yunlong Song x2 mate
@a11a123 жыл бұрын
"You don't need to wait until you die to update your value function" :)
@43SunSon4 жыл бұрын
@37:30, question please. I dont understand the difference between real return (G_t) and ESTimate return (bootstrap, TD target). In the class 2, bellman equation shows exactly they are the SAME, isn't it , G_t = R_(t+1) + r V(S_(t+1)) ? If left G_t is the REAL return, why the right side of the equation is ESTimation now ??? Can someone teach me ? thank you!
@leopoldmuller44864 жыл бұрын
Estimate return: you already got the real R_t+1 but estimate the rest (yV(S_t+1) together -> estimated return. In Monte Carlo you have are at the end of the episode thats why you can use the real return. You already received every real reward.
@43SunSon4 жыл бұрын
@@leopoldmuller4486 Leo, thanks for your help. Please allow me to ask more about this. 1. I do agree with you that: real return in MC G_t = R_(t+1) + rR_t+2 + .... Here you do have all the R since you do walk though to the end. Am I right ? 2. I do know what you saying, in TD, you estimate, you don't go to the end of eps, so you do need to estimate. But I don't understand the equation in class 2 about this bellman equation: G_t = R_(t+1) + r V(S_(t+1)) . My understanding is this V(S_(t+1)) is the real, not estimation. The equation derivation shows very clear. Did I miss anything ? the notation is exactly the same between est and real ? Could you check the class 2, it was in very early of that video. thank you!
@vegahimsa30574 жыл бұрын
Does TD work in the game of Go? I understand it's very difficult to evaluation a board state without playing it out further. Perhaps TD (breadth search) but MC to evaluate the leaves.
@driyagon4 жыл бұрын
nice way to spend coronavacation
@420_gunna5 ай бұрын
11:00 good question, David being pissy as usual about it Great lecture though
@akarshrastogi36825 жыл бұрын
45:46 somebody gagged himself lol
@b00i00d4 жыл бұрын
LOL
@divyanshushekhar51185 жыл бұрын
1:10:13 How does DP bootstraps? Doesn't the DP uses the already calculated real returns?
@alvinphantomhive37944 жыл бұрын
Dp uses Bellman Equation for synchronous backups, which is like two steps look a head given the current state or action and immediately update the value for Vk+1 using Value function equation. Behind the equation, it uses the returns of the successor state to keep updating the Vk+1 value onwards.
@MinhVu-fo6hd7 жыл бұрын
For TD learning, how can an agent know what the estimate (look ahead) value for state St+1 should be????. I know TD is bias because it has to guess the next value is and use it to update the current value, but how would it guess????
@sandeepinuganti87917 жыл бұрын
I think we randomly initialize the V(St) at first, then we use the iterative update, that's why he said TD is sensitive to initial conditions.
@Erain-aus-787 жыл бұрын
Here the value function is stationary, without the time index attached
@Isaac6687 жыл бұрын
It depends on the specific environment. In a case like a game of chess, where the environment is largely predictable, then information like number of pieces taken can be used to predict your future value function. Otherwise, you would apply a random value function and then use different update functions to adjust that value for the probability distributions p(s'|a,s) and V(s). Essentially, if you don't want a random initialization, then a prior needs to be applied
@ThePrefect6937 жыл бұрын
At 1:03:00, I don't understand how TD(0) gives a value of 0.75. If we used the equations, then won't the value be different?
@baicenxiao71077 жыл бұрын
As far as I understand, the TD(0) will consider multiple episodes. In other words, the TD guess v[s(t+1)] based on multiple episodes, so we can get value 0.75. But I am not quite sure about that.
@neilslater82237 жыл бұрын
I think you are right, it would update V(A) to 0 based on what it had seen if the sequences were taken in that order in a single pass through using TD algorithm. I think you can still understand the 0.75 value as a TD-like analysis after the fact. Or you would get 0.75 if you repeated the same 8 sequences again and again.
@daviderrington22416 жыл бұрын
This also puzzled me. I think the key is to realise that the equations shown at 1:03:00 are for the transition matrix and the reward structure rather than the value function. To find the value function, we should use a Bellman equation. Given the equation at 1:03:00 for transition matrix we see that P_{AB}=1 and the equation for reward structure shows R_{AB}=0. The Bellman equation will then give V(A) = R_{AB} + gamma * V(B). For no discounting we should set gamma = 1 and we also know that V(B) = 6/8 (I believe that's why he says V(B) is the easier one earlier in the video). Substituting then gives V(A) = 6/8. Incidentally, a similar argument can be used to get V(B). For example, we see that P_{BT}=1 where T is terminal state. We also establish that R_{BT}=6/8 using the equation at 1:03:00, and clearly V(T)=0. Again, we get V(B) = R_{BT} + gamma * V(T) and with gamma = 1, we get the desired result. Hopefully that makes sense!
@tejasduseja2 жыл бұрын
Can someone explain how TD(1) corresponding to monte carlo? According to formula (1-lambda) will be zero for lambda and hence estimated return should be zero for every state.
@vaishnav40352 жыл бұрын
since we have a terminal state T(ie it will terminate after some k),Then the probability density (weight)for that G_(k) will be the sum of geometrical from that point to n= infnity,Then you will get lamda^(k-1).Hence we get MC for lamda=1.This is what prof says in the side 1:28:00 ("we will assign the rest of the prob to the tail part(shaded portion) of the decaying distribution").
@edmonddantes4705 Жыл бұрын
In the forward view, it is a limiting case, i.e. lim TD(lambda) when lambda-->1, motivated by the argument Vaishnav described. Calculate the geometric sum of the coefficients from k onwards and take limits when lambda-->1. You will see that, in the limit, all the mass concentrates on the tail of the series. In the backward view, it can be shown by direct manipulation and the proof can be found in the extra material in the slides.
@lahaale58406 жыл бұрын
I do not understand: how MC and TD can find the optimal policy without compare all the possible policies like DP? Do they sampled the majority of the policies?
@chandrasekharChebiyyam6 жыл бұрын
As far as what I understood, MC and TD converge to one and the same value estimate when in an MDP. They are just different way to get there. No one in their right sense should ideally be using MC for estimating value in an MDP though, I guess MC is only useful for non MDP cases. Now, coming to the policy (which I believe would be explained in a later lecture), if you remember the explanation in one of the previous lectures, where we alternate between value and policy updates, the greedy policy estimated based on the current best value estimate is always better then the policy we started with before the current best value estimate converges. Hence, we are bound to have a better policy than before when we have a better value function estimate. And over multiple such alternate steps of updates of value and policy, we end up with the ideal most policy and value.
@SuperOnlyP5 жыл бұрын
What I understand is the more policy you sample the more accurate the policy is. You can see the example of blakjack 19:50 . 10,000 esposides vs 500,000 episodes are different. DP only work for small case which is mentioned on lecture 2 or 3 - I think. So for big case MC and TD will save you time by sampling.
@amornchotsingh39776 жыл бұрын
Hey everybody, Is this course an advanced course? I'm looking at various courses on the internet and choosing the best one. I've come up with Udacity's course(the hefty price tag is holding me back) but I heard a lot of good things about this course. How would you recommend studying RL? Take udacity course with a lot of coding implementations or study UCL's course thoroughly and implement code from pseudocodes from the slides?
@Shottedsheriff6 жыл бұрын
Sorry for 1 month delay, You've probably either picked something else or watched this one, but if not, I would recommend to follow Practical_RL assignments on github from Higher School of Economics, if I recall correctly, and there you have suggested videos for each 'practical week' and tasks for it, it really helps to grasp the ideas in this course and later start the Berkeley great CS294-112
@amornchotsingh39776 жыл бұрын
@@Shottedsheriff Thank you very much. I've just bought a course from Udacity, I'll do CS294 and Udacity and the one you recommended me :)
@pariveshplayson3 жыл бұрын
What about states which are very infrequent and it takes a long time to find the first time they are visited for a given episode? Now, over many episodes ..the time will add up. Right?
@edmonddantes4705 Жыл бұрын
In practice, there are many states that could not be visited when the state space is very large. For example, they argue that TD-Gammon probably has not seen most of the configurations in Backgammon and can still produce amazing results. Lectures 6, 7, 8, 10 will certainly answer your questions partially. When there are too many states, other approaches that generalise to unseen states are preferred (parametrising the value function of the policy by a neural network, for example).
@ildaalushaj86273 жыл бұрын
Does the First-Term MC and MC(all) return estimate biased? And how can we prove that formally?
@edmonddantes4705 Жыл бұрын
Both are unbiased estimators. Assume a policy pi is fixed. The experimental discounted returns you are computing every episode for every state s in MC are a samples of the random variable R = R_2 + gamma R_3 + gamma^2 R_4 +... conditioned to s, i.e. samples of the random variable R | S_1=s. Note that the value of a state s is defined by v(s) = E[R], where E denotes the expectation. By the strong law of large numbers, for every state s, the average experimental return converges to the real expectation almost surely, since the samples are independent and the rewards are uniformly bounded. That's the proof.
@윤섭-p9l2 жыл бұрын
Can someone explain about when doing TD(0), how can we get partition of TD target that estimated value of next state V(St+1)??
@3DAYSGG3 жыл бұрын
Wouldn't TD be considered a model-based algorithm since we need an immediate reward and value of state s'?
@shauryamohan7183 жыл бұрын
for MDP you need to have Action-space, State-space, Reward function (or distribution), Probability of moving to state s' from s given action 'a', over all s,s' in State-space and all 'a' over Action-space. Right now we only have reward function and State-space, (maybe in the next lecture action space) but not the probabilities. I think that is the reason why we don't have a model and this is not a model-based algorithm.
@girigupta4 жыл бұрын
for Analysis purpose can we treat the Every visit MC as a first visit MC by considering each occurrence of the state as a single episode?
@girigupta4 жыл бұрын
The only thing that will differ is the averaging factor for the chain with multiple visits to the same state.
@sng51928 жыл бұрын
Thanks for the nice lecture.
@princehan9204530243 жыл бұрын
Is alpha calculated by 1/N(s) of certain state when traversing all the episodes, or just chosen evenly and manually btw 0 and 1?
@duckdudette3 жыл бұрын
I don't know what example you're looking at but it generally varies. Both options you gave are viable in different cases.
@edmonddantes4705 Жыл бұрын
In the first case, your estimator is unbiased, in the second one, your estimator is biased. In both cases, there is convergence in the limit.
@divyanshushekhar51185 жыл бұрын
1:22:12 Is the offline method referred to Monte Carlo because it needs to wait till episode termination to update the value function but Temporal Difference method is offline because it updates the value function as it sees the state every time?
@alvinphantomhive37944 жыл бұрын
exactly.. that's the definiton, you might want to look up the slide on page 46. the lecture slide was in the description
@lahaale58406 жыл бұрын
why Ace is 10 on dealer side? it say is Ace can use either 1 or 11, right?
@davypaul1234 жыл бұрын
What's the difference between Backup and Bootstrapping?
@suryapalaniswamy37264 жыл бұрын
In my opinion I think backup is really for policy evaluation by taking the returns and finding out value functions and optimizing them. Bootstrapping is something to do with sampling the entire distribution and not using it as a whole
@hschoi36817 жыл бұрын
Why does considering (the immediate reward from next step + Value function of next step : TD(0)) means considering the MDP structure?..
@Isaac6687 жыл бұрын
Each step Rt you take increases noise and decreases the bias. If we take a probabilistic standpoint, the more data we have (i.e. the more iterations/steps of the model we take), the less the bias will have effect and the more the likelihood will take effect (on the proverbial posterior distribution). This effectively eliminates the bias, and since at TD(0), we've taken 0 steps, there's 0 noise and close to 0 bias, so that should converge to the true value
@justalaugher6 жыл бұрын
because in TD, you make instant response to what just happened after you took the action. But only in Markov environment can you assume that what happened just now was only related to your action and previous state
@santiagorestrepo54586 жыл бұрын
if you listen carefully you can hear a sutil fart 0:36:58
@johnhefele54326 жыл бұрын
heard dat
@shashankbhat96465 жыл бұрын
maybe he got a reward from monte carlo
@woncoh15 жыл бұрын
50:24 1:35:57
@scienceofart91214 жыл бұрын
What is the difference between online and offline update in 1:35:50
@DrN0072 жыл бұрын
Last 10 minutes (backward view) need more elaboration.
@anmolagarwal999 Жыл бұрын
1:05:00 = Monte Carlo backups
@saminder9874 жыл бұрын
41:45
@aneeshbhat73285 жыл бұрын
Difference b/w expected return and empirical mean return? 8:28 I'm still unable to grasp the concept of expectation.
@manishandu5 жыл бұрын
Empirical means refers to sample mean when you have the samples available, expected value is not for a sample but a distribution. In some cases, you can infer the distribution from the samples, and empirical mean and expected value would be the same.
@sajanpoudel57074 жыл бұрын
who is that guy on background suffering from covid-19, 5 years long back .. his coughing frequency is so irritating ...
@benvirusofthemind3 жыл бұрын
you know 2020 has passed when everyone is thinking the same when they hear someone coughing.
@sandeepinuganti87917 жыл бұрын
He skipped the last 10 slides and i saw them at their website and i think it was really something important and i am having trouble understanding it, can anybody tell me if those slides were important or not?
@kiuhnmmnhuik26277 жыл бұрын
The last 10 slides show that eligibility traces are truly equivalent to the slow counterpart of calculating updates as we do in MC and naive TD(lambda). If you understand eligibility traces intuitively, you should also "sort of" understand the proofs (I think the proofs are not very precise, though).
@ChumpyZhang5 жыл бұрын
I have the same questions. I totally lost it when it came to backward view of TD(lambda).
@vman0495 жыл бұрын
Could someone please clarify why lambda = 1 corresponds to MC? If you plug in lambda = 1, the return just goes to 0, so how can it be equal to MC?
@paulmarkus49174 жыл бұрын
The formula for the return G_t(lambda) should contain a term at the end for the final update (end of episode) including the terminal state. Need to add (+) lambda^(T-t-1) G_t to the end of the return function. In the case that lambda = 0 then this part become G_t which is the same as MC.
@edmonddantes4705 Жыл бұрын
Don't plug it in in the forward view, plug it in in the backward view and you can check it is MC (proof in the notes). In the forward case, it is true as a limit, i.e. MC is lim TD(lambda) when lambda-->1. Indeed, calculate the geometric sum of the coefficients from the end of the episode onwards and take limits when lambda-->1. You will see that all the mass concentrates on the tail of the series in the limit.
@II_superluminal_II3 жыл бұрын
COOL
@astaragmohapatra94 жыл бұрын
The class is very diverse, guessing from their voices
@divyanshushekhar51185 жыл бұрын
I don't understand the question asked at 11:34 and its answer
@manishandu5 жыл бұрын
David explained 'first-visit monte carlo', in which the number of times you visited a state is updated only the first time you visit it, and the expected return is calculated from the first visit. Even, if you visit the state again a second time, it's not updated until you start a next episode. Thus, you get N_s = 10, for let's say 10 episodes, because you are considering only first visit, and would find the expected return by averaging over return from first visit to state 's'. However, this has been proved that first-visit average return is equal to n-visit average return where you visit the state multiple times during one episode, and the count is updated everytime(unlike first visit). So, if you study only first-visit, you are not missing anything.
@manishandu5 жыл бұрын
In other words, value function is same for first visit or n-visit MC.
@lekhaemerald7 жыл бұрын
I am not really sure how he got the value of V(A) and V(B) at 59:54. Can someone please explain?
@neilslater82237 жыл бұрын
He explains just after. It was to compare how MC and TD use their samples. So you have 8 samples that include B, 6 of them get a final return of 1, and 2 of them get a return of 0. With MC sampling, then estimate for V(B) is 6/8 = 0.75. For TD it is the same because state B is always last one in sequence, so we are bootstrapping from final state, with V(final)=0 by definition. The interesting one is V(A). With MC sampling, you look at average return for any sequence that includes state A. There is only one such sequence and the total return is 0 + 0 = 0, making the MC estimate for V(A) = 0. But the TD estimate can include the existing estimate for B when the sequence transitions from A to B . . . I think it is a little wrong in that this only works with TD algorithm as shown if the A:0,B:0 sequence occurs as the last sample. But I think the idea in the lecture is to ignore the algorithms and compare the two intuitive answers V(A) = 0 (for MC) or V(A) = 0.75 (for TD) to get a sense on how MC and TD differ.
@hschoi36817 жыл бұрын
This is what I was looking for! Thanks! I also thought the sequences should be opposite otherwise in TD case it will also be V(A)=0
@vikramwaradpande50595 жыл бұрын
This is exactly what I was looking for! I also think the same, but seeing that no-one else is pointing this out makes me think am I missing something obvious?
@rahulramachandran34922 жыл бұрын
Just realised from the example from 1:00:00 that I'm an MC kind of guy
@claudiocimarelli7 жыл бұрын
Someone has more documentation on the question about geometrical decay for lambda at 1:28:00? Thanks
@kiuhnmmnhuik26277 жыл бұрын
Let's say we are at time t and we have already visited state s three times. If the eligibility trace decays geometrically than the decay is constant at each time step, so we don't need to maintain a history of the single visits. Let's say E = E1 + E2 + E3, where Ei is the "partial eligibility" relative to the i-th visit of state s. If the decay is constant, we have new(E) = new(E1) + new(E2) + new(E3) = lambda E1 + lambda E2 + lambda E3 = lambda E and so we don't really need to remember E1, E2 and E3 separately. If the decay is, say, hyperbolic (i.e. 1/n after n steps), then the update is more difficult: new(E) = new(E1) + new(E2) + new(E3) = ? E1 + ? E2 + ? E3 = ??? We can't proceed without knowing when the visits took place. Let's say the right decays are l1, l2, l3, respectively. Then we have new(E) = l1 E1 + l2 E2 + l3 E3 Unfortunately, we need to remember the entire history of the visits. We can't update E directly as before. Just out of curiosity, sei italiano?
@divyanshushekhar51185 жыл бұрын
23:56 What question is being asked?
@alvinphantomhive37944 жыл бұрын
I guess he asked that on the "Dealer Showing" axis, why the value function on '10' value decrease compare to the '9' value (Dealer showing axis). But again, i think this examples not really neccesary for you to understand, since he also knew, not everyone understand the gameplay which may be hard to bring the intuition. And there are many other more examples with simple concept on the next slide.
@zeweichu5507 жыл бұрын
Can someone give an example of the student's question at 1:15:39?
@edmonddantes4705 Жыл бұрын
I think pretty much any example you try is going to be a counterexample, because basically what the student proposes does not make any sense. Try with a state space with two or three states and some uniform policy with two different actions and I think it will work :). Let me explain the intuition, which is the fundamental thing here. We are assuming that our environment is an MDP, and MDPs are FORWARD processes. Now, any MDP has an associated Bellman expectation equation, which expresses the value of a state s in terms of the values of FUTURE STATES. In other words, you look at the future states where you might end up depending on which actions you choose and you express the value of a state as a weighted sum of values at those states. The Bellman expectation equation expresses PAST state values in terms of FUTURE state values. This is key. In particular, if we are at state S_t and sample an action with policy pi, R_{t+1} + gamma V(S_{t+1}) is an UNBIASED SAMPLE of V(S_t). This is the heart of TD learning, whose update rule uses the estimate R_{t+1} + \gamma V(S_{t+1}) for V(S_t) , where V(S_{t+1}) is now an approximation of the true value at state S_{t+1}, so the sample becomes biased. However, as you know, TD converges to the true value in the limit because the CLT. The estimates become more and more unbiased and accurate. Now, imagine that instead of updating like V(S_t)
@porkypig71707 ай бұрын
Lost me at eligibility trace, I'm too stupid too understand
@nikhilweee2 жыл бұрын
1:34:05 cha-chow
@jakoblindrup2 жыл бұрын
so chances for beating trhe house in black jack looks quite low :)
@burada89934 жыл бұрын
Isn't the formula wrong? The last lambda exponent should be T-t-1 instead of T-1; it is Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-1) * R_T but should have been Gt = R_(t+1) + lambda * R_(t+2) + ... +lambda ^(T-t-1) * R_T .