Bellman Equations, Dynamic Programming, Generalized Policy Iteration

Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2

Рет қаралды 75,724

Mutual Information

Күн бұрын

Пікірлер: 174

@mbeloch97 2 жыл бұрын

Great video! Can you explain more, that "sneaky" equation in aroun 6:00? Why is G_t+1 = v(S_t+1) in the expectation?

@Mutual_Information 2 жыл бұрын

@mbeloch97 2 жыл бұрын

@@Mutual_Information thanks!

@xiaoweilin8184 2 жыл бұрын

@@Mutual_Information May I ask how law of total probability is used to get the last line from the previous one? Thanks!

@Mutual_Information 2 жыл бұрын

@@xiaoweilin8184 Hm, it's just precisely what the law of total probability tells you. sum over b [p(a|b)p(b)] = p(a) The only difference is my expression has some extra conditioning on s^0, ->.. but that doesn't change anything. Hope that helps

@xiaoweilin8184 2 жыл бұрын

@@Mutual_Information But in your expression, the quantity to be summed is E[G_t+1|S_t+1 = s']. So do we need to write out this expectation to: sum over g_t+1 [g_t+1*p(G_t+1 = g_t+1|S_t+1 = s')] first? and the whole expression becomes a double sum: sum over s' , sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] exchange the sum: sum over g_t+1 {g_t+1 * sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]} (1) Until this step can we use the total probability formula to the second sum: sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] = p(G_t+1 = g_t+1 | s^0, ->) Put it back into (1): sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1 | s^0, ->)] = E[G_t+1 | s^0, ->] Is it the correct way to use the law of total probability to derive the last step from the previous one in your derivation? It seems to me these are a few more steps that are derived under the hood in your expressions. Sorry there is no Latex in KZbin comment, it would be nicer if they are in Latex...

@TheRealExecuter22 Жыл бұрын

I can't express how good these videos are, thank you so much for all the time you put into making them! this is a truly special channel

@Mutual_Information Жыл бұрын

Thank you, it's tailored for a particular audience. Doesn't hit for most, but some it nails it!

@light2666 4 ай бұрын

@@Mutual_Information I wish this was much more popular , this series feels tailor made for me.

@mCoding 2 жыл бұрын

Let's read from the textbook. *He opens the book, then stares at the camera and confidently recites from memory*.

@Mutual_Information 2 жыл бұрын

Lol I wish it was from memory! Fortunately teleprompters aren't that expensive :)

@NoNameAtAll2 2 жыл бұрын

@@Mutual_Information tsss don't ruin the good impression of you

@BenjaminLiraLuttges 2 жыл бұрын

That part of the video made me laugh out loud!!

@pandie4555 Жыл бұрын

i was looking for this comment lmao

@bean217 11 ай бұрын

This part of the video made me lose my focus entirely 😂

@abhinavanand1529 5 ай бұрын

So far the best and optimized playlist for reinforcement learning.

@Mutual_Information 5 ай бұрын

I agree :)

@fzet Ай бұрын

THESE ARE THE BEST VIDEOS ON THIS TOPIC EVER, AND YOUR WAY OF EXPLAINING AND MAKING THINGS SOUND SO SIMPLE IS INCREDIBLE, THANK YOU A MILLION TIMES !!!!!!

@hypershadow9226 11 ай бұрын

In 15:46 you said "if that policy is greedy in respect to thatvalue function" but i don't quite understand what you ment by that. Other than that the video is crystal clear. thank you for these videos.

@Mutual_Information 11 ай бұрын

A value function gives you the numeric value of every action in every state. A policy that's greedy 'with respect to that value function' is one which, in whatever state, picks the highest value action, according to the value function. Make sense?

@rajatjaiswal100 8 ай бұрын

You saved lot of my time by simple, concise and easy to follow video compared to other I have seen so far.

@timothytyree5211 2 жыл бұрын

Kudos, good sir. Your pedagogical skill is both impressive and efficient. Please continue to grace the world with it for the good of all of mankind.

@Mutual_Information 2 жыл бұрын

That's very kind of you Timothy - I have no plans of stopping :)

@Yahia.N_Ahmed 5 ай бұрын

One of the best series if not he best in describing DRL. Good Job !!!!

@Mutual_Information 5 ай бұрын

Thank you!

@manudasmd Жыл бұрын

This is the best reinforcements learning resource available in internet, Period

@taumag Ай бұрын

*opens book* *stares into soul* LOL! I watched David Silver's excellent series before finding the MI channel. Your animations add a new level to understanding the topic adding more AHA moments for grasping RL. Thank you for making them! (Subbed & liked)

@valterszakrevskis 2 жыл бұрын

Imagine if such great educational videos existed for all foundational topics in artificial intelligence, engineering, math, and physics. We are slowly getting there :). 3b1b py module manim has made it quite accessible to create high-quality, time efficient (for learning) educational content. It's amazing what people create. Thank you for the great videos!

@Mutual_Information 2 жыл бұрын

I hope that there's a section of KZbin that's one day more like Wikipedia. It's a bit of a pipedream, but I'm at least nudging this continent in that direction. FYI, I don't use manim

@vesk4000 8 ай бұрын

This is so well done! Explaining stuff well can be very difficult. Thanks a lot! I'm studying RL at a university course, but this was way more helpful!

@avinashsharma8913 7 ай бұрын

can you share me your email ? studying RL at a university course.

@marcin.sobocinski 2 жыл бұрын

Dziękujemy.

@Mutual_Information 2 жыл бұрын

Thank you!! A rare form of appreciation - thanks a ton :)

@hassaniftikhar5564 11 ай бұрын

best video lectures of rl on the internet

@usonian11 2 жыл бұрын

Excellent video. Even though I have been studying RL for a while, the video clarified some previously learned concepts and gave me a better understanding of the topic.

@Mutual_Information 2 жыл бұрын

Thanks, exactly what I was going for

@katchdawgs914 Жыл бұрын

These series of videos are really nice. I would love to see you go more into the theory/proofs of why policy iteration works... as another series. Once again, really good work.

@karthage3637 10 ай бұрын

Love the content so far. I would just prefere that you leave some times to breath like when you ask question "can you find S0 ?" don't answer straight away, let us think for few seconds. Will keep diging the playlist thank you for all this work !

@Mutual_Information 10 ай бұрын

Good feedback. I’ll keep it in mind. Idk why I’m in such a rush lol

@arturprzybysz6614 2 жыл бұрын

Good to see your content back!

@Mutual_Information 2 жыл бұрын

Oh I'm BACK!

@nicolaiholtkamp3223 6 ай бұрын

Damn, it really only took you 20 minutes to explain something that my professor needed two full lectures for. Thank you so much! This was so helpful

@marcin.sobocinski 2 жыл бұрын

Your videos are like espresso, condensed, tasty, full bodied but you should not try to rush when watching them. There are no spare words so when you miss one, you're lost 😀Great video, I love that logical structure, rock solid!

@Mutual_Information 2 жыл бұрын

lol you get what I'm going for! It's awesome - love the appreciation

@omerlevy6939 6 ай бұрын

goated videos literally all you need to understand the topic

@raminessalat9803 Жыл бұрын

Wow im not sure if I understood RL when I took the course in college or i just forgot it, but these videos made the aha moment for me for sure!

@Mutual_Information Жыл бұрын

That's what I'm going for!!

@EnesDeumic 2 жыл бұрын

Hello. Thank you very much for providiing such clear lectures I hope you will continue. I have one suggestion, when you're finishing explaining the set of equations, you make them disapear so fast that one doesn't have time to pause. Just take one full note pause and that should do it. Don't overdo it, the beauty of your lectures is that they are dense, clear and consise.

@Mutual_Information 2 жыл бұрын

Noted.. let the equations baked. Ok good point - thank you!

@bonettimauricio Жыл бұрын

Amazing explanation of the concepts! Really nice!

@Mutual_Information Жыл бұрын

Thank you, I appreciate it when the harder topics land :)

@vishwajeetohal9137 2 ай бұрын

17:05 * open textbook *, * proceeds to not even look at it * Man's a CHAD

@40NoNameFound-100-years-ago Жыл бұрын

This is the easiest way I have seen regarding this subject. You did a pretty great job there 😂😃😃👍👍

@Mutual_Information Жыл бұрын

Thank Sir No Name - I'm trying quite hard lol

@oj0024 2 жыл бұрын

I didn't expect the next video so quickly, amazing stuff. Have we been spoiled, or will this tight upload schedule continue?

@Mutual_Information 2 жыл бұрын

It will continue.. for a short amount of time ;}

@IgorAherne 2 жыл бұрын

Duane, at 16:20, I'm not sure why we would need to update the policy: I would think that we could just rely on updating the values of the states, again and again, until they stop changing. Following my logic, we wouldn't have to iteratively change the policy - at the very end we'd just make it "follow the highest action". ....But, I realize that these state-values were updated with the random-action policy (the 4 neighbor-states value are weighed by 0.25). Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others? So it no longer becomes 0.25 but some other probability. My confusion is because I am used to Q-learning where the policy is epsilon greedy. Thank you

@Mutual_Information 2 жыл бұрын

Hey Igor. If you are following that random-action policy example, then it just so happens you only need to apply policy improvement *once* and you're at the optimal policy. But that's not true in general. Here's it more spelled out: * start with a random policy. * determine it's value function. * make a slightly better policy using the pick-best-action rule. This only slightly better than the random policy. In the general case, it is not likely to be near the best optimal policy. * determine the value function of this new, slightly improved policy. * repeat. If you were to do your approach, you would only be doing 1 iteration. You wouldn't end with an optimal policy. And regarding "Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others?" Yes I think that's fair. We are changing the probabilities in states where the highest-action vcalue is NOT selected. Though I'm not sure what you mean by "slowly" Hope that helps!

@IgorAherne 2 жыл бұрын

@@Mutual_Information thank you

@XandraDave Жыл бұрын

It turns out that in fact, algebra *is* fun, cool, and exciting

@sidnath7336 Жыл бұрын

@6:48, how did you calculate the expectation via the sum i.e. get 17.4?

@denizdursun3353 11 ай бұрын

its been 11 months, but either way: first assume s=s^(1): r = 0: 0.12 * [0 + 0.95 * 18.1] r = 1: 0.22 * [1 + 0.95 * 18.1] r = 2: 0.20 * [2 + 0.95 * 18.1] sum all of these up then we assume s=s^(2) r = 0: 0.09 * [0 + 0.95 * 16.2] r = 1: 0.32 * [1 + 0.95 * 16.2] r = 2: 0.05 * [2 + 0.95 * 16.2] sum all of these up as well then add the grand total of both of those. its the same way for taking the left action :) Edit: for the state value of s=s^(0) you would simply have to create the weighted sum of the actions such that: 0.4 * 17.8 + 0.6 * 17.4 = 17.56 or 17.57 if you dont ceil/floor your intermediate results do the calculations in excel you will get the same results :)

@Mutual_Information 11 ай бұрын

Nailed it!

@ksilkov 6 ай бұрын

@@denizdursun3353 Thank you kind man. After i saw the digits and compared them with formulas in the video it became pretty obvious. Again, video is brilliant, but the digits explanation dramatically boost clarity.

@nasdas123 2 ай бұрын

The whole term in case anyone needs it: 0.6 *( (0.12 * 0.95*18.1) + 0.22 ( 1 + 0.95*18.1) + 0.2 ( 2 + 0.95*18.1) + 0.09 (0.95*16.2) + 0.32(1+0.95*16.2) + 0.05 (2+0.95*16.2)) + 0.4 * (0.17(0.95*16.5) + 0.23(1+0.95*16.5) + 0.04(2+0.95*16.5) + 0.34(0.95*19.2) + 0.05(1+0.95*19.2) + 0.17(2+0.95*19.2)) ≈ 17.57

@himm2003 8 ай бұрын

You mentioned something about optimal policy around 7:54. What is that and how did you relate it to state-action optimal value?

@pedrocastilho6789 2 жыл бұрын

Yes! We missed you

@Mutual_Information 2 жыл бұрын

Missed you too!

@bonob0123 3 ай бұрын

I'm wondering what causes the discontinuities at s= 25, 50, 75 in the gambler's problem solution. The top panel value function triangular appearance makes sense but I can't get an intuitive grasp of why the lower panel optimal policy would end up that shape.

@ManuThaisseril Жыл бұрын

I have questions about the 6:06 substitution, could you explain this a bit ? because v_pi does not need to be conditional on state and action , it only depends on the state

@Mutual_Information Жыл бұрын

If you look at the pinned comment on this video, I do a breakdown of the expression. If that doesn't answer your Q, let me know

@tmorid3 Жыл бұрын

13:53 - how come the -20 cells and the -22 cells don't have the same value? they are identically far from the ending point. no? Thanks!

@Mutual_Information Жыл бұрын

If we were valuing the optimal policy, you'd be right. But we're valuing the do-something-randomly policy, which can't be value by looking at the optimal path. You have to think about a random walk, and then the corners, in that sense, are further away.

@tmorid3 Жыл бұрын

@@Mutual_Information Thank you very much for the quick reply!

@sathvikkalikivaya10 Жыл бұрын

This series is just amazing. is there any deep learning series like this?

@Mutual_Information Жыл бұрын

From me? No (but maybe one day). In the meantime, 3Blue1Brown has an excellent explainer. And there are others..

@ManuThaisseril Жыл бұрын

v_pi is not a random variable why do we take the expected value of that at 6:06?

@Mutual_Information Жыл бұрын

Indeed it is not random.. but if you give it a random variable as input, it becomes a random variable. As a silly example, if f(x) = x^2.. and U is a uniform random variable over [0, 1], then f(U) is a random variable. It is produced by randomly sampling a value uniformly from 0, 1 and then squaring it.

@jeandescamps4962 Жыл бұрын

Incredible content, thanks a lot for your work !

@Mutual_Information Жыл бұрын

Thank you Jean, and thanks for watching!

@ReconFX 9 ай бұрын

Hi, so far I think this is a great video, however, I wanna point out that at 3:45 your illustration makes it seem like the states are a sort of one-dimensional grid, and one can go from s0 to s1, s1 to s2 etc. When you show the probabilities it becomes "obvious" that this is not the case, but this part had me confused a bit with your explanation/equation at 6:28, which I'm pretty sure should also have an s0 instead of an s. Like I said, otherwise a very good video!

@yuktikaura Жыл бұрын

Keep it up.. amazing take at this subject

@Mutual_Information Жыл бұрын

Glad you like it! And my current plans are to certainly keep it up :)

@aakashswami8143 7 ай бұрын

Amazing video!

@samudralatejaswini 8 ай бұрын

Can you explain the calculation in bellman equations with the example which values to substitute

@advcare-r9o Ай бұрын

Why there is no power (exponent) to the decay factor for S2 state in state action function RHS side??

@AlisonStuff 2 жыл бұрын

I love Bellman! And I love equations!!

@kafaayari 2 жыл бұрын

Great video but I have a question. At 3:56 probability distribution table appears regarding right selection. However s0 is not in table. After all, agent can go from s1 to s0. Am I wrong?

@Mutual_Information 2 жыл бұрын

This example is focused only on a small piece of the MDP. The MDP, in entirety, describe the probability of all state, reward pairs for each state, action pair. In this example, I'm only showing the state-action pairs from s0 and, in this example, we can only transition to s1 or s2 (when choosing right). In other words, this example is more restricted than the general case. Implicit in this example, is that the probability of transition from s0 to s0 is zero... or the probability to transition from s0 to s-1 is zero when choosing right. Make sense?

@kafaayari 2 жыл бұрын

Ah ok MI. now it's crystal clear. BTW we're lucky to ask you questions and get replies. We may not find this possibility in the future when the channel explodes. :)

@Mutual_Information 2 жыл бұрын

@@kafaayari Aw thanks :) we’ll see what happens. I enjoy answering the Qs and I’m gonna try to keep it up for as long as I can. So far the volume is quite manageable lol

@aryamohan7533 11 ай бұрын

Could you explain the policy improvement? I understand that choosing the action that maximizes the action value function will lead to a better policy but I don't understand why we could not do that in the first iteration after performing policy evaluation? Wouldn't that then be the most optimal policy? What other improvements can we iteratively make to the policy? PS: Thank you for this video series, it has helped me understand a lot!

@Mutual_Information 11 ай бұрын

Let's walk through it. To start, the action-value function is completely flat; all actions in all states have the same value. To do policy improvement from this moment, you must pick actions randomly (or some other arbitrary way to break ties). Now you have a random policy. Then in policy evaluation, we determine the action-values of this random policy. Next, in policy improvement, we can slightly beat it by always picking the max-action value. Ok, why isn't this the optimal strategy immediately? Well b/c it's a policy improvement step (applying the rule of picking the max-value-action) on action values of a crappy policy, the one where we just picked actions randomly! Make sense? It takes time for the action-values to model a good policy, because we start with a bad policy.

@aryamohan7533 11 ай бұрын

@@Mutual_Information This makes a lot more sense now. Thank you so much for taking the time to respond! To everyone else watching, while I have only watched the first 3 parts (will watch the rest soon), I can already tell you that this video series has peaked my interest in RL and I am excited to dive deeper into these topics and look into how I can incorporate this into my research.

@Mutual_Information 11 ай бұрын

Excellent! I hope it helps you

@kiffeeify Жыл бұрын

@ 13:50 for computing the value function. Is this somehow related to Gibbs sampling? Somehow it reminds me of it :)

@mehmeterenbulut6076 Жыл бұрын

Nice explaination. May I ask how did you gain your explanation skills, did you took a course or something? Because you just hit the right buttons with your method, so to speak, to make us understand what you are talking about. I'd love to explain things like you do man, appreciated!

@Mutual_Information Жыл бұрын

Funny you say that, I feel like I have so much more to learn. I explain things poorly all the time! What I'd say is.. starting *writing* ASAP. Write about whatever interests you and write somewhere where people will give you useful feedback. It takes awhile to learn what works and what doesn't. One thing that helped me is to realize I'll be writing/educating forever. So there's no rush. This makes it more enjoyable, which means it's easier to maintain a long term habit. The long term is where quality edu emerges.

@ILoveMattBellamy 6 ай бұрын

Great material thank you so much for posting. May I ask why the action generated on 20:09 for state = 60 is 1 under the uniform V = 0.4? I understand for 80 for example we get a = 20 because 0.4*1 + 0.6*0.4 = 0.64 (which is the updated V at the next step for 80).

@ILoveMattBellamy 6 ай бұрын

At state 60: For a = 1: 0.4*0.4+0.6*0.4 = 0.4 The counterfactual for a = 40: 0.4*1 + 0.6*0= 0.4. I guess we grab the minimum action in case of tie?

@randyrandom760 Ай бұрын

I have a question, does policy (pi) describes a probability distribution or the action itself? Because, I have a hard time understanding the optimal policy and the equation for `a = pi(s)` when the policy is deterministic. Also, how do we perform the policy improvement when the policy is stochastic? do we just reassign the state action value using the bellman equation after a sweep of policy evaluation? Because in the beginning of the grid world problem, you said that we use pi(a|s) = 0.25 and then you change the setting become deterministic by taking the argmax

@Mutual_Information Ай бұрын

If the policy is deterministic, it is a function. You give it a state and it gives you an action. That's all. If it's stochastic, it's a distribution over actions that depends on the state. If you give it a state, it gives back an entire distribution, specifying the probability for each action. Policy improvements for stochastic policies depends on the policy. If it's epsilon greedy, that's like a deterministic policy 'mixed' with a stochastic policy that is always a uniform distribution over the action. In this case, policy improvement doesn't touch the stochastic part - only the deterministic part. To perform policy improvement, you set the deterministic policy to select the highest value action in every state. You can do this, because presumably you just performed policy evaluation. This is the argmax operation: "select the highest value action". Make sense?

@ChocolateMilkCultLeader 2 жыл бұрын

Excellent work my friend.

@Mutual_Information 2 жыл бұрын

Thank you! Cheers!

@attilasarkany6123 Жыл бұрын

sorry, I may miscalculate and being stupid. I got different result at 6.48 (17.4).someone write it down?

@dwi4773 Жыл бұрын

watched it on 0.75, great video!

@dareadegun2460 18 күн бұрын

Hi @mutualInformation, i haved tried to solve for the same answer you got in the V_pi(s) for s^0 = 17.57, but i am not getting the same. even used chatgpt. please can you break it down?

@НиколайНовичков-е1э Жыл бұрын

Great video! Thank you!

@piero8284 Жыл бұрын

Ok, the notation of E_pi [.] does not necessarily imply you are averaging over a distribution pi(A | s), right? Like in 6:43, where you took the average over the r.v's joint p(s',r | s^0,->), so what's the point of using pi ?

@piero8284 Жыл бұрын

I mean, when I write E_pi[.] it just means that the function inside must be calculated considering that the agent followed the policy pi, but in practice I have to consider a special function (dependent on pi) for the r.v's probs. Am I correct?

@Mutual_Information Жыл бұрын

At 6:43, you can see E_pi[] means when you go from the second to the third line. It's an expectation operation, so we are expanding random variables within the expression into a weighted sum of values.. where the values are all values the random variables can take, and the weights are their respective probabilities. This is what happens between the second and third line.

@piero8284 Жыл бұрын

@@Mutual_Information I agree with you, my only source of doubt was from the pi subscript, as it does not make explicit the distribution of the random variables, I'm used to think of the subscript of the expectation meaning the distribution of the random variable itself, for example E_{p_X(x)}[X] = sum_x p(x)*x, but in this context it's not the case.

@Mutual_Information Жыл бұрын

@@piero8284 Oh I see what you're saying. yea pi is just suggestive. It's like saying "the expectation is with respect to the policy pi and you have to know what that means".

@nathanzorndorf8214 Жыл бұрын

Do you provide slides as a pdf anywhere ? That would be really helpful! Great video!!!

@Mutual_Information Жыл бұрын

gah unfortunately not.. I'd like to come back and create a written version of this series. I have that for a small set of other videos, but that'll take some time for this one - don't hold your breathe, sorry!

@nathanzorndorf8214 Жыл бұрын

@@Mutual_Information no problem! Thanks for your videos. They are a big help!

@luken476 2 жыл бұрын

Anyone have a currently most recommended library or software to learn the application part of reinforcement learning?

@Mutual_Information 2 жыл бұрын

These might help? * github.com/TianhongDai/reinforcement-learning-algorithms * github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch

@earthykibbles 8 ай бұрын

My brain is cooked. This is my fifth time rewatching this😢

@sayounara94 Жыл бұрын

Is it the case that the optimal policy found by optimizing the state rewards will always be the same as the one found by optimizing the action rewards?

@watcher8582 Жыл бұрын

Gotta upload a version that blends out the pause-less hand motions

@Mutual_Information Жыл бұрын

Yea I don't like my old style at all. I'm re-shooting one but it's a lotta work. :/ Feedback is welcome.

@watcher8582 Жыл бұрын

@@Mutual_Information It was just a mean comment for the sake of being mean. I'd probably not put in the work to upload content you already got got covered, but donno

@Mutual_Information Жыл бұрын

@@watcher8582 haha yea I looked at the video after the fact and decided it wasn't so bad I didn't need to do this one. But some of my older ones, hand motions are awful and I'll be changing some of that lol

@watcher8582 Жыл бұрын

@@Mutual_Information No I mean if you already got a video covered (or if it's well covered elsewhere), then I'd probably invest the energy into making a video with a topic not covered online yet. Don't misunderstand me, the hand motion is terrible and I put a post-it over the screen to watch the video.

@sayounara94 Жыл бұрын

I was super focused on the board I didn't notice any weird cutovers! I like how you go through each variable it's very useful to have these quick reminders of what these notations represent as we're going through new concepts so that we don't have to make more conscious effort to decipher them and can focus on the new concept

@SphereofTime 2 ай бұрын

11:24 Bellman eq for 4cases

@dmitriigalkin3445 8 ай бұрын

Amazing video! Thank you!

@MarkoTintor 2 жыл бұрын

Can you comment on why does Gambler's problem solution differ from Kelly's criterion from one of your previous videos? Having a goal to reach 100 vs maximum growth.

@Mutual_Information 2 жыл бұрын

Sure - they aren't optimizing the same thing. In the gamblers problem, the only thing that matter is probability of getting to 100. In the Betting Game for KC, it's the expected growth rate. Also, KC can sometimes tell you not to bet. In the gamblers problem, you are forced to bet every time. I guess that's enough to make for the different strategies

@curryeater259 2 жыл бұрын

Very cool. How did you make the animations?

@Mutual_Information 2 жыл бұрын

I use the Python plotting library called Altair. It creates beautiful static images. Then I have a personal library I use to stitch them into videos. That's also used to make the latex animation.

@yassine-sa 2 жыл бұрын

hmm, let me guess, wouldn't applying a discount be interesting if the state space is so big? furthermore, wouldn't it be more interesting if we instead of discounting by a constant rate use a Gaussian distribution as our discount ?

@Mutual_Information 2 жыл бұрын

Interesting perspective, but I don't think a strong discount removes the difficulty of a large state space. I see your perspective though - it makes it seem as though you only need to care about the most immediate states. But that's not necessarily true. It's because our policy is optimized for all t for G_t.. If you only cared about G_0 and gamma = 0, then yes, the immediate state/action pair is all the matters and you don't care about a lot of the state space. BUT, since we also care about G_1.. we have to have a policy that does well at time t=1.. which means we care about states beyond those in the neighbor of states at t=0. Eventually, we could end up caring about the whole state space. If, on the other hand, some states aren’t reachable from the starting state - then that would be one way in which a lot of the state space doesn't matter.

@yassine-sa 2 жыл бұрын

Yeahh that's it 👌👌, thanks for the fast reply!

@lfccardona 9 ай бұрын

You are awesome!

@rafas4307 Жыл бұрын

super good

@supercobra1746 2 жыл бұрын

Im tough and ambitious!

@Mutual_Information 2 жыл бұрын

lol, hell yea!

@sharmakartikeya Жыл бұрын

If anyone is having a hard time deriving the bellman equation, especially the part where E [ G_t+1 | S_t ] = v(S_t+1), then I have covered that in my playlist from the absolute ground zero (because I am dumb). kzbin.info/www/bejne/ap-wn5ZugpyIa80si=9QR8zNuLEd1QAijE

@Mutual_Information Жыл бұрын

I don't see anything dumb in that video. You're getting right into the meat of probability theory and that's no easy thing!

@NicolasChanCSY 2 жыл бұрын

🤯 My brain keeps saying "I understand" and then "But do I really?" every few seconds Have to rewatch for those algebra for my tiny brain but the overall idea is very well presented!

@Mutual_Information 2 жыл бұрын

Thank you. And fortunately the comment section is small enough that I can answer questions - feel free to ask and I'll do my best!

@nathanzorndorf8214 Жыл бұрын

Do you have code for the gamblers problem online anywhere?

@nathanzorndorf8214 Жыл бұрын

I attempted to code the policy iteration algorithm for the gambler's problem, but don't get the policy you show in this video. Instead I get a triangle with a max at 50. This does seem like a reasonable policy though, so I'm not sure if this is one of the "family of optimal policies" that barto and sutton reference in the text.

@Mutual_Information Жыл бұрын

Oh yes I think it might be. I haven't made the code public but I think I remember the problem. Changing how you are dealing with ties! The action you pick given a tie in value makes a difference

@envynoir 10 ай бұрын

THANKS!!! GOOD CONTENT

@milostean8615 2 жыл бұрын

brilliant

@corantinovirus821 Ай бұрын

Love that❤

@yli6050 Жыл бұрын

Bravo🎉

@pi5549 Жыл бұрын

How about creating a Discord? If you think of the mind-type you're filtering with your videos, it could make for a strong community.

@Mutual_Information Жыл бұрын

I should. I'm just not on Discord myself, so I don't have familiarity with it as a platform. But I have gotten the request a few times and it seems like a wise move..

@pi5549 Жыл бұрын

@@Mutual_Information Yannic Kilcher created a Discord to support his KZbin, and it is buzzing. Also the level of expertise is high. Yannic's used his following to accrete engineering power into LAION. I got into ML in 2015 and there were almost no online communities back then. I came back over christmas (thanks to the ChatGPT buzz) and was delighted to find that it has taken off bigtime over Discord. Also Karpathy has an active Discord.

@piero8284 Жыл бұрын

(11:10) v_*(s) = max_a {q_*(s,a)}, no proof for this equation in the book made me very disappointed.

@JoeyFknD Жыл бұрын

That awkward moment when a KZbin series teaches you more practical knowledge than a $50,000 4-year degree in math

@braineaterzombie3981 3 ай бұрын

was checking whather my speed was set to 2x cause thats a lot of things to pack

@LeviFinkelstein 2 жыл бұрын

I don't know very much about video stuff, but it looks like there's something off with your recording of yourself, it's pretty pixelated. Maybe it's just that your camera isn't that good, or something else, like the lighting, your rendering settings, or bit rate in OBS. Just wanted to let you know in case you didn't already. Thanks for the good videos.

@Mutual_Information 2 жыл бұрын

Thanks for looking out. This was my first time uploading in 4K (despite it being recorded in 1920 x 1080) - apparently that's recommended practice. From my end, the video doesn't look bad enough to warrant a re-upload, but I'll give the settings another look on the next videos. I believe I see the pixelation you're referring to.

@gigantopithecus8254 10 ай бұрын

i heard its simular to calculus of variations

@pg_a-d3e Жыл бұрын

EPIC

@JohnSchroter-nm9nb 2 ай бұрын

17:05 had me dying lmao

@sorn6813 Жыл бұрын

It's very hard to follow when you call everything "this" and just highlight "this". Would be easier to follow if you replaced "this" with the name of what you're describing

@sorn6813 Жыл бұрын

E.g. "R at time step t" or "the bottom-right XYZ"

@Mutual_Information Жыл бұрын

Appreciate the feedback. It's a work in progress.. going forward there's even less on screen so focusing attention might be alleviated a bit. Would you mind providing a timestamp for a case where this stood out? That'll help me identify similar cases in the future

@stallard5772 2 жыл бұрын

𝔭𝔯𝔬𝔪𝔬𝔰𝔪

@paudelvishal 10 ай бұрын

The thumbnail was Manim-like. Dissapointed. Did not watch the whole video.

@juliomastrodomenico5188 2 жыл бұрын

Hi! Nice explanation ! but I tried to implement the calculation from 12:46, like doing the iteration (sweep) and I couldnt reach the same result, for the first part of the equation I kept the 0.25 ( pi[a|s] ), p(s_prime, r | s, a) as 1, for a deterministic world and sum ( -1 + world[row, col]), where world[row, col] is the V_next_state. My result was: [[ 0. -5.05417921 -5.47163695 -3.10743228] [-5.05417921 -6.78698773 -6.51979625 -3.95809215] [-5.47163695 -6.51979625 -5.86246816 -3.20514008] [-3.10743228 -3.95809215 -3.20514008 0. ]] The iteration was something like: for s in the world grid: (except the goals) for act in actions: (if falls off grid does nothing (try catch)) V(s) += 0.25 * (-1 + V[act])