Bellman Equations, Dynamic Programming, Generalized Policy Iteration | Reinforcement Learning Part 2

  Рет қаралды 50,259

Mutual Information

Mutual Information

Күн бұрын

The machine learning consultancy: truetheta.io
Want to work together? See here: truetheta.io/about/#want-to-w...
Part two of a six part series on Reinforcement Learning. We discuss the Bellman Equations, Dynamic Programming and Generalized Policy Iteration.
SOCIAL MEDIA
LinkedIn : / dj-rich-90b91753
Twitter : / duanejrich
Github: github.com/Duane321
Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
SOURCES
[1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
[2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, • DeepMind x UCL | Deep ...
SOURCE NOTES
The video covers the topics of Chapter 3 and 4 from [1]. The whole series teaches from [1]. [2] was a useful secondary resource.
TIMESTAMP
0:00 What We'll Learn
1:09 Review of Previous Topics
2:46 Definition of Dynamic Programming
3:05 Discovering the Bellman Equation
7:13 Bellman Optimality
8:41 A Grid View of the Bellman Equations
11:24 Policy Evaluation
13:58 Policy Improvement
15:55 Generalized Policy Iteration
17:55 A Beautiful View of GPI
18:14 The Gambler's Problem
20:42 Watch the Next Video!

Пікірлер: 152
@mbeloch97
@mbeloch97 Жыл бұрын
Great video! Can you explain more, that "sneaky" equation in aroun 6:00? Why is G_t+1 = v(S_t+1) in the expectation?
@Mutual_Information
@Mutual_Information Жыл бұрын
Ah, something I intentionally skipped over out of laziness, so I'll pin this comment for others. We want to show E[G_t+1 | s^0, -> ] = E[v(S_t+1) | s^0, -> ]. So.. * E[v(S_t+1) | s^0, -> ] = E[E[G_t+1|S_t+1] | s^0, -> ] (by def of v ) * = sum over s' [E[G_t+1|S_t+1 = s']p(s' | s^0, -> ) ] (by def of an expectation - the outer one) * = E[G_t+1| s^0, -> ] (by law of total probability) where p(s' | s^0, -> ) = sum over r [p(s' , r| s^0, -> )]
@mbeloch97
@mbeloch97 Жыл бұрын
@@Mutual_Information thanks!
@xiaoweilin8184
@xiaoweilin8184 Жыл бұрын
@@Mutual_Information May I ask how law of total probability is used to get the last line from the previous one? Thanks!
@Mutual_Information
@Mutual_Information Жыл бұрын
@@xiaoweilin8184 Hm, it's just precisely what the law of total probability tells you. sum over b [p(a|b)p(b)] = p(a) The only difference is my expression has some extra conditioning on s^0, ->.. but that doesn't change anything. Hope that helps
@xiaoweilin8184
@xiaoweilin8184 Жыл бұрын
@@Mutual_Information But in your expression, the quantity to be summed is E[G_t+1|S_t+1 = s']. So do we need to write out this expectation to: sum over g_t+1 [g_t+1*p(G_t+1 = g_t+1|S_t+1 = s')] first? and the whole expression becomes a double sum: sum over s' , sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] exchange the sum: sum over g_t+1 {g_t+1 * sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]} (1) Until this step can we use the total probability formula to the second sum: sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] = p(G_t+1 = g_t+1 | s^0, ->) Put it back into (1): sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1 | s^0, ->)] = E[G_t+1 | s^0, ->] Is it the correct way to use the law of total probability to derive the last step from the previous one in your derivation? It seems to me these are a few more steps that are derived under the hood in your expressions. Sorry there is no Latex in KZbin comment, it would be nicer if they are in Latex...
@mCoding
@mCoding Жыл бұрын
Let's read from the textbook. *He opens the book, then stares at the camera and confidently recites from memory*.
@Mutual_Information
@Mutual_Information Жыл бұрын
Lol I wish it was from memory! Fortunately teleprompters aren't that expensive :)
@NoNameAtAll2
@NoNameAtAll2 Жыл бұрын
@@Mutual_Information tsss don't ruin the good impression of you
@BenjaminLiraLuttges
@BenjaminLiraLuttges Жыл бұрын
That part of the video made me laugh out loud!!
@pandie4555
@pandie4555 10 ай бұрын
i was looking for this comment lmao
@bean217
@bean217 4 ай бұрын
This part of the video made me lose my focus entirely 😂
@rajatjaiswal100
@rajatjaiswal100 Ай бұрын
You saved lot of my time by simple, concise and easy to follow video compared to other I have seen so far.
@TheRealExecuter22
@TheRealExecuter22 8 ай бұрын
I can't express how good these videos are, thank you so much for all the time you put into making them! this is a truly special channel
@Mutual_Information
@Mutual_Information 8 ай бұрын
Thank you, it's tailored for a particular audience. Doesn't hit for most, but some it nails it!
@katchdawgs914
@katchdawgs914 Жыл бұрын
These series of videos are really nice. I would love to see you go more into the theory/proofs of why policy iteration works... as another series. Once again, really good work.
@timothytyree5211
@timothytyree5211 Жыл бұрын
Kudos, good sir. Your pedagogical skill is both impressive and efficient. Please continue to grace the world with it for the good of all of mankind.
@Mutual_Information
@Mutual_Information Жыл бұрын
That's very kind of you Timothy - I have no plans of stopping :)
@valterszakrevskis
@valterszakrevskis Жыл бұрын
Imagine if such great educational videos existed for all foundational topics in artificial intelligence, engineering, math, and physics. We are slowly getting there :). 3b1b py module manim has made it quite accessible to create high-quality, time efficient (for learning) educational content. It's amazing what people create. Thank you for the great videos!
@Mutual_Information
@Mutual_Information Жыл бұрын
I hope that there's a section of KZbin that's one day more like Wikipedia. It's a bit of a pipedream, but I'm at least nudging this continent in that direction. FYI, I don't use manim
@vesk4000
@vesk4000 Ай бұрын
This is so well done! Explaining stuff well can be very difficult. Thanks a lot! I'm studying RL at a university course, but this was way more helpful!
@avinashsharma8913
@avinashsharma8913 11 күн бұрын
can you share me your email ? studying RL at a university course.
@bonettimauricio
@bonettimauricio 10 ай бұрын
Amazing explanation of the concepts! Really nice!
@Mutual_Information
@Mutual_Information 10 ай бұрын
Thank you, I appreciate it when the harder topics land :)
@manudasmd
@manudasmd 7 ай бұрын
This is the best reinforcements learning resource available in internet, Period
@arturprzybysz6614
@arturprzybysz6614 Жыл бұрын
Good to see your content back!
@Mutual_Information
@Mutual_Information Жыл бұрын
Oh I'm BACK!
@usonian11
@usonian11 Жыл бұрын
Excellent video. Even though I have been studying RL for a while, the video clarified some previously learned concepts and gave me a better understanding of the topic.
@Mutual_Information
@Mutual_Information Жыл бұрын
Thanks, exactly what I was going for
@hypershadow9226
@hypershadow9226 4 ай бұрын
In 15:46 you said "if that policy is greedy in respect to thatvalue function" but i don't quite understand what you ment by that. Other than that the video is crystal clear. thank you for these videos.
@Mutual_Information
@Mutual_Information 4 ай бұрын
A value function gives you the numeric value of every action in every state. A policy that's greedy 'with respect to that value function' is one which, in whatever state, picks the highest value action, according to the value function. Make sense?
@jeandescamps4962
@jeandescamps4962 Жыл бұрын
Incredible content, thanks a lot for your work !
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank you Jean, and thanks for watching!
@XandraDave
@XandraDave 6 ай бұрын
It turns out that in fact, algebra *is* fun, cool, and exciting
@pedrocastilho6789
@pedrocastilho6789 Жыл бұрын
Yes! We missed you
@Mutual_Information
@Mutual_Information Жыл бұрын
Missed you too!
@marcin.sobocinski
@marcin.sobocinski Жыл бұрын
Your videos are like espresso, condensed, tasty, full bodied but you should not try to rush when watching them. There are no spare words so when you miss one, you're lost 😀Great video, I love that logical structure, rock solid!
@Mutual_Information
@Mutual_Information Жыл бұрын
lol you get what I'm going for! It's awesome - love the appreciation
@dmitriigalkin3445
@dmitriigalkin3445 Ай бұрын
Amazing video! Thank you!
@AlisonStuff
@AlisonStuff Жыл бұрын
I love Bellman! And I love equations!!
@oj0024
@oj0024 Жыл бұрын
I didn't expect the next video so quickly, amazing stuff. Have we been spoiled, or will this tight upload schedule continue?
@Mutual_Information
@Mutual_Information Жыл бұрын
It will continue.. for a short amount of time ;}
@aakashswami8143
@aakashswami8143 28 күн бұрын
Amazing video!
@ChocolateMilkCultLeader
@ChocolateMilkCultLeader Жыл бұрын
Excellent work my friend.
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank you! Cheers!
@hassaniftikhar5564
@hassaniftikhar5564 4 ай бұрын
best video lectures of rl on the internet
@yuktikaura
@yuktikaura Жыл бұрын
Keep it up.. amazing take at this subject
@Mutual_Information
@Mutual_Information Жыл бұрын
Glad you like it! And my current plans are to certainly keep it up :)
@marcin.sobocinski
@marcin.sobocinski Жыл бұрын
Dziękujemy.
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank you!! A rare form of appreciation - thanks a ton :)
@rafas4307
@rafas4307 Жыл бұрын
super good
@40NoNameFound-100-years-ago
@40NoNameFound-100-years-ago Жыл бұрын
This is the easiest way I have seen regarding this subject. You did a pretty great job there 😂😃😃👍👍
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank Sir No Name - I'm trying quite hard lol
@raminessalat9803
@raminessalat9803 10 ай бұрын
Wow im not sure if I understood RL when I took the course in college or i just forgot it, but these videos made the aha moment for me for sure!
@Mutual_Information
@Mutual_Information 10 ай бұрын
That's what I'm going for!!
@milostean8615
@milostean8615 Жыл бұрын
brilliant
@sathvikkalikivaya10
@sathvikkalikivaya10 Жыл бұрын
This series is just amazing. is there any deep learning series like this?
@Mutual_Information
@Mutual_Information Жыл бұрын
From me? No (but maybe one day). In the meantime, 3Blue1Brown has an excellent explainer. And there are others..
@envynoir
@envynoir 3 ай бұрын
THANKS!!! GOOD CONTENT
@lfccardona
@lfccardona 2 ай бұрын
You are awesome!
@yli6050
@yli6050 Жыл бұрын
Bravo🎉
@karthage3637
@karthage3637 3 ай бұрын
Love the content so far. I would just prefere that you leave some times to breath like when you ask question "can you find S0 ?" don't answer straight away, let us think for few seconds. Will keep diging the playlist thank you for all this work !
@Mutual_Information
@Mutual_Information 3 ай бұрын
Good feedback. I’ll keep it in mind. Idk why I’m in such a rush lol
@curryeater259
@curryeater259 Жыл бұрын
Very cool. How did you make the animations?
@Mutual_Information
@Mutual_Information Жыл бұрын
I use the Python plotting library called Altair. It creates beautiful static images. Then I have a personal library I use to stitch them into videos. That's also used to make the latex animation.
@EnesDeumic
@EnesDeumic Жыл бұрын
Hello. Thank you very much for providiing such clear lectures I hope you will continue. I have one suggestion, when you're finishing explaining the set of equations, you make them disapear so fast that one doesn't have time to pause. Just take one full note pause and that should do it. Don't overdo it, the beauty of your lectures is that they are dense, clear and consise.
@Mutual_Information
@Mutual_Information Жыл бұрын
Noted.. let the equations baked. Ok good point - thank you!
@himm2003
@himm2003 Ай бұрын
You mentioned something about optimal policy around 7:54. What is that and how did you relate it to state-action optimal value?
@sayounara94
@sayounara94 7 ай бұрын
Is it the case that the optimal policy found by optimizing the state rewards will always be the same as the one found by optimizing the action rewards?
@ReconFX
@ReconFX 2 ай бұрын
Hi, so far I think this is a great video, however, I wanna point out that at 3:45 your illustration makes it seem like the states are a sort of one-dimensional grid, and one can go from s0 to s1, s1 to s2 etc. When you show the probabilities it becomes "obvious" that this is not the case, but this part had me confused a bit with your explanation/equation at 6:28, which I'm pretty sure should also have an s0 instead of an s. Like I said, otherwise a very good video!
@earthykibbles
@earthykibbles Ай бұрын
My brain is cooked. This is my fifth time rewatching this😢
@NicolasChanCSY
@NicolasChanCSY Жыл бұрын
🤯 My brain keeps saying "I understand" and then "But do I really?" every few seconds Have to rewatch for those algebra for my tiny brain but the overall idea is very well presented!
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank you. And fortunately the comment section is small enough that I can answer questions - feel free to ask and I'll do my best!
@samudralatejaswini
@samudralatejaswini Ай бұрын
Can you explain the calculation in bellman equations with the example which values to substitute
@sidnath7336
@sidnath7336 Жыл бұрын
@6:48, how did you calculate the expectation via the sum i.e. get 17.4?
@denizdursun3353
@denizdursun3353 4 ай бұрын
its been 11 months, but either way: first assume s=s^(1): r = 0: 0.12 * [0 + 0.95 * 18.1] r = 1: 0.22 * [1 + 0.95 * 18.1] r = 2: 0.20 * [2 + 0.95 * 18.1] sum all of these up then we assume s=s^(2) r = 0: 0.09 * [0 + 0.95 * 16.2] r = 1: 0.32 * [1 + 0.95 * 16.2] r = 2: 0.05 * [2 + 0.95 * 16.2] sum all of these up as well then add the grand total of both of those. its the same way for taking the left action :) Edit: for the state value of s=s^(0) you would simply have to create the weighted sum of the actions such that: 0.4 * 17.8 + 0.6 * 17.4 = 17.56 or 17.57 if you dont ceil/floor your intermediate results do the calculations in excel you will get the same results :)
@Mutual_Information
@Mutual_Information 4 ай бұрын
Nailed it!
@JoeyFknD
@JoeyFknD 11 ай бұрын
That awkward moment when a KZbin series teaches you more practical knowledge than a $50,000 4-year degree in math
@kiffeeify
@kiffeeify 11 ай бұрын
@ 13:50 for computing the value function. Is this somehow related to Gibbs sampling? Somehow it reminds me of it :)
@attilasarkany6123
@attilasarkany6123 11 ай бұрын
sorry, I may miscalculate and being stupid. I got different result at 6.48 (17.4).someone write it down?
@supercobra1746
@supercobra1746 Жыл бұрын
Im tough and ambitious!
@Mutual_Information
@Mutual_Information Жыл бұрын
lol, hell yea!
@mehmeterenbulut6076
@mehmeterenbulut6076 5 ай бұрын
Nice explaination. May I ask how did you gain your explanation skills, did you took a course or something? Because you just hit the right buttons with your method, so to speak, to make us understand what you are talking about. I'd love to explain things like you do man, appreciated!
@Mutual_Information
@Mutual_Information 5 ай бұрын
Funny you say that, I feel like I have so much more to learn. I explain things poorly all the time! What I'd say is.. starting *writing* ASAP. Write about whatever interests you and write somewhere where people will give you useful feedback. It takes awhile to learn what works and what doesn't. One thing that helped me is to realize I'll be writing/educating forever. So there's no rush. This makes it more enjoyable, which means it's easier to maintain a long term habit. The long term is where quality edu emerges.
@ManuThaisseril
@ManuThaisseril Жыл бұрын
I have questions about the 6:06 substitution, could you explain this a bit ? because v_pi does not need to be conditional on state and action , it only depends on the state
@Mutual_Information
@Mutual_Information 11 ай бұрын
If you look at the pinned comment on this video, I do a breakdown of the expression. If that doesn't answer your Q, let me know
@kafaayari
@kafaayari Жыл бұрын
Great video but I have a question. At 3:56 probability distribution table appears regarding right selection. However s0 is not in table. After all, agent can go from s1 to s0. Am I wrong?
@Mutual_Information
@Mutual_Information Жыл бұрын
This example is focused only on a small piece of the MDP. The MDP, in entirety, describe the probability of all state, reward pairs for each state, action pair. In this example, I'm only showing the state-action pairs from s0 and, in this example, we can only transition to s1 or s2 (when choosing right). In other words, this example is more restricted than the general case. Implicit in this example, is that the probability of transition from s0 to s0 is zero... or the probability to transition from s0 to s-1 is zero when choosing right. Make sense?
@kafaayari
@kafaayari Жыл бұрын
Ah ok MI. now it's crystal clear. BTW we're lucky to ask you questions and get replies. We may not find this possibility in the future when the channel explodes. :)
@Mutual_Information
@Mutual_Information Жыл бұрын
@@kafaayari Aw thanks :) we’ll see what happens. I enjoy answering the Qs and I’m gonna try to keep it up for as long as I can. So far the volume is quite manageable lol
@nathanzorndorf8214
@nathanzorndorf8214 8 ай бұрын
Do you provide slides as a pdf anywhere ? That would be really helpful! Great video!!!
@Mutual_Information
@Mutual_Information 8 ай бұрын
gah unfortunately not.. I'd like to come back and create a written version of this series. I have that for a small set of other videos, but that'll take some time for this one - don't hold your breathe, sorry!
@nathanzorndorf8214
@nathanzorndorf8214 8 ай бұрын
@@Mutual_Information no problem! Thanks for your videos. They are a big help!
@luken476
@luken476 Жыл бұрын
Anyone have a currently most recommended library or software to learn the application part of reinforcement learning?
@Mutual_Information
@Mutual_Information Жыл бұрын
These might help? * github.com/TianhongDai/reinforcement-learning-algorithms * github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch
@dwi4773
@dwi4773 5 ай бұрын
watched it on 0.75, great video!
@MarkoTintor
@MarkoTintor Жыл бұрын
Can you comment on why does Gambler's problem solution differ from Kelly's criterion from one of your previous videos? Having a goal to reach 100 vs maximum growth.
@Mutual_Information
@Mutual_Information Жыл бұрын
Sure - they aren't optimizing the same thing. In the gamblers problem, the only thing that matter is probability of getting to 100. In the Betting Game for KC, it's the expected growth rate. Also, KC can sometimes tell you not to bet. In the gamblers problem, you are forced to bet every time. I guess that's enough to make for the different strategies
@IgorAherne
@IgorAherne Жыл бұрын
Duane, at 16:20, I'm not sure why we would need to update the policy: I would think that we could just rely on updating the values of the states, again and again, until they stop changing. Following my logic, we wouldn't have to iteratively change the policy - at the very end we'd just make it "follow the highest action". ....But, I realize that these state-values were updated with the random-action policy (the 4 neighbor-states value are weighed by 0.25). Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others? So it no longer becomes 0.25 but some other probability. My confusion is because I am used to Q-learning where the policy is epsilon greedy. Thank you
@Mutual_Information
@Mutual_Information Жыл бұрын
Hey Igor. If you are following that random-action policy example, then it just so happens you only need to apply policy improvement *once* and you're at the optimal policy. But that's not true in general. Here's it more spelled out: * start with a random policy. * determine it's value function. * make a slightly better policy using the pick-best-action rule. This only slightly better than the random policy. In the general case, it is not likely to be near the best optimal policy. * determine the value function of this new, slightly improved policy. * repeat. If you were to do your approach, you would only be doing 1 iteration. You wouldn't end with an optimal policy. And regarding "Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others?" Yes I think that's fair. We are changing the probabilities in states where the highest-action vcalue is NOT selected. Though I'm not sure what you mean by "slowly" Hope that helps!
@IgorAherne
@IgorAherne Жыл бұрын
@@Mutual_Information thank you
@tmorid3
@tmorid3 9 ай бұрын
13:53 - how come the -20 cells and the -22 cells don't have the same value? they are identically far from the ending point. no? Thanks!
@Mutual_Information
@Mutual_Information 9 ай бұрын
If we were valuing the optimal policy, you'd be right. But we're valuing the do-something-randomly policy, which can't be value by looking at the optimal path. You have to think about a random walk, and then the corners, in that sense, are further away.
@tmorid3
@tmorid3 9 ай бұрын
@@Mutual_Information Thank you very much for the quick reply!
@aryamohan7533
@aryamohan7533 4 ай бұрын
Could you explain the policy improvement? I understand that choosing the action that maximizes the action value function will lead to a better policy but I don't understand why we could not do that in the first iteration after performing policy evaluation? Wouldn't that then be the most optimal policy? What other improvements can we iteratively make to the policy? PS: Thank you for this video series, it has helped me understand a lot!
@Mutual_Information
@Mutual_Information 4 ай бұрын
Let's walk through it. To start, the action-value function is completely flat; all actions in all states have the same value. To do policy improvement from this moment, you must pick actions randomly (or some other arbitrary way to break ties). Now you have a random policy. Then in policy evaluation, we determine the action-values of this random policy. Next, in policy improvement, we can slightly beat it by always picking the max-action value. Ok, why isn't this the optimal strategy immediately? Well b/c it's a policy improvement step (applying the rule of picking the max-value-action) on action values of a crappy policy, the one where we just picked actions randomly! Make sense? It takes time for the action-values to model a good policy, because we start with a bad policy.
@aryamohan7533
@aryamohan7533 4 ай бұрын
@@Mutual_Information This makes a lot more sense now. Thank you so much for taking the time to respond! To everyone else watching, while I have only watched the first 3 parts (will watch the rest soon), I can already tell you that this video series has peaked my interest in RL and I am excited to dive deeper into these topics and look into how I can incorporate this into my research.
@Mutual_Information
@Mutual_Information 4 ай бұрын
Excellent! I hope it helps you
@ManuThaisseril
@ManuThaisseril Жыл бұрын
v_pi is not a random variable why do we take the expected value of that at 6:06?
@Mutual_Information
@Mutual_Information 11 ай бұрын
Indeed it is not random.. but if you give it a random variable as input, it becomes a random variable. As a silly example, if f(x) = x^2.. and U is a uniform random variable over [0, 1], then f(U) is a random variable. It is produced by randomly sampling a value uniformly from 0, 1 and then squaring it.
@piero8284
@piero8284 8 ай бұрын
Ok, the notation of E_pi [.] does not necessarily imply you are averaging over a distribution pi(A | s), right? Like in 6:43, where you took the average over the r.v's joint p(s',r | s^0,->), so what's the point of using pi ?
@piero8284
@piero8284 8 ай бұрын
I mean, when I write E_pi[.] it just means that the function inside must be calculated considering that the agent followed the policy pi, but in practice I have to consider a special function (dependent on pi) for the r.v's probs. Am I correct?
@Mutual_Information
@Mutual_Information 8 ай бұрын
At 6:43, you can see E_pi[] means when you go from the second to the third line. It's an expectation operation, so we are expanding random variables within the expression into a weighted sum of values.. where the values are all values the random variables can take, and the weights are their respective probabilities. This is what happens between the second and third line.
@piero8284
@piero8284 8 ай бұрын
​@@Mutual_Information I agree with you, my only source of doubt was from the pi subscript, as it does not make explicit the distribution of the random variables, I'm used to think of the subscript of the expectation meaning the distribution of the random variable itself, for example E_{p_X(x)}[X] = sum_x p(x)*x, but in this context it's not the case.
@Mutual_Information
@Mutual_Information 8 ай бұрын
@@piero8284 Oh I see what you're saying. yea pi is just suggestive. It's like saying "the expectation is with respect to the policy pi and you have to know what that means".
@kjs333333
@kjs333333 8 ай бұрын
EPIC
@watcher8582
@watcher8582 8 ай бұрын
Gotta upload a version that blends out the pause-less hand motions
@Mutual_Information
@Mutual_Information 8 ай бұрын
Yea I don't like my old style at all. I'm re-shooting one but it's a lotta work. :/ Feedback is welcome.
@watcher8582
@watcher8582 8 ай бұрын
@@Mutual_Information It was just a mean comment for the sake of being mean. I'd probably not put in the work to upload content you already got got covered, but donno
@Mutual_Information
@Mutual_Information 8 ай бұрын
@@watcher8582 haha yea I looked at the video after the fact and decided it wasn't so bad I didn't need to do this one. But some of my older ones, hand motions are awful and I'll be changing some of that lol
@watcher8582
@watcher8582 8 ай бұрын
@@Mutual_Information No I mean if you already got a video covered (or if it's well covered elsewhere), then I'd probably invest the energy into making a video with a topic not covered online yet. Don't misunderstand me, the hand motion is terrible and I put a post-it over the screen to watch the video.
@sayounara94
@sayounara94 7 ай бұрын
I was super focused on the board I didn't notice any weird cutovers! I like how you go through each variable it's very useful to have these quick reminders of what these notations represent as we're going through new concepts so that we don't have to make more conscious effort to decipher them and can focus on the new concept
@yassinesafraoui
@yassinesafraoui Жыл бұрын
hmm, let me guess, wouldn't applying a discount be interesting if the state space is so big? furthermore, wouldn't it be more interesting if we instead of discounting by a constant rate use a Gaussian distribution as our discount ?
@Mutual_Information
@Mutual_Information Жыл бұрын
Interesting perspective, but I don't think a strong discount removes the difficulty of a large state space. I see your perspective though - it makes it seem as though you only need to care about the most immediate states. But that's not necessarily true. It's because our policy is optimized for all t for G_t.. If you only cared about G_0 and gamma = 0, then yes, the immediate state/action pair is all the matters and you don't care about a lot of the state space. BUT, since we also care about G_1.. we have to have a policy that does well at time t=1.. which means we care about states beyond those in the neighbor of states at t=0. Eventually, we could end up caring about the whole state space. If, on the other hand, some states aren’t reachable from the starting state - then that would be one way in which a lot of the state space doesn't matter.
@yassinesafraoui
@yassinesafraoui Жыл бұрын
Yeahh that's it 👌👌, thanks for the fast reply!
@nathanzorndorf8214
@nathanzorndorf8214 6 ай бұрын
Do you have code for the gamblers problem online anywhere?
@nathanzorndorf8214
@nathanzorndorf8214 6 ай бұрын
I attempted to code the policy iteration algorithm for the gambler's problem, but don't get the policy you show in this video. Instead I get a triangle with a max at 50. This does seem like a reasonable policy though, so I'm not sure if this is one of the "family of optimal policies" that barto and sutton reference in the text.
@Mutual_Information
@Mutual_Information 6 ай бұрын
Oh yes I think it might be. I haven't made the code public but I think I remember the problem. Changing how you are dealing with ties! The action you pick given a tie in value makes a difference
@sharmakartikeya
@sharmakartikeya 5 ай бұрын
If anyone is having a hard time deriving the bellman equation, especially the part where E [ G_t+1 | S_t ] = v(S_t+1), then I have covered that in my playlist from the absolute ground zero (because I am dumb). kzbin.info/www/bejne/ap-wn5ZugpyIa80si=9QR8zNuLEd1QAijE
@Mutual_Information
@Mutual_Information 5 ай бұрын
I don't see anything dumb in that video. You're getting right into the meat of probability theory and that's no easy thing!
@LeviFinkelstein
@LeviFinkelstein Жыл бұрын
I don't know very much about video stuff, but it looks like there's something off with your recording of yourself, it's pretty pixelated. Maybe it's just that your camera isn't that good, or something else, like the lighting, your rendering settings, or bit rate in OBS. Just wanted to let you know in case you didn't already. Thanks for the good videos.
@Mutual_Information
@Mutual_Information Жыл бұрын
Thanks for looking out. This was my first time uploading in 4K (despite it being recorded in 1920 x 1080) - apparently that's recommended practice. From my end, the video doesn't look bad enough to warrant a re-upload, but I'll give the settings another look on the next videos. I believe I see the pixelation you're referring to.
@pi5549
@pi5549 Жыл бұрын
How about creating a Discord? If you think of the mind-type you're filtering with your videos, it could make for a strong community.
@Mutual_Information
@Mutual_Information Жыл бұрын
I should. I'm just not on Discord myself, so I don't have familiarity with it as a platform. But I have gotten the request a few times and it seems like a wise move..
@pi5549
@pi5549 Жыл бұрын
@@Mutual_Information Yannic Kilcher created a Discord to support his KZbin, and it is buzzing. Also the level of expertise is high. Yannic's used his following to accrete engineering power into LAION. I got into ML in 2015 and there were almost no online communities back then. I came back over christmas (thanks to the ChatGPT buzz) and was delighted to find that it has taken off bigtime over Discord. Also Karpathy has an active Discord.
@piero8284
@piero8284 8 ай бұрын
(11:10) v_*(s) = max_a {q_*(s,a)}, no proof for this equation in the book made me very disappointed.
@gigantopithecus8254
@gigantopithecus8254 3 ай бұрын
i heard its simular to calculus of variations
@sorn6813
@sorn6813 11 ай бұрын
It's very hard to follow when you call everything "this" and just highlight "this". Would be easier to follow if you replaced "this" with the name of what you're describing
@sorn6813
@sorn6813 11 ай бұрын
E.g. "R at time step t" or "the bottom-right XYZ"
@Mutual_Information
@Mutual_Information 11 ай бұрын
Appreciate the feedback. It's a work in progress.. going forward there's even less on screen so focusing attention might be alleviated a bit. Would you mind providing a timestamp for a case where this stood out? That'll help me identify similar cases in the future
@stallard5772
@stallard5772 Жыл бұрын
𝔭𝔯𝔬𝔪𝔬𝔰𝔪
@anondsml
@anondsml 2 ай бұрын
you move too fast. almost as if you're nervous
@vishalpaudel
@vishalpaudel 3 ай бұрын
The thumbnail was Manim-like. Dissapointed. Did not watch the whole video.
@juliomastrodomenico5188
@juliomastrodomenico5188 Жыл бұрын
Hi! Nice explanation ! but I tried to implement the calculation from 12:46, like doing the iteration (sweep) and I couldnt reach the same result, for the first part of the equation I kept the 0.25 ( pi[a|s] ), p(s_prime, r | s, a) as 1, for a deterministic world and sum ( -1 + world[row, col]), where world[row, col] is the V_next_state. My result was: [[ 0. -5.05417921 -5.47163695 -3.10743228] [-5.05417921 -6.78698773 -6.51979625 -3.95809215] [-5.47163695 -6.51979625 -5.86246816 -3.20514008] [-3.10743228 -3.95809215 -3.20514008 0. ]] The iteration was something like: for s in the world grid: (except the goals) for act in actions: (if falls off grid does nothing (try catch)) V(s) += 0.25 * (-1 + V[act])
@user-co6pu8zv3v
@user-co6pu8zv3v 7 ай бұрын
Great video! Thank you!
Monte Carlo And Off-Policy Methods | Reinforcement Learning Part 3
27:06
Mutual Information
Рет қаралды 35 М.
The Boundary of Computation
12:59
Mutual Information
Рет қаралды 941 М.
She’s Giving Birth in Class…?
00:21
Alan Chikin Chow
Рет қаралды 13 МЛН
Did you find it?! 🤔✨✍️ #funnyart
00:11
Artistomg
Рет қаралды 126 МЛН
КАКОЙ ВАШ ЛЮБИМЫЙ ЦВЕТ?😍 #game #shorts
00:17
Mastering Dynamic Programming - How to solve any interview problem (Part 1)
19:41
Policy and Value Iteration
16:39
CIS 522 - Deep Learning
Рет қаралды 129 М.
Can you solve this fraction question from China?
11:59
MindYourDecisions
Рет қаралды 25 М.
The Most Important (and Surprising) Result from Information Theory
9:10
Mutual Information
Рет қаралды 83 М.
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 239 М.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 111 М.
Reinforcement Learning, by the Book
18:19
Mutual Information
Рет қаралды 74 М.