Great video! Can you explain more, that "sneaky" equation in aroun 6:00? Why is G_t+1 = v(S_t+1) in the expectation?
@Mutual_Information2 жыл бұрын
Ah, something I intentionally skipped over out of laziness, so I'll pin this comment for others. We want to show E[G_t+1 | s^0, -> ] = E[v(S_t+1) | s^0, -> ]. So.. * E[v(S_t+1) | s^0, -> ] = E[E[G_t+1|S_t+1] | s^0, -> ] (by def of v ) * = sum over s' [E[G_t+1|S_t+1 = s']p(s' | s^0, -> ) ] (by def of an expectation - the outer one) * = E[G_t+1| s^0, -> ] (by law of total probability) where p(s' | s^0, -> ) = sum over r [p(s' , r| s^0, -> )]
@mbeloch972 жыл бұрын
@@Mutual_Information thanks!
@xiaoweilin81842 жыл бұрын
@@Mutual_Information May I ask how law of total probability is used to get the last line from the previous one? Thanks!
@Mutual_Information2 жыл бұрын
@@xiaoweilin8184 Hm, it's just precisely what the law of total probability tells you. sum over b [p(a|b)p(b)] = p(a) The only difference is my expression has some extra conditioning on s^0, ->.. but that doesn't change anything. Hope that helps
@xiaoweilin81842 жыл бұрын
@@Mutual_Information But in your expression, the quantity to be summed is E[G_t+1|S_t+1 = s']. So do we need to write out this expectation to: sum over g_t+1 [g_t+1*p(G_t+1 = g_t+1|S_t+1 = s')] first? and the whole expression becomes a double sum: sum over s' , sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] exchange the sum: sum over g_t+1 {g_t+1 * sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)]} (1) Until this step can we use the total probability formula to the second sum: sum over s' [p(G_t+1 = g_t+1|S_t+1 = s') * p(S_t+1 = s'| s^0, ->)] = p(G_t+1 = g_t+1 | s^0, ->) Put it back into (1): sum over g_t+1 [g_t+1 * p(G_t+1 = g_t+1 | s^0, ->)] = E[G_t+1 | s^0, ->] Is it the correct way to use the law of total probability to derive the last step from the previous one in your derivation? It seems to me these are a few more steps that are derived under the hood in your expressions. Sorry there is no Latex in KZbin comment, it would be nicer if they are in Latex...
@TheRealExecuter22 Жыл бұрын
I can't express how good these videos are, thank you so much for all the time you put into making them! this is a truly special channel
@Mutual_Information Жыл бұрын
Thank you, it's tailored for a particular audience. Doesn't hit for most, but some it nails it!
@light26664 ай бұрын
@@Mutual_Information I wish this was much more popular , this series feels tailor made for me.
@mCoding2 жыл бұрын
Let's read from the textbook. *He opens the book, then stares at the camera and confidently recites from memory*.
@Mutual_Information2 жыл бұрын
Lol I wish it was from memory! Fortunately teleprompters aren't that expensive :)
@NoNameAtAll22 жыл бұрын
@@Mutual_Information tsss don't ruin the good impression of you
@BenjaminLiraLuttges2 жыл бұрын
That part of the video made me laugh out loud!!
@pandie4555 Жыл бұрын
i was looking for this comment lmao
@bean21711 ай бұрын
This part of the video made me lose my focus entirely 😂
@abhinavanand15295 ай бұрын
So far the best and optimized playlist for reinforcement learning.
@Mutual_Information5 ай бұрын
I agree :)
@fzetАй бұрын
THESE ARE THE BEST VIDEOS ON THIS TOPIC EVER, AND YOUR WAY OF EXPLAINING AND MAKING THINGS SOUND SO SIMPLE IS INCREDIBLE, THANK YOU A MILLION TIMES !!!!!!
@hypershadow922611 ай бұрын
In 15:46 you said "if that policy is greedy in respect to thatvalue function" but i don't quite understand what you ment by that. Other than that the video is crystal clear. thank you for these videos.
@Mutual_Information11 ай бұрын
A value function gives you the numeric value of every action in every state. A policy that's greedy 'with respect to that value function' is one which, in whatever state, picks the highest value action, according to the value function. Make sense?
@rajatjaiswal1008 ай бұрын
You saved lot of my time by simple, concise and easy to follow video compared to other I have seen so far.
@timothytyree52112 жыл бұрын
Kudos, good sir. Your pedagogical skill is both impressive and efficient. Please continue to grace the world with it for the good of all of mankind.
@Mutual_Information2 жыл бұрын
That's very kind of you Timothy - I have no plans of stopping :)
@Yahia.N_Ahmed5 ай бұрын
One of the best series if not he best in describing DRL. Good Job !!!!
@Mutual_Information5 ай бұрын
Thank you!
@manudasmd Жыл бұрын
This is the best reinforcements learning resource available in internet, Period
@taumagАй бұрын
*opens book* *stares into soul* LOL! I watched David Silver's excellent series before finding the MI channel. Your animations add a new level to understanding the topic adding more AHA moments for grasping RL. Thank you for making them! (Subbed & liked)
@valterszakrevskis2 жыл бұрын
Imagine if such great educational videos existed for all foundational topics in artificial intelligence, engineering, math, and physics. We are slowly getting there :). 3b1b py module manim has made it quite accessible to create high-quality, time efficient (for learning) educational content. It's amazing what people create. Thank you for the great videos!
@Mutual_Information2 жыл бұрын
I hope that there's a section of KZbin that's one day more like Wikipedia. It's a bit of a pipedream, but I'm at least nudging this continent in that direction. FYI, I don't use manim
@vesk40008 ай бұрын
This is so well done! Explaining stuff well can be very difficult. Thanks a lot! I'm studying RL at a university course, but this was way more helpful!
@avinashsharma89137 ай бұрын
can you share me your email ? studying RL at a university course.
@marcin.sobocinski2 жыл бұрын
Dziękujemy.
@Mutual_Information2 жыл бұрын
Thank you!! A rare form of appreciation - thanks a ton :)
@hassaniftikhar556411 ай бұрын
best video lectures of rl on the internet
@usonian112 жыл бұрын
Excellent video. Even though I have been studying RL for a while, the video clarified some previously learned concepts and gave me a better understanding of the topic.
@Mutual_Information2 жыл бұрын
Thanks, exactly what I was going for
@katchdawgs914 Жыл бұрын
These series of videos are really nice. I would love to see you go more into the theory/proofs of why policy iteration works... as another series. Once again, really good work.
@karthage363710 ай бұрын
Love the content so far. I would just prefere that you leave some times to breath like when you ask question "can you find S0 ?" don't answer straight away, let us think for few seconds. Will keep diging the playlist thank you for all this work !
@Mutual_Information10 ай бұрын
Good feedback. I’ll keep it in mind. Idk why I’m in such a rush lol
@arturprzybysz66142 жыл бұрын
Good to see your content back!
@Mutual_Information2 жыл бұрын
Oh I'm BACK!
@nicolaiholtkamp32236 ай бұрын
Damn, it really only took you 20 minutes to explain something that my professor needed two full lectures for. Thank you so much! This was so helpful
@marcin.sobocinski2 жыл бұрын
Your videos are like espresso, condensed, tasty, full bodied but you should not try to rush when watching them. There are no spare words so when you miss one, you're lost 😀Great video, I love that logical structure, rock solid!
@Mutual_Information2 жыл бұрын
lol you get what I'm going for! It's awesome - love the appreciation
@omerlevy69396 ай бұрын
goated videos literally all you need to understand the topic
@raminessalat9803 Жыл бұрын
Wow im not sure if I understood RL when I took the course in college or i just forgot it, but these videos made the aha moment for me for sure!
@Mutual_Information Жыл бұрын
That's what I'm going for!!
@EnesDeumic2 жыл бұрын
Hello. Thank you very much for providiing such clear lectures I hope you will continue. I have one suggestion, when you're finishing explaining the set of equations, you make them disapear so fast that one doesn't have time to pause. Just take one full note pause and that should do it. Don't overdo it, the beauty of your lectures is that they are dense, clear and consise.
@Mutual_Information2 жыл бұрын
Noted.. let the equations baked. Ok good point - thank you!
@bonettimauricio Жыл бұрын
Amazing explanation of the concepts! Really nice!
@Mutual_Information Жыл бұрын
Thank you, I appreciate it when the harder topics land :)
@vishwajeetohal91372 ай бұрын
17:05 * open textbook *, * proceeds to not even look at it * Man's a CHAD
@40NoNameFound-100-years-ago Жыл бұрын
This is the easiest way I have seen regarding this subject. You did a pretty great job there 😂😃😃👍👍
@Mutual_Information Жыл бұрын
Thank Sir No Name - I'm trying quite hard lol
@oj00242 жыл бұрын
I didn't expect the next video so quickly, amazing stuff. Have we been spoiled, or will this tight upload schedule continue?
@Mutual_Information2 жыл бұрын
It will continue.. for a short amount of time ;}
@IgorAherne2 жыл бұрын
Duane, at 16:20, I'm not sure why we would need to update the policy: I would think that we could just rely on updating the values of the states, again and again, until they stop changing. Following my logic, we wouldn't have to iteratively change the policy - at the very end we'd just make it "follow the highest action". ....But, I realize that these state-values were updated with the random-action policy (the 4 neighbor-states value are weighed by 0.25). Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others? So it no longer becomes 0.25 but some other probability. My confusion is because I am used to Q-learning where the policy is epsilon greedy. Thank you
@Mutual_Information2 жыл бұрын
Hey Igor. If you are following that random-action policy example, then it just so happens you only need to apply policy improvement *once* and you're at the optimal policy. But that's not true in general. Here's it more spelled out: * start with a random policy. * determine it's value function. * make a slightly better policy using the pick-best-action rule. This only slightly better than the random policy. In the general case, it is not likely to be near the best optimal policy. * determine the value function of this new, slightly improved policy. * repeat. If you were to do your approach, you would only be doing 1 iteration. You wouldn't end with an optimal policy. And regarding "Does this mean that when we update the policy, with each iteration we slowly shift the probabilities of actions in some states, but not in others?" Yes I think that's fair. We are changing the probabilities in states where the highest-action vcalue is NOT selected. Though I'm not sure what you mean by "slowly" Hope that helps!
@IgorAherne2 жыл бұрын
@@Mutual_Information thank you
@XandraDave Жыл бұрын
It turns out that in fact, algebra *is* fun, cool, and exciting
@sidnath7336 Жыл бұрын
@6:48, how did you calculate the expectation via the sum i.e. get 17.4?
@denizdursun335311 ай бұрын
its been 11 months, but either way: first assume s=s^(1): r = 0: 0.12 * [0 + 0.95 * 18.1] r = 1: 0.22 * [1 + 0.95 * 18.1] r = 2: 0.20 * [2 + 0.95 * 18.1] sum all of these up then we assume s=s^(2) r = 0: 0.09 * [0 + 0.95 * 16.2] r = 1: 0.32 * [1 + 0.95 * 16.2] r = 2: 0.05 * [2 + 0.95 * 16.2] sum all of these up as well then add the grand total of both of those. its the same way for taking the left action :) Edit: for the state value of s=s^(0) you would simply have to create the weighted sum of the actions such that: 0.4 * 17.8 + 0.6 * 17.4 = 17.56 or 17.57 if you dont ceil/floor your intermediate results do the calculations in excel you will get the same results :)
@Mutual_Information11 ай бұрын
Nailed it!
@ksilkov6 ай бұрын
@@denizdursun3353 Thank you kind man. After i saw the digits and compared them with formulas in the video it became pretty obvious. Again, video is brilliant, but the digits explanation dramatically boost clarity.
You mentioned something about optimal policy around 7:54. What is that and how did you relate it to state-action optimal value?
@pedrocastilho67892 жыл бұрын
Yes! We missed you
@Mutual_Information2 жыл бұрын
Missed you too!
@bonob01233 ай бұрын
I'm wondering what causes the discontinuities at s= 25, 50, 75 in the gambler's problem solution. The top panel value function triangular appearance makes sense but I can't get an intuitive grasp of why the lower panel optimal policy would end up that shape.
@ManuThaisseril Жыл бұрын
I have questions about the 6:06 substitution, could you explain this a bit ? because v_pi does not need to be conditional on state and action , it only depends on the state
@Mutual_Information Жыл бұрын
If you look at the pinned comment on this video, I do a breakdown of the expression. If that doesn't answer your Q, let me know
@tmorid3 Жыл бұрын
13:53 - how come the -20 cells and the -22 cells don't have the same value? they are identically far from the ending point. no? Thanks!
@Mutual_Information Жыл бұрын
If we were valuing the optimal policy, you'd be right. But we're valuing the do-something-randomly policy, which can't be value by looking at the optimal path. You have to think about a random walk, and then the corners, in that sense, are further away.
@tmorid3 Жыл бұрын
@@Mutual_Information Thank you very much for the quick reply!
@sathvikkalikivaya10 Жыл бұрын
This series is just amazing. is there any deep learning series like this?
@Mutual_Information Жыл бұрын
From me? No (but maybe one day). In the meantime, 3Blue1Brown has an excellent explainer. And there are others..
@ManuThaisseril Жыл бұрын
v_pi is not a random variable why do we take the expected value of that at 6:06?
@Mutual_Information Жыл бұрын
Indeed it is not random.. but if you give it a random variable as input, it becomes a random variable. As a silly example, if f(x) = x^2.. and U is a uniform random variable over [0, 1], then f(U) is a random variable. It is produced by randomly sampling a value uniformly from 0, 1 and then squaring it.
@jeandescamps4962 Жыл бұрын
Incredible content, thanks a lot for your work !
@Mutual_Information Жыл бұрын
Thank you Jean, and thanks for watching!
@ReconFX9 ай бұрын
Hi, so far I think this is a great video, however, I wanna point out that at 3:45 your illustration makes it seem like the states are a sort of one-dimensional grid, and one can go from s0 to s1, s1 to s2 etc. When you show the probabilities it becomes "obvious" that this is not the case, but this part had me confused a bit with your explanation/equation at 6:28, which I'm pretty sure should also have an s0 instead of an s. Like I said, otherwise a very good video!
@yuktikaura Жыл бұрын
Keep it up.. amazing take at this subject
@Mutual_Information Жыл бұрын
Glad you like it! And my current plans are to certainly keep it up :)
@aakashswami81437 ай бұрын
Amazing video!
@samudralatejaswini8 ай бұрын
Can you explain the calculation in bellman equations with the example which values to substitute
@advcare-r9oАй бұрын
Why there is no power (exponent) to the decay factor for S2 state in state action function RHS side??
@AlisonStuff2 жыл бұрын
I love Bellman! And I love equations!!
@kafaayari2 жыл бұрын
Great video but I have a question. At 3:56 probability distribution table appears regarding right selection. However s0 is not in table. After all, agent can go from s1 to s0. Am I wrong?
@Mutual_Information2 жыл бұрын
This example is focused only on a small piece of the MDP. The MDP, in entirety, describe the probability of all state, reward pairs for each state, action pair. In this example, I'm only showing the state-action pairs from s0 and, in this example, we can only transition to s1 or s2 (when choosing right). In other words, this example is more restricted than the general case. Implicit in this example, is that the probability of transition from s0 to s0 is zero... or the probability to transition from s0 to s-1 is zero when choosing right. Make sense?
@kafaayari2 жыл бұрын
Ah ok MI. now it's crystal clear. BTW we're lucky to ask you questions and get replies. We may not find this possibility in the future when the channel explodes. :)
@Mutual_Information2 жыл бұрын
@@kafaayari Aw thanks :) we’ll see what happens. I enjoy answering the Qs and I’m gonna try to keep it up for as long as I can. So far the volume is quite manageable lol
@aryamohan753311 ай бұрын
Could you explain the policy improvement? I understand that choosing the action that maximizes the action value function will lead to a better policy but I don't understand why we could not do that in the first iteration after performing policy evaluation? Wouldn't that then be the most optimal policy? What other improvements can we iteratively make to the policy? PS: Thank you for this video series, it has helped me understand a lot!
@Mutual_Information11 ай бұрын
Let's walk through it. To start, the action-value function is completely flat; all actions in all states have the same value. To do policy improvement from this moment, you must pick actions randomly (or some other arbitrary way to break ties). Now you have a random policy. Then in policy evaluation, we determine the action-values of this random policy. Next, in policy improvement, we can slightly beat it by always picking the max-action value. Ok, why isn't this the optimal strategy immediately? Well b/c it's a policy improvement step (applying the rule of picking the max-value-action) on action values of a crappy policy, the one where we just picked actions randomly! Make sense? It takes time for the action-values to model a good policy, because we start with a bad policy.
@aryamohan753311 ай бұрын
@@Mutual_Information This makes a lot more sense now. Thank you so much for taking the time to respond! To everyone else watching, while I have only watched the first 3 parts (will watch the rest soon), I can already tell you that this video series has peaked my interest in RL and I am excited to dive deeper into these topics and look into how I can incorporate this into my research.
@Mutual_Information11 ай бұрын
Excellent! I hope it helps you
@kiffeeify Жыл бұрын
@ 13:50 for computing the value function. Is this somehow related to Gibbs sampling? Somehow it reminds me of it :)
@mehmeterenbulut6076 Жыл бұрын
Nice explaination. May I ask how did you gain your explanation skills, did you took a course or something? Because you just hit the right buttons with your method, so to speak, to make us understand what you are talking about. I'd love to explain things like you do man, appreciated!
@Mutual_Information Жыл бұрын
Funny you say that, I feel like I have so much more to learn. I explain things poorly all the time! What I'd say is.. starting *writing* ASAP. Write about whatever interests you and write somewhere where people will give you useful feedback. It takes awhile to learn what works and what doesn't. One thing that helped me is to realize I'll be writing/educating forever. So there's no rush. This makes it more enjoyable, which means it's easier to maintain a long term habit. The long term is where quality edu emerges.
@ILoveMattBellamy6 ай бұрын
Great material thank you so much for posting. May I ask why the action generated on 20:09 for state = 60 is 1 under the uniform V = 0.4? I understand for 80 for example we get a = 20 because 0.4*1 + 0.6*0.4 = 0.64 (which is the updated V at the next step for 80).
@ILoveMattBellamy6 ай бұрын
At state 60: For a = 1: 0.4*0.4+0.6*0.4 = 0.4 The counterfactual for a = 40: 0.4*1 + 0.6*0= 0.4. I guess we grab the minimum action in case of tie?
@randyrandom760Ай бұрын
I have a question, does policy (pi) describes a probability distribution or the action itself? Because, I have a hard time understanding the optimal policy and the equation for `a = pi(s)` when the policy is deterministic. Also, how do we perform the policy improvement when the policy is stochastic? do we just reassign the state action value using the bellman equation after a sweep of policy evaluation? Because in the beginning of the grid world problem, you said that we use pi(a|s) = 0.25 and then you change the setting become deterministic by taking the argmax
@Mutual_InformationАй бұрын
If the policy is deterministic, it is a function. You give it a state and it gives you an action. That's all. If it's stochastic, it's a distribution over actions that depends on the state. If you give it a state, it gives back an entire distribution, specifying the probability for each action. Policy improvements for stochastic policies depends on the policy. If it's epsilon greedy, that's like a deterministic policy 'mixed' with a stochastic policy that is always a uniform distribution over the action. In this case, policy improvement doesn't touch the stochastic part - only the deterministic part. To perform policy improvement, you set the deterministic policy to select the highest value action in every state. You can do this, because presumably you just performed policy evaluation. This is the argmax operation: "select the highest value action". Make sense?
@ChocolateMilkCultLeader2 жыл бұрын
Excellent work my friend.
@Mutual_Information2 жыл бұрын
Thank you! Cheers!
@attilasarkany6123 Жыл бұрын
sorry, I may miscalculate and being stupid. I got different result at 6.48 (17.4).someone write it down?
@dwi4773 Жыл бұрын
watched it on 0.75, great video!
@dareadegun246018 күн бұрын
Hi @mutualInformation, i haved tried to solve for the same answer you got in the V_pi(s) for s^0 = 17.57, but i am not getting the same. even used chatgpt. please can you break it down?
@НиколайНовичков-е1э Жыл бұрын
Great video! Thank you!
@piero8284 Жыл бұрын
Ok, the notation of E_pi [.] does not necessarily imply you are averaging over a distribution pi(A | s), right? Like in 6:43, where you took the average over the r.v's joint p(s',r | s^0,->), so what's the point of using pi ?
@piero8284 Жыл бұрын
I mean, when I write E_pi[.] it just means that the function inside must be calculated considering that the agent followed the policy pi, but in practice I have to consider a special function (dependent on pi) for the r.v's probs. Am I correct?
@Mutual_Information Жыл бұрын
At 6:43, you can see E_pi[] means when you go from the second to the third line. It's an expectation operation, so we are expanding random variables within the expression into a weighted sum of values.. where the values are all values the random variables can take, and the weights are their respective probabilities. This is what happens between the second and third line.
@piero8284 Жыл бұрын
@@Mutual_Information I agree with you, my only source of doubt was from the pi subscript, as it does not make explicit the distribution of the random variables, I'm used to think of the subscript of the expectation meaning the distribution of the random variable itself, for example E_{p_X(x)}[X] = sum_x p(x)*x, but in this context it's not the case.
@Mutual_Information Жыл бұрын
@@piero8284 Oh I see what you're saying. yea pi is just suggestive. It's like saying "the expectation is with respect to the policy pi and you have to know what that means".
@nathanzorndorf8214 Жыл бұрын
Do you provide slides as a pdf anywhere ? That would be really helpful! Great video!!!
@Mutual_Information Жыл бұрын
gah unfortunately not.. I'd like to come back and create a written version of this series. I have that for a small set of other videos, but that'll take some time for this one - don't hold your breathe, sorry!
@nathanzorndorf8214 Жыл бұрын
@@Mutual_Information no problem! Thanks for your videos. They are a big help!
@luken4762 жыл бұрын
Anyone have a currently most recommended library or software to learn the application part of reinforcement learning?
@Mutual_Information2 жыл бұрын
These might help? * github.com/TianhongDai/reinforcement-learning-algorithms * github.com/p-christ/Deep-Reinforcement-Learning-Algorithms-with-PyTorch
@earthykibbles8 ай бұрын
My brain is cooked. This is my fifth time rewatching this😢
@sayounara94 Жыл бұрын
Is it the case that the optimal policy found by optimizing the state rewards will always be the same as the one found by optimizing the action rewards?
@watcher8582 Жыл бұрын
Gotta upload a version that blends out the pause-less hand motions
@Mutual_Information Жыл бұрын
Yea I don't like my old style at all. I'm re-shooting one but it's a lotta work. :/ Feedback is welcome.
@watcher8582 Жыл бұрын
@@Mutual_Information It was just a mean comment for the sake of being mean. I'd probably not put in the work to upload content you already got got covered, but donno
@Mutual_Information Жыл бұрын
@@watcher8582 haha yea I looked at the video after the fact and decided it wasn't so bad I didn't need to do this one. But some of my older ones, hand motions are awful and I'll be changing some of that lol
@watcher8582 Жыл бұрын
@@Mutual_Information No I mean if you already got a video covered (or if it's well covered elsewhere), then I'd probably invest the energy into making a video with a topic not covered online yet. Don't misunderstand me, the hand motion is terrible and I put a post-it over the screen to watch the video.
@sayounara94 Жыл бұрын
I was super focused on the board I didn't notice any weird cutovers! I like how you go through each variable it's very useful to have these quick reminders of what these notations represent as we're going through new concepts so that we don't have to make more conscious effort to decipher them and can focus on the new concept
@SphereofTime2 ай бұрын
11:24 Bellman eq for 4cases
@dmitriigalkin34458 ай бұрын
Amazing video! Thank you!
@MarkoTintor2 жыл бұрын
Can you comment on why does Gambler's problem solution differ from Kelly's criterion from one of your previous videos? Having a goal to reach 100 vs maximum growth.
@Mutual_Information2 жыл бұрын
Sure - they aren't optimizing the same thing. In the gamblers problem, the only thing that matter is probability of getting to 100. In the Betting Game for KC, it's the expected growth rate. Also, KC can sometimes tell you not to bet. In the gamblers problem, you are forced to bet every time. I guess that's enough to make for the different strategies
@curryeater2592 жыл бұрын
Very cool. How did you make the animations?
@Mutual_Information2 жыл бұрын
I use the Python plotting library called Altair. It creates beautiful static images. Then I have a personal library I use to stitch them into videos. That's also used to make the latex animation.
@yassine-sa2 жыл бұрын
hmm, let me guess, wouldn't applying a discount be interesting if the state space is so big? furthermore, wouldn't it be more interesting if we instead of discounting by a constant rate use a Gaussian distribution as our discount ?
@Mutual_Information2 жыл бұрын
Interesting perspective, but I don't think a strong discount removes the difficulty of a large state space. I see your perspective though - it makes it seem as though you only need to care about the most immediate states. But that's not necessarily true. It's because our policy is optimized for all t for G_t.. If you only cared about G_0 and gamma = 0, then yes, the immediate state/action pair is all the matters and you don't care about a lot of the state space. BUT, since we also care about G_1.. we have to have a policy that does well at time t=1.. which means we care about states beyond those in the neighbor of states at t=0. Eventually, we could end up caring about the whole state space. If, on the other hand, some states aren’t reachable from the starting state - then that would be one way in which a lot of the state space doesn't matter.
@yassine-sa2 жыл бұрын
Yeahh that's it 👌👌, thanks for the fast reply!
@lfccardona9 ай бұрын
You are awesome!
@rafas4307 Жыл бұрын
super good
@supercobra17462 жыл бұрын
Im tough and ambitious!
@Mutual_Information2 жыл бұрын
lol, hell yea!
@sharmakartikeya Жыл бұрын
If anyone is having a hard time deriving the bellman equation, especially the part where E [ G_t+1 | S_t ] = v(S_t+1), then I have covered that in my playlist from the absolute ground zero (because I am dumb). kzbin.info/www/bejne/ap-wn5ZugpyIa80si=9QR8zNuLEd1QAijE
@Mutual_Information Жыл бұрын
I don't see anything dumb in that video. You're getting right into the meat of probability theory and that's no easy thing!
@NicolasChanCSY2 жыл бұрын
🤯 My brain keeps saying "I understand" and then "But do I really?" every few seconds Have to rewatch for those algebra for my tiny brain but the overall idea is very well presented!
@Mutual_Information2 жыл бұрын
Thank you. And fortunately the comment section is small enough that I can answer questions - feel free to ask and I'll do my best!
@nathanzorndorf8214 Жыл бұрын
Do you have code for the gamblers problem online anywhere?
@nathanzorndorf8214 Жыл бұрын
I attempted to code the policy iteration algorithm for the gambler's problem, but don't get the policy you show in this video. Instead I get a triangle with a max at 50. This does seem like a reasonable policy though, so I'm not sure if this is one of the "family of optimal policies" that barto and sutton reference in the text.
@Mutual_Information Жыл бұрын
Oh yes I think it might be. I haven't made the code public but I think I remember the problem. Changing how you are dealing with ties! The action you pick given a tie in value makes a difference
@envynoir10 ай бұрын
THANKS!!! GOOD CONTENT
@milostean86152 жыл бұрын
brilliant
@corantinovirus821Ай бұрын
Love that❤
@yli6050 Жыл бұрын
Bravo🎉
@pi5549 Жыл бұрын
How about creating a Discord? If you think of the mind-type you're filtering with your videos, it could make for a strong community.
@Mutual_Information Жыл бұрын
I should. I'm just not on Discord myself, so I don't have familiarity with it as a platform. But I have gotten the request a few times and it seems like a wise move..
@pi5549 Жыл бұрын
@@Mutual_Information Yannic Kilcher created a Discord to support his KZbin, and it is buzzing. Also the level of expertise is high. Yannic's used his following to accrete engineering power into LAION. I got into ML in 2015 and there were almost no online communities back then. I came back over christmas (thanks to the ChatGPT buzz) and was delighted to find that it has taken off bigtime over Discord. Also Karpathy has an active Discord.
@piero8284 Жыл бұрын
(11:10) v_*(s) = max_a {q_*(s,a)}, no proof for this equation in the book made me very disappointed.
@JoeyFknD Жыл бұрын
That awkward moment when a KZbin series teaches you more practical knowledge than a $50,000 4-year degree in math
@braineaterzombie39813 ай бұрын
was checking whather my speed was set to 2x cause thats a lot of things to pack
@LeviFinkelstein2 жыл бұрын
I don't know very much about video stuff, but it looks like there's something off with your recording of yourself, it's pretty pixelated. Maybe it's just that your camera isn't that good, or something else, like the lighting, your rendering settings, or bit rate in OBS. Just wanted to let you know in case you didn't already. Thanks for the good videos.
@Mutual_Information2 жыл бұрын
Thanks for looking out. This was my first time uploading in 4K (despite it being recorded in 1920 x 1080) - apparently that's recommended practice. From my end, the video doesn't look bad enough to warrant a re-upload, but I'll give the settings another look on the next videos. I believe I see the pixelation you're referring to.
@gigantopithecus825410 ай бұрын
i heard its simular to calculus of variations
@pg_a-d3e Жыл бұрын
EPIC
@JohnSchroter-nm9nb2 ай бұрын
17:05 had me dying lmao
@sorn6813 Жыл бұрын
It's very hard to follow when you call everything "this" and just highlight "this". Would be easier to follow if you replaced "this" with the name of what you're describing
@sorn6813 Жыл бұрын
E.g. "R at time step t" or "the bottom-right XYZ"
@Mutual_Information Жыл бұрын
Appreciate the feedback. It's a work in progress.. going forward there's even less on screen so focusing attention might be alleviated a bit. Would you mind providing a timestamp for a case where this stood out? That'll help me identify similar cases in the future
@stallard57722 жыл бұрын
𝔭𝔯𝔬𝔪𝔬𝔰𝔪
@paudelvishal10 ай бұрын
The thumbnail was Manim-like. Dissapointed. Did not watch the whole video.
@juliomastrodomenico51882 жыл бұрын
Hi! Nice explanation ! but I tried to implement the calculation from 12:46, like doing the iteration (sweep) and I couldnt reach the same result, for the first part of the equation I kept the 0.25 ( pi[a|s] ), p(s_prime, r | s, a) as 1, for a deterministic world and sum ( -1 + world[row, col]), where world[row, col] is the V_next_state. My result was: [[ 0. -5.05417921 -5.47163695 -3.10743228] [-5.05417921 -6.78698773 -6.51979625 -3.95809215] [-5.47163695 -6.51979625 -5.86246816 -3.20514008] [-3.10743228 -3.95809215 -3.20514008 0. ]] The iteration was something like: for s in the world grid: (except the goals) for act in actions: (if falls off grid does nothing (try catch)) V(s) += 0.25 * (-1 + V[act])