Monte Carlo And Off-Policy Methods | Reinforcement Learning Part 3

  Рет қаралды 35,439

Mutual Information

Mutual Information

Күн бұрын

The machine learning consultancy: truetheta.io
Want to work together? See here: truetheta.io/about/#want-to-w...
Part three of a six part series on Reinforcement Learning. It covers the Monte Carlo approach a Markov Decision Process with mere samples. At the end, we touch on off-policy methods, which enable RL when the data was generate with a different agent.
SOCIAL MEDIA
LinkedIn : / dj-rich-90b91753
Twitter : / duanejrich
Github: github.com/Duane321
Enjoy learning this way? Want me to make more videos? Consider supporting me on Patreon: / mutualinformation
SOURCES
[1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
[2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, • DeepMind x UCL | Deep ...
SOURCE NOTES
The video covers topics from chapters 5 and 7 from [1]. The whole series teaches from [1]. [2] has been a useful secondary resource.
TIMESTAMP
0:00 What We'll Learn
0:33 Review of Previous Topics
2:50 Monte Carlo Methods
3:35 Model-Free vs Model-Based Methods
4:59 Monte Carlo Evaluation
9:30 MC Evaluation Example
11:48 MC Control
13:01 The Exploration-Exploitation Trade-Off
15:01 The Rules of Blackjack and its MDP
16:55 Constant-alpha MC Applied to Blackjack
21:55 Off-Policy Methods
24:32 Off-Policy Blackjack
26:43 Watch the next video!
NOTES
Link to Constant-alpha MC applied to Blackjack: github.com/Duane321/mutual_in...
The Off-Policy method you see at 25:00 is different from the rule you'll see in the textbook at eq 7.9 (which will be MC if n goes to inf). That's because they are showing re-weighted IS and I'm showing plain ( high variance) IS.

Пікірлер: 65
@shahadalrawi6744
@shahadalrawi6744 Жыл бұрын
This is beyond great. I can't thank you enough for the effort and clarity in this series. This is gold.
@Mutual_Information
@Mutual_Information Жыл бұрын
You thanked me plenty! Glad you enjoy it
@PromptStreamer
@PromptStreamer 10 ай бұрын
These videos genuinely help me learn. A lot of the time studying math that’s above your head doesn’t have any tiny cumulative value, you’re just out of your league. But in these videos I often feel like I get the general idea of what he’s saying, even if I can’t work out all the details on my own yet. It’s something you can actually watch relaxed, like hearing a podcast, but walk away having learned something. I’m watching this in a hospital waiting room and it’s gripping. After watching his softmax video I was able to read through a paper I saw linked on twitter and sure enough, they mentioned the softmax, and my eyes lit up for a second. These are really high quality videos.
@andrewkovachik2062
@andrewkovachik2062 Жыл бұрын
Your video on importance sampling was so useful and well made I'm sticking around for this whole series even though I don't expect I'll need any time soon
@Mutual_Information
@Mutual_Information Жыл бұрын
The whole series!? You're a champ dude - thank you!
@architasrivastava218
@architasrivastava218 Жыл бұрын
I have been doing specialization in AI since last 2 years in my college. I wish my teachers had explained to me such a clear way.
@aadi.p4159
@aadi.p4159 Жыл бұрын
keep em coming man. This is one of the most well prod. videos Ive seen on this topic!
@minefacex
@minefacex Жыл бұрын
I love how your videos are so understandable, but mathematically concise and clear at the same time! You also have amazing animations and figures. Good job and thank you!
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank you Balazs!
@moranreznik
@moranreznik Жыл бұрын
I wish every math book in the world was written by you.
@Mutual_Information
@Mutual_Information Жыл бұрын
lol that's very nice of you, but that sounds like an awfully lot of work :)
@jacksonstenger
@jacksonstenger Жыл бұрын
This is really great information, thanks for taking the time to make these videos
@timothytyree5211
@timothytyree5211 Жыл бұрын
Fantastic video series! I am looking forward to your next video, good sir.
@catcoder12
@catcoder12 8 ай бұрын
26:03 We did a better estimate because the behavioural policy chooses hit/stick with equal probability so we "explore" more of the suboptimal states compared to a on-policy method where we greedily always choose the most optimal action? Am I right?
@Mutual_Information
@Mutual_Information 8 ай бұрын
It could be something like that.. I can't confidently say. It could also be the noise of the simulation I did. I'd have to re-run it a lot to know it's a real effect. I don't suspect it is.. in general, off policy makes for strictly worse learning.
@melihozcan8676
@melihozcan8676 6 ай бұрын
Don’t expect to understand these videos by only watching. They are like concentrated juices (without sugar/chemicals added hehe), you can't just drink them, it’ll overload your body… Water must be added, which is time and effort, in this context. Everybody has some vague idea about reinforcement learning already: Give rewards / punishment & repeat. Nevertheless this high level understanding is only adequate for people from different areas: Like Justin Trudeau knowing the basics of Quantum computers (which is impressive actually). I would like to thank Mutual Information for this series! The connections between topics and the amount of details (math) is very well established. Such quality content is really sparse. If you also make similar series on ML or similar topics, count me in!
@Mutual_Information
@Mutual_Information 6 ай бұрын
Wow, that's very kind of you! Thank you for noticing what I was aiming for her.. and I'm going to use that line "concentrated juices" - that's a good analogy!
@melihozcan8676
@melihozcan8676 6 ай бұрын
Thank you as well@@Mutual_Information, apparently good lectures lead to good analogies! I am honored!
@tomsojer7524
@tomsojer7524 4 ай бұрын
I am so grateful for this series man, it helped me pass my exam. Thank you so much man. I'm waiting for more of your videos
@Mutual_Information
@Mutual_Information 3 ай бұрын
Awesome - exactly what I was going for. And I'm working on the next video now..
@DKFBJ
@DKFBJ 3 ай бұрын
This is excellent - Highly appreciated. Thank you very much. Have a great week, Kind regards
@imanmossavat9383
@imanmossavat9383 Жыл бұрын
Excellent series.
@codesmell786
@codesmell786 7 ай бұрын
Best Video I have seen many but this one is best ...great work
@Mutual_Information
@Mutual_Information 7 ай бұрын
Means a lot, I appreciate hearing it
@hamade7997
@hamade7997 Жыл бұрын
This is really quite excellent, thank you.
@marcin.sobocinski
@marcin.sobocinski Жыл бұрын
Dziękujemy.
@bonettimauricio
@bonettimauricio 9 ай бұрын
Thanks for sharing this content, really amazing!
@jiaqint961
@jiaqint961 11 ай бұрын
Thank you for your videos they are very comprehensive and well explained.
@Mutual_Information
@Mutual_Information 11 ай бұрын
Glad they helped!
@user-co6pu8zv3v
@user-co6pu8zv3v 6 ай бұрын
This is great! excellent. Thank you!
@qiguosun129
@qiguosun129 Жыл бұрын
With all due respect, your lecture is more vivid than what deep-mind teacher explained.
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank you ! Their lecture series is great. I just put more of an emphasis on visualizing the mechanics and compressing the subject
@qiguosun129
@qiguosun129 Жыл бұрын
@@Mutual_Information Yes, that helps a lot for understanding the underlying mechanism.
@DARWINANDRESBENAVIDESMIRANDA
@DARWINANDRESBENAVIDESMIRANDA Жыл бұрын
so great explanation, notation and video produce!!
@Mutual_Information
@Mutual_Information Жыл бұрын
Thank you Darwin!
@dimitrispapapdimitriou6364
@dimitrispapapdimitriou6364 5 ай бұрын
This is very well made. Thank you!
@IMAdiministrator
@IMAdiministrator 6 ай бұрын
I have a question in BlackJack example. Why doesn't the stick graph with both having and not having ace have similar result? You stick to whatever you have anyway so it seems a bit odd to have different state-value between the graph.
@rickycarollo6410
@rickycarollo6410 3 ай бұрын
amazing stuff thanks !
@glebmokeev6312
@glebmokeev6312 7 ай бұрын
12:14 Why the policy is deterministic if we have probability > 0 of taking one of two actions?
@user-ed7ze8sx9c
@user-ed7ze8sx9c 6 ай бұрын
Super awesome video series and I have thoroughly enjoyed it so far! I do want to ask what tool(s) did you use to perform visualization and add animations for the plots in the video. If you can provide me the answer it would be a great help for a documentation I am currently working on! Again, super awesome video and I am glad people like you put so much effort to communicate and simply these complicated topics in a really fun and very descriptive manner.
@user-kz6jr6gw7t
@user-kz6jr6gw7t 3 ай бұрын
Thank you so much! But at 25:13, since the target policy is derived after the data are sampled by the behavior policy, is there an iterative process to update the rho, then get a new target policy, and then so on?
@Mutual_Information
@Mutual_Information 3 ай бұрын
Yea, you're thinking about it right. The target policy is the thing getting updated. The behavior policy is a fixed, given function. So pho changes as the target policy changes. Intuition, rho is adjusting for the fact the target and behavior policies 'hang out' in different regions of the state space. So, as the target policy changes were it hangs out, rho needs change how it adjusts.
@user-kz6jr6gw7t
@user-kz6jr6gw7t 3 ай бұрын
Thanks a lot for the further clarification. That really helps! @@Mutual_Information
@fallognet
@fallognet 4 ай бұрын
Hey ! Thank you so much for your videos, they are great and very useful ! I still have a question tho, when you are showing the Off-policy version of the Constant alpha MC algorithm (25:10), why is the behaviour policy b never updated to generates the new trajectories (we would like the new trajectories to take into account of our improvments on the policy and the decision making, right ?) Thank you again Sir !
@Mutual_Information
@Mutual_Information 4 ай бұрын
Good question! It's because it's off policy. That's defined as the case where the behavior policy is fixed and given to you (imagine someone just handed you data and said it was generated by XYZ code or agent). Then we're using that data / fixed behavior policy to update the Q-table, which gives us another policy, pi. Think of it as the 'policy recommended according to the data collected by the given behavior policy."
@florianvogt1638
@florianvogt1638 6 ай бұрын
I am curios, in most pseudo code algorithms for Off policy MC Control the order we go over the states after generating an episode is in reverse, that is, we start from T-1 and go to t=0. However, you start at t=0 until T-1. I wonder if both approaches are really equal?
@IMAdiministrator
@IMAdiministrator 6 ай бұрын
Judging on the RL Book, IMHO he altered the off-policy MC control(section 5.7) into this method which initially multiply all the ratio from the start to terminal state and gradually stripping the ratio when t progress to the terminal state, hence he can progress timestep forward. Alpha is suppose to be a ratio between weight of the current state and cumulative sum which value is between 0 to 1 according to the method, but cumulative sum need to be calculated backward. In order to calculate alpha forward you need first get all the cumulative sum in episode. cumulative sum can be gathered from all the importance sampling ratio between all timestep in an episode. And gradually stripping the weight of the current state from cumulative sums to calculate alpha forward.
@049_revanfauzialgifari6
@049_revanfauzialgifari6 3 ай бұрын
how do you evaluate reinforcement learning results? i know precision, recall, mAP, etc. but i dont think it can be used in this cenario, CMIIW.
@EdupugantiAadityaaeb
@EdupugantiAadityaaeb 7 ай бұрын
what an explanation
@Mutual_Information
@Mutual_Information 6 ай бұрын
I like it too
@ice-skully2651
@ice-skully2651 Жыл бұрын
Great quality sir! The material is well presented, do you have a social media account I could follow you on ?
@Mutual_Information
@Mutual_Information Жыл бұрын
Yea, Twitter: @DuaneJRich
@faysoufox
@faysoufox Жыл бұрын
I understood the two first videos well, but in this one, you spend time talking about fine model points without explaining the model with enough time to actually understand it. Still thank you for your videos that seem to be good introductions.
@Mutual_Information
@Mutual_Information Жыл бұрын
Ah sorry it's not landing :/ But maybe I can help. Is there something specific you don't understand and maybe I can clarify it here?
@faysoufox
@faysoufox Жыл бұрын
@@Mutual_Information thank you for your proposition, I was actually watching your videos more for fun, it's not like I need to be able to do RL things tomorrow. If I want to understand in detail I'll read the book you based your videos on.
@ChocolateMilkCultLeader
@ChocolateMilkCultLeader Жыл бұрын
I love your videos. Would love to connect with you further
@user-uh4ng2ot1m
@user-uh4ng2ot1m 8 ай бұрын
I loved it, can you make coding videos regarding this.
@Mutual_Information
@Mutual_Information 8 ай бұрын
I included some notebooks in the description. That's probably as far as I'll go. Just got other topics I'd like to get to.
@PromptStreamer
@PromptStreamer 10 ай бұрын
Can you please start a discord server? Would be wonderful to discuss the video content somewhere. Thx
@alexchen879
@alexchen879 Жыл бұрын
Could you please public your source code?
@Mutual_Information
@Mutual_Information Жыл бұрын
There's a link to a notebook in a description. It covers some of the code, but not everything. If there's a specific question you have, I can try to answer it here. Maybe that'll fill that gap.
@user-wn9jq3zn6u
@user-wn9jq3zn6u Ай бұрын
Anyone's brain explode like mine?
@catcoder12
@catcoder12 8 ай бұрын
You teach great but I feel you speak a little too fast.
@Mutual_Information
@Mutual_Information 8 ай бұрын
Good to know, I'm still getting calibrated. I've spoken *way* too fast before and sometimes too slow. Finding that sweet spot..
How many pencils can hold me up?
00:40
A4
Рет қаралды 17 МЛН
Why? 😭 #shorts by Leisi Crazy
00:16
Leisi Crazy
Рет қаралды 46 МЛН
Dynamic #gadgets for math genius! #maths
00:29
FLIP FLOP Hacks
Рет қаралды 18 МЛН
The Most Important (and Surprising) Result from Information Theory
9:10
Mutual Information
Рет қаралды 83 М.
Reinforcement Learning Series: Overview of Methods
21:37
Steve Brunton
Рет қаралды 86 М.
The Boundary of Computation
12:59
Mutual Information
Рет қаралды 928 М.
Reinforcement Learning, by the Book
18:19
Mutual Information
Рет қаралды 73 М.
Importance Sampling
12:46
Mutual Information
Рет қаралды 54 М.
Is the Future of Linear Algebra.. Random?
35:11
Mutual Information
Рет қаралды 211 М.