REINFORCE: Reinforcement Learning Most Fundamental Algorithm

Рет қаралды 8,411

Күн бұрын

If you would like to see more videos like this please consider supporting me on Patreon - / andriydrozdyuk
Reinforcement Learning: An Introduction, 2nd Ed, Sutton & Barto
For REINFORCE algorithm see Section "13.3 REINFORCE: Monte Carlo Policy Gradient":
incompleteideas.net/book/the-b...
Complete code used in the video can be found here:
github.com/drozzy/reinforce
0:00 - Introduction
0:15 - Intro to RL
0:38 - Problem with Environment
1:02 - Why is this a problem for RL?
1:41 - Puppy treats (low level of abstraction)
2:14 - Good actions (middle level of abstraction)
3:22 - Reward as a signal (high level of abstraction)
4:04 - REINFORCE Algorithm Overview
5:11 - Collected Trajectory
6:01 - Product of G and Policy Gradient
6:34 - Two key concepts: sample and evaluate
6:48 - Sampling an action
7:22 - Sampling in REINFORCE
7:38 - Evaluating an action
8:24 - Sampling vs. Evaluating
8:41 - Sampling using torch.distributions.Categorical
9:12 - Evaluating using torch.distributions.Categorical
9:50 - Env/NN/Optim
10:07 - Collect One Episode of Experience
10:53 - Compute Discounted Returns
11:44 - Update the Policy
12:41 - Executing Trained Policy
13:04 - Demo Cart Pole Balancing

Пікірлер: 34

@ollipasanen3570 2 жыл бұрын

I am not sure why this doesn't have more views. Thank you for the really clear and concise explanation of REINFORCE!

@yodarocco Жыл бұрын

"Maybe you've seen it.. maybe you've read it.. but have you implemented it?" That is when you gained my like and subscribe!

@LatelierdArmand 4 ай бұрын

Thanks I liked the minimal PyTorch implementation !

@akwstr 2 жыл бұрын

That was a great explanation / walkthrough. Thank you!

@samdonald741 2 жыл бұрын

Such a good explaination! Loved the examples + clear visuals

@antoineajsv6976 2 жыл бұрын

super clear, many thanks for this brilliant explanation

@jeonghwankim8973 5 ай бұрын

This is amazing. Thanks for the clear explanation and the illustrative code samples!

@TwoSocksFoxtrot 2 жыл бұрын

this is unbelievably good. well done, sir

@supersonic956 Жыл бұрын

Things are always so much simpler when written in code (at least for me!). Thanks!

@kalixml1523 5 ай бұрын

Great video! Your clear explanations and pleasant voice make learning enjoyable. Looking forward to more content like this! Subscribed 👍

@beltusnkwawir2908 2 жыл бұрын

Your explanations are so clear and your voice is pleasant to the ear making learning more enjoyable. Thank you

@pitiwatlueang6899 2 ай бұрын

This is pure gold! You should produce more videos!

@anirudhthatipelli8765 Жыл бұрын

Thanks, this was a great explanation!

@debarchanbasu768 Жыл бұрын

This is the OG explanation on REINFORCE on KZbin. I hope you can come up with an entire RL series! Subscribed!

@jonathanwogerbauer2703 23 күн бұрын

Big respect for this video! thank you

@AKUKamil 3 ай бұрын

intriguing start

@marcin.sobocinski 2 жыл бұрын

Great showman Andriy is not, for sure... but the explanation of the algo is spot on and crystal clear. Thank you!

@openaidalle Жыл бұрын

Loved this.. Please make more videos

@jubaaissaoui5678 Жыл бұрын

Great video

@abdelrahmanwaelhelaly1871 2 жыл бұрын

Thank you so much.

@swamikannan943 2 жыл бұрын

This video answered the single biggest doubt in my mind. How do you backprop through env.step(). Brilliant explanation ! Thanks a lot !

@ahmedaj2000 8 ай бұрын

thanks!

@abdelrahmanwaelhelaly1871 2 жыл бұрын

This is super good, did you have more courses/videos?

@Ishaheennabi 3 ай бұрын

great

@AndrewGarrisonTheHuman 2 жыл бұрын

Great video, thank you. Do you have any advice or links to resources on how to apply a policy gradient to a continuous action space or an environment where multiple actions must be taken for each state?

@AndriyDrozdyuk 2 жыл бұрын

Yes, for theory take a look at section "13.7 Policy Parameterization for Continuous Actions" in RL Sutton & Barto Book 2nd ed. incompleteideas.net/book/RLbook2020.pdf , and for implementation it seems Spinning Up by open AI has "vanilla" policy gradients, here is an example: github.com/openai/spinningup/blob/038665d62d569055401d91856abb287263096178/spinup/algos/pytorch/vpg/core.py#L80

@AndrewGarrisonTheHuman 2 жыл бұрын

@@AndriyDrozdyuk Great! Thanks!

@hoddy2001 Жыл бұрын

what dimension is log_prob? and why do we only set one argument to the loss function? does it require both nn output and target value? thanks great video

@AndriyDrozdyuk Жыл бұрын

log_prob is just a scalar here. You can read as to why the loss is like that here (see "Score" function in both cases): mpatacchiola.github.io/blog/2021/02/08/intro-variational-inference-2.html or here pytorch.org/docs/stable/distributions.html Basically it's a way to go around the problem of our inability to backpropagate through the random samples (the picking of the action here). Instead we create a surrogate function that *can* be backpropagated. I am considering making a video on policy gradient to explain the loss function in more detail.

@timanb2491 2 жыл бұрын

can you explain 1 thing please - we compute G with constant policy and after that with this an array of G's we use it to find a direction for optimization and with every G from our array we update our policy(for example NN) BUT we use rewards from initial policy and update current policy that has some difference with initial one. Is it right to update current policy with information(G) that we get from initial policy? Because in gradient descent for example we update current NN by using information that we get from current state(from forward propagation)

@AndriyDrozdyuk 2 жыл бұрын

Yes, very good, you are correct. I was waiting for someone to point that out! Technically we should do update all at once. I just thought it would be a bit confusing for the viewer (since it would differ from the book's pseudocode). But you can see my old code version here where I create EligibilityVector (as it is called in RL book - top of page 328): github.com/drozzy/reinforce/blob/99a56061102a82bb4a835e852307ffa9d693ac98/reinforce.py#L40 additionally here is pytorch implementation taking a single step: github.com/pytorch/examples/blob/41b035f2f8faede544174cfd82960b7b407723eb/reinforcement_learning/reinforce.py#L62 On a side note, I do think the book's pseudocode shows it rather strangely - since it makes it seem as if multiple updates are occurring, and that is what I show here. But you are right that after we update the policy the first time, it then becomes kind-of an off-policy learning, where we are now updating a target policy that is no longer the same as the behavior policy.

@MaartinLPDA 2 жыл бұрын

why do you need to take the log of the probability?

@AndriyDrozdyuk 2 жыл бұрын

You don't, it's just a shorthand for \frac{ abla \pi}{pi}. You can look up the derivation and the answer in Reinforcement Learning book 2nd Ed. section "13.3. REINFORCE: Monte Carlo Policy Gradient" at the very bottom of page 327 incompleteideas.net/book/RLbook2020.pdf