L1 MDPs, Exact Solution Methods, Max-ent RL (Foundations of Deep RL Series)

  Рет қаралды 60,912

Pieter Abbeel

Pieter Abbeel

Күн бұрын

Пікірлер: 49
@prerakmathur1431
@prerakmathur1431 2 жыл бұрын
This guy is seriously the god of reinforcement learning. He and Andrew Ng have single handedly transformed ML. Kudos to you Pieter.
@florentinrieger5306
@florentinrieger5306 Жыл бұрын
Don't forget David Silver!
@danielmoreiradesousa185
@danielmoreiradesousa185 9 ай бұрын
This is one of the best content I've seen in a long time, congratulations and thank you so much!
@zerenzhong
@zerenzhong Ай бұрын
I would say this series is a great shortcut for beginners to enter the realm of RL from scratch. Even the most minor details and associated definitions are covered with plain words for breakdown explanation. Notes are clear to understand and illustrations of computing are beautiful and intuitive. Thank you so much prof.
@unionsafetymatch
@unionsafetymatch 3 жыл бұрын
I don't believe what I've stumbled upon. This is amazing!
@theodorocardoso
@theodorocardoso 3 ай бұрын
Thanks for that! It's so much better than any RL intros I've watched on a graduate level.
@Prokage
@Prokage 3 жыл бұрын
Thank you for everything you've done for the field over the years, Dr. Abbeel.
@henriquepett2124
@henriquepett2124 Жыл бұрын
Nice explanation about RL, Pieter! Will be watching your updates closer now
@Shah_Khan
@Shah_Khan 3 жыл бұрын
Thank you Pieter for briniging the latest lecture series on Deep RL. I was looking just for that.
@OK-bt6lu
@OK-bt6lu 2 жыл бұрын
This was the best video lecture intro to deep RL that I have ever watched. Thanks a lot for sharing Prof. Abbeel! Please post more :)
@hongkyulee9724
@hongkyulee9724 9 ай бұрын
This lecture is my first and best RL lecture. ❤❤
@小鹏-w9s
@小鹏-w9s 3 жыл бұрын
Hi Pieter, thank you very much for this great lecture! I found a mistake on P54 of the slide attached. For the policy evaluation expression, the item in the last bracket should be " s' " instead of "s".
@wanliyu4243
@wanliyu4243 Жыл бұрын
regarding the example at 26:30, when calculating the value of V*(3,3), we need to know the values of V*(4,3), V*(3,3) and V*(3,2). The problem is except V*(4,3), how can we get to know the values of V*(3,3) and V*(3,2)?
@stevens68dev75
@stevens68dev75 3 жыл бұрын
Really an excellent lecture series! Thanks a lot!! Just one question regarding the example at 22:30: Shouldn't the value for the terminal states V(4,3) and V(4,2) both be 0 because in terminal states the expected future reward is 0 (there is no future)? The Value Iteration algorithm at 30:20 also implies it.
@saihemanthiitkgp
@saihemanthiitkgp 2 жыл бұрын
I think it has to do with the environmental setup. In the example of gridworld, the agent can get rewarded just for being in a specific state and the action of collecting the gem doesnt require any timestep. Value Iteration is formalized more for practical scenarios where the agent is rewarded for the decisions (transitions) it makes. Thats why V^*_0(s) = 0 for all s instead of R(s, phi, s), meaning no reward for just being in a state.
@wireghost897
@wireghost897 Жыл бұрын
Yeah exactly. This was quite confusing coz in every other book I have seen, terminal state has 0 value.
@许翔宇
@许翔宇 5 ай бұрын
I agree with you. Did the lecturer make a mistake here?
@VaibhavSharma-u8g
@VaibhavSharma-u8g Жыл бұрын
At 32:30 he said we will explicitly avoid the fire pit how will we do that as the only actions that we have are up, right and left. It would be optimum to take up but as the environment is stochastic we'll end up in the fire pit 20% of the time and the value function must also update to 0.2*0.9*-1. Am I right?
@ai-from-scratch
@ai-from-scratch Жыл бұрын
the possible actions are up, right, left, down, the optimum to take is left, which makes it stay at the same place with probability 0.8, go up with 0.1 and go down with 0.1, go to the fire pit with 0.0
@eonr
@eonr Жыл бұрын
I believe there's a mistake at 51:01 . The last term in the last two equations should be the value function of s' instead of s.
@TheAdityagopalan
@TheAdityagopalan 3 жыл бұрын
Nice series, thanks! Question- In the MaxEnt part, starting at around 1:07:00, shouldn't the Lagrangian dual of a max problem be the min max instead of max min?
@MLSCLUB-t2y
@MLSCLUB-t2y 6 ай бұрын
don't forget he is maximizing the function , what u might be used to is minimizing the funcion , so that's why the logic is flipped
@BruinChang
@BruinChang 2 жыл бұрын
Much thanks, no other words.
@KeithMarzSo
@KeithMarzSo 28 күн бұрын
I don't understand 32:15. If the dynamics are stochastic, then the term P("firepit" | "left-of-firepit" , a) would be non-zero and I would expect V_2("left-of-firepit") would be negative. What am I missing here?
@junghwanro4829
@junghwanro4829 Жыл бұрын
Thank you for the great lecture. It was super helpful even after taking a RL course.
@itepsilon
@itepsilon 3 жыл бұрын
Thanks so much for sharing! Awesome!
@offthepathworks9171
@offthepathworks9171 11 ай бұрын
Especially liked the 'intuition' part! What would be the best way to get more in-depth on some of the "prerequisites" for RL?
@KrischerBoy
@KrischerBoy 2 жыл бұрын
at 53:33 (exercise2) About the correct Option 2: Shouldn't the sum be flipped? So: sigma(a) [ sigma(s') [ ... ] ] As if I iterate over "a" the term expresses only the reward value of the s' we wanted to go to through action a with its respective probability but not the other possible states through noise? With sigma(a) [ sigma(s') [ ... ] ] the term would express the sum of all the possible outcomes action "a" could cause and then would iterate over another action
@user-or7ji5hv8y
@user-or7ji5hv8y 3 жыл бұрын
Awesome!
@awesama
@awesama 3 жыл бұрын
You call Reinforcement Learning as Trial and Error learning. Can we not use prior knowledge of a task or prior data to help learn a task or speed up the learning of such a task? For example, the AlphaGo can learn from existing Go games between players?
@julianequihuabenitez5344
@julianequihuabenitez5344 3 жыл бұрын
You can, it's called offline reinforcement learning.
@MrJugodoy
@MrJugodoy 3 жыл бұрын
I believe Alpha Go does this, unlike Alpha Zero which starts learning from "scratch"
@JumpDiffusion
@JumpDiffusion 2 жыл бұрын
yes, that's what a replay buffer is for.
@Himanshu-xe7ek
@Himanshu-xe7ek 2 жыл бұрын
At 1:11:00, how pi(a)= exp[(r(a) - beta + lamda)/beta] became pi(a) = exp(r(a)/beta)/Z , where is the lamda - 1 term?
@karthiksuryadevara2546
@karthiksuryadevara2546 3 ай бұрын
Good explanation. Very clear. I did not understand the entropy part , could someone suggest a good resource to understand that part?
@blackdeutrium746
@blackdeutrium746 3 жыл бұрын
Hi proffessor , the walking robot you just made and showed if I wanna make a similar type of robot what I have to learn ? I quite interested in deep reinforcement learning
@guoshenli4193
@guoshenli4193 3 жыл бұрын
great lecture, so much thanks!!!!
@datascience6104
@datascience6104 3 жыл бұрын
Thanks for sharing 👍
@goldfishjy95
@goldfishjy95 3 жыл бұрын
omg thank you so much!!!!
@sakethsaketh750
@sakethsaketh750 6 ай бұрын
Is it recommended lecture series for me a a roboticist but i havent have basics in deep learning or machine learning. Can i directly start with this series
@김동규-b6l
@김동규-b6l 3 жыл бұрын
Thank you!
@zainmehdi8886
@zainmehdi8886 Жыл бұрын
How to we know/caclulate action success probability ? By collecting statistical data ?
@TheThunderSpirit
@TheThunderSpirit 2 жыл бұрын
is it necessary that the state set must be finite?
@LucSoret
@LucSoret Ай бұрын
The explanations on update policy are a bit quick. This video gives more insights : kzbin.info/www/bejne/ommao5qCnJ5jfqs Refers to the comment for more explanation, especially for n=3 and why V(3,3) = 0.78 "You need to consider the noise. 80% of the time, agent goes in the desired direction. But, 10% of the time it goest left relative to desired direction and 10% of the time it goes right. If agent hits a wall, it stays in place. So, at iteration 3, 10% of the time the agent goes left and hits a wall so ends up staying at (3,3). 10% of the time it goes right and ends up in (2,3). 80% of the time it goes where it wants and ends up in (3,4). Value at iteration 2 of (3,3) = .72, (2,3) = 0, and (3,4)=1. So, using the bellman equation, you get 0.8[0+(0.9)(1)]+0.1[0+(0.9)(0.72)]+0.1[0+(0.9)(0)] = 0.78"
@pipe_runner_lab
@pipe_runner_lab Жыл бұрын
I would recommend kzbin.info/www/bejne/b5iWY6ltl7BmedE before starting this video. The lecturer goes in greater detail over value interaction and how it works.
@wireghost897
@wireghost897 Жыл бұрын
*value iteration
@bigjeffystyle7011
@bigjeffystyle7011 Жыл бұрын
Thanks for the suggestion!
@ThePatelprateek
@ThePatelprateek Ай бұрын
may i suggest a basic course should at least introduce all terminology , it would be good to specify what is pi upfront
L2 Deep Q-Learning (Foundations of Deep RL Series)
34:09
Pieter Abbeel
Рет қаралды 25 М.
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL)
1:07:30
Enceinte et en Bazard: Les Chroniques du Nettoyage ! 🚽✨
00:21
Two More French
Рет қаралды 42 МЛН
Tuna 🍣 ​⁠@patrickzeinali ​⁠@ChefRush
00:48
albert_cancook
Рет қаралды 148 МЛН
Bayes theorem, the geometry of changing beliefs
15:11
3Blue1Brown
Рет қаралды 4,6 МЛН
An introduction to Policy Gradient methods - Deep Reinforcement Learning
19:50
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4 МЛН
MIT 6.S191 (2023): Reinforcement Learning
57:33
Alexander Amini
Рет қаралды 137 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 398 М.
RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning
1:28:13
Enceinte et en Bazard: Les Chroniques du Nettoyage ! 🚽✨
00:21
Two More French
Рет қаралды 42 МЛН