This guy is seriously the god of reinforcement learning. He and Andrew Ng have single handedly transformed ML. Kudos to you Pieter.
@florentinrieger5306 Жыл бұрын
Don't forget David Silver!
@danielmoreiradesousa1859 ай бұрын
This is one of the best content I've seen in a long time, congratulations and thank you so much!
@zerenzhongАй бұрын
I would say this series is a great shortcut for beginners to enter the realm of RL from scratch. Even the most minor details and associated definitions are covered with plain words for breakdown explanation. Notes are clear to understand and illustrations of computing are beautiful and intuitive. Thank you so much prof.
@unionsafetymatch3 жыл бұрын
I don't believe what I've stumbled upon. This is amazing!
@theodorocardoso3 ай бұрын
Thanks for that! It's so much better than any RL intros I've watched on a graduate level.
@Prokage3 жыл бұрын
Thank you for everything you've done for the field over the years, Dr. Abbeel.
@henriquepett2124 Жыл бұрын
Nice explanation about RL, Pieter! Will be watching your updates closer now
@Shah_Khan3 жыл бұрын
Thank you Pieter for briniging the latest lecture series on Deep RL. I was looking just for that.
@OK-bt6lu2 жыл бұрын
This was the best video lecture intro to deep RL that I have ever watched. Thanks a lot for sharing Prof. Abbeel! Please post more :)
@hongkyulee97249 ай бұрын
This lecture is my first and best RL lecture. ❤❤
@小鹏-w9s3 жыл бұрын
Hi Pieter, thank you very much for this great lecture! I found a mistake on P54 of the slide attached. For the policy evaluation expression, the item in the last bracket should be " s' " instead of "s".
@wanliyu4243 Жыл бұрын
regarding the example at 26:30, when calculating the value of V*(3,3), we need to know the values of V*(4,3), V*(3,3) and V*(3,2). The problem is except V*(4,3), how can we get to know the values of V*(3,3) and V*(3,2)?
@stevens68dev753 жыл бұрын
Really an excellent lecture series! Thanks a lot!! Just one question regarding the example at 22:30: Shouldn't the value for the terminal states V(4,3) and V(4,2) both be 0 because in terminal states the expected future reward is 0 (there is no future)? The Value Iteration algorithm at 30:20 also implies it.
@saihemanthiitkgp2 жыл бұрын
I think it has to do with the environmental setup. In the example of gridworld, the agent can get rewarded just for being in a specific state and the action of collecting the gem doesnt require any timestep. Value Iteration is formalized more for practical scenarios where the agent is rewarded for the decisions (transitions) it makes. Thats why V^*_0(s) = 0 for all s instead of R(s, phi, s), meaning no reward for just being in a state.
@wireghost897 Жыл бұрын
Yeah exactly. This was quite confusing coz in every other book I have seen, terminal state has 0 value.
@许翔宇5 ай бұрын
I agree with you. Did the lecturer make a mistake here?
@VaibhavSharma-u8g Жыл бұрын
At 32:30 he said we will explicitly avoid the fire pit how will we do that as the only actions that we have are up, right and left. It would be optimum to take up but as the environment is stochastic we'll end up in the fire pit 20% of the time and the value function must also update to 0.2*0.9*-1. Am I right?
@ai-from-scratch Жыл бұрын
the possible actions are up, right, left, down, the optimum to take is left, which makes it stay at the same place with probability 0.8, go up with 0.1 and go down with 0.1, go to the fire pit with 0.0
@eonr Жыл бұрын
I believe there's a mistake at 51:01 . The last term in the last two equations should be the value function of s' instead of s.
@TheAdityagopalan3 жыл бұрын
Nice series, thanks! Question- In the MaxEnt part, starting at around 1:07:00, shouldn't the Lagrangian dual of a max problem be the min max instead of max min?
@MLSCLUB-t2y6 ай бұрын
don't forget he is maximizing the function , what u might be used to is minimizing the funcion , so that's why the logic is flipped
@BruinChang2 жыл бұрын
Much thanks, no other words.
@KeithMarzSo28 күн бұрын
I don't understand 32:15. If the dynamics are stochastic, then the term P("firepit" | "left-of-firepit" , a) would be non-zero and I would expect V_2("left-of-firepit") would be negative. What am I missing here?
@junghwanro4829 Жыл бұрын
Thank you for the great lecture. It was super helpful even after taking a RL course.
@itepsilon3 жыл бұрын
Thanks so much for sharing! Awesome!
@offthepathworks917111 ай бұрын
Especially liked the 'intuition' part! What would be the best way to get more in-depth on some of the "prerequisites" for RL?
@KrischerBoy2 жыл бұрын
at 53:33 (exercise2) About the correct Option 2: Shouldn't the sum be flipped? So: sigma(a) [ sigma(s') [ ... ] ] As if I iterate over "a" the term expresses only the reward value of the s' we wanted to go to through action a with its respective probability but not the other possible states through noise? With sigma(a) [ sigma(s') [ ... ] ] the term would express the sum of all the possible outcomes action "a" could cause and then would iterate over another action
@user-or7ji5hv8y3 жыл бұрын
Awesome!
@awesama3 жыл бұрын
You call Reinforcement Learning as Trial and Error learning. Can we not use prior knowledge of a task or prior data to help learn a task or speed up the learning of such a task? For example, the AlphaGo can learn from existing Go games between players?
@julianequihuabenitez53443 жыл бұрын
You can, it's called offline reinforcement learning.
@MrJugodoy3 жыл бұрын
I believe Alpha Go does this, unlike Alpha Zero which starts learning from "scratch"
@JumpDiffusion2 жыл бұрын
yes, that's what a replay buffer is for.
@Himanshu-xe7ek2 жыл бұрын
At 1:11:00, how pi(a)= exp[(r(a) - beta + lamda)/beta] became pi(a) = exp(r(a)/beta)/Z , where is the lamda - 1 term?
@karthiksuryadevara25463 ай бұрын
Good explanation. Very clear. I did not understand the entropy part , could someone suggest a good resource to understand that part?
@blackdeutrium7463 жыл бұрын
Hi proffessor , the walking robot you just made and showed if I wanna make a similar type of robot what I have to learn ? I quite interested in deep reinforcement learning
@guoshenli41933 жыл бұрын
great lecture, so much thanks!!!!
@datascience61043 жыл бұрын
Thanks for sharing 👍
@goldfishjy953 жыл бұрын
omg thank you so much!!!!
@sakethsaketh7506 ай бұрын
Is it recommended lecture series for me a a roboticist but i havent have basics in deep learning or machine learning. Can i directly start with this series
@김동규-b6l3 жыл бұрын
Thank you!
@zainmehdi8886 Жыл бұрын
How to we know/caclulate action success probability ? By collecting statistical data ?
@TheThunderSpirit2 жыл бұрын
is it necessary that the state set must be finite?
@LucSoretАй бұрын
The explanations on update policy are a bit quick. This video gives more insights : kzbin.info/www/bejne/ommao5qCnJ5jfqs Refers to the comment for more explanation, especially for n=3 and why V(3,3) = 0.78 "You need to consider the noise. 80% of the time, agent goes in the desired direction. But, 10% of the time it goest left relative to desired direction and 10% of the time it goes right. If agent hits a wall, it stays in place. So, at iteration 3, 10% of the time the agent goes left and hits a wall so ends up staying at (3,3). 10% of the time it goes right and ends up in (2,3). 80% of the time it goes where it wants and ends up in (3,4). Value at iteration 2 of (3,3) = .72, (2,3) = 0, and (3,4)=1. So, using the bellman equation, you get 0.8[0+(0.9)(1)]+0.1[0+(0.9)(0.72)]+0.1[0+(0.9)(0)] = 0.78"
@pipe_runner_lab Жыл бұрын
I would recommend kzbin.info/www/bejne/b5iWY6ltl7BmedE before starting this video. The lecturer goes in greater detail over value interaction and how it works.
@wireghost897 Жыл бұрын
*value iteration
@bigjeffystyle7011 Жыл бұрын
Thanks for the suggestion!
@ThePatelprateekАй бұрын
may i suggest a basic course should at least introduce all terminology , it would be good to specify what is pi upfront