People who feel like quiting at this stage, relax, take a break, watch this video over and over again and read sutton and barto. Do everything but dont quit. You are amongst the 10% who came this far.
@matejnovosad91523 жыл бұрын
Me in highschool trying to make a rocket league bot: :0
@TY-il7tf3 жыл бұрын
I finally came to this stage of learning after watching his videos over and over. He did very well explaining everything, but RL knowledge differs from other ML and it takes time to learn and get used to.
@juliansuse13 жыл бұрын
Bro thanks for the encouragement
@SuperBiggestking2 жыл бұрын
Brother Ashsat! This was the most timely comment ever on youtube. I was watching these lectures and I felt brain dead since they are pretty long. This is a great encouragemnt. May God bless you for this!
@jamesnesfield72882 жыл бұрын
my best advice is to apply each algo in the sutton and barto text book to problems in ai gym to help understand all this..... you can do it, you got this....
@xicocaio4 жыл бұрын
This is what I call commitment: David Silver explored not showing his face policy, received less reward, and then switched back to the past lectures' optimal policy. Nothing like learning from this "one stream of data called life."
@Sanderex4 жыл бұрын
What a treasure of a comment
@camozot3 жыл бұрын
And we experienced a different state, realized we got much less reward from it, and updated our value function! Then David adjusts his policy to our value function like actor critic? Or is that a lil stretch, meh I think true there's some link between his and our value function here, he wants us to do well, because he's a legend!
@stevecarson70313 жыл бұрын
Genius!
@js116600 Жыл бұрын
Can't discount face value!
@alexanderyau63477 жыл бұрын
Oh, I can't concentrate without seeing David.
@terrykarekarem91805 жыл бұрын
Exactly the same here.
@musicsirme36044 жыл бұрын
Me too. This is much better than 2018 one
@mathavraj93784 жыл бұрын
How will I understand expected return without him walking to a spot and summing up all along the path after it
@DavidKristoffersson4 жыл бұрын
Turn on subtitles. Helps a lot.
@JonnyHuman7 жыл бұрын
For those confused: - whenever he speaks of the u vector he's talking about the theta vector (slides don't match). - At 1:02:00 he's talking about slide 4. - At 1:16:35 he says Vhat but slides show Vv - He refers to Q in Natural Policy Gradient, which is actually Gtheta in the slides - At 1:30:30 the slide should be 41 (the last slide), not the Natural Actor-Critic slide
@illuminatic76 жыл бұрын
Also, when he starts talking about DPG at 1:26:10 it helps a lot to have a look at his original paper (proceedings.mlr.press/v32/silver14.pdf) pages 4 and 5 in particular. I think the DPG slides he is actually referring to are not available online.
@MiottoGuilherme5 жыл бұрын
@50:00 he mentions G_t, but the slides show v_t, right?
@gregh65865 жыл бұрын
@@MiottoGuilherme yes. G_t as in Sutton/Barto's book, i.e. future, discounted reward.
@mhtb324 жыл бұрын
@@illuminatic7 Unfortunately that link doesn't work anymore, here is an alternative: hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf
@finarwa36029 ай бұрын
I have to listen repetitively because I could not concentrate with out seeing him. I have to imagine what he was trying to show through his gestures . This is a gold standard lecture for RL. Thank you professor David Silver.
@saltcheese8 жыл бұрын
ahhh... where did u go david.. i loved your moderated gesturing
@lucasli92258 жыл бұрын
I have the same question too. His gestures are really helpful in learning the course!
@Chr0nalis8 жыл бұрын
He probably forgot to turn on the cam capture :)
@DoingEasy7 жыл бұрын
Lecture 7 is optional
@danielc42676 жыл бұрын
I think Lecture 10 is optional. Lecture 7 seems rather important.
1:30 Outline 3:25 Policy-Based Reinforcement Learning 7:40 Value-Based and Policy-Based RL 10:15 Advantages of Policy Based RL 14:10 Example: Rock-Paper-Scissors 16:00 Example: Aliased Gridworld 20:45 Policy Objective Function 23:55 Policy Optimization 26:40 Policy Gradient 28:30 Computing Gradients by Finite Differences 30:30 Training AIBO to Walk by Finite Difference Policy Gradient 33:40 Score Function 36:45 Softmax Policy 39:28 Gaussian Policy 41:30 One-Step MDPs 46:35 Policy Gradient Theorem 48:30 Monte-Carlo Policy Gradient (REINFORCE) 51:05 Puck World Example 53:00 Reducing Variance Using a Critic 56:00 Estimating the Action-Value Function 57:10 Action-Value Actor-Critic 1:05:04 Bias in Actor-Critic Algorithms 1:05:30 Compatible Function Approximation 1:06:00 Proof of Compatible Function Approximation Theorem 1:06:33 Reducing Variance using a Baseline 1:12:05 Estimating the Advantage Function 1:17:00 Critics at Different Time-Scales 1:18:30 Actors at Different Time-Scales 1:21:38 Policy Gradient with Eligibility Traces 1:23:50 Alternative Policy Gradient Directions 1:26:08 Natural Policy Gradient 1:30:05 Natural Actor-Critic
@krishnanjanareddy20673 жыл бұрын
And it turns out that this is best course to learn RL even after 6 years.
@snared_ Жыл бұрын
really? What were you able to do with this information?
@michaellin94075 жыл бұрын
This course should be called: "But wait, there's an even better algorithm!"
@mathavraj93784 жыл бұрын
lol entire machine learning is like that
@akshatgarg66353 жыл бұрын
That my friend is the core principal of any field of engineering. Thats how computers got from a room sized contraption to a hand held device. Becuse somebody said wait, there is an even better way of doing this
@Wuu4D8 жыл бұрын
Damn.Its was alot easier understandin it with gestures
@fktudiablo95795 жыл бұрын
he could describe his gestures in the subtiles
@chrisanderson15137 жыл бұрын
Starts at 1:25. Actor critic at 52:55.
@ranhao88827 жыл бұрын
thx
@lorenzovannini823 ай бұрын
This lectures are a gift. Thanks
@T4l0nITA3 жыл бұрын
By far the best video about policy gradient methods on youtube
@georgegvishiani7366 жыл бұрын
It would have been great if it was possible to recreate David in this lecture based on his voice using some combination of RL frameworks.
@Vasha884 жыл бұрын
First time in my life I had to DECREASE the speed of the video and not increase....man he talks REALLY fast, while at the same time showing new slides filled up by equations
@helinw6 жыл бұрын
Just to make sure, in 36:22, the purpose of the likelihood ratio trick is to make the gradient of the objective function gets converted to a expectation again? Just a David said at 44:33, "... that's the whole point of using the likelihood ratio trick".
@AM-kx4ue5 жыл бұрын
I'm not sure about it neither
@edmonddantes4705 Жыл бұрын
That's exactly right. Once you convert it into an expectation, you can approximate it by sampling, so that trick is very practical.
@OmPrakash-vt5vr4 ай бұрын
“No matter how ridiculous the odds may seem, within us resides the power to overcome these challenges and achieve something beautiful. That one day we look back at where we started, and be amazed by how far we’ve come.” -Technoblade I started this series a month ago in summer break, I even did the Easy21 assignment and now I finally learned what I wanted, when I started this series i.e. Actor Critic Method. Time to do some gymnasium env.
@florentinrieger5306 Жыл бұрын
It is unfortunate that exactly this episode is without david in the screen. It is again a quite compley topic and Devaid jumping and running around and pointing out the relevant parts make it much easier to digest.
@jurgenstrydom7 жыл бұрын
I wanted to see the AIBO training :(
@felixt12506 жыл бұрын
Me too. If you look at the Paper by Nate Kohl and Peter Stone where they describe it, they reference a web page for the videos. And surprisingly it is still online. You find it at www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk
@tchz4 жыл бұрын
@@felixt1250 not anymore :'(
@gunsodo3 жыл бұрын
@@tchz I think it is still there but you have to copy and paste the link.
@MrCmon1134 жыл бұрын
Unfortunately the slides do not fit what is said. It's a pity they don't seem to put much effort into these videos. David is surely one of the best people to learn RL from.
@akarshrastogi51455 жыл бұрын
This lecture was immensely difficult to get owing to david's absence and mismatch of slides
@liamroche14737 жыл бұрын
I am not sure exactly how this video was created, but the right slide is often not displayed (especially near the end, but elsewhere as well). It is probably better to download the slides for the lecture and find your own way through them while listening to the audio.
@ErwinDSouza5 жыл бұрын
At 45:36 I think the notation he is describing is different from that shown in these slides. I think his "capital R" is the small "r" for us. And the "curly R" is the "Rs,a" for us.
@MrCmon1134 жыл бұрын
Also u is theta.
@AdamCajf4 жыл бұрын
Yes, fully agree. I believe this is important so to reiterate the small correction: the lowercase r is a random reward, the actual reward that agent/we experience, while the curly uppercase R is the reward from the MDP (Markov Decision Process).
@omeryilmaz66536 жыл бұрын
You are fantastic David. Thanks for the tutorial.
@JakobFoerster9 жыл бұрын
Thank you for creating the video John, this is really great!
@LucGendrot6 жыл бұрын
Is there any particular reason that, in the basic TD(0) QAC pseudocode (1:00:00), we don't update the Q weights first before doing the theta gradient update?
@alvinphantomhive37945 жыл бұрын
i think you can start with arbitrary value for the weight. Since the weight value also will be adjusted proportion to the td error and get better as the iteration increase to n steps.
@edmonddantes4705 Жыл бұрын
Super good question. I am guessing the reason is computational right? You want to reuse the computation you did for Q_w(s,a) when computing delta instead of computing it again with new weights when doing the gradient ascent update of the policy parameters (theta). However, what you propose seems more solid, just more costly.
@jorgelarangeira70136 жыл бұрын
It took me a while to realize that policy function pi(s, a) is alternately used as the probability of taking a certain action in state s, and the action proper (a notation overload that comes from the Sutton book). I think specific notation for each instance would avoid a lot of confusion.
@d41386 жыл бұрын
36:20, could anyone please explain, what kind of expectation we are computing (i only see the gradients). And why is expectation of the right-hand side easier to compute then that of the left-hand side
@edmonddantes4705 Жыл бұрын
You want to minimise J in 43:25, which is the expected immediate reward. Note that thanks to the computation in 36:20, the gradient of J at 43:25 becomes an expectation. The expectation is computed in the full state-action space of the MDP with policy pi_\theta. Note that without the term pi_\theta(s,a) in the sum, that thing would not be an expectation anymore, so you COULD NOT APPROXIMATE IT BY SAMPLING.
@SSS.83205 жыл бұрын
We miss you David
@BramGrooten4 жыл бұрын
Is there perhaps a link to the videos of AIBOs running? (supposed to be shown at 31:55)
@BramGrooten4 жыл бұрын
@@oDaRkDeMoNo Thank you!
@AliRafieiAliRafiei9 жыл бұрын
thank you many times Dear Karolina. cheers
@Darshanhegde9 жыл бұрын
Thanks for updating lectures :) I sort of got stuck on this lecture because the video wasn't available :P Now I have no excuse for not finishing the course !
@Darshanhegde9 жыл бұрын
Thought this is a real video ! I was wrong ! David keeps referring to equations on slides but audio and slides are not synced ! It's confusing sometime ! But still better than just audio !
@jingchuliu86358 жыл бұрын
The slides are perfectly synced with the audio for most times, but the slides on "compatible function approximation" is not in the right order and the slides on "deterministic policy gradient" is missing.
@fktudiablo95795 жыл бұрын
1:00:54, this man got a -1 reward and restarted a new episode.
@alvinphantomhive37945 жыл бұрын
1:00:55 " Ugnkhhh.... "
@MotherFriendSon4 жыл бұрын
Sometimes David words and slides don't corresopond to each other. And I don't know what to do: listen to David or read slides. For example at 1:29:55 when he speaks about deterministic gradient theorem
@binjianxin78304 жыл бұрын
David has a paper about DPG which he mentioned was published “last year” in 2014, later a DDPG one. Just check them out.
@nirajabcd4 жыл бұрын
The lectures were going great until someone decided not to show David's gestures. God I was learning so much just from his gestures.
@mohammadfarzanullah55494 жыл бұрын
He teaches much better than hado van hasselt. makes it much easier
@nirmalnarasimha9181 Жыл бұрын
Made me cry after very long :( given the professors absence and slide mismatch
@d41386 жыл бұрын
could anyone please explain the slide at 45:51. In particular, i don't understand what how the big $R_{s,a}$ becomes just $r$ when we compress the gradient to expectation E. What is the difference between the big R and the small one?
@edmonddantes4705 Жыл бұрын
r is the immediate reward understood as a RANDOM VARIABLE. This is useful because we want to compute the expectation of r along the state space generated by the MDP given a fixed policy pi. This is a measure of how good our policy is. R_{s,a} is the expectation of r given that you are at state s and carry out action a, i.e. R_{s,a} = E[r | s,a].
@samlaf925 жыл бұрын
I don't understand why he says that Value-based methods can't work with stochastic policies? By definition epsilon-greedy is stochastic. If we find two actions with the same value, we could simply have a stochastic policy with 1/2 probability to both. And thus, value-function based methods could also solve the aliased-gridworld example around 20:00.
@edmonddantes4705 Жыл бұрын
The convergence theorems in CONTROL require epsilon --> 0. If you read papers, you will often see assumptions of GLIE type (greedy in the limit with infinite exploration), which go towards a deterministic policy. David also mentions this (lecture 5 I think).
@sengonzi20102 жыл бұрын
Fantastic lectures
@20a3c5f93 жыл бұрын
51:58 - "you get this very nice smooth learning curve... but learning is slow because rewards are high variance" Any idea why the learning curve is smooth despite high variance of returns? We use returns directly in gradient formula, so intuitively I'd guess they'd affect behavior of the learning curve as well.
@edmonddantes4705 Жыл бұрын
I mean, look at the scale, it is massive. I bet if you zoom in, it is not going to be very smooth. Lets say we have an absorbing MDP with pretty long trajectories and we calculate the mean returns by applying MC. By the central limit theorem, the mean experimental returns converge to the real returns, but it will take many iterations due to the high variance of those returns. The smoothness you would see when zooming out (when looking at how the mean returns converge) would be due to the central limit theorem. Note that I am simply making a parallel. In the case of MC policy gradient, that smoothness is due to its convergence properties, which rely on the fact that the MC returns are unbiased samples of the real returns, but that thing is very bumpy when you zoom in precisely due to the variance.
@kunkumamithunbalajivenkate88932 жыл бұрын
32:38 - AIBO Training Video Links: www.cs.utexas.edu/~AustinVilla/?p=research/learned_walk
@hermonqua12 күн бұрын
@ 1:08:00 What happened to the log in the first equation?
@gregh65865 жыл бұрын
Hado van Hasselt holds basically the same lecture here: kzbin.info/www/bejne/mIPJhquHqJurf68. I still like David's lecture much more but perhaps this other lecture can fill some of the gaps that appeared with David's disappearence.
@randalllionelkharkrang40472 жыл бұрын
around 1:00:00, in the action-value and actor critic algorthims, to update w, he used \beta * \delta * feature. Why is he taking the feature here ? in model free evaluation , he used the eligibility trace , but why feature here ?
@edmonddantes4705 Жыл бұрын
He is using linear function approximation for Q. It is a choice. Not sure why you are bothered that much by that.
@SuperBiggestking2 жыл бұрын
Following this lecture is like learning math by listening to a podcast.
@GeneralKenobi694202 жыл бұрын
Not sure if that's supposed to be good or not lol
@fndTenorio6 жыл бұрын
28:52 So J(teta) is the average reward your agent gets following policy teta, and pi(teta, a) is the probability of taking action a given policy teta?
@AM-kx4ue5 жыл бұрын
J(teta) is the cost function, check the 22:30 slide.
@rylanschaeffer32488 жыл бұрын
At 1:03:49, shouldn't the action be sampled from \pi_\theta(s', a)?
@MinhVu-fo6hd7 жыл бұрын
In line "Sample a ~ pi_theta" of the actor-critic algorithm around 58:00. From what I understand that pi_theta(s, a) = P[ a | s, theta], I don't clearly understand how can we pick an action a given s and theta. Do we have to calculate phi(s , a) * theta for all possible action a at state s, and then choose an action accordingly to their probabilities? If yes, how can we take an action in continuous action domains? If no, how can we pick an action then?
@chukybaby6 жыл бұрын
Something like this a = np.random.choice(action_space, 1, p=action_probability_distribution) See docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html
@edmonddantes4705 Жыл бұрын
In continuous action domains, pi_theta(s,a) could be a Gaussian for fixed s (just an example). In discrete action spaces, for every state s, there is a probability of every action given by pi_theta(s,a). They sum to one of course.
@emilfilipov1696 жыл бұрын
I love it how there is always someone moaning or chewing food near the camera/microphone.
@serendipity03063 жыл бұрын
Wish to see David in person.
@ffff43032 жыл бұрын
While I don't know how generalizable the solution to this specific problem in an adversarial game would be, I can't help but wonder how these Policy Gradient Methods could solve it. The problem I am considering is one where the agent is out-matched, out-classed, or an "underdog" of limited range, damage, or resources than it's opponent in an adversarial game where it is known that the opponent's vulnerability increases with time or proximity. Think of Rocky Balboa vs Apollo Creed in Rocky 2 (where Rocky draws punches for many rounds to tire Apollo and then throws a train of left punches to secure the knockout) , being pursued by a larger vessel in water or space (where the opponent has longer range artillery or railguns but less maneuverability due to it's greater size), eliminating a gunmen in a foxhole with a manually thrown grenade, or sieging a castle. If we assume that the agent can only win these games by concentrating all the actions that actually give measurable or estimable reward in the last few sequences of actions in the small fraction of possible episodes that reach the goal, how would any of these Policy Gradient Methods be able to find a winning solution? Given that all actions for many steps from the initial state would require receiving consistent negative rewards (either through glancing blows with punches for many rounds, evasive actions like maneuvering the agent's ship to dodge or incur nonvital damage from the longer-range attacks, or simply lose large portions of an army to get from the field to castle walls and ascend the walls) I imagine the solution would have to be some bidirectional search with some nonlinear step between minimizing negative rewards from the initial state and maximizing positive reward from the goal. But can any of these Gradient Policy Methods ever capture such strategies if they are model-free (what if they have to be online or in partially observable environments as well)? It seems that TD lambda with both forward and backward views might be able to, but would the critical states of transitioning between min-negative and max-positive reward be lost in a "smoothing out" over all action sequence steps or never found given the nonlinearity between the negative and positive rewards? What if the requisite transitions were also the most dangerous for the underdog agent (ie t_100 rewards: -100, +0; t_101 rewards: -1000, +5)? If the environment is partially observable, and there really is no real benefit in strictly following the min-negative reward, given that the only true reward that matters is surviving and eliminating the opponent, some stochasticity would be required in action selection on the forward-view to explore states that are nonoptimal locally for the min-negative reward and required for ever experiencing the global terminal reward state, but this stochasticity may not be affordable on the backward view where the concentration of limited resource use cannot be wasted. I guess the only assailable method is if the network captured a function in the feature vector of the opponent's vulnerability as a function of time, resources exhausted, and/or proximity, but what still remains is this concern of increased danger for the agent as it gets closer to the goal. I realize that one could bound the negative reward minimization from zero damage to "anything short of death", but normalizing that with the positive rewards at the final steps of the game or episode would be interesting to understand. In this strategy it seems odd for an algorithm at certain states to effectively be "saying" things like: "Yes! You just got punched in the face 27 times in a row! (+2700 reward)"; "Congratulations! 2/3s of your ship has lost cabin pressure! (+6600 reward)"; "You have one functional leg, one functional arm, and suffering acute exsanguination! (+10,000 reward)" "Infantry death rate increases 200x! (+200,000 reward)". Any thoughts?
@snared_ Жыл бұрын
did you figure it out yet? It's been a year, hopefullly you've had time to sit down and make actual progress towards creating this?
@MrHailstorm006 жыл бұрын
The slides are outdated, judging by David's speech, he apparently changed notations and added a few slides in the last 30 mins or so.
@ProfessionalTycoons6 жыл бұрын
man without the gestures its not the same, the lecutre is not the same…...
@emrahe4686 жыл бұрын
This has good sound quality, but missing nice body language..
@ck53000459 жыл бұрын
This really helps. Thanks
@VladislavProkhorov-sr2mf7 жыл бұрын
How does he get the score function at 37:41?
@blairfraser80057 жыл бұрын
I've seen this question a few places around the net so I answered it here: math.stackexchange.com/questions/2013050/log-of-softmax-function-derivative/2340848#2340848
@alexanderyau63477 жыл бұрын
Thank you, very elaborate answer!
@MinhVu-fo6hd6 жыл бұрын
So, how do you get a score function for a deep NN?
@AM-kx4ue5 жыл бұрын
@@blairfraser8005 could you do it for dummies? I don't understand why you put the terms inside logs.
@blairfraser80055 жыл бұрын
Our goal is to get a score function by taking the gradient of softmax. It looks like a difficult problem so I need to break it down into a simpler form. The first way to break it down is to separate the numerator and denominator using the log identity: log(x/y) = log(x) - log(y). Now I can apply the gradient to the left and right side independently. I also know that anytime I see something in the form e^x there is a good chance I can simplify and get at the guts of the exponent by taking the log of it. That helps simplify the left side. Next, the right side also takes advantage of a log property - namely that the gradient of the log of f(x) can be written in the form of gradient of f(x) / f(x). This is just the chain rule from calculus. Now the gradients of both the left and right sides are easier.
@jk91657 жыл бұрын
Thank you for the lecture. I was wondering if you are constrained to use the same state-action feature vectors for actor and critics? The weights are, of course, different, but does \phi(s,a) need to be the same? (57:18)
@narendiranchembu58937 жыл бұрын
As far as my understanding goes, the feature vectors of actor and critic are completely different. The feature vector of critic is more like the state space and action space representation, as you have seen in Value Function Approximation lecture. But for the actor, the feature vectors are probabilities of taking an action in a given state (mostly).
@edmonddantes4705 Жыл бұрын
Of course they don't have to be the same. The state-action value features are stuff that approximate the state-action value function well, and the policy features are stuff that approximate a general useful policy well. For example, look at what compatible function approximation imposes in order to flow along the real gradient 1:05:52. How are you going to achieve that condition with the same features?
@charles1019936 жыл бұрын
What is the log policy exactly? Is it just the log of the output of the gradient with respect to some state action pair?
@xingyuanzhang59895 жыл бұрын
I need David! It's hard to understand some pronouns without seeing him.
@brandomiranda67035 жыл бұрын
Does he talk about REINFORCE in this talk/lecture? If yes when?
@AlbaraRamli5 жыл бұрын
Here: 48:30
@divyanshushekhar51185 жыл бұрын
1:07:28 What does Silver mean when he says : "We can reduce the variance without changing the expectation"
@alvinphantomhive37945 жыл бұрын
There's several way to reduce the variance, but reducing variance like using "Critic" at 53:02 , may keeps changing and updating the expectation value onward. So this slide shows the way to reduce the variance without changing the expectation. The idea here is, by subtracting the "Baseline function B(s)" from the "Policy gradient" could do the job. The expectation equation above shows that after a few algebra steps, which ends up with the "B(s) or Baseline" multiply by the "gradient" of "the policy that sums up to 1". And the gradient of a constant (1) equals to "Zero". So the whole terms of that equation shows that, the calculation between the expectation and the "baseline B(s)" actually equal to zero. That's mean you could use this "baseline function B(s)" as a Trick to control the variance without changing the expectation. The baseline not gonna affect the expectation, Since the calculation between the expectation and the baseline actually equal to zero.
@alvinphantomhive37945 жыл бұрын
sorry if the explanation not straight forward and bit complicated lol
@edmonddantes4705 Жыл бұрын
abla log pi(s,a) A(s,a) and abla log pi(s,a) Q(s,a) have the same expectation in the MDP space. However, which one has the larger variance? V[X] = E[X^2]-E[X]^2. Obviously E[X]^2 is the same for both. However, which expectation is larger, that of | abla log pi(s,a)|^2 |A(s,a)|^2 or that of | abla log pi(s,a)|^2 |Q(s,a)|^2? Note that A just centers Q, so tipically its square is smaller.
@shaz71637 жыл бұрын
Can someone explain how he got the score function from the maximum likelihood expression in 22.39 .
@sravanchittupalli23334 жыл бұрын
I am 2 years late but this might help someone else😅😅 It is simple differentiation grad of log(a) wrt a = grad(a)/a He just did this backwards
@hyunjaecho14154 жыл бұрын
What does phi(s) mean at 1:18:05 ?
@sarahjamal866 жыл бұрын
Where did you go David :-(
@andreariba77922 жыл бұрын
it's a pity to see only the slides compared to the previous lectures, the change of format makes it very hard to follow
@TillMiltzow2 жыл бұрын
When adding the baseline, there is an error. The gradient is zero when multiplying with the baseline because the function B(s) does not depend on theta. Then he uses B(s) = V^{\pi_\Theta} (s), which depends on theta. :( So this is at most a motivation rather than a mathematical proof.
@edmonddantes4705 Жыл бұрын
No error. That gradient is not hitting the baseline B, so it does not matter that B depends on theta. The gradient inside the sum is zero because the policy coefficients sum to one for fixed s. This is a well-known classical thing anyway. It was originally proven in Sutton's paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation".
@helloworld94784 жыл бұрын
31:40 "...until your graduate students collapse it..." LoL
@kyanas17506 жыл бұрын
Why there is not a single implementation in MATALB?
@thomasmezzanine54707 жыл бұрын
Thanks for updating lectures. I have some problem in understanding the state-action feature vector \phi(s, a). I know the feature of environment mentioned in the last lecture, it could be some kind of observation of the environment, but how to understand this state-action feature vector?
@edmonddantes4705 Жыл бұрын
The state-action features in the last lecture and this lecture are different, since in the last lecture they were used to approximate the VALUE Q of a particular state-action pair, and in this lecture they are used to approximate a POLICY PI. State-action features filter important information about the state and action used to approximate the state-value function or maybe the policy, depending on the context.
@edmonddantes4705 Жыл бұрын
Say we are in a 2D grid world. The possible actions are up, down, left and right. Every time I move up, I get +1 reward, every time I move down, left or right, I get 0 reward. Define two features as (1,0) if I choose to go up, and (0,1) otherwise. Note that now I can compute the value function EXACTLY as a linear combination of my features, since they contain all the relevant information. My optimal policy is also a linear combination of those features only. PS: you are asking about the linear case, but for me the most interesting case is the nonlinear one.
@guptamber Жыл бұрын
@@edmonddantes4705 Wow that is response to 6 year old question. Thanks for taking time.
@edmonddantes4705 Жыл бұрын
@@guptamberhaha I do it to practise!
@alenmanoocherian6318 жыл бұрын
Hello Karolina, Is there any real video for this class?
@zhaoxuanzhu43645 жыл бұрын
I am guessing the slides shown in the video is slightly different from the one they used in the lecture.
@japneetsingh50155 жыл бұрын
will these policy gradient methods work better in the previous methods based on generalized policy iteration MC and TD and SARSA?
@alexanderyau63477 жыл бұрын
Hi, guys, how can I get the v_t in 50:53?
@narendiranchembu58937 жыл бұрын
Since we have rewards of the entire episode, we can calculate the returns, Monte Carlo way. Here, v_t is more like G_t. v_t = R_(t+1) + gamma*R_(t+2)..... + gamma^(T-1)*R_T
@alexanderyau63477 жыл бұрын
Thank you!
@TillMiltzow3 жыл бұрын
I feel like the last 15 minutes the slides and what he says is not in sync anymore. :(
@robert7806126 жыл бұрын
David disappeared, but CC subtitle is coming!!
@ks35623 жыл бұрын
I lost it after he started talking about bias and reducing variance in actor-critic algorithms, after 1:05:03
@Sickkkkiddddd9 ай бұрын
Isn't the state value function 'useless' to an agent considering he 'chooses' actions but can't 'choose' his state?
@hyunghochrischoi65613 жыл бұрын
First time making it this far. But is it just me or did alot of the notations change?
@hyunghochrischoi65613 жыл бұрын
Also, he seems to be speaking in one notation while the screen is showing something else.
@alxyok6 жыл бұрын
rock paper scissors problem: would it not be a better strategy to try to fool the opponent into thinking we are following a policy other than the random play so that we can exploit the consequences of his decisions?
@BGasperov5 жыл бұрын
Whatever strategy you come up with can not beat the uniform random strategy - that's why it is considered optimal.
@edmonddantes4705 Жыл бұрын
In real life it could be good, but theoretically of course not, since it is not a Nash Equilibrium. It can be exploited. Watch lecture 10.
@guptamber Жыл бұрын
I found Prof. Silver brilliant but concepts in this lecture by large are not explained concretely but just illustration of book. Moreover the lectures earlier showed where on the slide prof is pointing and that is missing too.
@karthik-ex4dm6 жыл бұрын
Came with high hopes from last video.... WO video unable to predict what is he pointing to
@claudiocimarelli7 жыл бұрын
the slide about deterministic policy gradient at the end is the one with compatible function approximation(like in the middle of the presentation)?:S Luckily there is the paper from Silver online :) Very good videos. With the video would have been better this one, but thanks anyways.
@illuminatic76 жыл бұрын
Not really, the slide you are referring to does not have the gradient of the Q-Function in the equations, which is the main point of what he is talking about. It helps a lot to have a look at the original paper (pages 4 and 5 in particular) to understand his explanation of DPG which can be found here: proceedings.mlr.press/v32/silver14.pdf
@shivajidutta84728 жыл бұрын
Does anyone know about the reading material David mentions in the previous class?
@hantinglu80507 жыл бұрын
I guess is this one: "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto
@chongsun78728 ай бұрын
A little mismatched between the voice and slides...
@tomwon54516 жыл бұрын
v: critic param, u: actor param
@Kalernor3 жыл бұрын
Why do all lecture videos on Policy Gradient Methods use the exact same set of slides lol
@p.z.83553 жыл бұрын
so when can we determine that there is state aliasing ?
@edmonddantes4705 Жыл бұрын
Basically when you feel like your features are not representing the MDP very well. The solution is changing the features or improving them.
@jiansenxmu7 жыл бұрын
I'm looking for the robot.. 32:49
@alexanderyau63477 жыл бұрын
What is state aliasing in reinforcement learning?
@sam416197 жыл бұрын
its like when two different states are represented with same features OR if two different states are encoded/represented using same encoding. though they are different (and have different rewards) but due to aliasing property, they appear same so it gets difficult for the algorithm or approximator to differentiate between these
@hardikmadhu5846 жыл бұрын
Someone forgot to hit Record!!
@MinhVu-fo6hd7 жыл бұрын
How about the non-MDP? Does anyone have experience with that?
@Erain-aus-787 жыл бұрын
Minh Vu non-MDP can be artificially converted to be able to use MDP to solve, like quasi-Markov Chain