RL Course by David Silver - Lecture 7: Policy Gradient Methods

Рет қаралды 278,569

Күн бұрын

Пікірлер: 192

@akshatgarg6635 3 жыл бұрын

People who feel like quiting at this stage, relax, take a break, watch this video over and over again and read sutton and barto. Do everything but dont quit. You are amongst the 10% who came this far.

@matejnovosad9152 3 жыл бұрын

Me in highschool trying to make a rocket league bot: :0

@TY-il7tf 3 жыл бұрын

I finally came to this stage of learning after watching his videos over and over. He did very well explaining everything, but RL knowledge differs from other ML and it takes time to learn and get used to.

@juliansuse1 3 жыл бұрын

Bro thanks for the encouragement

@SuperBiggestking 2 жыл бұрын

Brother Ashsat! This was the most timely comment ever on youtube. I was watching these lectures and I felt brain dead since they are pretty long. This is a great encouragemnt. May God bless you for this!

@jamesnesfield7288 2 жыл бұрын

my best advice is to apply each algo in the sutton and barto text book to problems in ai gym to help understand all this..... you can do it, you got this....

@xicocaio 4 жыл бұрын

This is what I call commitment: David Silver explored not showing his face policy, received less reward, and then switched back to the past lectures' optimal policy. Nothing like learning from this "one stream of data called life."

@Sanderex 4 жыл бұрын

What a treasure of a comment

@camozot 3 жыл бұрын

And we experienced a different state, realized we got much less reward from it, and updated our value function! Then David adjusts his policy to our value function like actor critic? Or is that a lil stretch, meh I think true there's some link between his and our value function here, he wants us to do well, because he's a legend!

@stevecarson7031 3 жыл бұрын

Genius!

@js116600 Жыл бұрын

Can't discount face value!

@alexanderyau6347 7 жыл бұрын

Oh, I can't concentrate without seeing David.

@terrykarekarem9180 5 жыл бұрын

Exactly the same here.

@musicsirme3604 4 жыл бұрын

Me too. This is much better than 2018 one

@mathavraj9378 4 жыл бұрын

How will I understand expected return without him walking to a spot and summing up all along the path after it

@DavidKristoffersson 4 жыл бұрын

Turn on subtitles. Helps a lot.

@JonnyHuman 7 жыл бұрын

For those confused: - whenever he speaks of the u vector he's talking about the theta vector (slides don't match). - At 1:02:00 he's talking about slide 4. - At 1:16:35 he says Vhat but slides show Vv - He refers to Q in Natural Policy Gradient, which is actually Gtheta in the slides - At 1:30:30 the slide should be 41 (the last slide), not the Natural Actor-Critic slide

@illuminatic7 6 жыл бұрын

Also, when he starts talking about DPG at 1:26:10 it helps a lot to have a look at his original paper (proceedings.mlr.press/v32/silver14.pdf) pages 4 and 5 in particular. I think the DPG slides he is actually referring to are not available online.

@MiottoGuilherme 5 жыл бұрын

@50:00 he mentions G_t, but the slides show v_t, right?

@gregh6586 5 жыл бұрын

@@MiottoGuilherme yes. G_t as in Sutton/Barto's book, i.e. future, discounted reward.

@mhtb32 4 жыл бұрын

@@illuminatic7 Unfortunately that link doesn't work anymore, here is an alternative: hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf

@finarwa3602 9 ай бұрын

I have to listen repetitively because I could not concentrate with out seeing him. I have to imagine what he was trying to show through his gestures . This is a gold standard lecture for RL. Thank you professor David Silver.

@saltcheese 8 жыл бұрын

ahhh... where did u go david.. i loved your moderated gesturing

@lucasli9225 8 жыл бұрын

I have the same question too. His gestures are really helpful in learning the course!

@Chr0nalis 8 жыл бұрын

He probably forgot to turn on the cam capture :)

@DoingEasy 7 жыл бұрын

Lecture 7 is optional

@danielc4267 6 жыл бұрын

I think Lecture 10 is optional. Lecture 7 seems rather important.

@NganVu 4 жыл бұрын

3:24 Introduction 26:39 Finite Difference Policy Gradient 33:38 Monte-Carlo Policy Gradient 52:55 Actor-Critic Policy Gradient

@yasseraziz1287 4 жыл бұрын

1:30 Outline 3:25 Policy-Based Reinforcement Learning 7:40 Value-Based and Policy-Based RL 10:15 Advantages of Policy Based RL 14:10 Example: Rock-Paper-Scissors 16:00 Example: Aliased Gridworld 20:45 Policy Objective Function 23:55 Policy Optimization 26:40 Policy Gradient 28:30 Computing Gradients by Finite Differences 30:30 Training AIBO to Walk by Finite Difference Policy Gradient 33:40 Score Function 36:45 Softmax Policy 39:28 Gaussian Policy 41:30 One-Step MDPs 46:35 Policy Gradient Theorem 48:30 Monte-Carlo Policy Gradient (REINFORCE) 51:05 Puck World Example 53:00 Reducing Variance Using a Critic 56:00 Estimating the Action-Value Function 57:10 Action-Value Actor-Critic 1:05:04 Bias in Actor-Critic Algorithms 1:05:30 Compatible Function Approximation 1:06:00 Proof of Compatible Function Approximation Theorem 1:06:33 Reducing Variance using a Baseline 1:12:05 Estimating the Advantage Function 1:17:00 Critics at Different Time-Scales 1:18:30 Actors at Different Time-Scales 1:21:38 Policy Gradient with Eligibility Traces 1:23:50 Alternative Policy Gradient Directions 1:26:08 Natural Policy Gradient 1:30:05 Natural Actor-Critic

@krishnanjanareddy2067 3 жыл бұрын

And it turns out that this is best course to learn RL even after 6 years.

@snared_ Жыл бұрын

really? What were you able to do with this information?

@michaellin9407 5 жыл бұрын

This course should be called: "But wait, there's an even better algorithm!"

@mathavraj9378 4 жыл бұрын

lol entire machine learning is like that

@akshatgarg6635 3 жыл бұрын

That my friend is the core principal of any field of engineering. Thats how computers got from a room sized contraption to a hand held device. Becuse somebody said wait, there is an even better way of doing this

@Wuu4D 8 жыл бұрын

Damn.Its was alot easier understandin it with gestures

@fktudiablo9579 5 жыл бұрын

he could describe his gestures in the subtiles

@chrisanderson1513 7 жыл бұрын

Starts at 1:25. Actor critic at 52:55.

@ranhao8882 7 жыл бұрын

thx

@lorenzovannini82 3 ай бұрын

This lectures are a gift. Thanks

@T4l0nITA 3 жыл бұрын

By far the best video about policy gradient methods on youtube

@georgegvishiani736 6 жыл бұрын

It would have been great if it was possible to recreate David in this lecture based on his voice using some combination of RL frameworks.

@Vasha88 4 жыл бұрын

First time in my life I had to DECREASE the speed of the video and not increase....man he talks REALLY fast, while at the same time showing new slides filled up by equations

@helinw 6 жыл бұрын

Just to make sure, in 36:22, the purpose of the likelihood ratio trick is to make the gradient of the objective function gets converted to a expectation again? Just a David said at 44:33, "... that's the whole point of using the likelihood ratio trick".

@AM-kx4ue 5 жыл бұрын

I'm not sure about it neither

@edmonddantes4705 Жыл бұрын

That's exactly right. Once you convert it into an expectation, you can approximate it by sampling, so that trick is very practical.

@OmPrakash-vt5vr 4 ай бұрын

“No matter how ridiculous the odds may seem, within us resides the power to overcome these challenges and achieve something beautiful. That one day we look back at where we started, and be amazed by how far we’ve come.” -Technoblade I started this series a month ago in summer break, I even did the Easy21 assignment and now I finally learned what I wanted, when I started this series i.e. Actor Critic Method. Time to do some gymnasium env.

@florentinrieger5306 Жыл бұрын

It is unfortunate that exactly this episode is without david in the screen. It is again a quite compley topic and Devaid jumping and running around and pointing out the relevant parts make it much easier to digest.

@jurgenstrydom 7 жыл бұрын

I wanted to see the AIBO training :(

@felixt1250 6 жыл бұрын

Me too. If you look at the Paper by Nate Kohl and Peter Stone where they describe it, they reference a web page for the videos. And surprisingly it is still online. You find it at www.cs.utexas.edu/users/AustinVilla/?p=research/learned_walk

@tchz 4 жыл бұрын

@@felixt1250 not anymore :'(

@gunsodo 3 жыл бұрын

@@tchz I think it is still there but you have to copy and paste the link.

@MrCmon113 4 жыл бұрын

Unfortunately the slides do not fit what is said. It's a pity they don't seem to put much effort into these videos. David is surely one of the best people to learn RL from.

@akarshrastogi5145 5 жыл бұрын

This lecture was immensely difficult to get owing to david's absence and mismatch of slides

@liamroche1473 7 жыл бұрын

I am not sure exactly how this video was created, but the right slide is often not displayed (especially near the end, but elsewhere as well). It is probably better to download the slides for the lecture and find your own way through them while listening to the audio.

@ErwinDSouza 5 жыл бұрын

At 45:36 I think the notation he is describing is different from that shown in these slides. I think his "capital R" is the small "r" for us. And the "curly R" is the "Rs,a" for us.

@MrCmon113 4 жыл бұрын

Also u is theta.

@AdamCajf 4 жыл бұрын

Yes, fully agree. I believe this is important so to reiterate the small correction: the lowercase r is a random reward, the actual reward that agent/we experience, while the curly uppercase R is the reward from the MDP (Markov Decision Process).

@omeryilmaz6653 6 жыл бұрын

You are fantastic David. Thanks for the tutorial.

@JakobFoerster 9 жыл бұрын

Thank you for creating the video John, this is really great!

@LucGendrot 6 жыл бұрын

Is there any particular reason that, in the basic TD(0) QAC pseudocode (1:00:00), we don't update the Q weights first before doing the theta gradient update?

@alvinphantomhive3794 5 жыл бұрын

i think you can start with arbitrary value for the weight. Since the weight value also will be adjusted proportion to the td error and get better as the iteration increase to n steps.

@edmonddantes4705 Жыл бұрын

Super good question. I am guessing the reason is computational right? You want to reuse the computation you did for Q_w(s,a) when computing delta instead of computing it again with new weights when doing the gradient ascent update of the policy parameters (theta). However, what you propose seems more solid, just more costly.

@jorgelarangeira7013 6 жыл бұрын

It took me a while to realize that policy function pi(s, a) is alternately used as the probability of taking a certain action in state s, and the action proper (a notation overload that comes from the Sutton book). I think specific notation for each instance would avoid a lot of confusion.

@d4138 6 жыл бұрын

36:20, could anyone please explain, what kind of expectation we are computing (i only see the gradients). And why is expectation of the right-hand side easier to compute then that of the left-hand side

@edmonddantes4705 Жыл бұрын

You want to minimise J in 43:25, which is the expected immediate reward. Note that thanks to the computation in 36:20, the gradient of J at 43:25 becomes an expectation. The expectation is computed in the full state-action space of the MDP with policy pi_\theta. Note that without the term pi_\theta(s,a) in the sum, that thing would not be an expectation anymore, so you COULD NOT APPROXIMATE IT BY SAMPLING.

@SSS.8320 5 жыл бұрын

We miss you David

@BramGrooten 4 жыл бұрын

Is there perhaps a link to the videos of AIBOs running? (supposed to be shown at 31:55)

@BramGrooten 4 жыл бұрын

@@oDaRkDeMoNo Thank you!

@AliRafieiAliRafiei 9 жыл бұрын

thank you many times Dear Karolina. cheers

@Darshanhegde 9 жыл бұрын

Thanks for updating lectures :) I sort of got stuck on this lecture because the video wasn't available :P Now I have no excuse for not finishing the course !

@Darshanhegde 9 жыл бұрын

Thought this is a real video ! I was wrong ! David keeps referring to equations on slides but audio and slides are not synced ! It's confusing sometime ! But still better than just audio !

@jingchuliu8635 8 жыл бұрын

The slides are perfectly synced with the audio for most times, but the slides on "compatible function approximation" is not in the right order and the slides on "deterministic policy gradient" is missing.

@fktudiablo9579 5 жыл бұрын

1:00:54, this man got a -1 reward and restarted a new episode.

@alvinphantomhive3794 5 жыл бұрын

1:00:55 " Ugnkhhh.... "

@MotherFriendSon 4 жыл бұрын

Sometimes David words and slides don't corresopond to each other. And I don't know what to do: listen to David or read slides. For example at 1:29:55 when he speaks about deterministic gradient theorem

@binjianxin7830 4 жыл бұрын

David has a paper about DPG which he mentioned was published “last year” in 2014, later a DDPG one. Just check them out.

@nirajabcd 4 жыл бұрын

The lectures were going great until someone decided not to show David's gestures. God I was learning so much just from his gestures.

@mohammadfarzanullah5549 4 жыл бұрын

He teaches much better than hado van hasselt. makes it much easier

@nirmalnarasimha9181 Жыл бұрын

Made me cry after very long :( given the professors absence and slide mismatch

@d4138 6 жыл бұрын

could anyone please explain the slide at 45:51. In particular, i don't understand what how the big $R_{s,a}$ becomes just $r$ when we compress the gradient to expectation E. What is the difference between the big R and the small one?

@edmonddantes4705 Жыл бұрын

r is the immediate reward understood as a RANDOM VARIABLE. This is useful because we want to compute the expectation of r along the state space generated by the MDP given a fixed policy pi. This is a measure of how good our policy is. R_{s,a} is the expectation of r given that you are at state s and carry out action a, i.e. R_{s,a} = E[r | s,a].

@samlaf92 5 жыл бұрын

I don't understand why he says that Value-based methods can't work with stochastic policies? By definition epsilon-greedy is stochastic. If we find two actions with the same value, we could simply have a stochastic policy with 1/2 probability to both. And thus, value-function based methods could also solve the aliased-gridworld example around 20:00.

@edmonddantes4705 Жыл бұрын

The convergence theorems in CONTROL require epsilon --> 0. If you read papers, you will often see assumptions of GLIE type (greedy in the limit with infinite exploration), which go towards a deterministic policy. David also mentions this (lecture 5 I think).

@sengonzi2010 2 жыл бұрын

Fantastic lectures

@20a3c5f9 3 жыл бұрын

51:58 - "you get this very nice smooth learning curve... but learning is slow because rewards are high variance" Any idea why the learning curve is smooth despite high variance of returns? We use returns directly in gradient formula, so intuitively I'd guess they'd affect behavior of the learning curve as well.

@edmonddantes4705 Жыл бұрын

I mean, look at the scale, it is massive. I bet if you zoom in, it is not going to be very smooth. Lets say we have an absorbing MDP with pretty long trajectories and we calculate the mean returns by applying MC. By the central limit theorem, the mean experimental returns converge to the real returns, but it will take many iterations due to the high variance of those returns. The smoothness you would see when zooming out (when looking at how the mean returns converge) would be due to the central limit theorem. Note that I am simply making a parallel. In the case of MC policy gradient, that smoothness is due to its convergence properties, which rely on the fact that the MC returns are unbiased samples of the real returns, but that thing is very bumpy when you zoom in precisely due to the variance.

@kunkumamithunbalajivenkate8893 2 жыл бұрын

32:38 - AIBO Training Video Links: www.cs.utexas.edu/~AustinVilla/?p=research/learned_walk

@hermonqua 12 күн бұрын

@ 1:08:00 What happened to the log in the first equation?

@gregh6586 5 жыл бұрын

Hado van Hasselt holds basically the same lecture here: kzbin.info/www/bejne/mIPJhquHqJurf68. I still like David's lecture much more but perhaps this other lecture can fill some of the gaps that appeared with David's disappearence.

@randalllionelkharkrang4047 2 жыл бұрын

around 1:00:00, in the action-value and actor critic algorthims, to update w, he used \beta * \delta * feature. Why is he taking the feature here ? in model free evaluation , he used the eligibility trace , but why feature here ?

@edmonddantes4705 Жыл бұрын

He is using linear function approximation for Q. It is a choice. Not sure why you are bothered that much by that.

@SuperBiggestking 2 жыл бұрын

Following this lecture is like learning math by listening to a podcast.

@GeneralKenobi69420 2 жыл бұрын

Not sure if that's supposed to be good or not lol

@fndTenorio 6 жыл бұрын

28:52 So J(teta) is the average reward your agent gets following policy teta, and pi(teta, a) is the probability of taking action a given policy teta?

@AM-kx4ue 5 жыл бұрын

J(teta) is the cost function, check the 22:30 slide.

@rylanschaeffer3248 8 жыл бұрын

At 1:03:49, shouldn't the action be sampled from \pi_\theta(s', a)?

@MinhVu-fo6hd 7 жыл бұрын

In line "Sample a ~ pi_theta" of the actor-critic algorithm around 58:00. From what I understand that pi_theta(s, a) = P[ a | s, theta], I don't clearly understand how can we pick an action a given s and theta. Do we have to calculate phi(s , a) * theta for all possible action a at state s, and then choose an action accordingly to their probabilities? If yes, how can we take an action in continuous action domains? If no, how can we pick an action then?

@chukybaby 6 жыл бұрын

Something like this a = np.random.choice(action_space, 1, p=action_probability_distribution) See docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.choice.html

@edmonddantes4705 Жыл бұрын

In continuous action domains, pi_theta(s,a) could be a Gaussian for fixed s (just an example). In discrete action spaces, for every state s, there is a probability of every action given by pi_theta(s,a). They sum to one of course.

@emilfilipov169 6 жыл бұрын

I love it how there is always someone moaning or chewing food near the camera/microphone.

@serendipity0306 3 жыл бұрын

Wish to see David in person.

@ffff4303 2 жыл бұрын

While I don't know how generalizable the solution to this specific problem in an adversarial game would be, I can't help but wonder how these Policy Gradient Methods could solve it. The problem I am considering is one where the agent is out-matched, out-classed, or an "underdog" of limited range, damage, or resources than it's opponent in an adversarial game where it is known that the opponent's vulnerability increases with time or proximity. Think of Rocky Balboa vs Apollo Creed in Rocky 2 (where Rocky draws punches for many rounds to tire Apollo and then throws a train of left punches to secure the knockout) , being pursued by a larger vessel in water or space (where the opponent has longer range artillery or railguns but less maneuverability due to it's greater size), eliminating a gunmen in a foxhole with a manually thrown grenade, or sieging a castle. If we assume that the agent can only win these games by concentrating all the actions that actually give measurable or estimable reward in the last few sequences of actions in the small fraction of possible episodes that reach the goal, how would any of these Policy Gradient Methods be able to find a winning solution? Given that all actions for many steps from the initial state would require receiving consistent negative rewards (either through glancing blows with punches for many rounds, evasive actions like maneuvering the agent's ship to dodge or incur nonvital damage from the longer-range attacks, or simply lose large portions of an army to get from the field to castle walls and ascend the walls) I imagine the solution would have to be some bidirectional search with some nonlinear step between minimizing negative rewards from the initial state and maximizing positive reward from the goal. But can any of these Gradient Policy Methods ever capture such strategies if they are model-free (what if they have to be online or in partially observable environments as well)? It seems that TD lambda with both forward and backward views might be able to, but would the critical states of transitioning between min-negative and max-positive reward be lost in a "smoothing out" over all action sequence steps or never found given the nonlinearity between the negative and positive rewards? What if the requisite transitions were also the most dangerous for the underdog agent (ie t_100 rewards: -100, +0; t_101 rewards: -1000, +5)? If the environment is partially observable, and there really is no real benefit in strictly following the min-negative reward, given that the only true reward that matters is surviving and eliminating the opponent, some stochasticity would be required in action selection on the forward-view to explore states that are nonoptimal locally for the min-negative reward and required for ever experiencing the global terminal reward state, but this stochasticity may not be affordable on the backward view where the concentration of limited resource use cannot be wasted. I guess the only assailable method is if the network captured a function in the feature vector of the opponent's vulnerability as a function of time, resources exhausted, and/or proximity, but what still remains is this concern of increased danger for the agent as it gets closer to the goal. I realize that one could bound the negative reward minimization from zero damage to "anything short of death", but normalizing that with the positive rewards at the final steps of the game or episode would be interesting to understand. In this strategy it seems odd for an algorithm at certain states to effectively be "saying" things like: "Yes! You just got punched in the face 27 times in a row! (+2700 reward)"; "Congratulations! 2/3s of your ship has lost cabin pressure! (+6600 reward)"; "You have one functional leg, one functional arm, and suffering acute exsanguination! (+10,000 reward)" "Infantry death rate increases 200x! (+200,000 reward)". Any thoughts?

@snared_ Жыл бұрын

did you figure it out yet? It's been a year, hopefullly you've had time to sit down and make actual progress towards creating this?

@MrHailstorm00 6 жыл бұрын

The slides are outdated, judging by David's speech, he apparently changed notations and added a few slides in the last 30 mins or so.

@ProfessionalTycoons 6 жыл бұрын

man without the gestures its not the same, the lecutre is not the same…...

@emrahe468 6 жыл бұрын

This has good sound quality, but missing nice body language..

@ck5300045 9 жыл бұрын

This really helps. Thanks

@VladislavProkhorov-sr2mf 7 жыл бұрын

How does he get the score function at 37:41?

@blairfraser8005 7 жыл бұрын

I've seen this question a few places around the net so I answered it here: math.stackexchange.com/questions/2013050/log-of-softmax-function-derivative/2340848#2340848

@alexanderyau6347 7 жыл бұрын

Thank you, very elaborate answer!

@MinhVu-fo6hd 6 жыл бұрын

So, how do you get a score function for a deep NN?

@AM-kx4ue 5 жыл бұрын

@@blairfraser8005 could you do it for dummies? I don't understand why you put the terms inside logs.

@blairfraser8005 5 жыл бұрын

Our goal is to get a score function by taking the gradient of softmax. It looks like a difficult problem so I need to break it down into a simpler form. The first way to break it down is to separate the numerator and denominator using the log identity: log(x/y) = log(x) - log(y). Now I can apply the gradient to the left and right side independently. I also know that anytime I see something in the form e^x there is a good chance I can simplify and get at the guts of the exponent by taking the log of it. That helps simplify the left side. Next, the right side also takes advantage of a log property - namely that the gradient of the log of f(x) can be written in the form of gradient of f(x) / f(x). This is just the chain rule from calculus. Now the gradients of both the left and right sides are easier.

@jk9165 7 жыл бұрын

Thank you for the lecture. I was wondering if you are constrained to use the same state-action feature vectors for actor and critics? The weights are, of course, different, but does \phi(s,a) need to be the same? (57:18)

@narendiranchembu5893 7 жыл бұрын

As far as my understanding goes, the feature vectors of actor and critic are completely different. The feature vector of critic is more like the state space and action space representation, as you have seen in Value Function Approximation lecture. But for the actor, the feature vectors are probabilities of taking an action in a given state (mostly).

@edmonddantes4705 Жыл бұрын

Of course they don't have to be the same. The state-action value features are stuff that approximate the state-action value function well, and the policy features are stuff that approximate a general useful policy well. For example, look at what compatible function approximation imposes in order to flow along the real gradient 1:05:52. How are you going to achieve that condition with the same features?

@charles101993 6 жыл бұрын

What is the log policy exactly? Is it just the log of the output of the gradient with respect to some state action pair?

@xingyuanzhang5989 5 жыл бұрын

I need David! It's hard to understand some pronouns without seeing him.

@brandomiranda6703 5 жыл бұрын

Does he talk about REINFORCE in this talk/lecture? If yes when?

@AlbaraRamli 5 жыл бұрын

Here: 48:30

@divyanshushekhar5118 5 жыл бұрын

1:07:28 What does Silver mean when he says : "We can reduce the variance without changing the expectation"

@alvinphantomhive3794 5 жыл бұрын

There's several way to reduce the variance, but reducing variance like using "Critic" at 53:02 , may keeps changing and updating the expectation value onward. So this slide shows the way to reduce the variance without changing the expectation. The idea here is, by subtracting the "Baseline function B(s)" from the "Policy gradient" could do the job. The expectation equation above shows that after a few algebra steps, which ends up with the "B(s) or Baseline" multiply by the "gradient" of "the policy that sums up to 1". And the gradient of a constant (1) equals to "Zero". So the whole terms of that equation shows that, the calculation between the expectation and the "baseline B(s)" actually equal to zero. That's mean you could use this "baseline function B(s)" as a Trick to control the variance without changing the expectation. The baseline not gonna affect the expectation, Since the calculation between the expectation and the baseline actually equal to zero.

@alvinphantomhive3794 5 жыл бұрын

sorry if the explanation not straight forward and bit complicated lol

@edmonddantes4705 Жыл бұрын

abla log pi(s,a) A(s,a) and abla log pi(s,a) Q(s,a) have the same expectation in the MDP space. However, which one has the larger variance? V[X] = E[X^2]-E[X]^2. Obviously E[X]^2 is the same for both. However, which expectation is larger, that of | abla log pi(s,a)|^2 |A(s,a)|^2 or that of | abla log pi(s,a)|^2 |Q(s,a)|^2? Note that A just centers Q, so tipically its square is smaller.

@shaz7163 7 жыл бұрын

Can someone explain how he got the score function from the maximum likelihood expression in 22.39 .

@sravanchittupalli2333 4 жыл бұрын

I am 2 years late but this might help someone else😅😅 It is simple differentiation grad of log(a) wrt a = grad(a)/a He just did this backwards

@hyunjaecho1415 4 жыл бұрын

What does phi(s) mean at 1:18:05 ?

@sarahjamal86 6 жыл бұрын

Where did you go David :-(

@andreariba7792 2 жыл бұрын

it's a pity to see only the slides compared to the previous lectures, the change of format makes it very hard to follow

@TillMiltzow 2 жыл бұрын

When adding the baseline, there is an error. The gradient is zero when multiplying with the baseline because the function B(s) does not depend on theta. Then he uses B(s) = V^{\pi_\Theta} (s), which depends on theta. :( So this is at most a motivation rather than a mathematical proof.

@edmonddantes4705 Жыл бұрын

No error. That gradient is not hitting the baseline B, so it does not matter that B depends on theta. The gradient inside the sum is zero because the policy coefficients sum to one for fixed s. This is a well-known classical thing anyway. It was originally proven in Sutton's paper "Policy Gradient Methods for Reinforcement Learning with Function Approximation".

@helloworld9478 4 жыл бұрын

31:40 "...until your graduate students collapse it..." LoL

@kyanas1750 6 жыл бұрын

Why there is not a single implementation in MATALB?

@thomasmezzanine5470 7 жыл бұрын

Thanks for updating lectures. I have some problem in understanding the state-action feature vector \phi(s, a). I know the feature of environment mentioned in the last lecture, it could be some kind of observation of the environment, but how to understand this state-action feature vector?

@edmonddantes4705 Жыл бұрын

The state-action features in the last lecture and this lecture are different, since in the last lecture they were used to approximate the VALUE Q of a particular state-action pair, and in this lecture they are used to approximate a POLICY PI. State-action features filter important information about the state and action used to approximate the state-value function or maybe the policy, depending on the context.

@edmonddantes4705 Жыл бұрын

Say we are in a 2D grid world. The possible actions are up, down, left and right. Every time I move up, I get +1 reward, every time I move down, left or right, I get 0 reward. Define two features as (1,0) if I choose to go up, and (0,1) otherwise. Note that now I can compute the value function EXACTLY as a linear combination of my features, since they contain all the relevant information. My optimal policy is also a linear combination of those features only. PS: you are asking about the linear case, but for me the most interesting case is the nonlinear one.

@guptamber Жыл бұрын

@@edmonddantes4705 Wow that is response to 6 year old question. Thanks for taking time.

@edmonddantes4705 Жыл бұрын

@@guptamberhaha I do it to practise!

@alenmanoocherian631 8 жыл бұрын

Hello Karolina, Is there any real video for this class?

@zhaoxuanzhu4364 5 жыл бұрын

I am guessing the slides shown in the video is slightly different from the one they used in the lecture.

@japneetsingh5015 5 жыл бұрын

will these policy gradient methods work better in the previous methods based on generalized policy iteration MC and TD and SARSA?

@alexanderyau6347 7 жыл бұрын

Hi, guys, how can I get the v_t in 50:53?

@narendiranchembu5893 7 жыл бұрын

Since we have rewards of the entire episode, we can calculate the returns, Monte Carlo way. Here, v_t is more like G_t. v_t = R_(t+1) + gamma*R_(t+2)..... + gamma^(T-1)*R_T

@alexanderyau6347 7 жыл бұрын

Thank you!

@TillMiltzow 3 жыл бұрын

I feel like the last 15 minutes the slides and what he says is not in sync anymore. :(

@robert780612 6 жыл бұрын

David disappeared, but CC subtitle is coming!!

@ks3562 3 жыл бұрын

I lost it after he started talking about bias and reducing variance in actor-critic algorithms, after 1:05:03

@Sickkkkiddddd 9 ай бұрын

Isn't the state value function 'useless' to an agent considering he 'chooses' actions but can't 'choose' his state?

@hyunghochrischoi6561 3 жыл бұрын

First time making it this far. But is it just me or did alot of the notations change?

@hyunghochrischoi6561 3 жыл бұрын

Also, he seems to be speaking in one notation while the screen is showing something else.

@alxyok 6 жыл бұрын

rock paper scissors problem: would it not be a better strategy to try to fool the opponent into thinking we are following a policy other than the random play so that we can exploit the consequences of his decisions?

@BGasperov 5 жыл бұрын

Whatever strategy you come up with can not beat the uniform random strategy - that's why it is considered optimal.

@edmonddantes4705 Жыл бұрын

In real life it could be good, but theoretically of course not, since it is not a Nash Equilibrium. It can be exploited. Watch lecture 10.

@guptamber Жыл бұрын

I found Prof. Silver brilliant but concepts in this lecture by large are not explained concretely but just illustration of book. Moreover the lectures earlier showed where on the slide prof is pointing and that is missing too.

@karthik-ex4dm 6 жыл бұрын

Came with high hopes from last video.... WO video unable to predict what is he pointing to

@claudiocimarelli 7 жыл бұрын

the slide about deterministic policy gradient at the end is the one with compatible function approximation(like in the middle of the presentation)?:S Luckily there is the paper from Silver online :) Very good videos. With the video would have been better this one, but thanks anyways.

@illuminatic7 6 жыл бұрын

Not really, the slide you are referring to does not have the gradient of the Q-Function in the equations, which is the main point of what he is talking about. It helps a lot to have a look at the original paper (pages 4 and 5 in particular) to understand his explanation of DPG which can be found here: proceedings.mlr.press/v32/silver14.pdf

@shivajidutta8472 8 жыл бұрын

Does anyone know about the reading material David mentions in the previous class?

@hantinglu8050 7 жыл бұрын

I guess is this one: "Reinforcement Learning: An Introduction" by Richard S. Sutton and Andrew G. Barto

@chongsun7872 8 ай бұрын

A little mismatched between the voice and slides...

@tomwon5451 6 жыл бұрын

v: critic param, u: actor param

@Kalernor 3 жыл бұрын

Why do all lecture videos on Policy Gradient Methods use the exact same set of slides lol

@p.z.8355 3 жыл бұрын

so when can we determine that there is state aliasing ?

@edmonddantes4705 Жыл бұрын

Basically when you feel like your features are not representing the MDP very well. The solution is changing the features or improving them.

@jiansenxmu 7 жыл бұрын

I'm looking for the robot.. 32:49

@alexanderyau6347 7 жыл бұрын

What is state aliasing in reinforcement learning?

@sam41619 7 жыл бұрын

its like when two different states are represented with same features OR if two different states are encoded/represented using same encoding. though they are different (and have different rewards) but due to aliasing property, they appear same so it gets difficult for the algorithm or approximator to differentiate between these