DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning [1/13]

Рет қаралды 304,793

Google DeepMind

Күн бұрын

Пікірлер: 113

@hasuchObe 3 жыл бұрын

A full lesson on reinforcement learning from a Deep Mind researcher. For Free! What a time to be alive.

@mawkuri5496 3 жыл бұрын

lol.. 2 minute papers

@321conquer 3 жыл бұрын

You might be dead it is your AI clone is typing this... Hope this helps ®

@alejandrorodriguezdomingue174 3 жыл бұрын

yes, apart from sharing knowledge (do not get me wrong) they also target a market place and teach people so to use their future products

@masternobody1896 3 жыл бұрын

yes

@jakelionlight3936 3 жыл бұрын

@@mawkuri5496 lol

@JamesNorthrup 2 жыл бұрын

TOC in-anger 0:00 class details 5:00 intro-ish 6:50 turing 8:20 define goals of class 9:00 what is RL? 12:14 interaction loop 17:20 reward 25:49 atari game example 28:18 formalization 29:40 reward 30:10 the return 34:00 policy, actions denoted with Q 35:00 goto lectures 3,4,5,6 43:00 markov is maybe not the most important property 44:00 partial observability 46:10 the update function 53:48 Policy -> mapping of agent state to action. 54:20 stochastic policy 56:00 discount factor magnify local reward proximity (or not) 59:00 pi does not mean 3.14. means probability distribution. Bellman Equation so named here 1:02:00 optional Model 1:04:00 model projects next state+reward, or possibly any state and any reward, because reasons 1:07:00 Agent Categories 1:10:00 common terminology

@cennywenner516 Жыл бұрын

For those who may view this lecture, note that I think it is a bit non-standard in reinforcement learning to denote the state S_t as the "agent's state". It usually refers to the environment's state. This is important for other literature. The closest thing for the agent's state is perhaps "the belief state" b_t. Both are relevant depending on what is being done, and some of the formalization might not work when the two are mixed. Notably, most of the environments that are dealt with are Markovian in the (possibly hidden) environment state but not in the observations or even what the agent may derive about the state, which also means most of the time it is insufficient to condition on only "S_t=s" the way it is defined here, rather than the full history H_t. Considering how a lot of the RL formalism is with regard to the state that is often not fully observable to the agent, maybe this approach is useful.

@chevalier5691 2 жыл бұрын

Thanks for the amazing lecture! Honestly I prefer this online format rather than an actual lecture, not only because the audio and presentation are more clear, the concepts are explained more thoroughly without any interruption from students

@luisleal4169 Жыл бұрын

And also you can go back to sections you missed or didn't fully understand.

@prantikdeb3937 3 жыл бұрын

Thank you for releasing those awesome tutorials 🙏

@abanoubyounan9331 Жыл бұрын

Thank you, DeepMind, for sharing these resources publicly.

@loelie01 3 жыл бұрын

Great course, thank you for sharing Hado! Particularly enjoyed the clear explanation of Markov Decision Processes and how they relate to Reinforcement Learning.

@321conquer 3 жыл бұрын

👍🏿

@Adri209001 3 жыл бұрын

Thank so much for this, We love you from Africa

@DilekCelik Жыл бұрын

Some diamond lectures from top researchers are public. Amazing.. Get benefit guys. You will not get this much quality lectures from the universities.

@alisheheryar1770 3 жыл бұрын

The type of learning in which your AI agent learns/tunes itself by interacting with its environment, is called Reinforcement Learning. More generalization power than a neural network and more able to cater to those unforeseen situations that were not considered when designing such a system.

@zoranjovicicprofessor4765 3 жыл бұрын

Bravo!

@tamimyousefi 3 жыл бұрын

15:45 Goal: Prosper in all societies. Env.: A world comprised of two societies, killers and pacifists. These two groups despise the actions of the other. You will find reward from one and penalty from the other for any given action.

@gkirgizov_ai 2 жыл бұрын

just kill all the pacifists and the goal becomes trivial

@MikeLambert1 2 жыл бұрын

I think you're still maximizing a reward in your scenario, but it's just the reward is not static, and is instead a function of your state (ie, which society you are physically in).

@charliestinson8088 Жыл бұрын

At 59:06, does the Bellman Equation only apply to MDPs? If it depends on earlier states I don't see how we can write it in terms of only v_pi(S_{t+1})

@gokublack4832 3 жыл бұрын

At 49:40 how about just storing the number of steps the agent has taken? Would that make it Markov?

@thedingodile5699 3 жыл бұрын

No, you would still be able to stand in the two places he highlighted with the same amount of steps taken, so you can't tell the difference. While if you knew the entire history of the states you visited you would be able to tell the difference.

@gokublack4832 3 жыл бұрын

@@thedingodile5699 Maze games like this usually have an initial state (i.e, position in the grid) where the game starts, so I'm not sure why if you stored the number of steps taken you wouldn't be able to tell the difference. You'd just look at the steps taken and notice that although the two observations are the same, they are very far away from each other and they're likely different. I'd agree if the game could start anywhere on the grid, but that's usually not the case.

@thedingodile5699 3 жыл бұрын

@@gokublack4832 even if you start the same place you can most likely reach the two squares at the same time-step (unless there is something like you can only reach this state in an even amount of steps or something like that)

@gokublack4832 3 жыл бұрын

@@thedingodile5699 True, yeah I guess it's theoretically possible to construct a maze that starts at the same place, but then comes to a fork in the road later where the mazes are identical on both sides except only one contains the reward. In that case, counting steps wouldn't help you distinguish between two observations on either side... 🤔tricky problem

@hadovanhasselt7357 3 жыл бұрын

@@gokublack4832 It's a great question. In some cases adding something simple as counting steps could make the state Markovian, in other cases it wouldn't. But even if this does help disentangle things (and make the resulting inputs Markov), adding such information to the state would also result in there being more states, which could make it harder to learn accurate predictions or good behaviour. In general, this is a tricky problem: we want the state to be informative, but also for it to be easy to generalise from past situations to new ones. If each situation is represented completely separately, the latter can be hard. In later lectures we go more into these kind of questions, including how to use deep learning and neural networks to learn good representations, that hopefully can result in a good mixture between expressiveness on the one hand, and ease of generalisation on the other.

@kiet-onlook 3 жыл бұрын

Does anyone know how this course compares to the 2015 or 2018 courses offered by Deepmind and UCL? I’m looking to start with one but not sure which one to take.

@sumanthnandamuri2168 3 жыл бұрын

@DeepMind Can you share assignments?

@sh4ny1 3 ай бұрын

45:20 if we have a differentiable physics simulator as an evnironment from which we can feed the gradient information to agent, would that be consiered a fully observable environment ?

@JuanMoreno-tj9xh 3 жыл бұрын

"Any goal can be formalized as the outcome of maximizing a cumulative reward." What about the goal being to know if a program will halt?

@judewells1 3 жыл бұрын

My goal is to find a counter example that disproves the reward hypothesis.

@hadovanhasselt7357 3 жыл бұрын

Great question, Juan! I would say you can still represent this goal with a reward. E.g., give +1 reward when you know the program will halt. So in this case the problem perhaps isn't so much to formulate the goal. Rather, the problem is that we cannot find a policy that optimises it. This is, obviously, a very important question, but it's a different one. One could argue that the halting problem gives an example that some problems can have well-formalised goals, but still do not allow us to find feasible solutions (in at least some cases, or in finite time). This itself doesn't invalidate the reward hypothesis. In fact, this example remains pertinent if you try to formalise this goal in any other way, right? Of course, there is an interesting question which kind of goals we can or cannot hope to achieve in practice, with concrete algorithms. We go into that a bit in subsequent lectures, for instance talking about when optimal policies can guaranteed to be found, and discussing concrete algorithms that can find these, and discussing the required conditions for these algorithms to succeed.

@nocomments_s 3 жыл бұрын

@@hadovanhasselt7357 thank you very much for such an elaborate answer!

@JuanMoreno-tj9xh 3 жыл бұрын

@@hadovanhasselt7357 True. I didn't think about it that way. I just thought that if you couldn't find a framework, a reward to give to your agent, such that you could solve your problem by finding the right policy then you could say that the reward hypothesis was false. Since there is no way to get around it. But you are right. It's a different question. But then it's still a hypothesis. Thanks for your time. :)

@JuanMoreno-tj9xh 3 жыл бұрын

@@judewells1 Nice one!

@chadmcintire4128 3 жыл бұрын

Why the downvote on free education? Thanks, I am comparing this to to cs 285 for Berkeley, so far it has been good, different focus.

@rudigerhoffman3541 2 жыл бұрын

After around 30:00 it was said that we can't hope to always optimize the return itself therefore we need to optimize the expected return. Why? Is this because we don't know the return yet and can only calculate an expected return based on some inference made on the basis of known returns? Or is it only because of the need of discounted returns in possibly infinite markov decision processes? If so, why wouldn't it work in finite MDPs?

@MikeLambert1 2 жыл бұрын

My attempt at an answer, on the same journey as you: If you have built/trained/learned a model, it is merely an approximation of the actual environment behavior (based on how we've seen the world evolve thus far). If there's any unknowns (ie, you don't know what other players will do, you don't know what you will see when you look behind the door, etc) then you need to optimize E[R], based on our model's best understanding of what our action will do. Optimizing E[R] will still push us to open the doors because we believe there _might_ be gold behind them. But if we open a particular door without any gold, it doesn't help R (in fact, I believe it lowers R, because any gold we find is now "one additional step" out in the future), even though it maximized E[R].

@0Tsutsumi0 Жыл бұрын

"Any goal can be formalized as the outcome of maximizing a cumulative reward." A broader question would be "Can all possible goals be transformed into a Math formula?", it starts getting trickier whenever you deal with subjective human concepts such as love.

@TexasBUSHMAN 3 жыл бұрын

Great video! Thank you! 💪🏾

@randalllionelkharkrang4047 2 жыл бұрын

Please , can you link the assignments for this course? For non UCL students

@lqpchen 3 жыл бұрын

Thank you! Is there any assignments pdf files?

@yulinchao7837 2 жыл бұрын

15:45 Let's say my goal is to live forever and I can take 1 pill per day and that gaurantees my survival the next day. If I don't take it, I die. How do I formalize the goal by the cumulative rewards? My goal would be getting infinite rewards. However, the outcomes of me taking the pill or not at some day in the future are both infinite. In other words, I can't distinguish if I can live forever from maximizing the cumulative reward. Does this count as a success to breaking the hypothesis?

@ianhailey 3 жыл бұрын

Are the code and simulation environments for these examples available somewhere?

@patrickliu7179 2 жыл бұрын

16:22 For a task that fails due to trying to maximize a cumulative reward, would casino games that have turns of independent probability such as roulette break the model? This is tentative to the reward accumulation period expanding beyond one turn, resulting in a misapplication of the model. While it is more of a human error than machine error, its a common human misconception of the game and thus liable to be programmed that way. Another example may be games with black swan events, so the reward accumulation period is too short to have witnessed a black swan event.

@billykotsos4642 3 жыл бұрын

The man, the myth the legend. OMG ! I’m in !

@matthewfeeley6226 3 жыл бұрын

Thankyou very much for this lesson and for you to take the time to deliver the content.

@AtrejuTauschinsky 3 жыл бұрын

I'm a bit confused by models... In particular, value functions map states to rewards, but so do (some) models -- what's the difference? You seem to have the same equation (S -> R) for both on the slide visible at 1:16:30

@hadovanhasselt7357 3 жыл бұрын

Some models indeed use explicit reward models, that try to learn the expected *immediate reward* following a state or action. Typically, a separate transition model is also learnt, that predicts the next *state*. So a reward model maps a state to a number, but the semantics of that number is not the same as the semantics of what we call a *value*. Values, in reinforcement learning, are defined as the expected sum of future rewards, rather than just the immediate subsequent reward. So while a reward model and a value function have the same functional form (they both map a state to a number), the meaning of that number is different. Hope that helps!

@bobaktadjalli 2 жыл бұрын

Hi, at 59:50 I couldn't understand the meaning of argument "a" under "max". It would be appreciated if anyone could explain this to me.

@SmartMihir 2 жыл бұрын

I think Regular value function would get value of a state when we pick action by following pi. Optimal value function however would pick action such that value is maximum (for all time steps further)

@bhoomeendra Жыл бұрын

37:28 what is ment by prediction is it different from the actions?

@mohammadhaadiakhter2869 9 ай бұрын

At 1:05:49, how did we approximate the policy?

@cuylerbrehaut9813 Жыл бұрын

Suppose the reward hypothesis is true. Then the goal “keep this goal’s reward function beneath its maximum” has a corresponding reward function (rendering the goal itself meaningful) whose maximization is equivalent to the achievement of the goal. If the reward function were maximized, the goal would be achieved, but then the function must not be maximized. This is a contradiction. Therefore the reward function cannot be maximized. Therefore the goal is always achieved, and therefore the reward function is always maximized. This is a contradiction. Therefore the reward hypothesis is false.

@cuylerbrehaut9813 Жыл бұрын

This assumes that the goal described exists. But any counter-example would require such an assumption. To decide if this argument disproves the reward hypothesis, we would need some formal way of figuring out which goals exist and which don’t.

@swazza9999 3 жыл бұрын

Thanks Hado, this has been very well explained. I've been through similar lectures/ intro papers before but here I learned more of the finer points / subtleties of the RL formalism - things that a teacher might take for granted and not mention explicitly. Question: 1:03:23 anyone know why the second expression is an expectation value and the first is a probability distribution? Typo or a clue to something much more meaningful?

@TheArrowShooter 3 жыл бұрын

Given that for a pair (s, a) there is one "true" reward signal in the to be learnt model, the expected value should suffice. I.e. if you would model this with a distribution, this would in the limit be a dirac delta function at value r. The alternative where there are two (or more) possible reward values for a state-action pair, a probability distribution that you sample from could make more sense. You can ask yourself if it even make sense to have multiple possible rewards for an (s, a)-pair. I think it could be useful to model your reward function like a distribution when your observed state is only a subset of the environment for example. E.g. assume you can't sense whether it is raining or not, and this will respectively determine the reward of your (s, a) pairs being either 5 or 10. Modelling the reward as an expected value (would be 7.5 given that it rains 50 percent of the time) would ignore some subtleties of your model here I suppose. I'm no RL specialist so don't take my word for it!

@swazza9999 3 жыл бұрын

@@TheArrowShooter hmm is it really right that there is one "true" reward signal for a given pair (s, a)? If a robot makes a step in a direction it may or may not slip on a rock so despite the action and state being determined as a prior, the consequences can vary. I was thinking about this more and I realised the first expression is asking about a state, which is a member of a set of states, so it makes sense to ask for the probability that the next state is s'. But in the second expression we are dealing with a scalar variable, so it makes more sense to ask for an expectation value. But don't take my word for it :)

@TheArrowShooter 3 жыл бұрын

@@swazza9999 I agree that there are multiple possible reward signals for a given state action pair. I tend to work with deterministic environments (no slipping, ending up in different states, ..), hence our misunderstanding :)! My main point was that you could model it as a probability distribution as well. The resulting learnt model would be more faithful to the underlying "true" model as it could return rewards by sampling (i.e. 5 or 10 in my example).

@willrazen 3 жыл бұрын

It's a design choice, you could choose whatever formulation that is suitable for your problem. For example, if you have a small and finite set of possible states, you can build/learn a table with all state transition probabilities, i.e. the transition matrix. As mentioned in the same slide, you could also use a generative model, instead of working with probabilities directly. In Sutton&Barto 2018 they say: "In the first edition we used special notations, P_{ss'}^a and R_{ss'}^a, for the transition probabilities and expected rewards. One weakness of that notation is that it still did not fully characterize the dynamics of the rewards, giving only their expectations, which is sufficient for dynamic programming but not for reinforcement learning. Another weakness is the excess of subscripts and superscripts. In this edition we use the explicit notation of p(s',r | s,a) for the joint probability for the next state and reward given the current state and action."

@Cinephile.. 2 жыл бұрын

Hi I want to learn Data science , machine learning. And AI I am unable to get the right approach and study material there are numerous courses as well but still struggling to find the right one

@Fordance100 3 жыл бұрын

Amazing introduction on RL.

@neerajdokania300 27 күн бұрын

Hi, Can anyone please explain why does the optimal value v_star does not depend on the policy and only depend on the states and actions?

@abdul6974 3 жыл бұрын

is there any practical Course in Python in RL to apply the theory of the RL?

@boriskabak 2 жыл бұрын

where we can see a coding intrioduction how to code reinforcment learning models

@malcolm7436 Жыл бұрын

If your goal is to win the lottery, you incur a weekly debt for each attempt, and the chance is the same with no guarantee of achieving the goal. If the reward is your profit over time, then the cumulative reward could even be negative and decreasing with each attempt.

@rakshithv5073 3 жыл бұрын

Why do we need to maximize expectation of return ? What will happen if I maximize return alone without expectation ?

@ckhalifa_ 3 жыл бұрын

expectation of return (reward actually) includes the relevant discount factor for each future reward.

@theminesweeper1 Жыл бұрын

Is the reward hypothesis generally regarded as true among computer scientists and other smart people?

@Saurabhsingh-cl7px 3 жыл бұрын

So I have to watch videos of previous years on RL by deep minds to understand this ?

@los4776 3 жыл бұрын

No it would not be a requirement

@kejianshi299 3 жыл бұрын

Thanks so much for this lesson!

@robinkhlng8728 3 жыл бұрын

Could you further explain what v(S_t+1) formally is? Because v(s) is defined with lowercase s as input. From what you said I would say it is SUM_s' [ p(s'|S_t=s ) * v(s') ], so the expected value over all possible states s' for S_t+1.

@a4anandr 3 жыл бұрын

That seems right to me. Probably, it is conditioned on the policy \pi as well.

@robertocordovacastillo3035 3 жыл бұрын

That is awesome! thank you from Ecuador

@anshitbansal215 3 жыл бұрын

"If we are observing the full environment then we do not need to worry about keeping the history of previous actions". Why would this be the case, because then from what the agent will learn?

@matiassandacz9145 2 жыл бұрын

Does anyone know where can I find assignments for this course? Thank you in advance!

@may8049 3 жыл бұрын

when will we be able to download alpha go and play with him.

@mabbasiazad 3 жыл бұрын

Can we have access to the assignments?

@mattsmith6509 3 жыл бұрын

Can it tell us y people bought toilet paper in the pan dam

@spectator5144 2 жыл бұрын

😂😂😂😂😂😂

@chanpreetsingh007 2 жыл бұрын

Could you please share assignments?

@anhtientran3158 3 жыл бұрын

Thank you for your informative lecture

@adwaitnaik4003 3 жыл бұрын

Thanks for this course.

@goutamgarai6632 3 жыл бұрын

thanks DeepMind

@madhurivuddaraju3123 2 жыл бұрын

Pro tip: Always switch off vaccum cleaner when recording lectures.

@spectator5144 2 жыл бұрын

he is most probably not using an Apple M1 computer

@KayzeeFPS 3 жыл бұрын

I miss David silver

@roboticscontrolandmachinel6324 3 жыл бұрын

David Silver is a hard act to follow

@umarsaboor6881 Жыл бұрын

amazing

@AyushSingh-vj6he 2 жыл бұрын

Thanks, I am marking 49:21

@AineOfficial 3 жыл бұрын

Day 1 asking him when AlphaZero Back to Chess Again.

@comradestinger 3 жыл бұрын

ow right in the inbox

@hadjseddikyousfi00 Ай бұрын

Thank you!

@garrymaemahinay3046 3 жыл бұрын

i have solution but i need a team

@WhishingRaven 3 жыл бұрын

이게 또 나오네

@philippededeken4881 2 жыл бұрын

Lovely

@extendedclips 3 жыл бұрын

✨👏🏽

@jonathansum9084 3 жыл бұрын

Many great people have said RL is replaced by DL. If so, I think we should focus more on newer topics like perceive.IO. I think they are much important and practical rather than those histories. I hope you do not mind what I said.

@felipemaldonado8028 2 жыл бұрын

Do you mind to provide evidence about those "many great people"?