A full lesson on reinforcement learning from a Deep Mind researcher. For Free! What a time to be alive.
@mawkuri54963 жыл бұрын
lol.. 2 minute papers
@321conquer3 жыл бұрын
You might be dead it is your AI clone is typing this... Hope this helps ®
@alejandrorodriguezdomingue1743 жыл бұрын
yes, apart from sharing knowledge (do not get me wrong) they also target a market place and teach people so to use their future products
@masternobody18963 жыл бұрын
yes
@jakelionlight39363 жыл бұрын
@@mawkuri5496 lol
@JamesNorthrup2 жыл бұрын
TOC in-anger 0:00 class details 5:00 intro-ish 6:50 turing 8:20 define goals of class 9:00 what is RL? 12:14 interaction loop 17:20 reward 25:49 atari game example 28:18 formalization 29:40 reward 30:10 the return 34:00 policy, actions denoted with Q 35:00 goto lectures 3,4,5,6 43:00 markov is maybe not the most important property 44:00 partial observability 46:10 the update function 53:48 Policy -> mapping of agent state to action. 54:20 stochastic policy 56:00 discount factor magnify local reward proximity (or not) 59:00 pi does not mean 3.14. means probability distribution. Bellman Equation so named here 1:02:00 optional Model 1:04:00 model projects next state+reward, or possibly any state and any reward, because reasons 1:07:00 Agent Categories 1:10:00 common terminology
@cennywenner516 Жыл бұрын
For those who may view this lecture, note that I think it is a bit non-standard in reinforcement learning to denote the state S_t as the "agent's state". It usually refers to the environment's state. This is important for other literature. The closest thing for the agent's state is perhaps "the belief state" b_t. Both are relevant depending on what is being done, and some of the formalization might not work when the two are mixed. Notably, most of the environments that are dealt with are Markovian in the (possibly hidden) environment state but not in the observations or even what the agent may derive about the state, which also means most of the time it is insufficient to condition on only "S_t=s" the way it is defined here, rather than the full history H_t. Considering how a lot of the RL formalism is with regard to the state that is often not fully observable to the agent, maybe this approach is useful.
@chevalier56912 жыл бұрын
Thanks for the amazing lecture! Honestly I prefer this online format rather than an actual lecture, not only because the audio and presentation are more clear, the concepts are explained more thoroughly without any interruption from students
@luisleal4169 Жыл бұрын
And also you can go back to sections you missed or didn't fully understand.
@prantikdeb39373 жыл бұрын
Thank you for releasing those awesome tutorials 🙏
@abanoubyounan9331 Жыл бұрын
Thank you, DeepMind, for sharing these resources publicly.
@loelie013 жыл бұрын
Great course, thank you for sharing Hado! Particularly enjoyed the clear explanation of Markov Decision Processes and how they relate to Reinforcement Learning.
@321conquer3 жыл бұрын
👍🏿
@Adri2090013 жыл бұрын
Thank so much for this, We love you from Africa
@DilekCelik Жыл бұрын
Some diamond lectures from top researchers are public. Amazing.. Get benefit guys. You will not get this much quality lectures from the universities.
@alisheheryar17703 жыл бұрын
The type of learning in which your AI agent learns/tunes itself by interacting with its environment, is called Reinforcement Learning. More generalization power than a neural network and more able to cater to those unforeseen situations that were not considered when designing such a system.
@zoranjovicicprofessor47653 жыл бұрын
Bravo!
@tamimyousefi3 жыл бұрын
15:45 Goal: Prosper in all societies. Env.: A world comprised of two societies, killers and pacifists. These two groups despise the actions of the other. You will find reward from one and penalty from the other for any given action.
@gkirgizov_ai2 жыл бұрын
just kill all the pacifists and the goal becomes trivial
@MikeLambert12 жыл бұрын
I think you're still maximizing a reward in your scenario, but it's just the reward is not static, and is instead a function of your state (ie, which society you are physically in).
@charliestinson8088 Жыл бұрын
At 59:06, does the Bellman Equation only apply to MDPs? If it depends on earlier states I don't see how we can write it in terms of only v_pi(S_{t+1})
@gokublack48323 жыл бұрын
At 49:40 how about just storing the number of steps the agent has taken? Would that make it Markov?
@thedingodile56993 жыл бұрын
No, you would still be able to stand in the two places he highlighted with the same amount of steps taken, so you can't tell the difference. While if you knew the entire history of the states you visited you would be able to tell the difference.
@gokublack48323 жыл бұрын
@@thedingodile5699 Maze games like this usually have an initial state (i.e, position in the grid) where the game starts, so I'm not sure why if you stored the number of steps taken you wouldn't be able to tell the difference. You'd just look at the steps taken and notice that although the two observations are the same, they are very far away from each other and they're likely different. I'd agree if the game could start anywhere on the grid, but that's usually not the case.
@thedingodile56993 жыл бұрын
@@gokublack4832 even if you start the same place you can most likely reach the two squares at the same time-step (unless there is something like you can only reach this state in an even amount of steps or something like that)
@gokublack48323 жыл бұрын
@@thedingodile5699 True, yeah I guess it's theoretically possible to construct a maze that starts at the same place, but then comes to a fork in the road later where the mazes are identical on both sides except only one contains the reward. In that case, counting steps wouldn't help you distinguish between two observations on either side... 🤔tricky problem
@hadovanhasselt73573 жыл бұрын
@@gokublack4832 It's a great question. In some cases adding something simple as counting steps could make the state Markovian, in other cases it wouldn't. But even if this does help disentangle things (and make the resulting inputs Markov), adding such information to the state would also result in there being more states, which could make it harder to learn accurate predictions or good behaviour. In general, this is a tricky problem: we want the state to be informative, but also for it to be easy to generalise from past situations to new ones. If each situation is represented completely separately, the latter can be hard. In later lectures we go more into these kind of questions, including how to use deep learning and neural networks to learn good representations, that hopefully can result in a good mixture between expressiveness on the one hand, and ease of generalisation on the other.
@kiet-onlook3 жыл бұрын
Does anyone know how this course compares to the 2015 or 2018 courses offered by Deepmind and UCL? I’m looking to start with one but not sure which one to take.
@sumanthnandamuri21683 жыл бұрын
@DeepMind Can you share assignments?
@sh4ny13 ай бұрын
45:20 if we have a differentiable physics simulator as an evnironment from which we can feed the gradient information to agent, would that be consiered a fully observable environment ?
@JuanMoreno-tj9xh3 жыл бұрын
"Any goal can be formalized as the outcome of maximizing a cumulative reward." What about the goal being to know if a program will halt?
@judewells13 жыл бұрын
My goal is to find a counter example that disproves the reward hypothesis.
@hadovanhasselt73573 жыл бұрын
Great question, Juan! I would say you can still represent this goal with a reward. E.g., give +1 reward when you know the program will halt. So in this case the problem perhaps isn't so much to formulate the goal. Rather, the problem is that we cannot find a policy that optimises it. This is, obviously, a very important question, but it's a different one. One could argue that the halting problem gives an example that some problems can have well-formalised goals, but still do not allow us to find feasible solutions (in at least some cases, or in finite time). This itself doesn't invalidate the reward hypothesis. In fact, this example remains pertinent if you try to formalise this goal in any other way, right? Of course, there is an interesting question which kind of goals we can or cannot hope to achieve in practice, with concrete algorithms. We go into that a bit in subsequent lectures, for instance talking about when optimal policies can guaranteed to be found, and discussing concrete algorithms that can find these, and discussing the required conditions for these algorithms to succeed.
@nocomments_s3 жыл бұрын
@@hadovanhasselt7357 thank you very much for such an elaborate answer!
@JuanMoreno-tj9xh3 жыл бұрын
@@hadovanhasselt7357 True. I didn't think about it that way. I just thought that if you couldn't find a framework, a reward to give to your agent, such that you could solve your problem by finding the right policy then you could say that the reward hypothesis was false. Since there is no way to get around it. But you are right. It's a different question. But then it's still a hypothesis. Thanks for your time. :)
@JuanMoreno-tj9xh3 жыл бұрын
@@judewells1 Nice one!
@chadmcintire41283 жыл бұрын
Why the downvote on free education? Thanks, I am comparing this to to cs 285 for Berkeley, so far it has been good, different focus.
@rudigerhoffman35412 жыл бұрын
After around 30:00 it was said that we can't hope to always optimize the return itself therefore we need to optimize the expected return. Why? Is this because we don't know the return yet and can only calculate an expected return based on some inference made on the basis of known returns? Or is it only because of the need of discounted returns in possibly infinite markov decision processes? If so, why wouldn't it work in finite MDPs?
@MikeLambert12 жыл бұрын
My attempt at an answer, on the same journey as you: If you have built/trained/learned a model, it is merely an approximation of the actual environment behavior (based on how we've seen the world evolve thus far). If there's any unknowns (ie, you don't know what other players will do, you don't know what you will see when you look behind the door, etc) then you need to optimize E[R], based on our model's best understanding of what our action will do. Optimizing E[R] will still push us to open the doors because we believe there _might_ be gold behind them. But if we open a particular door without any gold, it doesn't help R (in fact, I believe it lowers R, because any gold we find is now "one additional step" out in the future), even though it maximized E[R].
@0Tsutsumi0 Жыл бұрын
"Any goal can be formalized as the outcome of maximizing a cumulative reward." A broader question would be "Can all possible goals be transformed into a Math formula?", it starts getting trickier whenever you deal with subjective human concepts such as love.
@TexasBUSHMAN3 жыл бұрын
Great video! Thank you! 💪🏾
@randalllionelkharkrang40472 жыл бұрын
Please , can you link the assignments for this course? For non UCL students
@lqpchen3 жыл бұрын
Thank you! Is there any assignments pdf files?
@yulinchao78372 жыл бұрын
15:45 Let's say my goal is to live forever and I can take 1 pill per day and that gaurantees my survival the next day. If I don't take it, I die. How do I formalize the goal by the cumulative rewards? My goal would be getting infinite rewards. However, the outcomes of me taking the pill or not at some day in the future are both infinite. In other words, I can't distinguish if I can live forever from maximizing the cumulative reward. Does this count as a success to breaking the hypothesis?
@ianhailey3 жыл бұрын
Are the code and simulation environments for these examples available somewhere?
@patrickliu71792 жыл бұрын
16:22 For a task that fails due to trying to maximize a cumulative reward, would casino games that have turns of independent probability such as roulette break the model? This is tentative to the reward accumulation period expanding beyond one turn, resulting in a misapplication of the model. While it is more of a human error than machine error, its a common human misconception of the game and thus liable to be programmed that way. Another example may be games with black swan events, so the reward accumulation period is too short to have witnessed a black swan event.
@billykotsos46423 жыл бұрын
The man, the myth the legend. OMG ! I’m in !
@matthewfeeley62263 жыл бұрын
Thankyou very much for this lesson and for you to take the time to deliver the content.
@AtrejuTauschinsky3 жыл бұрын
I'm a bit confused by models... In particular, value functions map states to rewards, but so do (some) models -- what's the difference? You seem to have the same equation (S -> R) for both on the slide visible at 1:16:30
@hadovanhasselt73573 жыл бұрын
Some models indeed use explicit reward models, that try to learn the expected *immediate reward* following a state or action. Typically, a separate transition model is also learnt, that predicts the next *state*. So a reward model maps a state to a number, but the semantics of that number is not the same as the semantics of what we call a *value*. Values, in reinforcement learning, are defined as the expected sum of future rewards, rather than just the immediate subsequent reward. So while a reward model and a value function have the same functional form (they both map a state to a number), the meaning of that number is different. Hope that helps!
@bobaktadjalli2 жыл бұрын
Hi, at 59:50 I couldn't understand the meaning of argument "a" under "max". It would be appreciated if anyone could explain this to me.
@SmartMihir2 жыл бұрын
I think Regular value function would get value of a state when we pick action by following pi. Optimal value function however would pick action such that value is maximum (for all time steps further)
@bhoomeendra Жыл бұрын
37:28 what is ment by prediction is it different from the actions?
@mohammadhaadiakhter28699 ай бұрын
At 1:05:49, how did we approximate the policy?
@cuylerbrehaut9813 Жыл бұрын
Suppose the reward hypothesis is true. Then the goal “keep this goal’s reward function beneath its maximum” has a corresponding reward function (rendering the goal itself meaningful) whose maximization is equivalent to the achievement of the goal. If the reward function were maximized, the goal would be achieved, but then the function must not be maximized. This is a contradiction. Therefore the reward function cannot be maximized. Therefore the goal is always achieved, and therefore the reward function is always maximized. This is a contradiction. Therefore the reward hypothesis is false.
@cuylerbrehaut9813 Жыл бұрын
This assumes that the goal described exists. But any counter-example would require such an assumption. To decide if this argument disproves the reward hypothesis, we would need some formal way of figuring out which goals exist and which don’t.
@swazza99993 жыл бұрын
Thanks Hado, this has been very well explained. I've been through similar lectures/ intro papers before but here I learned more of the finer points / subtleties of the RL formalism - things that a teacher might take for granted and not mention explicitly. Question: 1:03:23 anyone know why the second expression is an expectation value and the first is a probability distribution? Typo or a clue to something much more meaningful?
@TheArrowShooter3 жыл бұрын
Given that for a pair (s, a) there is one "true" reward signal in the to be learnt model, the expected value should suffice. I.e. if you would model this with a distribution, this would in the limit be a dirac delta function at value r. The alternative where there are two (or more) possible reward values for a state-action pair, a probability distribution that you sample from could make more sense. You can ask yourself if it even make sense to have multiple possible rewards for an (s, a)-pair. I think it could be useful to model your reward function like a distribution when your observed state is only a subset of the environment for example. E.g. assume you can't sense whether it is raining or not, and this will respectively determine the reward of your (s, a) pairs being either 5 or 10. Modelling the reward as an expected value (would be 7.5 given that it rains 50 percent of the time) would ignore some subtleties of your model here I suppose. I'm no RL specialist so don't take my word for it!
@swazza99993 жыл бұрын
@@TheArrowShooter hmm is it really right that there is one "true" reward signal for a given pair (s, a)? If a robot makes a step in a direction it may or may not slip on a rock so despite the action and state being determined as a prior, the consequences can vary. I was thinking about this more and I realised the first expression is asking about a state, which is a member of a set of states, so it makes sense to ask for the probability that the next state is s'. But in the second expression we are dealing with a scalar variable, so it makes more sense to ask for an expectation value. But don't take my word for it :)
@TheArrowShooter3 жыл бұрын
@@swazza9999 I agree that there are multiple possible reward signals for a given state action pair. I tend to work with deterministic environments (no slipping, ending up in different states, ..), hence our misunderstanding :)! My main point was that you could model it as a probability distribution as well. The resulting learnt model would be more faithful to the underlying "true" model as it could return rewards by sampling (i.e. 5 or 10 in my example).
@willrazen3 жыл бұрын
It's a design choice, you could choose whatever formulation that is suitable for your problem. For example, if you have a small and finite set of possible states, you can build/learn a table with all state transition probabilities, i.e. the transition matrix. As mentioned in the same slide, you could also use a generative model, instead of working with probabilities directly. In Sutton&Barto 2018 they say: "In the first edition we used special notations, P_{ss'}^a and R_{ss'}^a, for the transition probabilities and expected rewards. One weakness of that notation is that it still did not fully characterize the dynamics of the rewards, giving only their expectations, which is sufficient for dynamic programming but not for reinforcement learning. Another weakness is the excess of subscripts and superscripts. In this edition we use the explicit notation of p(s',r | s,a) for the joint probability for the next state and reward given the current state and action."
@Cinephile..2 жыл бұрын
Hi I want to learn Data science , machine learning. And AI I am unable to get the right approach and study material there are numerous courses as well but still struggling to find the right one
@Fordance1003 жыл бұрын
Amazing introduction on RL.
@neerajdokania30027 күн бұрын
Hi, Can anyone please explain why does the optimal value v_star does not depend on the policy and only depend on the states and actions?
@abdul69743 жыл бұрын
is there any practical Course in Python in RL to apply the theory of the RL?
@boriskabak2 жыл бұрын
where we can see a coding intrioduction how to code reinforcment learning models
@malcolm7436 Жыл бұрын
If your goal is to win the lottery, you incur a weekly debt for each attempt, and the chance is the same with no guarantee of achieving the goal. If the reward is your profit over time, then the cumulative reward could even be negative and decreasing with each attempt.
@rakshithv50733 жыл бұрын
Why do we need to maximize expectation of return ? What will happen if I maximize return alone without expectation ?
@ckhalifa_3 жыл бұрын
expectation of return (reward actually) includes the relevant discount factor for each future reward.
@theminesweeper1 Жыл бұрын
Is the reward hypothesis generally regarded as true among computer scientists and other smart people?
@Saurabhsingh-cl7px3 жыл бұрын
So I have to watch videos of previous years on RL by deep minds to understand this ?
@los47763 жыл бұрын
No it would not be a requirement
@kejianshi2993 жыл бұрын
Thanks so much for this lesson!
@robinkhlng87283 жыл бұрын
Could you further explain what v(S_t+1) formally is? Because v(s) is defined with lowercase s as input. From what you said I would say it is SUM_s' [ p(s'|S_t=s ) * v(s') ], so the expected value over all possible states s' for S_t+1.
@a4anandr3 жыл бұрын
That seems right to me. Probably, it is conditioned on the policy \pi as well.
@robertocordovacastillo30353 жыл бұрын
That is awesome! thank you from Ecuador
@anshitbansal2153 жыл бұрын
"If we are observing the full environment then we do not need to worry about keeping the history of previous actions". Why would this be the case, because then from what the agent will learn?
@matiassandacz91452 жыл бұрын
Does anyone know where can I find assignments for this course? Thank you in advance!
@may80493 жыл бұрын
when will we be able to download alpha go and play with him.
@mabbasiazad3 жыл бұрын
Can we have access to the assignments?
@mattsmith65093 жыл бұрын
Can it tell us y people bought toilet paper in the pan dam
@spectator51442 жыл бұрын
😂😂😂😂😂😂
@chanpreetsingh0072 жыл бұрын
Could you please share assignments?
@anhtientran31583 жыл бұрын
Thank you for your informative lecture
@adwaitnaik40033 жыл бұрын
Thanks for this course.
@goutamgarai66323 жыл бұрын
thanks DeepMind
@madhurivuddaraju31232 жыл бұрын
Pro tip: Always switch off vaccum cleaner when recording lectures.
@spectator51442 жыл бұрын
he is most probably not using an Apple M1 computer
@KayzeeFPS3 жыл бұрын
I miss David silver
@roboticscontrolandmachinel63243 жыл бұрын
David Silver is a hard act to follow
@umarsaboor6881 Жыл бұрын
amazing
@AyushSingh-vj6he2 жыл бұрын
Thanks, I am marking 49:21
@AineOfficial3 жыл бұрын
Day 1 asking him when AlphaZero Back to Chess Again.
@comradestinger3 жыл бұрын
ow right in the inbox
@hadjseddikyousfi00Ай бұрын
Thank you!
@garrymaemahinay30463 жыл бұрын
i have solution but i need a team
@WhishingRaven3 жыл бұрын
이게 또 나오네
@philippededeken48812 жыл бұрын
Lovely
@extendedclips3 жыл бұрын
✨👏🏽
@jonathansum90843 жыл бұрын
Many great people have said RL is replaced by DL. If so, I think we should focus more on newer topics like perceive.IO. I think they are much important and practical rather than those histories. I hope you do not mind what I said.
@felipemaldonado80282 жыл бұрын
Do you mind to provide evidence about those "many great people"?