EfficientZero: Mastering Atari Games with Limited Data (Machine Learning Research Paper Explained)

Рет қаралды 24,187

Күн бұрын

Пікірлер: 62

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Outline 2:30 - MuZero Recap 10:50 - EfficientZero improvements 14:15 - Self-Supervised consistency loss 17:50 - End-to-end prediction of the value prefix 20:40 - Model-based off-policy correction 25:45 - Experimental Results & Conclusion Paper: arxiv.org/abs/2111.00210 Code: github.com/YeWR/EfficientZero Note: code not there yet as of release of this video

@mastercheater08 3 жыл бұрын

Hi Yannic, the 100k training steps are actually 2h playtime per game and not 2 days. More importantly though, you referred to the code being available. If you actually follow the given link, you will see, that it is in fact not.

@joedalton77 3 жыл бұрын

Sharp as always

@dylanloader3869 3 жыл бұрын

The classic "Thank you for your attention, we will open-source the codebase later. Please leave your email address here. We will send you the email once we open-source it." - Posted 5 years ago Papers should be retracted if the codebase isn't published within a reasonable amount of time post publication.

@YannicKilcher 3 жыл бұрын

Thanks for correcting and noticing

@Frankthegravelrider 3 жыл бұрын

@@dylanloader3869 agree, honestly I wonder how many deep learning papers have massaged results.

@NoNameAtAll2 3 жыл бұрын

@@Frankthegravelrider I don't think anyone massages results anymore they just can't find results' shoulders to put massage cream onto :)

@kev9220 3 жыл бұрын

Trajectory Prediction would be also an amazing topic to cover! Thanks for this awesome video.

@howuhh8960 3 жыл бұрын

he already done video about decision transformer (which is very similar to trajectory transformer)

@ivanmochalov3102 3 жыл бұрын

Usually, I'm not watching such videos. But this one is superb (due to clear explanation)

@Bvic3 3 жыл бұрын

What I'd like to know is how reliable it is compared to DQN. DQN is monstruously hard to train, it depends of how many frames we take together, how many frames between time steps. Also, to get to real world problems, we probably need skipping latent states. When we remember the world, we remember the memorable landmarks and how to transition between landmarks. Just like if we want to do math, we remember important steps and between steps we use business as usual flow. This is a multi-speed/multi-scale way of thinking. This is how we manage to navigate problems with long term goals. If I understand your video well, it's still trying to predict time step by time step.

@eelcohoogendoorn8044 3 жыл бұрын

Did they repeat those ablations multiple times to check for repeatability; or was it already stretching their compute budget? Was not quite clear to me if the fact that sometimes the improvements were not improvements was simply due to noise, and perhaps youd see a positive expected benefit if repeating that ablation experiment many times.

@ssssssstssssssss 2 жыл бұрын

Engineering-oriented papers like this are a good thing even though machine learning purists don't like it. But not emphasizing the engineering aspect in papers is unfortunate. Real world practitioners need to know what works for different problems.

@MikkoRantalainen 2 жыл бұрын

I think machine learning should be more engineering oriented anyways. This is because all those algorithms require obscene amount of computations and an algorithm that better matches the actual hardware can be 100x faster with nothing else changed. This is because e.g. data in L1 can be fetched 50-100x faster than the data in RAM. If the algorithm does indirect memory reference using two data values in L1 vs two data values in RAM, the speed difference will be at least 100x for that difference alone. And big-O analysis will usually still claim both algorithms are identical.

@serta5727 2 жыл бұрын

I can’t wait to implement Efficient Zero and try it for software testing

3 жыл бұрын

Reminds me a bit of a paper called curiosity driven exploration, except it was used only for exploration (the part where you compare the hidden state for the expected next step to the actual next step)

@MikkoRantalainen 2 жыл бұрын

Great explanation of the EfficientZero! I also really liked that you explicitly compared the differences to AlphaZero.

@MikkoRantalainen 2 жыл бұрын

I would have expected that the would have tried to create some kind of estimated world rules (basically conversion from observation to latent space) and then optimize that conversion with the reward. And at least start of the training focus on testing actions that have highest probability for high positive or negative reward because that would train the world rules faster. I think this would better match the way humans work. When you encountered an old computer game as a child, you tried every possible action first (what if I run towards the wall, will the movement wrap to the other side or what happens) before trying to find smaller actions. Without known set of rules for the world you could just assume any random set of rules and then try actions that rule out the maximum amount of wrong states from your model for every timestep. I agree with MuZero design that the reward is the true target that you should optimize but to avoid local maximum, you have to survey across a huge variance of possible actions and world states to figure out the edge cases and loop holes in the rules. Especially in the older games, the max performance required exploiting all kind of loop holes. Look at any modern speedruns of older games and it's obvious that the players exploit design mistakes in games. I think that EfficientZero should have objective to first find the exploitable rules and then maximize the reward using those exploits. As a side-effect, that would probably generate a very effective software exploitation algorithm, too. As a tool for the programmer, it would highlight all the security issues in your code. As a tool for the attacker (and not publicly available) it would be superior tool to attack any system.

@herp_derpingson 3 жыл бұрын

17:30 These kind of losses tend to be unstable. The neural network might simply learn to output a fixed vector or vectors in a very small cluster for all states to minimize this loss. So maybe this wont work for very complex games like Go.

@tresuvesdobles 3 жыл бұрын

As he mentions, Go does not require to encode the model of the world in any way, because you already know it perfectly! I don't know about the stability though, usually they have to resort to things like contrastive learning, and I am not sure if they are using any of that here. However, due to there being other losses pushing the optimization, maybe this is not a problem in this particular case

@YannicKilcher 3 жыл бұрын

true, I guess that's why muzero left it out, because I would totally put that in if I was the muzero author. but the loss does need to be balanced with the other losses, so maybe the consequence is just a very slippery hyperparameter to tune :)

@StefanausBC 3 жыл бұрын

Got it! Whole thing is yet another invention by Jürgen Schmidhuber as it uses LSTM o_O

@marcobiemann8770 3 жыл бұрын

Actually, the idea of learning a world model goes back to a paper of Ha and Schmidhuber :)

@evanwalters2117 3 жыл бұрын

17:00 Thank you for explaining why the muzero authors purposely left out supervising the hidden state. There are definitely trade-offs here, and it is clear their ideas are geared toward learning atari fast. I can't see the supervised hidden state and maybe the value prefix helping with boardgames, and I also wonder if rerunning mcts in reanalyze would be too slow without their c++ batched mcts implementation, so these are things to keep in mind. These ideas are great additions as options to try, though, depending on your needs!

@jchen5803 3 жыл бұрын

1. reuse samples is the key, and supervision on hidden states is done through the SSL module. 2. it depends on implementation. first deepmind does not release their impl, and second, the current open-source efforts made by the community is clearly not efficient enough (muzero-pytorch)

@NoNTr1v1aL 3 жыл бұрын

Great video! May I please know where to get code for the experimental results or if there is code for model-based off-policy iteration?

@Rhannmah 3 жыл бұрын

28:10 well, is this optimization per game tailored by a second AI? Because if it is, I don't see any problem with this approach. Although, did they test on Montezuma's Revenge? Not all Atari games are created equal, some are waaaay more complex than others.

@MikkoRantalainen 2 жыл бұрын

What would the input for such second AI be? Name of the game followed by best known latent state? Of course, you could always run two full AI systems in parallel and use third system to select next action based on short term success from either subsystem. Doing that will require 2-3x the computing power and the interesting question is could either subsystem give more accurate answer with that extra 2-3x computing power applied to that system alone? Given infinite processing power solving these issues would be much easier. The computations needed are so expensive that this is more like engineering optimization problem than purely theoretical computational problem.

@G12GilbertProduction 3 жыл бұрын

Epistemic uncertainty it was according to the structural textual semantic models in this language paradigma? Implementation of that into a neural network models language models like CiT is strange and buzzing up all the resources and time.

@nobody_8_1 Жыл бұрын

EfficientZero is a beast.

@DistortedV12 3 жыл бұрын

Can you do a video on nando de freitas model delusions in sequence transformer. Interesting blend of causality and sequence models

@lilhabibi3783 3 жыл бұрын

Section 4.1 describes their "Self-Supervised Consistency Loss", which is close to the WorldModel architecture [1], but where V and M components are trained end-to-end with the addition of a projector. I find it weird to formulate this in a SimSiam framework given the existing WorldModel architecture. Also, can anyone explain the point of the shared projector? It does not seem to be described well in the paper. [1] Ha and Schmidhuber, "Recurrent World Models Facilitate Policy Evolution", 2018

@JTMoustache 3 жыл бұрын

Really confused by having the environment drawn on the left - that must be how the english do it

@billykotsos4642 3 жыл бұрын

27:50 lol Freeway doesn't really care

@吴吉人 2 жыл бұрын

100k data is only 2 hours, not 2 days.

@Keirp1 3 жыл бұрын

This paper should be withdrawn from NeurIPS for lying about the number of seeds they ran (turns out they just ran one). There is literally a paper on how high the variance is for Atari100k. Also those reconstructions seem very broken.

@billykotsos4642 3 жыл бұрын

Are you sure about this? It has Abbeel's name on it

@Keirp1 3 жыл бұрын

Look at the tweet by agarwl_

@billykotsos4642 3 жыл бұрын

@@Keirp1 liiiink ???? please???

@Keirp1 3 жыл бұрын

@@billykotsos4642 I think youtube wont let me post a link to twitter.

@billykotsos4642 3 жыл бұрын

@@Keirp1 any other sources that confirm this ? Anything else you can point to ?

@freemind.d2714 3 жыл бұрын

Isn't Dreamer already does most of this???

@YannicKilcher 3 жыл бұрын

not sure I recall correctly, but I don't think dreamer has MCTS, which is one of the main components here. but yes, there are a lot of similarities

@marcobiemann8770 3 жыл бұрын

Another difference is that this paper uses a contrastive loss, whereas Dreamer minimises the KL divergence between the distributions

@freemind.d2714 3 жыл бұрын

But man, I really wish we could have some ways that we could doing those research more analytically like in the Computer Graphic or Physics Simulation field... Must those thing really feel like artistic design instead of scientific research, like sometime we can only guess if it's good or not even after we test it

@TheHerbert4321 2 жыл бұрын

@@YannicKilcher Isn't Dreamer even more data efficient than this architecture?

@sanagnos 3 жыл бұрын

🙏 🙏

@guidoansem 2 жыл бұрын

algo

@billykotsos4642 3 жыл бұрын

Its 2021. Atari games are old news. Researchers should up their game !

@doppelrutsch9540 3 жыл бұрын

They're still a useful benchmark. Being able to hit superhuman performance in a human-comparable timeframe is kind fo big news...

@Rhannmah 3 жыл бұрын

It's more about Atari games being in a grey legal area where they're kind of unofficially public domain. No one is going to go ballistic over using the games to benchmark AIs without their consent.

@binjianxin7830 3 жыл бұрын

Vapnik still believes he can find the most important ML theory by experimenting with MNIST datasets!

@go00o87 3 жыл бұрын

kzbin.info/www/bejne/hHumfYiwoNNgqaM got me somewhat angry. The whole point of 100k should be sample efficiency so that one moves closer to real-world application. However, to me, it seems like the benchmark is poorly constructed, as one can trade-off sample efficiency vs. hyperparameter optimisation to some degree. In the business world, the total cost (total number of training steps) is what matters, together with performance, robustness, explainability, ... . But lets "just" plugin jet another loss (with hyperparameter) another network (with hyperparameters) etc .. it will all be good. Anyways thanks for the video I apprechiate your explanations a lot ;)