OUTLINE: 0:00 - Intro & Outline 2:30 - MuZero Recap 10:50 - EfficientZero improvements 14:15 - Self-Supervised consistency loss 17:50 - End-to-end prediction of the value prefix 20:40 - Model-based off-policy correction 25:45 - Experimental Results & Conclusion Paper: arxiv.org/abs/2111.00210 Code: github.com/YeWR/EfficientZero Note: code not there yet as of release of this video
@mastercheater083 жыл бұрын
Hi Yannic, the 100k training steps are actually 2h playtime per game and not 2 days. More importantly though, you referred to the code being available. If you actually follow the given link, you will see, that it is in fact not.
@joedalton773 жыл бұрын
Sharp as always
@dylanloader38693 жыл бұрын
The classic "Thank you for your attention, we will open-source the codebase later. Please leave your email address here. We will send you the email once we open-source it." - Posted 5 years ago Papers should be retracted if the codebase isn't published within a reasonable amount of time post publication.
@YannicKilcher3 жыл бұрын
Thanks for correcting and noticing
@Frankthegravelrider3 жыл бұрын
@@dylanloader3869 agree, honestly I wonder how many deep learning papers have massaged results.
@NoNameAtAll23 жыл бұрын
@@Frankthegravelrider I don't think anyone massages results anymore they just can't find results' shoulders to put massage cream onto :)
@kev92203 жыл бұрын
Trajectory Prediction would be also an amazing topic to cover! Thanks for this awesome video.
@howuhh89603 жыл бұрын
he already done video about decision transformer (which is very similar to trajectory transformer)
@ivanmochalov31023 жыл бұрын
Usually, I'm not watching such videos. But this one is superb (due to clear explanation)
@Bvic33 жыл бұрын
What I'd like to know is how reliable it is compared to DQN. DQN is monstruously hard to train, it depends of how many frames we take together, how many frames between time steps. Also, to get to real world problems, we probably need skipping latent states. When we remember the world, we remember the memorable landmarks and how to transition between landmarks. Just like if we want to do math, we remember important steps and between steps we use business as usual flow. This is a multi-speed/multi-scale way of thinking. This is how we manage to navigate problems with long term goals. If I understand your video well, it's still trying to predict time step by time step.
@eelcohoogendoorn80443 жыл бұрын
Did they repeat those ablations multiple times to check for repeatability; or was it already stretching their compute budget? Was not quite clear to me if the fact that sometimes the improvements were not improvements was simply due to noise, and perhaps youd see a positive expected benefit if repeating that ablation experiment many times.
@ssssssstssssssss2 жыл бұрын
Engineering-oriented papers like this are a good thing even though machine learning purists don't like it. But not emphasizing the engineering aspect in papers is unfortunate. Real world practitioners need to know what works for different problems.
@MikkoRantalainen2 жыл бұрын
I think machine learning should be more engineering oriented anyways. This is because all those algorithms require obscene amount of computations and an algorithm that better matches the actual hardware can be 100x faster with nothing else changed. This is because e.g. data in L1 can be fetched 50-100x faster than the data in RAM. If the algorithm does indirect memory reference using two data values in L1 vs two data values in RAM, the speed difference will be at least 100x for that difference alone. And big-O analysis will usually still claim both algorithms are identical.
@serta57272 жыл бұрын
I can’t wait to implement Efficient Zero and try it for software testing
3 жыл бұрын
Reminds me a bit of a paper called curiosity driven exploration, except it was used only for exploration (the part where you compare the hidden state for the expected next step to the actual next step)
@MikkoRantalainen2 жыл бұрын
Great explanation of the EfficientZero! I also really liked that you explicitly compared the differences to AlphaZero.
@MikkoRantalainen2 жыл бұрын
I would have expected that the would have tried to create some kind of estimated world rules (basically conversion from observation to latent space) and then optimize that conversion with the reward. And at least start of the training focus on testing actions that have highest probability for high positive or negative reward because that would train the world rules faster. I think this would better match the way humans work. When you encountered an old computer game as a child, you tried every possible action first (what if I run towards the wall, will the movement wrap to the other side or what happens) before trying to find smaller actions. Without known set of rules for the world you could just assume any random set of rules and then try actions that rule out the maximum amount of wrong states from your model for every timestep. I agree with MuZero design that the reward is the true target that you should optimize but to avoid local maximum, you have to survey across a huge variance of possible actions and world states to figure out the edge cases and loop holes in the rules. Especially in the older games, the max performance required exploiting all kind of loop holes. Look at any modern speedruns of older games and it's obvious that the players exploit design mistakes in games. I think that EfficientZero should have objective to first find the exploitable rules and then maximize the reward using those exploits. As a side-effect, that would probably generate a very effective software exploitation algorithm, too. As a tool for the programmer, it would highlight all the security issues in your code. As a tool for the attacker (and not publicly available) it would be superior tool to attack any system.
@herp_derpingson3 жыл бұрын
17:30 These kind of losses tend to be unstable. The neural network might simply learn to output a fixed vector or vectors in a very small cluster for all states to minimize this loss. So maybe this wont work for very complex games like Go.
@tresuvesdobles3 жыл бұрын
As he mentions, Go does not require to encode the model of the world in any way, because you already know it perfectly! I don't know about the stability though, usually they have to resort to things like contrastive learning, and I am not sure if they are using any of that here. However, due to there being other losses pushing the optimization, maybe this is not a problem in this particular case
@YannicKilcher3 жыл бұрын
true, I guess that's why muzero left it out, because I would totally put that in if I was the muzero author. but the loss does need to be balanced with the other losses, so maybe the consequence is just a very slippery hyperparameter to tune :)
@StefanausBC3 жыл бұрын
Got it! Whole thing is yet another invention by Jürgen Schmidhuber as it uses LSTM o_O
@marcobiemann87703 жыл бұрын
Actually, the idea of learning a world model goes back to a paper of Ha and Schmidhuber :)
@evanwalters21173 жыл бұрын
17:00 Thank you for explaining why the muzero authors purposely left out supervising the hidden state. There are definitely trade-offs here, and it is clear their ideas are geared toward learning atari fast. I can't see the supervised hidden state and maybe the value prefix helping with boardgames, and I also wonder if rerunning mcts in reanalyze would be too slow without their c++ batched mcts implementation, so these are things to keep in mind. These ideas are great additions as options to try, though, depending on your needs!
@jchen58033 жыл бұрын
1. reuse samples is the key, and supervision on hidden states is done through the SSL module. 2. it depends on implementation. first deepmind does not release their impl, and second, the current open-source efforts made by the community is clearly not efficient enough (muzero-pytorch)
@NoNTr1v1aL3 жыл бұрын
Great video! May I please know where to get code for the experimental results or if there is code for model-based off-policy iteration?
@Rhannmah3 жыл бұрын
28:10 well, is this optimization per game tailored by a second AI? Because if it is, I don't see any problem with this approach. Although, did they test on Montezuma's Revenge? Not all Atari games are created equal, some are waaaay more complex than others.
@MikkoRantalainen2 жыл бұрын
What would the input for such second AI be? Name of the game followed by best known latent state? Of course, you could always run two full AI systems in parallel and use third system to select next action based on short term success from either subsystem. Doing that will require 2-3x the computing power and the interesting question is could either subsystem give more accurate answer with that extra 2-3x computing power applied to that system alone? Given infinite processing power solving these issues would be much easier. The computations needed are so expensive that this is more like engineering optimization problem than purely theoretical computational problem.
@G12GilbertProduction3 жыл бұрын
Epistemic uncertainty it was according to the structural textual semantic models in this language paradigma? Implementation of that into a neural network models language models like CiT is strange and buzzing up all the resources and time.
@nobody_8_1 Жыл бұрын
EfficientZero is a beast.
@DistortedV123 жыл бұрын
Can you do a video on nando de freitas model delusions in sequence transformer. Interesting blend of causality and sequence models
@lilhabibi37833 жыл бұрын
Section 4.1 describes their "Self-Supervised Consistency Loss", which is close to the WorldModel architecture [1], but where V and M components are trained end-to-end with the addition of a projector. I find it weird to formulate this in a SimSiam framework given the existing WorldModel architecture. Also, can anyone explain the point of the shared projector? It does not seem to be described well in the paper. [1] Ha and Schmidhuber, "Recurrent World Models Facilitate Policy Evolution", 2018
@JTMoustache3 жыл бұрын
Really confused by having the environment drawn on the left - that must be how the english do it
@billykotsos46423 жыл бұрын
27:50 lol Freeway doesn't really care
@吴吉人2 жыл бұрын
100k data is only 2 hours, not 2 days.
@Keirp13 жыл бұрын
This paper should be withdrawn from NeurIPS for lying about the number of seeds they ran (turns out they just ran one). There is literally a paper on how high the variance is for Atari100k. Also those reconstructions seem very broken.
@billykotsos46423 жыл бұрын
Are you sure about this? It has Abbeel's name on it
@Keirp13 жыл бұрын
Look at the tweet by agarwl_
@billykotsos46423 жыл бұрын
@@Keirp1 liiiink ???? please???
@Keirp13 жыл бұрын
@@billykotsos4642 I think youtube wont let me post a link to twitter.
@billykotsos46423 жыл бұрын
@@Keirp1 any other sources that confirm this ? Anything else you can point to ?
@freemind.d27143 жыл бұрын
Isn't Dreamer already does most of this???
@YannicKilcher3 жыл бұрын
not sure I recall correctly, but I don't think dreamer has MCTS, which is one of the main components here. but yes, there are a lot of similarities
@marcobiemann87703 жыл бұрын
Another difference is that this paper uses a contrastive loss, whereas Dreamer minimises the KL divergence between the distributions
@freemind.d27143 жыл бұрын
But man, I really wish we could have some ways that we could doing those research more analytically like in the Computer Graphic or Physics Simulation field... Must those thing really feel like artistic design instead of scientific research, like sometime we can only guess if it's good or not even after we test it
@TheHerbert43212 жыл бұрын
@@YannicKilcher Isn't Dreamer even more data efficient than this architecture?
@sanagnos3 жыл бұрын
🙏 🙏
@guidoansem2 жыл бұрын
algo
@billykotsos46423 жыл бұрын
Its 2021. Atari games are old news. Researchers should up their game !
@doppelrutsch95403 жыл бұрын
They're still a useful benchmark. Being able to hit superhuman performance in a human-comparable timeframe is kind fo big news...
@Rhannmah3 жыл бұрын
It's more about Atari games being in a grey legal area where they're kind of unofficially public domain. No one is going to go ballistic over using the games to benchmark AIs without their consent.
@binjianxin78303 жыл бұрын
Vapnik still believes he can find the most important ML theory by experimenting with MNIST datasets!
@go00o873 жыл бұрын
kzbin.info/www/bejne/hHumfYiwoNNgqaM got me somewhat angry. The whole point of 100k should be sample efficiency so that one moves closer to real-world application. However, to me, it seems like the benchmark is poorly constructed, as one can trade-off sample efficiency vs. hyperparameter optimisation to some degree. In the business world, the total cost (total number of training steps) is what matters, together with performance, robustness, explainability, ... . But lets "just" plugin jet another loss (with hyperparameter) another network (with hyperparameters) etc .. it will all be good. Anyways thanks for the video I apprechiate your explanations a lot ;)