Training more effective learned optimizers, and using them to train themselves (Paper Explained)

Рет қаралды 19,304

Yannic Kilcher

Күн бұрын

Пікірлер: 65

@salim.miloudi 4 жыл бұрын

One of the channels where I like the video before watching the content, keep up the good work :)

@norik1616 4 жыл бұрын

Exactly. I just hope it does not trigger some negative feedback in the Recommendation engine. Open the video in the evening: Like, click download & watch in the morning.

@KamilCzerski 4 жыл бұрын

I remember some time ago reading Learning to learn by gradient descent by gradient descent. The authors here cited it as [9]. Nice to hear and see the paper like this to be out!

@eelcohoogendoorn8044 4 жыл бұрын

A learned optimizer that trains itself... I was about to get hyped about the singularity, but then I watched the video and came to my senses again. Not quite sure about their architecture; only watched the vid and didnt read the paper, but back in the day when I was designing various descent algorithms, kids were using things like dot products between step vectors and what not. I would hardly expect a per parameter FFN to be able to approximate that, and feeding more primitives like that into the system might make it go much further; and actually start gaining some meaningfull insight into what the local loss landscape is like?

@Neural_Causality 4 жыл бұрын

I´m tempted to site you as a guideline on writing that section (broader impact). Thanks for these videos!!

@sanderbos4243 4 жыл бұрын

It's very cool that they're able to make a computer learn a previously hard coded function! It makes me wonder how often this concept could be applied to previous works, on any kind of topic.

@MikeAirforce111 4 жыл бұрын

:D :D

@MikeAirforce111 4 жыл бұрын

Really enjoyed listening to your discussion on this paper, and agree with criticism in the pseudocode section; There do seem to be a couple of "magic numbers" in there. In that connection I have a (perhaps stupid) question for you; It seems that in AutoML-related research the goal is to reduce the hyper parameter space as much as possible.. However, to me it seems like there is "no free lunch" in this sense - looking at various approaches to HPO there is always hyper parameters on top of what one tries to automate/reduce, and some tradeoffs involved. Can you point me towards any relevant theory on this?

@TimeofMinecraftMods 4 жыл бұрын

> reduce the hyper parameter space as much as possible Not necessarily: It can make sense to add more parameters if they increase the relative number of "acceptable" hyperparameters. Think about the Adam-optimizer: You add more hyperparameters to control the running mean/std, but you increase the range of learning rates that still get the network to converge. In practice, you can achieve significantly faster training with Adam, even without tuning the additional parameters, so you get the higher range of acceptable learning rates for free. So instead of having a few very sensitive parameters, you have a couple more that are significantly more robust.

@MikeAirforce111 4 жыл бұрын

@@TimeofMinecraftMods Thats a good point.

@herp_derpingson 4 жыл бұрын

@@TimeofMinecraftMods I dont think it should be called free. Rather much "knowledge" of neural network tuning is already baked into the Adam itself. This can be thought as simply fixing a lot of hyperparameters to a good value which works for many DL task we do in research community. If you try to use Adam on some esoteric tasks you find that vanilla SGD actually performs better because it makes less assumptions. Yannic does talk a little about it in this video itself.

@herp_derpingson 4 жыл бұрын

@Mike Riess Read Law of Leaky abstractions by Joel Spolsky

@YannicKilcher 4 жыл бұрын

I can only guess (on top of the other replies, which are very nice). maybe it's something analogous to the central limit theorem going on here. i.e. the parameters of a neural network are super different for each problem. the hyperparameters though seem to be much more similar, i.e. learning rate, regularization constants etc, they seem to only vary a bit from problem to problem, and hyper-hyper parameters seem to very even less, maybe to the point where one good default value can cover most problems. just a guess

@techscience5649 4 жыл бұрын

Great explanation. Thank you. I have a question please: you said (@18:30) that this method outputs numerical weights while there are symbolic methods. Could you give paper examples of the symbolic ones.

@YannicKilcher 3 жыл бұрын

maybe in here arxiv.org/pdf/2004.05439.pdf

@Meppify 4 жыл бұрын

I think you have it backwards at 43:00. The algorithm is optimized to produce good validation loss and that's what it's doing. If it overfits it has failed. I guess we are just used to our algorithms overfitting, but its *not actually what we want*. What we want is good generalization. And if it yields (at least) the same validation loss as Adam, then it has succeeded.

@EternalKernel 4 жыл бұрын

That is exactly what I was thinking when he said that.. I noticed that it may have generalized better and was excited.. Then he said he didn't care which confused me ... now im thinking It's probably dependent on the problem your network is being used on. For instance if you can be sure that your validation data is closer to the rest of all possible data then yeah its really important that the validation set has a good score.. but if for instance you are sure that your validation set is just a subset of the training data, and that the domain of the remaining possible data is completely unknown then It may be more important to see that the model at least learned something very well, validation be damned. Idk what do you think?

@YannicKilcher 3 жыл бұрын

You raise a valid point for sure, but that's what I criticize in the video. I see that the learned optimizer is indeed "better" in terms of validation loss, but I'm not willing to give up the property of fittin any dataset (in this case the training dataset) as well as I want it to fit, otherwise it's not really a good optimizer.

@simleek 4 жыл бұрын

Hmm... what about using a transformer to learn which features to look at from the full gradient tensor, including previously looked at features? Seems like it could take a very, very long time to train, but may work well on arbitrary size/shape models.

@herp_derpingson 4 жыл бұрын

This paper is basically, "What if hyper parameter tuning, but for the optimizer itself?" Feeding the grads obtained by finite difference into Adam is interesting. I wonder if we could find some non deep learning use cases for that. 36:15 This can be interpreted as the hyperparameters being already baked into the neural network for these specific set of problems in the problem set. It is not that you dont have to do grid search, it is that it has already been done. All in all, I think it is a recurring problem in DL research where the scientists just say, "Whatever, let gradient descent figure it out". They dont try to understand why something works.

@YannicKilcher 3 жыл бұрын

I think the field has given up on understanding a long time ago :D

@111dimka111 4 жыл бұрын

Great review! One comment though about the "scaling up" part: in many cases, when training loss goes to zero, we inevitably start overfit. That is, in losses like cross-entropy it is impossible to make training loss zero and not to harm the testing loss.

@DieHobbylosenPros 3 жыл бұрын

Actually, I think they only use ES for fine-tuning and use Adam before that. They say "We start outer-training by sampling unroll steps uniformly from 240-360 steps. When performance saturates, we continue training with PES". But I agree that everything in the paper is really vague, the lack of mathematical notation doesn't help.

@littlebigphil 4 жыл бұрын

You were brutal this vid. Love it.

@quebono100 4 жыл бұрын

Wow the field of machine learning is moving fast

@herp_derpingson 4 жыл бұрын

It was much faster back in 2016. Things have slowed down a bit.

@thj7084 4 жыл бұрын

@Dmitry Akimov underrated comment

@patrickphillips7009 4 жыл бұрын

"what is this thing? phi? eh lets do psi, I know psi."

@drdca8263 4 жыл бұрын

I think the 4th paragraph of the “broader impacts” section is rather specific to the specific thing they are working on, and is not just a cookie cutter “ml bias” paragraph. I’d like to hear your thoughts on that paragraph?

@YannicKilcher 4 жыл бұрын

it sounds to me like they're just mentioning AGI as. abuzzword, but I could be wrong

@drdca8263 4 жыл бұрын

@@YannicKilcher Maybe? But they they are pointing out that people working on AI risk have specifically discussed it as a potential source of alignment drift. I know that people on the Alignment Forum have talked a substantial amount about "mesa-optimizers", the possibilities of something which is trained to achieve some task, learning to do it by running another optimization process itself, and the risks of this optimization process not being aligned. The topic of this paper may be slightly different in that it is talking about a case where something is being intentionally trained to be an optimizer, while I think the concern about mesa-optimizers may be more focused on when "being or including an optimizer" is just something that happens to be the natural result of training something for some other task, but I think "mesa-optimization" still includes cases like this one, and I think many of the same risks would apply to both cases, with the dangers being essentially for the same reason. I think it seems relevant enough.

@astroganov 4 жыл бұрын

TgTbTb - this part of your video is amazing

@pensiveintrovert4318 3 жыл бұрын

I would say instead of learning the mother of all optimizers, we should go in the opposite direction and learn better optimizers tailored to specific domain/task.

@יובלנירהוד 4 жыл бұрын

Thank you so much! great video as always, very interesting and clear.

@chainingten3819 4 жыл бұрын

Good video as always 😁

@norik1616 4 жыл бұрын

Based on the CIFAR RN behavior you may say it's crap - or this is the best result achievable without overfitting. I believe this is, what the learned optimizer was trained to do. Lower the val_loss, right?

@EternalKernel 4 жыл бұрын

I think that depends on the expected domain of all the remaining possible data vrs. the domain of the validation data and the training data?

@egidioln 4 жыл бұрын

Great video and analysis! I just slightly disagree that one would like to go as down as possible in training loss. After some point, you are just fitting to noise.

@greggaustin4764 3 жыл бұрын

I think that your claim is not mutually exclusive to the point Yannic was trying to make. The training loss should continue to decrease with a well-tuned optimizer, even at the cost of worse validation loss. A human should have a choice to add regularization or implement early stopping if they suspect overfitting during training. The model presented in the paper makes that choice for us, which I suppose would be advantageous if the learned optimizer is very good. I think Yannic was saying that he doesn't believe the learned optimizer is there yet.

@erickmarin6147 2 жыл бұрын

I now look at this video and think if this can be implemented with a fine tuned gpt token by token

@Hexanitrobenzene 2 жыл бұрын

43:05 This is some nerdy comedy: "You can make two claims: you can say it's, dunno, "implicitly regularized", or you can just say it's crap..." :)

@NicheAsQuiche 4 жыл бұрын

IVE WANTES TO SEE THIS DONE FOR SO LONG BUT MY INPEMENTATION WAS CRAP IM SO HYPED

@herp_derpingson 4 жыл бұрын

Too bad, their implementation is crap too :P

@MrJaggy123 4 жыл бұрын

You start out by saying you have mixed feelings about the paper but the further into the video we get the more you unreservedly hate it 😆

@LouisChiaki 4 жыл бұрын

Are AI papers all required to have the broader impact section?

@YannicKilcher 3 жыл бұрын

At some conferences yes

@glennkroegel1342 4 жыл бұрын

Mentioning AGI in the broader impact statement is just a no for me. Unless you are doing a discussion paper on the topic or something.

@xd-os7jl 3 жыл бұрын

Whats your tool for reading pdfs?

@patrickphillips7009 4 жыл бұрын

I wonder how many kilowatts this research used....

@drdca8263 4 жыл бұрын

Pedantic, but: do you mean kilowatt hours?

@herp_derpingson 4 жыл бұрын

bout tree-fiddy

@NicheAsQuiche 4 жыл бұрын

I believe Google has been 100% renewable energy for a few years

@DieHobbylosenPros 3 жыл бұрын

They state it in the Appendix H.2 "These models take on the order of 200 megawatt hours of power to outer-train"

@stivstivsti 4 жыл бұрын

Formula at 16:47 is wrong it misses gradient of w.

@DieHobbylosenPros 3 жыл бұрын

I don't think so, a and b are the outputs of the learned optimizer which takes the gradient (and other features) as input

@dingleberriesify 4 жыл бұрын

Evolutionary strategies; AKA semi-random flailing in hyperspace

@CosmiaNebula 4 жыл бұрын

But can you do better?

@ЗакировМарат-в5щ 3 жыл бұрын

Очень много воды, утомляет повествование

@elnazsn 4 жыл бұрын

Thanks for the awesome contect, Do you take requests at all? can you do Microsoft's Layoutlm Model, Google are also working on Document AI and integration on BERT based architectures with OCR/image processing to understand document structures. paper here: arxiv.org/abs/1912.13318

@pensiveintrovert4318 3 жыл бұрын

A non-deep question. Why are we all working so hard to make sure we have no jobs in the near future?