One of the channels where I like the video before watching the content, keep up the good work :)
@norik16164 жыл бұрын
Exactly. I just hope it does not trigger some negative feedback in the Recommendation engine. Open the video in the evening: Like, click download & watch in the morning.
@KamilCzerski4 жыл бұрын
I remember some time ago reading Learning to learn by gradient descent by gradient descent. The authors here cited it as [9]. Nice to hear and see the paper like this to be out!
@eelcohoogendoorn80444 жыл бұрын
A learned optimizer that trains itself... I was about to get hyped about the singularity, but then I watched the video and came to my senses again. Not quite sure about their architecture; only watched the vid and didnt read the paper, but back in the day when I was designing various descent algorithms, kids were using things like dot products between step vectors and what not. I would hardly expect a per parameter FFN to be able to approximate that, and feeding more primitives like that into the system might make it go much further; and actually start gaining some meaningfull insight into what the local loss landscape is like?
@Neural_Causality4 жыл бұрын
I´m tempted to site you as a guideline on writing that section (broader impact). Thanks for these videos!!
@sanderbos42434 жыл бұрын
It's very cool that they're able to make a computer learn a previously hard coded function! It makes me wonder how often this concept could be applied to previous works, on any kind of topic.
@MikeAirforce1114 жыл бұрын
:D :D
@MikeAirforce1114 жыл бұрын
Really enjoyed listening to your discussion on this paper, and agree with criticism in the pseudocode section; There do seem to be a couple of "magic numbers" in there. In that connection I have a (perhaps stupid) question for you; It seems that in AutoML-related research the goal is to reduce the hyper parameter space as much as possible.. However, to me it seems like there is "no free lunch" in this sense - looking at various approaches to HPO there is always hyper parameters on top of what one tries to automate/reduce, and some tradeoffs involved. Can you point me towards any relevant theory on this?
@TimeofMinecraftMods4 жыл бұрын
> reduce the hyper parameter space as much as possible Not necessarily: It can make sense to add more parameters if they increase the relative number of "acceptable" hyperparameters. Think about the Adam-optimizer: You add more hyperparameters to control the running mean/std, but you increase the range of learning rates that still get the network to converge. In practice, you can achieve significantly faster training with Adam, even without tuning the additional parameters, so you get the higher range of acceptable learning rates for free. So instead of having a few very sensitive parameters, you have a couple more that are significantly more robust.
@MikeAirforce1114 жыл бұрын
@@TimeofMinecraftMods Thats a good point.
@herp_derpingson4 жыл бұрын
@@TimeofMinecraftMods I dont think it should be called free. Rather much "knowledge" of neural network tuning is already baked into the Adam itself. This can be thought as simply fixing a lot of hyperparameters to a good value which works for many DL task we do in research community. If you try to use Adam on some esoteric tasks you find that vanilla SGD actually performs better because it makes less assumptions. Yannic does talk a little about it in this video itself.
@herp_derpingson4 жыл бұрын
@Mike Riess Read Law of Leaky abstractions by Joel Spolsky
@YannicKilcher4 жыл бұрын
I can only guess (on top of the other replies, which are very nice). maybe it's something analogous to the central limit theorem going on here. i.e. the parameters of a neural network are super different for each problem. the hyperparameters though seem to be much more similar, i.e. learning rate, regularization constants etc, they seem to only vary a bit from problem to problem, and hyper-hyper parameters seem to very even less, maybe to the point where one good default value can cover most problems. just a guess
@techscience56494 жыл бұрын
Great explanation. Thank you. I have a question please: you said (@18:30) that this method outputs numerical weights while there are symbolic methods. Could you give paper examples of the symbolic ones.
@YannicKilcher3 жыл бұрын
maybe in here arxiv.org/pdf/2004.05439.pdf
@Meppify4 жыл бұрын
I think you have it backwards at 43:00. The algorithm is optimized to produce good validation loss and that's what it's doing. If it overfits it has failed. I guess we are just used to our algorithms overfitting, but its *not actually what we want*. What we want is good generalization. And if it yields (at least) the same validation loss as Adam, then it has succeeded.
@EternalKernel4 жыл бұрын
That is exactly what I was thinking when he said that.. I noticed that it may have generalized better and was excited.. Then he said he didn't care which confused me ... now im thinking It's probably dependent on the problem your network is being used on. For instance if you can be sure that your validation data is closer to the rest of all possible data then yeah its really important that the validation set has a good score.. but if for instance you are sure that your validation set is just a subset of the training data, and that the domain of the remaining possible data is completely unknown then It may be more important to see that the model at least learned something very well, validation be damned. Idk what do you think?
@YannicKilcher3 жыл бұрын
You raise a valid point for sure, but that's what I criticize in the video. I see that the learned optimizer is indeed "better" in terms of validation loss, but I'm not willing to give up the property of fittin any dataset (in this case the training dataset) as well as I want it to fit, otherwise it's not really a good optimizer.
@simleek4 жыл бұрын
Hmm... what about using a transformer to learn which features to look at from the full gradient tensor, including previously looked at features? Seems like it could take a very, very long time to train, but may work well on arbitrary size/shape models.
@herp_derpingson4 жыл бұрын
This paper is basically, "What if hyper parameter tuning, but for the optimizer itself?" Feeding the grads obtained by finite difference into Adam is interesting. I wonder if we could find some non deep learning use cases for that. 36:15 This can be interpreted as the hyperparameters being already baked into the neural network for these specific set of problems in the problem set. It is not that you dont have to do grid search, it is that it has already been done. All in all, I think it is a recurring problem in DL research where the scientists just say, "Whatever, let gradient descent figure it out". They dont try to understand why something works.
@YannicKilcher3 жыл бұрын
I think the field has given up on understanding a long time ago :D
@111dimka1114 жыл бұрын
Great review! One comment though about the "scaling up" part: in many cases, when training loss goes to zero, we inevitably start overfit. That is, in losses like cross-entropy it is impossible to make training loss zero and not to harm the testing loss.
@DieHobbylosenPros3 жыл бұрын
Actually, I think they only use ES for fine-tuning and use Adam before that. They say "We start outer-training by sampling unroll steps uniformly from 240-360 steps. When performance saturates, we continue training with PES". But I agree that everything in the paper is really vague, the lack of mathematical notation doesn't help.
@littlebigphil4 жыл бұрын
You were brutal this vid. Love it.
@quebono1004 жыл бұрын
Wow the field of machine learning is moving fast
@herp_derpingson4 жыл бұрын
It was much faster back in 2016. Things have slowed down a bit.
@thj70844 жыл бұрын
@Dmitry Akimov underrated comment
@patrickphillips70094 жыл бұрын
"what is this thing? phi? eh lets do psi, I know psi."
@drdca82634 жыл бұрын
I think the 4th paragraph of the “broader impacts” section is rather specific to the specific thing they are working on, and is not just a cookie cutter “ml bias” paragraph. I’d like to hear your thoughts on that paragraph?
@YannicKilcher4 жыл бұрын
it sounds to me like they're just mentioning AGI as. abuzzword, but I could be wrong
@drdca82634 жыл бұрын
@@YannicKilcher Maybe? But they they are pointing out that people working on AI risk have specifically discussed it as a potential source of alignment drift. I know that people on the Alignment Forum have talked a substantial amount about "mesa-optimizers", the possibilities of something which is trained to achieve some task, learning to do it by running another optimization process itself, and the risks of this optimization process not being aligned. The topic of this paper may be slightly different in that it is talking about a case where something is being intentionally trained to be an optimizer, while I think the concern about mesa-optimizers may be more focused on when "being or including an optimizer" is just something that happens to be the natural result of training something for some other task, but I think "mesa-optimization" still includes cases like this one, and I think many of the same risks would apply to both cases, with the dangers being essentially for the same reason. I think it seems relevant enough.
@astroganov4 жыл бұрын
TgTbTb - this part of your video is amazing
@pensiveintrovert43183 жыл бұрын
I would say instead of learning the mother of all optimizers, we should go in the opposite direction and learn better optimizers tailored to specific domain/task.
@יובלנירהוד4 жыл бұрын
Thank you so much! great video as always, very interesting and clear.
@chainingten38194 жыл бұрын
Good video as always 😁
@norik16164 жыл бұрын
Based on the CIFAR RN behavior you may say it's crap - or this is the best result achievable without overfitting. I believe this is, what the learned optimizer was trained to do. Lower the val_loss, right?
@EternalKernel4 жыл бұрын
I think that depends on the expected domain of all the remaining possible data vrs. the domain of the validation data and the training data?
@egidioln4 жыл бұрын
Great video and analysis! I just slightly disagree that one would like to go as down as possible in training loss. After some point, you are just fitting to noise.
@greggaustin47643 жыл бұрын
I think that your claim is not mutually exclusive to the point Yannic was trying to make. The training loss should continue to decrease with a well-tuned optimizer, even at the cost of worse validation loss. A human should have a choice to add regularization or implement early stopping if they suspect overfitting during training. The model presented in the paper makes that choice for us, which I suppose would be advantageous if the learned optimizer is very good. I think Yannic was saying that he doesn't believe the learned optimizer is there yet.
@erickmarin61472 жыл бұрын
I now look at this video and think if this can be implemented with a fine tuned gpt token by token
@Hexanitrobenzene2 жыл бұрын
43:05 This is some nerdy comedy: "You can make two claims: you can say it's, dunno, "implicitly regularized", or you can just say it's crap..." :)
@NicheAsQuiche4 жыл бұрын
IVE WANTES TO SEE THIS DONE FOR SO LONG BUT MY INPEMENTATION WAS CRAP IM SO HYPED
@herp_derpingson4 жыл бұрын
Too bad, their implementation is crap too :P
@MrJaggy1234 жыл бұрын
You start out by saying you have mixed feelings about the paper but the further into the video we get the more you unreservedly hate it 😆
@LouisChiaki4 жыл бұрын
Are AI papers all required to have the broader impact section?
@YannicKilcher3 жыл бұрын
At some conferences yes
@glennkroegel13424 жыл бұрын
Mentioning AGI in the broader impact statement is just a no for me. Unless you are doing a discussion paper on the topic or something.
@xd-os7jl3 жыл бұрын
Whats your tool for reading pdfs?
@patrickphillips70094 жыл бұрын
I wonder how many kilowatts this research used....
@drdca82634 жыл бұрын
Pedantic, but: do you mean kilowatt hours?
@herp_derpingson4 жыл бұрын
bout tree-fiddy
@NicheAsQuiche4 жыл бұрын
I believe Google has been 100% renewable energy for a few years
@DieHobbylosenPros3 жыл бұрын
They state it in the Appendix H.2 "These models take on the order of 200 megawatt hours of power to outer-train"
@stivstivsti4 жыл бұрын
Formula at 16:47 is wrong it misses gradient of w.
@DieHobbylosenPros3 жыл бұрын
I don't think so, a and b are the outputs of the learned optimizer which takes the gradient (and other features) as input
@dingleberriesify4 жыл бұрын
Evolutionary strategies; AKA semi-random flailing in hyperspace
@CosmiaNebula4 жыл бұрын
But can you do better?
@ЗакировМарат-в5щ3 жыл бұрын
Очень много воды, утомляет повествование
@elnazsn4 жыл бұрын
Thanks for the awesome contect, Do you take requests at all? can you do Microsoft's Layoutlm Model, Google are also working on Document AI and integration on BERT based architectures with OCR/image processing to understand document structures. paper here: arxiv.org/abs/1912.13318
@pensiveintrovert43183 жыл бұрын
A non-deep question. Why are we all working so hard to make sure we have no jobs in the near future?