Grokking: Generalization beyond Overfitting on small algorithmic datasets (Paper Explained)

  Рет қаралды 69,192

Yannic Kilcher

Yannic Kilcher

Күн бұрын

#grokking #openai #deeplearning
Grokking is a phenomenon when a neural network suddenly learns a pattern in the dataset and jumps from random chance generalization to perfect generalization very suddenly. This paper demonstrates grokking on small algorithmic datasets where a network has to fill in binary tables. Interestingly, the learned latent spaces show an emergence of the underlying binary operations that the data were created with.
OUTLINE:
0:00 - Intro & Overview
1:40 - The Grokking Phenomenon
3:50 - Related: Double Descent
7:50 - Binary Operations Datasets
11:45 - What quantities influence grokking?
15:40 - Learned Emerging Structure
17:35 - The role of smoothness
21:30 - Simple explanations win
24:30 - Why does weight decay encourage simplicity?
26:40 - Appendix
28:55 - Conclusion & Comments
Paper: mathai-iclr.github.io/papers/...
Abstract:
In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of “grokking” a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.
Authors: Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin & Vedant Misra
Links:
TabNine Code Completion (Referral): bit.ly/tabnine-yannick
KZbin: / yannickilcher
Twitter: / ykilcher
Discord: / discord
BitChute: www.bitchute.com/channel/yann...
Minds: www.minds.com/ykilcher
Parler: parler.com/profile/YannicKilcher
LinkedIn: / ykilcher
BiliBili: space.bilibili.com/1824646584
If you want to support me, the best thing to do is to share out the content :)
If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: www.subscribestar.com/yannick...
Patreon: / yannickilcher
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Пікірлер: 222
@YannicKilcher
@YannicKilcher 2 жыл бұрын
OUTLINE: 0:00 - Intro & Overview 1:40 - The Grokking Phenomenon 3:50 - Related: Double Descent 7:50 - Binary Operations Datasets 11:45 - What quantities influence grokking? 15:40 - Learned Emerging Structure 17:35 - The role of smoothness 21:30 - Simple explanations win 24:30 - Why does weight decay encourage simplicity? 26:40 - Appendix 28:55 - Conclusion & Comments Paper: mathai-iclr.github.io/papers/papers/MATHAI_29_paper.pdf
@ruemeese
@ruemeese 2 жыл бұрын
The verb 'to grok' has been in niche use in computer science since the 1960's meaning to fully understand something (as opposed to just knowing how to apply the rules for example). It was coined in Robert A Heinlein's classic sci-fi novel "Stranger in a Strange Land".
@michaelnurse9089
@michaelnurse9089 2 жыл бұрын
Thanks for the history. Today it is a somewhat well known word in my anecdotal experience.
@ruemeese
@ruemeese 2 жыл бұрын
@@michaelnurse9089 Always going to be new to somebody.
@cactusmilenario
@cactusmilenario 2 жыл бұрын
@@ruemeese thanks
@user-wd8wx5md5z
@user-wd8wx5md5z 2 жыл бұрын
Thanks that's why it was not it the Webster dictionary ;)
@RobertWF42
@RobertWF42 2 жыл бұрын
In my opinion "grokking" shouldn't be used here. There's no grokking happening, instead it's running an ML algorithm for a long time until it fits the test data. But it's still a really cool idea. :-)
@martinsmolik2449
@martinsmolik2449 2 жыл бұрын
WAIT, WHAT? In my Masters thesis, I did something similar, and I got the same result. A quick overfit and subsequent snap into place. I never thought that this would be interesting, as I was not an ML engineer, but a mathematician looking for new ways to model structures. That was in 2019. Could have been famous! Ah well...
@andrewcutler4599
@andrewcutler4599 2 жыл бұрын
What do you think of their line "we speculate that such visualizations could one day be a useful way to gain intuitions about novel mathematical objects"?
@thomasdeniffel2122
@thomasdeniffel2122 2 жыл бұрын
Schmidthuber, is ist you?
@visuality2541
@visuality2541 2 жыл бұрын
may i ask u the thesis title?
@arzigogolato1
@arzigogolato1 2 жыл бұрын
As with many discoveries you need both the luck to find something totally new, and the knowledge to recognize what's happening...I'm sure a lot of missed opportunities out there...still, the advisor or some of the other professors could have noticed that this was indeed something "off and interesting"...it might actually be interesting to see how and when this happens :)
@autolykos9822
@autolykos9822 2 жыл бұрын
Yeah, a lot of science comes from having the leisure to investigate strange accidents, instead of being forced to just shrug and start over to get done with what you're actually trying to do. Overworking scientists results in poor science - and yet there are many academics proud of working 60-80h weeks.
@mgostIH
@mgostIH 2 жыл бұрын
Thanks for accepting my suggestion in reviewing this! I think this paper can fall between "Wow what a strange behaviour" to "Is more compute all we need?" If the findings of this paper are reproduced on a lot of other practical scenarious and we find ways to speed up grokking it might change the entire way we think about neural networks, where we just need a model that's big enough to express some solution and then keep training it to find simpler ones, leading to an algorithmic Occam's Razor!
@TimScarfe
@TimScarfe 2 жыл бұрын
I was nagging Yannic too!! 😂
@aryasenadewanusa8920
@aryasenadewanusa8920 2 жыл бұрын
but again, though it's wonderful, in some cases, the cost to train such a large networks in such long time to achieve that AHA moment by the network can be too crazy to make sense..
@SteveStavropoulos
@SteveStavropoulos 2 жыл бұрын
This seems like random search where the network stumbles upon the solution and then stays there. The "stay there" part is well explained, but the interesting part is the search. And nothing is presented here to suggest search is not entirely random. It would be interesting to compare the neural network to a simple random search algorithm which just randomly tries every combination. Would the neural network find the solution significantly faster than the random search?
@RyanMartinRAM
@RyanMartinRAM 2 жыл бұрын
@@SteveStavropoulos or just start with different random noise every so often?
@brunosantidrian2301
@brunosantidrian2301 2 жыл бұрын
@@SteveStavropoulos Well, you are actually restricting the search to the space of networks that perform well in the training set. So I would say yes.
@Phenix66
@Phenix66 2 жыл бұрын
Loved the context you brought into this about double descent! And, as always, really great explanations in general!
@Bellenchia
@Bellenchia 2 жыл бұрын
So our low dimensional intuitions don't work for extremely high dimensional space. For example, look at the sphere bounding problem. If you join four 2D spheres, the sphere that you can draw inside the bounds of those four spheres is smaller--and that seems like common sense. But in 10D space, the sphere you place in the middle of the hyperspheres actually reaches outside the bounds of the enclosing hyperspheres. I imagine an analogous thing is happening here, if you were at a mountaintop and you could only feel around with your feet, you of course wouldn't get to mountain tops further away by just stepping in the direction of higher incline, but maybe in higher dimensions these mountain tops are actually closer together than we think. Or in other words, some local minima of the loss function which we associate with overfitting is actually not so far away from a good generalization, or global minimum.
@MrjbushM
@MrjbushM 2 жыл бұрын
Interesting thought…
@andrewjong2498
@andrewjong2498 2 жыл бұрын
Good hypothesis, I hope someone can prove it one day
@franklydoodle350
@franklydoodle350 2 жыл бұрын
Wow seems you're speaking in the 10th dimension. Your intelligence is way beyond mine.
@franklydoodle350
@franklydoodle350 2 жыл бұрын
Now I wonder if we could push the model to make more rash decisions as the loss function nears 1. In theory I don't see why not.
@drhilm
@drhilm 2 жыл бұрын
I saw a similar idea on the paper "Train longer, generalize better: closing the generalization gap in large batch training of neural networks Elad Hoffer, Itay Hubara, Daniel Soudry" from 2017. Since then, I always try to run the network longer than the over fitting point and in many cases it works.
@ChocolateMilkCultLeader
@ChocolateMilkCultLeader 2 жыл бұрын
Thanks for sharing
@CristianGarcia
@CristianGarcia 2 жыл бұрын
Thanks Yannic! Started reading this paper but forgot about it, very handy to have it here.
@AdobadoFantastico
@AdobadoFantastico 2 жыл бұрын
This is a fascinating phenomenon, thanks for sharing and walking through the paper.
@FromBeaverton
@FromBeaverton 2 жыл бұрын
Thank you! Very cool! I have observed the grokking phenomena and was really puzzled why I do not see anybody talking about it other than "well, you can try training your neural network longer and see what happen"
@gocandra9847
@gocandra9847 2 жыл бұрын
last time i was this early schmidhuber had invented transformers
@NLPprompter
@NLPprompter 4 күн бұрын
what???
@bingochipspass08
@bingochipspass08 Жыл бұрын
Great walkthrough of the paper Yannic! You covered some details which weren't clear to me initially!
@piratepartyftw
@piratepartyftw 2 жыл бұрын
very cool phenomenon. thanks for highlighting this paper.
@AliMoeeny
@AliMoeeny 2 жыл бұрын
Just wanted to say thank you, I learned a lot.
@SeagleLiu
@SeagleLiu 2 жыл бұрын
Yannic, thanks for bring this to us. I agree with your heuristics on the local minimum vs global minimum. To me, the phenomena in this paper is related to the "oracle property" of Lasso in the tradition statistical learning literature, aka, for a penalized algorithm (Lasso or equivalent weight decay), such that the algorithm is almost sure to find the correct non-zero parameters (aka, the global minimum), as well as their correct coefficients. It will be interesting that some one further study this.
@guruprasadsomasundaram9273
@guruprasadsomasundaram9273 2 жыл бұрын
Thank you for the wonderful review of this interesting paper.
@Yaketycast
@Yaketycast 2 жыл бұрын
OMG, I'm not crazy! This has happened to me when I forgot to down a model for a month
@freemind.d2714
@freemind.d2714 2 жыл бұрын
Same here
@TheRyulord
@TheRyulord 2 жыл бұрын
Here I was watching the video and wondering "How do people figure something like this out?" but I guess you've answered my question.
@solennsara6763
@solennsara6763 2 жыл бұрын
not paying attention is a very essential skill for achieving progress.
@Laszer271
@Laszer271 2 жыл бұрын
@@solennsara6763 That's essentially also how antibiotics were invented.
@anshul5243
@anshul5243 2 жыл бұрын
@@solennsara6763 so are you saying no attention is all you need?
@nauy
@nauy 2 жыл бұрын
I actually expected this kind of behavior based on the principle of regularization. With regularization (weight decay is L2 regularization), there is always a tension between the prediction error term and regularization term in the loss. At some point in the long training session, regularization term starts to dominate and the network, given enough time and noise, finds a different solution that has much lower regularization loss. This is also true with increasing the number of parameters, since doing that also increases the regularization loss. At some point, adding more parameters results in the regularization loss dominating over the error loss and the network latches on another solution that would result in lower regularization loss and overall loss. In other words, regularization provides the pressure that, when given enough time and noise in the system, comes up with a solution that uses fewer effective parameters. I am assuming fewer parameters leads to better generalization.
@Mefaso09
@Mefaso09 2 жыл бұрын
While this is the general motivation for regularization, the really surprising thing here is that it does still find this better solution by further training. Personally I would have expected it to get stuck in a local overfit optimum and just never leave it. The fact that it does "suddenly" do so after 1e5 times as many training steps is quite surprising. Also, if the solution really is to have fewer parameters for better generalization, traditionally L1 regularization is more likely to achieve that than L2 regularization (weight decay). Would be interesting to try that.
@Mefaso09
@Mefaso09 2 жыл бұрын
Also, I suppose it would be interesting to look at if the number of weights with significantly nonzero magnitudes does actually decrease over time, to test your hypothesis. Now if only they had shared some code...
@vladislavstankov1796
@vladislavstankov1796 2 жыл бұрын
@@Mefaso09 Agree it is confusing why training much longer time helps. If you remember the example with overfitting polynomials that matches every data point, but is very "unsmooth", unsmoothness comes from "large" coefficients. When you let this polynomial have way more freedom but severely constrain the weights than it can look like a parabola (even though in reality it will be a polynomial of degree 20), but will fit the data match better than any polynomial of degree 2.
@markmorgan5568
@markmorgan5568 2 жыл бұрын
I think this is helpful. My initial reaction was “how does a network learn when the loss and gradients are zero?” But with weight decay or dropout the weights do still change. Now back to watching the video. :)
@visuality2541
@visuality2541 2 жыл бұрын
l2 reg (weight decay) is more about small valued parameters rather than about a fewer number of params (sparsity); at least the classical view is so as u probably already know. i personally believe that the finidnng is related to Lipschitzness of the overall net and also the angle component of the representation vector. but i am not sure what loss function they really used itd be really surprising if the future work shows the superiority of l2 reg compared to other regs for generalization, of course, with a solid thoery.
@Chocapic_13
@Chocapic_13 2 жыл бұрын
It would have been interesting if they had shown the evolution of the sum of the squared weights during the training. Maybe the sum of the weights gets progressively smaller and this helps the model to find the "optimal" solution to the problem starting from the memorized solution. It can also be that the l2 penalty kind of progressively "destroys" the learned representation and it adds noise until it finds this optimal solution. Very interesting paper and video anyways !
@billykotsos4642
@billykotsos4642 2 жыл бұрын
Should keep this on our radar !
@petertemp4785
@petertemp4785 2 жыл бұрын
Really well explained!
@TimScarfe
@TimScarfe 2 жыл бұрын
Been waiting for this
@TheMrVallex
@TheMrVallex 2 жыл бұрын
Dude, thanks so much for the video! You just gave me an absolut 5-head idea for a paper :D Too bad i work full time as data scientist, so i have little time for doing research. I don't want to spoil the idea, but i got a strong gut feeling that these guys stumbled onto an absolute goldmine discovery! Keep up the good work! Love your channel
@letsburn00
@letsburn00 2 жыл бұрын
I suspect that it has something to do with the fact that there is limited entropy change from the initial randomised initial set values to the "fully trained" conditions. The parameters have so much more capacity to learn in there, but the dataset often can be learned in more detail.
@Zhi-zv1in
@Zhi-zv1in 2 ай бұрын
pleassse share the idea i wanna know
@jasdeepsinghgrover2470
@jasdeepsinghgrover2470 2 жыл бұрын
Thanks for the amazing video... I have seen this while working with piecewise linear functions.. Somehow leaky relu works much better than relu or other deactivating functions.... Wouldn't have known this paper exists of you hadn't posted..
@chocol8milkman750
@chocol8milkman750 Жыл бұрын
Really well explained.
@UncoveredTruths
@UncoveredTruths 2 жыл бұрын
S5? ah yes, all these years of abstract algebra: my time to shine
@tristanwegner
@tristanwegner 2 жыл бұрын
In a few years a neuronal Network might watch your videos as training to output better neural Networks. I mean Google own KZbin and DeepMind, and videos are the next step from picture and text creation, so maybe the training is already happening. But with such a huge dataset I hope they weigh the videos not by views only, but also the predicted education of the audience. I mean they sometimes ask me, if I liked a video because it was relaxing, or informative etc. So thanks for being part of the pre-singularity :)
@billykotsos4642
@billykotsos4642 2 жыл бұрын
Interesting indeed !
@Champignon1000
@Champignon1000 2 жыл бұрын
This is super cool thanks. - I'm trying somewhere where I use the intermediate layers of a discriminator, as an additional loss to the generator/autoencoder, to avoid collapse. And adding regularizers really helps a lot. (specifically instead of MSE error loss from real image and fake image. I use a super deep perceptual loss of the discriminator, So basically only letting the autoencoder know how different the image was conceptual)
@Idiomatick
@Idiomatick 2 жыл бұрын
Your suggestion seems sufficient for both phenomenon. Decay can potentially cause worse loss with a 'wild' memorization so a more smooth solution will eventually be found after many many many attempts. "Worst way to generalize solutions" would have been a fun title.
@CristianGarcia
@CristianGarcia 2 жыл бұрын
The "Deep Double Descent" paper did look into the double descent phenomena as a function of computation time/steps.
@jabowery
@jabowery 2 жыл бұрын
And intuitive way of thinking about the double dip is imagine you're trying to write an algorithm that goes from gzip to bzip2. You must first decompress. Or think about trying to compress video. If you convert from pixels to voxels it seems like it's a very costly operation but it allows you to go from a voxel representation to a geometric abstraction in 3d where there is much less change with time. So don't be confused by the parameter count. You are still trying to come up with a smallest number of parameters in the final model.
@nikre
@nikre 2 жыл бұрын
Why would it be specifically t-sne that reflects the "underlying mechanism" but not a different method? I would like to see how the t-sne structure evolve over time, and other visualization methods. This might as well be a cherry picked result.
@Mefaso09
@Mefaso09 2 жыл бұрын
Definitely, the evolution over time would be very interesting
@vslaykovsky
@vslaykovsky 2 жыл бұрын
This looks like an Aha moment discovered by the network at some point.
@djfl58mdlwqlf
@djfl58mdlwqlf 2 жыл бұрын
The Eureka Moment
@herp_derpingson
@herp_derpingson 2 жыл бұрын
This is mind blowing. . I wonder if a "grokked" image model is more robust against adversarial attacks. Somebody needs to do a paper on that. . We can also draw parallels to the dimpled manifold hypothesis. Weight decay implies that the dimples would have to have a shallow angle, while still being correct. A dimple with a shallow angle will just have a lot more volume to "catch particles" within it, leading to better generalization. . I would also be interested in seeing how fast should we decay the weight. My intuition says that it has to me significantly less than the learning rate, otherwise, the model wont catch up with the training, while not being too less that the weight loss is nullfied by gradient addition. . I remember in one of the blog posts of Andrej Karpathy he said that make sure that your model is able to overfit a very small dataset before moving on to next steps. I wonder if now it would become standard practice to "overfit first" a model. . I wonder if something similar happens in humans. I have noticed that I develop better understanding of some topics after I let them sit in my brain for a few weeks and I forget a lot of the details.
@tsunamidestructor
@tsunamidestructor 2 жыл бұрын
As soon as I read the title, my mind immediately went to the "reconciling modern machine learning and the bias variance trade-off" paper.
@gianpierocea
@gianpierocea 2 жыл бұрын
Very cool. For some reason what I would like to see is to combine this sort of thing with a multitask approach i.e. , make the model solve the binary operation task and some other synthetic task that is different enough. I really doubt with the current computational resources this would ever work, but I do feel like that if the model could grokk that...well that would be really impressive, closer to AGI like.
@IIIIIawesIIIII
@IIIIIawesIIIII 2 жыл бұрын
This may be due to the fact, that there are countless combinations of factors that lead to overfitting, but only one "middle point" between those combinations. A quicker (non exponential) way to arrive there would be to overfit a few times with a different random seed and then add an additional layer that checks for patterns that these overfitted representations have in common.
@tristanwegner
@tristanwegner 2 жыл бұрын
21:33. Your explanation about why weightdecay helps with generalization after a lot of training makes intuitive sense. Weightdecay is like being forgetful. Thus you try to remember varying details of each sample, but the details is different for each sample, thus if you forget a bit, you do not recognize it anymore, and when learning about it again, you pick a new detail to remember. But if you find (stumble upon?) the underlying pattern, each sample re enforces your rule, counteracting the constant forgetting rate, thus it will still be there when the training samples comes around again, thus staying. Neuroscience has assumed that forgetting is essential for learning, and some Savants with amazing memory often have a hard time in real world situations.
@jeanf6295
@jeanf6295 2 жыл бұрын
When you are limited to composition of elementary operations that don't have quite the right structure, it kinda makes sense that you need a lot of parameters to emulate the proper rule, possibly way more than you have data points. It's a bit like using NOT and OR gates to represent an operation, sure it will work but you will need quite a few layers of computation and you will need to store quite a few intermediary results at each step.
@rayyuan6349
@rayyuan6349 2 жыл бұрын
I don't know if GAN community (Perhaps overparameterization community too) and general ML community have widely accept "Grokking", but this this absolute mind-blowing to the field as it implies that we can just brainless add more parameters to get better results.
@TheEbbemonster
@TheEbbemonster 2 жыл бұрын
That is grokked up!
@affrokilla
@affrokilla 2 жыл бұрын
This is actually really interesting, I wonder if the same concepts apply to harder problems (object detection, classification, etc). But the compute necessary to go 100-1000x more training steps than usual will be insane. Really hope someone tries it though.
@andrewjong2498
@andrewjong2498 2 жыл бұрын
Anecdotally, my colleague also independently observed grokking on their dataset with height maps a year ago, using a UNet. So it does seem possible for harder problems.
@thunder89
@thunder89 2 жыл бұрын
I cannot find in the paper what the dataset size for fig 6 was -- what percentage of the training data are outliers? It sees interesting that there is a jump between 1k and 2k outliers...
@yasserothman4023
@yasserothman4023 2 жыл бұрын
Thank you for the explanation , any reference or recommended videos on weight decay in the context of deep learning optimization ? Thanks
@hacker2ish
@hacker2ish 5 ай бұрын
I think the weight decay having to do with simplicity of the trained model, is due to the fact that it may impose in some cases many parameters to be almost zeroed out (because weight decay means adding the 2-norm of the parameters to the objective function, scaled by a constant)
@timkellermann7669
@timkellermann7669 2 жыл бұрын
Love it :)
@justfoundit
@justfoundit 2 жыл бұрын
I'm wondering if it works with XGBoost too: increasing the size of the model to 100k and let the model figure out what went wrong when it starts to overfit at around 2k steps?
@oreganorx7
@oreganorx7 2 жыл бұрын
I would have named the paper/research "Weight decay drives generalization" : P
@oreganorx7
@oreganorx7 2 жыл бұрын
@@aletheapower1977 Hi Alethea, I was mostly just joking around. In my head I took a few leaps and came up with an understanding of how I think it is that with weight decay this behaviour happens the fastest. Great work on the paper and research!
@herewegoagain2
@herewegoagain2 2 жыл бұрын
Like you mentioned, i think even a minor amount of noise is enough to influence the 'fit' on higher order of dimensions. I think they're considering an ideal scenario. If so, it wont 'generalise' to reak life use cases. Very interesting paper nonetheless.
@felix-ht
@felix-ht 2 жыл бұрын
Feels a lot like the process when a human am is learning something new. In the beginning one just tries to soak up the knowledge. And then suddenly you just get how it works and convert the knowledge to understanding. To me this moment always felt very relieving - as much of the knowledge can simply be replaced with the understanding. Arguably the driving factor in humans for this might actually be something similar to weight decay. Many neurons storing knowledge is certainly more costly than a few just remembering the rule. Extrapolated even further this might every well be what drives sentience itself, the simplest solution to being efficient in a reality is to understand its rules and ones role within it. So sentience would be the human brain grokking to a new more elegant solution to it's training data. One example of this would be human babys suddenly starting recognize themselfs in a mirror.
@abubakrbabiker9553
@abubakrbabiker9553 2 жыл бұрын
My guess would be that this is more like (guided search for optimal weights) as in the neural network explores many local minimas (with SGD) wrt the trainining set until the weights find some weights that fit the test dataset. I suspect that if they extended the training period the test accuracy will drop again and might rise later as well
@THEMithrandir09
@THEMithrandir09 2 жыл бұрын
25:50 looking forward to the Occams Razor Regularizer.
@drewduncan5774
@drewduncan5774 2 жыл бұрын
en.wikipedia.org/wiki/Minimum_description_length
@THEMithrandir09
@THEMithrandir09 2 жыл бұрын
@@drewduncan5774 great read, thx!
@ericadar
@ericadar 2 жыл бұрын
I'm curious if floating point errors has anything to do with this phenomena. Is it possible that FP errors accumulate and cause de-facto regularization after orders of magnitude froward and back prop? Is the order-of-magnitude delay between training and validation accuracy jump shorter for FP16 when compared to FP32?
@BitJam
@BitJam 2 жыл бұрын
This is such a surprising result! How can training be so effective at generalizing when the loss function is already so close to zero? Is the loss function really pointing the way to the generalization when we're already at a very flat bottom? Is it going down when the NN generalizes? How many bits of precision do we need for this to work? I wonder what happens if noise is added to the loss function during generalization. If generalization always slows down when we add noise then we know the loss function is driving it. If it speeds up or doesn't slow down then that points to self-organization.
@beaconofwierd1883
@beaconofwierd1883 2 жыл бұрын
I don’t see what needs explaining here. The loss function is something like L= MSE(w,x,y) + d*w^2 The optimization tries to get the error as small as possible and also get the weights as small as possible. A simple rule requires a lot less representation than memorizing every datapoint, thus more weights can be 0. The rule based solution is exactly what we are opimizing for, it just takes a long time to find it because the dominant term in the loss is the MSE, hence why it first moves toward minimizing the error by memorizing the training data then slowly moves towards minimizing the weights. The only way to minimize both is to learn the rule. How is this different from simple regularization? As for why the generalization happens ”suddenly” it is just a result of how gradient descent (with the Adam optimizer) works. Once it has found a gradient direction toward the global minimim it will pick up speed and momentum moving towards that global minimum, but finding a path towards that global minimum could take a long time, possibly forever if it got stuck in a local minima without any way out. Training 1000x longer to achieve generalization feels like a bad/inefficient approach to me.
@beaconofwierd1883
@beaconofwierd1883 2 жыл бұрын
@@aletheapower1977 Not really though, look at the graphs at 14:48, without regularization you need about 80% of the data to be training data for generalization to occur, then you’re basically just training on all your data and the model is more likely to find rule before it manages to memorize all of the training data. Not really any surprise there either.
@beaconofwierd1883
@beaconofwierd1883 2 жыл бұрын
@@aletheapower1977 Is there a relationship between the complexity of the underlying rule and the data fraction needed to grok? A good metric to use here could be the entropy of the training dataset. One would expect groking to happen faster and at smaller training data fractions when the entropy of the dataset is low, and vice versa. Not sure how much updating you will be doing, but this could be something worth investigating :) If there’s no correlation between entropy and when grokking occurs that would be very interesting.
@beaconofwierd1883
@beaconofwierd1883 2 жыл бұрын
@@aletheapower1977 I hope you find the time :) If not you can always add it as possible future work :)
@ssssssstssssssss
@ssssssstssssssss 2 жыл бұрын
Grokking is a pretty awesome name. Maybe I'll find a way to work into my daily lingo.
@victorrielly4588
@victorrielly4588 2 жыл бұрын
Yes! A transformlet paper.
@derekxiaoEvanescentBliss
@derekxiaoEvanescentBliss 2 жыл бұрын
Curious though if this has to do with the relationship between the underlying rule of the task, and the implicit regularization of gradient descdnt and explicit regularization. Is this a property of neural network learning, or is it a property of the underlying manifolds of datasets drawn from the real world and math? Edit: ah looks like that's where they went in the rest of the paper
@petretrusca2
@petretrusca2 2 жыл бұрын
It happened also to me on a problem, but the phenomenon occured much faster, I have observed that is seemed like it has converged but than after some time spend apparently learning nothing it snapped into a much better performance. I also observed that if the model is larger it snaps faster, and if the model is smaller it snaps slower or not at all
@petretrusca2
@petretrusca2 2 жыл бұрын
And it is not a double descent phenomenon because the validation loss does not increase, it stays constant Also my dataset is real one, not an artificial one
@TheRohr
@TheRohr 2 жыл бұрын
Do the number of necessary steps correlate with those necessary for differential evolution algorithms?
@imranq9241
@imranq9241 2 жыл бұрын
how can we prove double descent mathematically? I wonder if there could be a triple descent as well
@XecutionStyle
@XecutionStyle 2 жыл бұрын
What does this mean for Reinforcement Learning and Robotics? Does an over-parameterized network with tons of synthetic data generalize better than a simple one with a fraction, but real data?
@f3arbhy
@f3arbhy 2 жыл бұрын
No more earlyStopping. Got it!!
@kadirgunel5926
@kadirgunel5926 2 жыл бұрын
This paper reminded me the halting problem. Isn't it ? It can stop, but it will take too much time; it is indeterministic. I felt like living in 1950s.
@PhilipTeare
@PhilipTeare 2 жыл бұрын
As well as double decent I think this relates strongly to the work by Naftali Tishby and his group around the 'forgetting phase' beyond overfit - where unnecessary information carried forward through the layers gets forgot until all that is left if what is needed to predict the label kzbin.info/www/bejne/iprWgJWafbxrjdE again he was running for ~10K epochs on very simple problems and networks.
@andreasschneider1966
@andreasschneider1966 2 жыл бұрын
Nice paper, also well done on your side! The generalization is relatively slow though I would say, since you are on a log scale here
@erwile
@erwile 2 жыл бұрын
It could be helpful with tiny datasets !
@altus1226
@altus1226 2 жыл бұрын
This jives with my understanding of Machine-learning. We task the computer with looking through a "latent space" of possible code combinations in an common yet esoteric language (maths), and ask it to craft a program that takes our given inputs and returns our given outputs. In this example, the number of layers and parameters we give it directly represents the number of separate functions and variables respectively that it is allowed to create a program with. As we increase the size of the notepad file we are giving our artificial programmer, its ability to look at different possibilities and write complex functions increases, with non-linear increases to efficacy. If it is only alittle too large, it may find it can create a few approximations at a time- but can't get enough together to compare and contrast to quickly see the a workable pattern- so it has to slog through things a few a time while forgetting all the progress it had otherwise made. If the notepad is too small, it will never make a perfect program regardless of time spent. If you were to compare the training of a perfectly sized notepad for a program, to one that is over-sized, this seems to represent a trade off in memory-size for training-speed. As this allows the learning algorithm to investigate more of the latent space at a time.
@G12GilbertProduction
@G12GilbertProduction 2 жыл бұрын
Everything is right. But where's a compiler playing own role?
@altus1226
@altus1226 2 жыл бұрын
@@G12GilbertProduction a compiler generally only removes a layer of abstraction, which helps reduce complexity greatly for humans, however in our case, I feel like our programmer is writing directly in the computed language- there is no need for a compilation step. That said, I see no reason we could not come out with a reverse compiler to translate such NN into human readable code.
@pascalzoleko694
@pascalzoleko694 2 жыл бұрын
No matter how good a video is, I guess someone has to put the thumb down. Very interesting stuff. I wonder what it costs to run training scripts for soooo long :D
@pisoiorfan
@pisoiorfan 2 жыл бұрын
I wonder if a model after grokking suddenly becomes more compressible through distillation than any of its previous incarnations. That would be expected if it discovers a simple rule that fits any inputs and be equivalent of the machine having an "Aha!"
@raunaquepatra3966
@raunaquepatra3966 2 жыл бұрын
Is there Learning rate decay? Then it will be more crazy because then jumping from a local minima to a global will be counter intuitive
@agentds1624
@agentds1624 2 жыл бұрын
it would have been also verry interesting to see if the validation loss also snaps down. Maybe just coincidence that there is no val loss curve ;). I currently also have a kind of similar case where my loss curves scream overfitting however the accuracy sometimes shoots up...
@Coolguydudeness1234
@Coolguydudeness1234 2 жыл бұрын
the figures and the whole subject of the paper is referring to the validation loss
@Lawrencelot89
@Lawrencelot89 2 жыл бұрын
How is this related to weight regularization? If the weight decay is what causes this, making the weights already near zero should cause this even earlier right?
@XOPOIIIO
@XOPOIIIO 2 жыл бұрын
My networks are showing the same phenomenon when they get overfitted on the validation set too.
@gunter7518
@gunter7518 2 жыл бұрын
How can i contact u ? I want to see these networks
@hannesstark5024
@hannesstark5024 2 жыл бұрын
I guess the most awesome papers are at workshops!? It would be really nice to have a plot of the magnitude of the weights over the training time to see if it decreases a lot when grokking occurs.
@ssssssstssssssss
@ssssssstssssssss 2 жыл бұрын
Well. Workshops are a lot more fun than big conferences. And the barrier to entry is typically lower so conservative (aka exploitation) bias has less of an impact.
@hannesstark5024
@hannesstark5024 2 жыл бұрын
@@aletheapower1977 ❤️
@martinkaras775
@martinkaras775 2 жыл бұрын
In natural datasets where generation function is unknown (or maybe not existent at all) isnt there a bigger chance of overfitting to validation set than to find actual generation function?
@ralvarezb78
@ralvarezb78 2 жыл бұрын
This means that at some point there is a very sharp optimum in the surface of loss function
@LouisChiaki
@LouisChiaki 2 жыл бұрын
Could be the beginning on GAI
@parker1981xxx
@parker1981xxx 2 жыл бұрын
Doesn't double descent phenom occur with epochs (not params)?
@employeeofthemonth1
@employeeofthemonth1 2 жыл бұрын
So is there a penalty on larger weights in the training algorithm? If yes I wouldnt find this suprising intuitively.
@st33lbird
@st33lbird 2 жыл бұрын
Code or it didn't happen
@tinyentropy
@tinyentropy 2 жыл бұрын
Whats the name of the other paper mentioned?
@jjpp1993
@jjpp1993 2 жыл бұрын
Its like when you find out that its easier to lick the sugar cookie than to scrap the figure with a needle ;)
@ayushkanodia240
@ayushkanodia240 2 жыл бұрын
In appendix A.1 - For each training run, we chose a fraction of all available equations at random and declared them tobe the training set, with the rest of equations being the validation set. Can someone tell me how they choose their training and test sets? Figure 5 seems to indicate that one model runs on only one equation (same for the RHS in figure 1) but this line from the paper suggests otherwise.
@ayushkanodia240
@ayushkanodia240 2 жыл бұрын
@@aletheapower1977 thank you. So the statement in the appendix is for equations within an operation. Clarifies!
@shm6273
@shm6273 2 жыл бұрын
Wouldn't the snapping be a symptom of the loss landscape having a very sharp global minima and a ton of flat points? Let the network learn forever (and also push it around) and it should eventually find the global minima. I'm assuming in a real dataset, the 10^K iterations needed for a snap will grow larger than there are atoms in the universe. EDIT: I commented too soon, seems you had a similar idea in the video.
@shm6273
@shm6273 2 жыл бұрын
@@aletheapower1977 cool. So you believe the minima is actually not sharp, but pretty flat and that makes the generalization better? I remember reading a paper about this, I’ll link to it a bit later
@ZLvids
@ZLvids 2 жыл бұрын
Actually, I have encountered several time in an "opposite grokking" phenomena where the loss is suddenly *increase*. Does this phenomena has a name?
@MCRuCr
@MCRuCr 2 жыл бұрын
This sounds just like your average „I forgot to stop the experiment over the weekend“ random scientific discovery.
@larryjuang5643
@larryjuang5643 2 жыл бұрын
Wonder if Grokking could happen for a reinforcement learning set up......
@CharlesVanNoland
@CharlesVanNoland 2 жыл бұрын
Reminds me of a maturing brain, when a baby starts being aware of objects and tracking them, understanding permanence and tracking them, etc.
@sq9340
@sq9340 2 жыл бұрын
Maybe... just "maybe", it's not really overfitting, but learning quite slowly, If you look at the spikes of training accuracy at around 100%, those could be weak learning signals to fix the model. And the synthesis-data "could" be different, means the val-accuracy will jump up only when you have a quite perfect model.
@74357175
@74357175 2 жыл бұрын
What tasks are likely to be amenable to this? Only symbolic? How does the grokking phenomenon appear in non synthetic tasks?
@drdca8263
@drdca8263 2 жыл бұрын
this comment leads me to ask, "what about a synthetic task that isn't symbolic?" e.g. addition mod 23.5 , with floating point numbers. Or, maybe they tried that? not sure.
@JihyunYu
@JihyunYu Жыл бұрын
중요한 것은 꺾이지 않는 마음
@moajjem04
@moajjem04 2 жыл бұрын
Is this kind of like how we humans sometimes get that gotcha moment where understand stuff? Like the teacher explaining a concept and we understand it after nth time it has been explained?
@azai.mp4
@azai.mp4 2 жыл бұрын
How is this different from epoch-wise double descent?
@Mordenor
@Mordenor 2 жыл бұрын
I think the group is C97, the only abelian group of order 97. Identity element is a.
@birkensocks
@birkensocks 2 жыл бұрын
I don't think Z_97 is a group under most of the binary operations they used. But (x, y) -> x + y mod 97 is one of the operations, so it could be the group. It does seem like a is the identity element, and the table is symmetric about the diagonal
@vochinin
@vochinin Ай бұрын
Ok, Isn't it just a weight decay loss not included in validation loss for control. Controlling for accuracy on itself might be a stupid idea. Finding solutions beyond overfitting point might be strange, but if we include weight decay loss (Which is not done in dl libraries as it's part of optimizers) we can see the different story. Isn't it?
@AxelAhmer
@AxelAhmer 2 жыл бұрын
madness
@dk-zd5rg
@dk-zd5rg 2 жыл бұрын
whoopsie doopsie generalised, yes
@MarkNowotarski
@MarkNowotarski 2 жыл бұрын
22:58 "that's a song, no?" That's a song, Yes! kzbin.info/www/bejne/jpndoaugqtyNr5I
@henrilemoine3953
@henrilemoine3953 2 жыл бұрын
This is so cool?!
Как быстро замутить ЭлектроСамокат
00:59
ЖЕЛЕЗНЫЙ КОРОЛЬ
Рет қаралды 13 МЛН
The day of the sea 🌊 🤣❤️ #demariki
00:22
Demariki
Рет қаралды 40 МЛН
$10,000 Every Day You Survive In The Wilderness
26:44
MrBeast
Рет қаралды 136 МЛН
When someone reclines their seat ✈️
00:21
Adam W
Рет қаралды 27 МЛН
Deep Networks Are Kernel Machines (Paper Explained)
43:04
Yannic Kilcher
Рет қаралды 59 М.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 123 М.
Paper Highlights: Grokking Structure with Transformers
11:23
Pister Labs
Рет қаралды 425
New Discovery: Retrieval Heads for Long Context
30:50
code_your_own_AI
Рет қаралды 2,5 М.
Neural Architecture Search without Training (Paper Explained)
35:06
Yannic Kilcher
Рет қаралды 27 М.
Convolutional Neural Networks (CNNs) explained
8:37
deeplizard
Рет қаралды 1,2 МЛН
Main filter..
0:15
CikoYt
Рет қаралды 5 МЛН
🔥Идеальный чехол для iPhone! 📱 #apple #iphone
0:36
AI от Apple - ОБЪЯСНЯЕМ
24:19
Droider
Рет қаралды 117 М.
Mi primera placa con dios
0:12
Eyal mewing
Рет қаралды 719 М.
i like you subscriber ♥️♥️ #trending #iphone #apple #iphonefold
0:14