The verb 'to grok' has been in niche use in computer science since the 1960's meaning to fully understand something (as opposed to just knowing how to apply the rules for example). It was coined in Robert A Heinlein's classic sci-fi novel "Stranger in a Strange Land".
@michaelnurse90893 жыл бұрын
Thanks for the history. Today it is a somewhat well known word in my anecdotal experience.
@ruemeese3 жыл бұрын
@@michaelnurse9089 Always going to be new to somebody.
@cactusmilenario3 жыл бұрын
@@ruemeese thanks
@和平和平-c4i3 жыл бұрын
Thanks that's why it was not it the Webster dictionary ;)
@RobertWF423 жыл бұрын
In my opinion "grokking" shouldn't be used here. There's no grokking happening, instead it's running an ML algorithm for a long time until it fits the test data. But it's still a really cool idea. :-)
@mgostIH3 жыл бұрын
Thanks for accepting my suggestion in reviewing this! I think this paper can fall between "Wow what a strange behaviour" to "Is more compute all we need?" If the findings of this paper are reproduced on a lot of other practical scenarious and we find ways to speed up grokking it might change the entire way we think about neural networks, where we just need a model that's big enough to express some solution and then keep training it to find simpler ones, leading to an algorithmic Occam's Razor!
@TimScarfe3 жыл бұрын
I was nagging Yannic too!! 😂
@aryasenadewanusa89203 жыл бұрын
but again, though it's wonderful, in some cases, the cost to train such a large networks in such long time to achieve that AHA moment by the network can be too crazy to make sense..
@SteveStavropoulos3 жыл бұрын
This seems like random search where the network stumbles upon the solution and then stays there. The "stay there" part is well explained, but the interesting part is the search. And nothing is presented here to suggest search is not entirely random. It would be interesting to compare the neural network to a simple random search algorithm which just randomly tries every combination. Would the neural network find the solution significantly faster than the random search?
@RyanMartinRAM3 жыл бұрын
@@SteveStavropoulos or just start with different random noise every so often?
@brunosantidrian23013 жыл бұрын
@@SteveStavropoulos Well, you are actually restricting the search to the space of networks that perform well in the training set. So I would say yes.
@martinsmolik24493 жыл бұрын
WAIT, WHAT? In my Masters thesis, I did something similar, and I got the same result. A quick overfit and subsequent snap into place. I never thought that this would be interesting, as I was not an ML engineer, but a mathematician looking for new ways to model structures. That was in 2019. Could have been famous! Ah well...
@andrewcutler45993 жыл бұрын
What do you think of their line "we speculate that such visualizations could one day be a useful way to gain intuitions about novel mathematical objects"?
@thomasdeniffel21223 жыл бұрын
Schmidthuber, is ist you?
@visuality25413 жыл бұрын
may i ask u the thesis title?
@arzigogolato13 жыл бұрын
As with many discoveries you need both the luck to find something totally new, and the knowledge to recognize what's happening...I'm sure a lot of missed opportunities out there...still, the advisor or some of the other professors could have noticed that this was indeed something "off and interesting"...it might actually be interesting to see how and when this happens :)
@autolykos98223 жыл бұрын
Yeah, a lot of science comes from having the leisure to investigate strange accidents, instead of being forced to just shrug and start over to get done with what you're actually trying to do. Overworking scientists results in poor science - and yet there are many academics proud of working 60-80h weeks.
@Phenix663 жыл бұрын
Loved the context you brought into this about double descent! And, as always, really great explanations in general!
@Bellenchia3 жыл бұрын
So our low dimensional intuitions don't work for extremely high dimensional space. For example, look at the sphere bounding problem. If you join four 2D spheres, the sphere that you can draw inside the bounds of those four spheres is smaller--and that seems like common sense. But in 10D space, the sphere you place in the middle of the hyperspheres actually reaches outside the bounds of the enclosing hyperspheres. I imagine an analogous thing is happening here, if you were at a mountaintop and you could only feel around with your feet, you of course wouldn't get to mountain tops further away by just stepping in the direction of higher incline, but maybe in higher dimensions these mountain tops are actually closer together than we think. Or in other words, some local minima of the loss function which we associate with overfitting is actually not so far away from a good generalization, or global minimum.
@MrjbushM3 жыл бұрын
Interesting thought…
@andrewjong24983 жыл бұрын
Good hypothesis, I hope someone can prove it one day
@franklydoodle3502 жыл бұрын
Wow seems you're speaking in the 10th dimension. Your intelligence is way beyond mine.
@franklydoodle3502 жыл бұрын
Now I wonder if we could push the model to make more rash decisions as the loss function nears 1. In theory I don't see why not.
@FromBeaverton3 жыл бұрын
Thank you! Very cool! I have observed the grokking phenomena and was really puzzled why I do not see anybody talking about it other than "well, you can try training your neural network longer and see what happen"
@TheMrVallex3 жыл бұрын
Dude, thanks so much for the video! You just gave me an absolut 5-head idea for a paper :D Too bad i work full time as data scientist, so i have little time for doing research. I don't want to spoil the idea, but i got a strong gut feeling that these guys stumbled onto an absolute goldmine discovery! Keep up the good work! Love your channel
@letsburn003 жыл бұрын
I suspect that it has something to do with the fact that there is limited entropy change from the initial randomised initial set values to the "fully trained" conditions. The parameters have so much more capacity to learn in there, but the dataset often can be learned in more detail.
@Zhi-zv1in9 ай бұрын
pleassse share the idea i wanna know
@drhilm3 жыл бұрын
I saw a similar idea on the paper "Train longer, generalize better: closing the generalization gap in large batch training of neural networks Elad Hoffer, Itay Hubara, Daniel Soudry" from 2017. Since then, I always try to run the network longer than the over fitting point and in many cases it works.
@ChocolateMilkCultLeader3 жыл бұрын
Thanks for sharing
@bingochipspass082 жыл бұрын
Great walkthrough of the paper Yannic! You covered some details which weren't clear to me initially!
@CristianGarcia3 жыл бұрын
Thanks Yannic! Started reading this paper but forgot about it, very handy to have it here.
@PunmasterSTP26 күн бұрын
The concept of grokking is just so fascinating to me! I remember getting intrigued watching the documentary on AlphaGo and reading up about AlphaZero and AlphaStar after that, and imagining what might happen if training was left on indefinitely. Would there be some random discovery made after days, weeks or years that radically changed the model's intuition, or even just gave it a slight edge that made it much easier to beat earlier versions of itself? I also love how neural nets seem like they have a deterministic substrate, but yet are so full of mystery! Also, and I know this is kinda unrelated, but anyone else stoked for Grok 3 to come out?
@Chocapic_133 жыл бұрын
It would have been interesting if they had shown the evolution of the sum of the squared weights during the training. Maybe the sum of the weights gets progressively smaller and this helps the model to find the "optimal" solution to the problem starting from the memorized solution. It can also be that the l2 penalty kind of progressively "destroys" the learned representation and it adds noise until it finds this optimal solution. Very interesting paper and video anyways !
@nauy3 жыл бұрын
I actually expected this kind of behavior based on the principle of regularization. With regularization (weight decay is L2 regularization), there is always a tension between the prediction error term and regularization term in the loss. At some point in the long training session, regularization term starts to dominate and the network, given enough time and noise, finds a different solution that has much lower regularization loss. This is also true with increasing the number of parameters, since doing that also increases the regularization loss. At some point, adding more parameters results in the regularization loss dominating over the error loss and the network latches on another solution that would result in lower regularization loss and overall loss. In other words, regularization provides the pressure that, when given enough time and noise in the system, comes up with a solution that uses fewer effective parameters. I am assuming fewer parameters leads to better generalization.
@Mefaso093 жыл бұрын
While this is the general motivation for regularization, the really surprising thing here is that it does still find this better solution by further training. Personally I would have expected it to get stuck in a local overfit optimum and just never leave it. The fact that it does "suddenly" do so after 1e5 times as many training steps is quite surprising. Also, if the solution really is to have fewer parameters for better generalization, traditionally L1 regularization is more likely to achieve that than L2 regularization (weight decay). Would be interesting to try that.
@Mefaso093 жыл бұрын
Also, I suppose it would be interesting to look at if the number of weights with significantly nonzero magnitudes does actually decrease over time, to test your hypothesis. Now if only they had shared some code...
@vladislavstankov17963 жыл бұрын
@@Mefaso09 Agree it is confusing why training much longer time helps. If you remember the example with overfitting polynomials that matches every data point, but is very "unsmooth", unsmoothness comes from "large" coefficients. When you let this polynomial have way more freedom but severely constrain the weights than it can look like a parabola (even though in reality it will be a polynomial of degree 20), but will fit the data match better than any polynomial of degree 2.
@markmorgan55683 жыл бұрын
I think this is helpful. My initial reaction was “how does a network learn when the loss and gradients are zero?” But with weight decay or dropout the weights do still change. Now back to watching the video. :)
@visuality25413 жыл бұрын
l2 reg (weight decay) is more about small valued parameters rather than about a fewer number of params (sparsity); at least the classical view is so as u probably already know. i personally believe that the finidnng is related to Lipschitzness of the overall net and also the angle component of the representation vector. but i am not sure what loss function they really used itd be really surprising if the future work shows the superiority of l2 reg compared to other regs for generalization, of course, with a solid thoery.
@beaconofwierd18833 жыл бұрын
I don’t see what needs explaining here. The loss function is something like L= MSE(w,x,y) + d*w^2 The optimization tries to get the error as small as possible and also get the weights as small as possible. A simple rule requires a lot less representation than memorizing every datapoint, thus more weights can be 0. The rule based solution is exactly what we are opimizing for, it just takes a long time to find it because the dominant term in the loss is the MSE, hence why it first moves toward minimizing the error by memorizing the training data then slowly moves towards minimizing the weights. The only way to minimize both is to learn the rule. How is this different from simple regularization? As for why the generalization happens ”suddenly” it is just a result of how gradient descent (with the Adam optimizer) works. Once it has found a gradient direction toward the global minimim it will pick up speed and momentum moving towards that global minimum, but finding a path towards that global minimum could take a long time, possibly forever if it got stuck in a local minima without any way out. Training 1000x longer to achieve generalization feels like a bad/inefficient approach to me.
@beaconofwierd18833 жыл бұрын
@@aletheapower1977 Not really though, look at the graphs at 14:48, without regularization you need about 80% of the data to be training data for generalization to occur, then you’re basically just training on all your data and the model is more likely to find rule before it manages to memorize all of the training data. Not really any surprise there either.
@beaconofwierd18833 жыл бұрын
@@aletheapower1977 Is there a relationship between the complexity of the underlying rule and the data fraction needed to grok? A good metric to use here could be the entropy of the training dataset. One would expect groking to happen faster and at smaller training data fractions when the entropy of the dataset is low, and vice versa. Not sure how much updating you will be doing, but this could be something worth investigating :) If there’s no correlation between entropy and when grokking occurs that would be very interesting.
@beaconofwierd18833 жыл бұрын
@@aletheapower1977 I hope you find the time :) If not you can always add it as possible future work :)
@AdobadoFantastico3 жыл бұрын
This is a fascinating phenomenon, thanks for sharing and walking through the paper.
@AliMoeeny3 жыл бұрын
Just wanted to say thank you, I learned a lot.
@tristanwegner2 жыл бұрын
In a few years a neuronal Network might watch your videos as training to output better neural Networks. I mean Google own KZbin and DeepMind, and videos are the next step from picture and text creation, so maybe the training is already happening. But with such a huge dataset I hope they weigh the videos not by views only, but also the predicted education of the audience. I mean they sometimes ask me, if I liked a video because it was relaxing, or informative etc. So thanks for being part of the pre-singularity :)
@Yaketycast3 жыл бұрын
OMG, I'm not crazy! This has happened to me when I forgot to down a model for a month
@freemind.d27143 жыл бұрын
Same here
@TheRyulord3 жыл бұрын
Here I was watching the video and wondering "How do people figure something like this out?" but I guess you've answered my question.
@solennsara67633 жыл бұрын
not paying attention is a very essential skill for achieving progress.
@Laszer2713 жыл бұрын
@@solennsara6763 That's essentially also how antibiotics were invented.
@anshul52433 жыл бұрын
@@solennsara6763 so are you saying no attention is all you need?
@tristanwegner2 жыл бұрын
21:33. Your explanation about why weightdecay helps with generalization after a lot of training makes intuitive sense. Weightdecay is like being forgetful. Thus you try to remember varying details of each sample, but the details is different for each sample, thus if you forget a bit, you do not recognize it anymore, and when learning about it again, you pick a new detail to remember. But if you find (stumble upon?) the underlying pattern, each sample re enforces your rule, counteracting the constant forgetting rate, thus it will still be there when the training samples comes around again, thus staying. Neuroscience has assumed that forgetting is essential for learning, and some Savants with amazing memory often have a hard time in real world situations.
@pladselsker83405 ай бұрын
How come this only has gained traction and popularity now, 2 years after the paper's release
@CristianGarcia3 жыл бұрын
The "Deep Double Descent" paper did look into the double descent phenomena as a function of computation time/steps.
@Idiomatick3 жыл бұрын
Your suggestion seems sufficient for both phenomenon. Decay can potentially cause worse loss with a 'wild' memorization so a more smooth solution will eventually be found after many many many attempts. "Worst way to generalize solutions" would have been a fun title.
@IIIIIawesIIIII3 жыл бұрын
This may be due to the fact, that there are countless combinations of factors that lead to overfitting, but only one "middle point" between those combinations. A quicker (non exponential) way to arrive there would be to overfit a few times with a different random seed and then add an additional layer that checks for patterns that these overfitted representations have in common.
@SeagleLiu3 жыл бұрын
Yannic, thanks for bring this to us. I agree with your heuristics on the local minimum vs global minimum. To me, the phenomena in this paper is related to the "oracle property" of Lasso in the tradition statistical learning literature, aka, for a penalized algorithm (Lasso or equivalent weight decay), such that the algorithm is almost sure to find the correct non-zero parameters (aka, the global minimum), as well as their correct coefficients. It will be interesting that some one further study this.
@jabowery3 жыл бұрын
And intuitive way of thinking about the double dip is imagine you're trying to write an algorithm that goes from gzip to bzip2. You must first decompress. Or think about trying to compress video. If you convert from pixels to voxels it seems like it's a very costly operation but it allows you to go from a voxel representation to a geometric abstraction in 3d where there is much less change with time. So don't be confused by the parameter count. You are still trying to come up with a smallest number of parameters in the final model.
@gocandra98473 жыл бұрын
last time i was this early schmidhuber had invented transformers
@NLPprompter7 ай бұрын
what???
@GeorgeSut7 ай бұрын
Schmidthuber also invented GANs, supposedly 😆
@UncoveredTruths3 жыл бұрын
S5? ah yes, all these years of abstract algebra: my time to shine
@herp_derpingson3 жыл бұрын
This is mind blowing. . I wonder if a "grokked" image model is more robust against adversarial attacks. Somebody needs to do a paper on that. . We can also draw parallels to the dimpled manifold hypothesis. Weight decay implies that the dimples would have to have a shallow angle, while still being correct. A dimple with a shallow angle will just have a lot more volume to "catch particles" within it, leading to better generalization. . I would also be interested in seeing how fast should we decay the weight. My intuition says that it has to me significantly less than the learning rate, otherwise, the model wont catch up with the training, while not being too less that the weight loss is nullfied by gradient addition. . I remember in one of the blog posts of Andrej Karpathy he said that make sure that your model is able to overfit a very small dataset before moving on to next steps. I wonder if now it would become standard practice to "overfit first" a model. . I wonder if something similar happens in humans. I have noticed that I develop better understanding of some topics after I let them sit in my brain for a few weeks and I forget a lot of the details.
@billykotsos46423 жыл бұрын
Should keep this on our radar !
@TimScarfe3 жыл бұрын
Been waiting for this
@hacker2ish Жыл бұрын
I think the weight decay having to do with simplicity of the trained model, is due to the fact that it may impose in some cases many parameters to be almost zeroed out (because weight decay means adding the 2-norm of the parameters to the objective function, scaled by a constant)
@nikre3 жыл бұрын
Why would it be specifically t-sne that reflects the "underlying mechanism" but not a different method? I would like to see how the t-sne structure evolve over time, and other visualization methods. This might as well be a cherry picked result.
@Mefaso093 жыл бұрын
Definitely, the evolution over time would be very interesting
@tsunamidestructor3 жыл бұрын
As soon as I read the title, my mind immediately went to the "reconciling modern machine learning and the bias variance trade-off" paper.
@jeanf62953 жыл бұрын
When you are limited to composition of elementary operations that don't have quite the right structure, it kinda makes sense that you need a lot of parameters to emulate the proper rule, possibly way more than you have data points. It's a bit like using NOT and OR gates to represent an operation, sure it will work but you will need quite a few layers of computation and you will need to store quite a few intermediary results at each step.
@rayyuan63493 жыл бұрын
I don't know if GAN community (Perhaps overparameterization community too) and general ML community have widely accept "Grokking", but this this absolute mind-blowing to the field as it implies that we can just brainless add more parameters to get better results.
@piratepartyftw3 жыл бұрын
very cool phenomenon. thanks for highlighting this paper.
@Savage2Flinch7 ай бұрын
Stupid question: if your loss on the training set goes to zero, what happens during back-prop if you continue? Shouldn't the weights stop changing after overfitting?
@pokerandphilosophy83285 ай бұрын
My guess is that the model keeps learning about the deep structure of the problem albeit not yet in a way that is deep enough to enable generalization outside of the training set. Imagine you are memorizing a set of poems by a poet. At some point you know all of them by rote but you don't yet understand the style of the poet deeply enough to be able to predict (with above chance accuracy) how some of the unseen poems proceed. As you keep rehearsing the initial poems (and being penalized for the occasional error) you latch on an increasing number of heuristics or features of the poet's style. This doesn't improve significantly your ability to recite those poems since you already know them by rote. It doesn't also significantly improve your ability to make predictions outside of the training set since your understanding of the style isn't wholistic enough. But, at some point, you have latched on sufficiently many stylistic features of them (and learned to tacitly represent aspects of the poet's world view) to be able to understand why (emphasis on "why") the poet chooses this or that word in this or that context. You've grokked and are now able to generalize outside of the training data set.
@affrokilla3 жыл бұрын
This is actually really interesting, I wonder if the same concepts apply to harder problems (object detection, classification, etc). But the compute necessary to go 100-1000x more training steps than usual will be insane. Really hope someone tries it though.
@andrewjong24983 жыл бұрын
Anecdotally, my colleague also independently observed grokking on their dataset with height maps a year ago, using a UNet. So it does seem possible for harder problems.
@guruprasadsomasundaram92733 жыл бұрын
Thank you for the wonderful review of this interesting paper.
@BjornHeijligers7 ай бұрын
Can you explain what "training on the validation or test data" means? are you using the actual "y" data of the test or validation to back propagate the predictions error? How is that different to adding more data to the training set? Thanks. trying to understand.
@gianpierocea3 жыл бұрын
Very cool. For some reason what I would like to see is to combine this sort of thing with a multitask approach i.e. , make the model solve the binary operation task and some other synthetic task that is different enough. I really doubt with the current computational resources this would ever work, but I do feel like that if the model could grokk that...well that would be really impressive, closer to AGI like.
@yasserothman40233 жыл бұрын
Thank you for the explanation , any reference or recommended videos on weight decay in the context of deep learning optimization ? Thanks
@petretrusca23 жыл бұрын
It happened also to me on a problem, but the phenomenon occured much faster, I have observed that is seemed like it has converged but than after some time spend apparently learning nothing it snapped into a much better performance. I also observed that if the model is larger it snaps faster, and if the model is smaller it snaps slower or not at all
@petretrusca23 жыл бұрын
And it is not a double descent phenomenon because the validation loss does not increase, it stays constant Also my dataset is real one, not an artificial one
@BitJam3 жыл бұрын
This is such a surprising result! How can training be so effective at generalizing when the loss function is already so close to zero? Is the loss function really pointing the way to the generalization when we're already at a very flat bottom? Is it going down when the NN generalizes? How many bits of precision do we need for this to work? I wonder what happens if noise is added to the loss function during generalization. If generalization always slows down when we add noise then we know the loss function is driving it. If it speeds up or doesn't slow down then that points to self-organization.
@imranq92413 жыл бұрын
how can we prove double descent mathematically? I wonder if there could be a triple descent as well
@THEMithrandir093 жыл бұрын
25:50 looking forward to the Occams Razor Regularizer.
@drewduncan57743 жыл бұрын
en.wikipedia.org/wiki/Minimum_description_length
@THEMithrandir093 жыл бұрын
@@drewduncan5774 great read, thx!
@felix-ht3 жыл бұрын
Feels a lot like the process when a human am is learning something new. In the beginning one just tries to soak up the knowledge. And then suddenly you just get how it works and convert the knowledge to understanding. To me this moment always felt very relieving - as much of the knowledge can simply be replaced with the understanding. Arguably the driving factor in humans for this might actually be something similar to weight decay. Many neurons storing knowledge is certainly more costly than a few just remembering the rule. Extrapolated even further this might every well be what drives sentience itself, the simplest solution to being efficient in a reality is to understand its rules and ones role within it. So sentience would be the human brain grokking to a new more elegant solution to it's training data. One example of this would be human babys suddenly starting recognize themselfs in a mirror.
@oreganorx73 жыл бұрын
I would have named the paper/research "Weight decay drives generalization" : P
@oreganorx73 жыл бұрын
@@aletheapower1977 Hi Alethea, I was mostly just joking around. In my head I took a few leaps and came up with an understanding of how I think it is that with weight decay this behaviour happens the fastest. Great work on the paper and research!
@jasdeepsinghgrover24703 жыл бұрын
Thanks for the amazing video... I have seen this while working with piecewise linear functions.. Somehow leaky relu works much better than relu or other deactivating functions.... Wouldn't have known this paper exists of you hadn't posted..
@vslaykovsky3 жыл бұрын
This looks like an Aha moment discovered by the network at some point.
@djfl58mdlwqlf3 жыл бұрын
The Eureka Moment
@herewegoagain23 жыл бұрын
Like you mentioned, i think even a minor amount of noise is enough to influence the 'fit' on higher order of dimensions. I think they're considering an ideal scenario. If so, it wont 'generalise' to reak life use cases. Very interesting paper nonetheless.
@petertemp47853 жыл бұрын
Really well explained!
@chocol8milkman7502 жыл бұрын
Really well explained.
@thunder893 жыл бұрын
I cannot find in the paper what the dataset size for fig 6 was -- what percentage of the training data are outliers? It sees interesting that there is a jump between 1k and 2k outliers...
@GeorgeSut7 ай бұрын
So this is another interpretation of double descent, right?
@GeorgeSut7 ай бұрын
Also, why didn't the authors train the model for more epochs/optimization steps? I honestly doubt the accuracy stays at that level, noticing that the peak on the right of the plot is cut-off by the plot's boundaries. What if there is some weird periodicity here?
@tinyentropy2 жыл бұрын
Whats the name of the other paper mentioned?
@abubakrbabiker95533 жыл бұрын
My guess would be that this is more like (guided search for optimal weights) as in the neural network explores many local minimas (with SGD) wrt the trainining set until the weights find some weights that fit the test dataset. I suspect that if they extended the training period the test accuracy will drop again and might rise later as well
@Champignon10003 жыл бұрын
This is super cool thanks. - I'm trying somewhere where I use the intermediate layers of a discriminator, as an additional loss to the generator/autoencoder, to avoid collapse. And adding regularizers really helps a lot. (specifically instead of MSE error loss from real image and fake image. I use a super deep perceptual loss of the discriminator, So basically only letting the autoencoder know how different the image was conceptual)
@justfoundit3 жыл бұрын
I'm wondering if it works with XGBoost too: increasing the size of the model to 100k and let the model figure out what went wrong when it starts to overfit at around 2k steps?
@goodtothinkwith7 ай бұрын
So if a phenomenon is relatively isolated in the dataset, it’s more likely to grok it? If that’s the case, it seems like experimental physics datasets would be good candidates to grok new physics…
@altus12263 жыл бұрын
This jives with my understanding of Machine-learning. We task the computer with looking through a "latent space" of possible code combinations in an common yet esoteric language (maths), and ask it to craft a program that takes our given inputs and returns our given outputs. In this example, the number of layers and parameters we give it directly represents the number of separate functions and variables respectively that it is allowed to create a program with. As we increase the size of the notepad file we are giving our artificial programmer, its ability to look at different possibilities and write complex functions increases, with non-linear increases to efficacy. If it is only alittle too large, it may find it can create a few approximations at a time- but can't get enough together to compare and contrast to quickly see the a workable pattern- so it has to slog through things a few a time while forgetting all the progress it had otherwise made. If the notepad is too small, it will never make a perfect program regardless of time spent. If you were to compare the training of a perfectly sized notepad for a program, to one that is over-sized, this seems to represent a trade off in memory-size for training-speed. As this allows the learning algorithm to investigate more of the latent space at a time.
@G12GilbertProduction3 жыл бұрын
Everything is right. But where's a compiler playing own role?
@altus12263 жыл бұрын
@@G12GilbertProduction a compiler generally only removes a layer of abstraction, which helps reduce complexity greatly for humans, however in our case, I feel like our programmer is writing directly in the computed language- there is no need for a compilation step. That said, I see no reason we could not come out with a reverse compiler to translate such NN into human readable code.
@ericadar3 жыл бұрын
I'm curious if floating point errors has anything to do with this phenomena. Is it possible that FP errors accumulate and cause de-facto regularization after orders of magnitude froward and back prop? Is the order-of-magnitude delay between training and validation accuracy jump shorter for FP16 when compared to FP32?
@hannesstark50243 жыл бұрын
I guess the most awesome papers are at workshops!? It would be really nice to have a plot of the magnitude of the weights over the training time to see if it decreases a lot when grokking occurs.
@ssssssstssssssss3 жыл бұрын
Well. Workshops are a lot more fun than big conferences. And the barrier to entry is typically lower so conservative (aka exploitation) bias has less of an impact.
@hannesstark50243 жыл бұрын
@@aletheapower1977 ❤️
@TheRohr3 жыл бұрын
Do the number of necessary steps correlate with those necessary for differential evolution algorithms?
@PhilipTeare3 жыл бұрын
As well as double decent I think this relates strongly to the work by Naftali Tishby and his group around the 'forgetting phase' beyond overfit - where unnecessary information carried forward through the layers gets forgot until all that is left if what is needed to predict the label kzbin.info/www/bejne/iprWgJWafbxrjdE again he was running for ~10K epochs on very simple problems and networks.
@ssssssstssssssss3 жыл бұрын
Grokking is a pretty awesome name. Maybe I'll find a way to work into my daily lingo.
@Lawrencelot893 жыл бұрын
How is this related to weight regularization? If the weight decay is what causes this, making the weights already near zero should cause this even earlier right?
@kadirgunel59263 жыл бұрын
This paper reminded me the halting problem. Isn't it ? It can stop, but it will take too much time; it is indeterministic. I felt like living in 1950s.
@TheEbbemonster3 жыл бұрын
That is grokked up!
@pisoiorfan3 жыл бұрын
I wonder if a model after grokking suddenly becomes more compressible through distillation than any of its previous incarnations. That would be expected if it discovers a simple rule that fits any inputs and be equivalent of the machine having an "Aha!"
@743571753 жыл бұрын
What tasks are likely to be amenable to this? Only symbolic? How does the grokking phenomenon appear in non synthetic tasks?
@drdca82633 жыл бұрын
this comment leads me to ask, "what about a synthetic task that isn't symbolic?" e.g. addition mod 23.5 , with floating point numbers. Or, maybe they tried that? not sure.
@XOPOIIIO3 жыл бұрын
My networks are showing the same phenomenon when they get overfitted on the validation set too.
@gunter75182 жыл бұрын
How can i contact u ? I want to see these networks
@martinkaras7753 жыл бұрын
In natural datasets where generation function is unknown (or maybe not existent at all) isnt there a bigger chance of overfitting to validation set than to find actual generation function?
@themiguel-martinho2 жыл бұрын
Why do many of the binary operations tested include a modulus operator?
@themiguel-martinho2 жыл бұрын
Also, since both this work and dual descent come from open ai, did the researchers discuss with each other?
@andrewminhnguyen94463 жыл бұрын
Makes me think of "Knowledge distillation: A good teacher is patient and consistent," where the authors obtain better student networks by training for an unintuitively long time. Also, curious how this might play with "Deep Learning on a Data Diet." Andrej Karpathy talks about how he accidentally left a model training over winter break and found it had SotA'ed.
@raunaquepatra39663 жыл бұрын
Is there Learning rate decay? Then it will be more crazy because then jumping from a local minima to a global will be counter intuitive
@employeeofthemonth13 жыл бұрын
So is there a penalty on larger weights in the training algorithm? If yes I wouldnt find this suprising intuitively.
@ralvarezb783 жыл бұрын
This means that at some point there is a very sharp optimum in the surface of loss function
@derekxiaoEvanescentBliss3 жыл бұрын
Curious though if this has to do with the relationship between the underlying rule of the task, and the implicit regularization of gradient descdnt and explicit regularization. Is this a property of neural network learning, or is it a property of the underlying manifolds of datasets drawn from the real world and math? Edit: ah looks like that's where they went in the rest of the paper
@XecutionStyle3 жыл бұрын
What does this mean for Reinforcement Learning and Robotics? Does an over-parameterized network with tons of synthetic data generalize better than a simple one with a fraction, but real data?
@andreasschneider19663 жыл бұрын
Nice paper, also well done on your side! The generalization is relatively slow though I would say, since you are on a log scale here
@erwile3 жыл бұрын
It could be helpful with tiny datasets !
@larryjuang56433 жыл бұрын
Wonder if Grokking could happen for a reinforcement learning set up......
@azai.mp43 жыл бұрын
How is this different from epoch-wise double descent?
@shm62733 жыл бұрын
Wouldn't the snapping be a symptom of the loss landscape having a very sharp global minima and a ton of flat points? Let the network learn forever (and also push it around) and it should eventually find the global minima. I'm assuming in a real dataset, the 10^K iterations needed for a snap will grow larger than there are atoms in the universe. EDIT: I commented too soon, seems you had a similar idea in the video.
@shm62733 жыл бұрын
@@aletheapower1977 cool. So you believe the minima is actually not sharp, but pretty flat and that makes the generalization better? I remember reading a paper about this, I’ll link to it a bit later
@agentds16243 жыл бұрын
it would have been also verry interesting to see if the validation loss also snaps down. Maybe just coincidence that there is no val loss curve ;). I currently also have a kind of similar case where my loss curves scream overfitting however the accuracy sometimes shoots up...
@Coolguydudeness12343 жыл бұрын
the figures and the whole subject of the paper is referring to the validation loss
@patrickjdarrow3 жыл бұрын
Some intern probably added a few too many 0's to their n_epochs variable, went on vacation and came back to these results
@oscezrcd3 жыл бұрын
Do interns have vacations?
@patrickjdarrow3 жыл бұрын
@@oscezrcd depends on their arrangement
@patrickjdarrow3 жыл бұрын
@@aletheapower1977 LOL. Love that
@ayushkanodia2403 жыл бұрын
In appendix A.1 - For each training run, we chose a fraction of all available equations at random and declared them tobe the training set, with the rest of equations being the validation set. Can someone tell me how they choose their training and test sets? Figure 5 seems to indicate that one model runs on only one equation (same for the RHS in figure 1) but this line from the paper suggests otherwise.
@ayushkanodia2403 жыл бұрын
@@aletheapower1977 thank you. So the statement in the appendix is for equations within an operation. Clarifies!
@billykotsos46423 жыл бұрын
Interesting indeed !
@CharlesVanNoland3 жыл бұрын
Reminds me of a maturing brain, when a baby starts being aware of objects and tracking them, understanding permanence and tracking them, etc.
@jjpp33013 жыл бұрын
Its like when you find out that its easier to lick the sugar cookie than to scrap the figure with a needle ;)
@MCRuCr3 жыл бұрын
This sounds just like your average „I forgot to stop the experiment over the weekend“ random scientific discovery.
@ZLvids3 жыл бұрын
Actually, I have encountered several time in an "opposite grokking" phenomena where the loss is suddenly *increase*. Does this phenomena has a name?
@vochinin9 ай бұрын
Ok, Isn't it just a weight decay loss not included in validation loss for control. Controlling for accuracy on itself might be a stupid idea. Finding solutions beyond overfitting point might be strange, but if we include weight decay loss (Which is not done in dl libraries as it's part of optimizers) we can see the different story. Isn't it?
@victorrielly45883 жыл бұрын
Yes! A transformlet paper.
@sq93403 жыл бұрын
Maybe... just "maybe", it's not really overfitting, but learning quite slowly, If you look at the spikes of training accuracy at around 100%, those could be weak learning signals to fix the model. And the synthesis-data "could" be different, means the val-accuracy will jump up only when you have a quite perfect model.
@pascalzoleko6943 жыл бұрын
No matter how good a video is, I guess someone has to put the thumb down. Very interesting stuff. I wonder what it costs to run training scripts for soooo long :D
@xiao26343 ай бұрын
It would be cool if the training data were the Enigma encoded messages, and suddenly the model figured it out after millions of epochs.
@DasGrosseFressen3 жыл бұрын
Learn the rule? So there was extrapolation?
@Mordenor3 жыл бұрын
I think the group is C97, the only abelian group of order 97. Identity element is a.
@birkensocks3 жыл бұрын
I don't think Z_97 is a group under most of the binary operations they used. But (x, y) -> x + y mod 97 is one of the operations, so it could be the group. It does seem like a is the identity element, and the table is symmetric about the diagonal