So AI gets BETRAYED and TRAPPED in a HYPERBOLIC TIME CHAMBER for a 1000 years until it starts "GROKKING".
@nizanklinghoffer4620Сағат бұрын
That stablemax loss completely ruined my model's generalization 😅
@gjadams742 сағат бұрын
Ripples in the waves of the feedback machine
@corgirun78926 сағат бұрын
The value of some papers lies in persuading the community:We need to change our approach.
@dandushi98728 сағат бұрын
What capabilities will an LCM have over an LLM? I understand that it can understand whole sentences but what are the benefits?
@tikendraw11 сағат бұрын
have you seen this: Titans: Learning to Memorize at Test Time (arXiv:2501.00663v1 [cs.LG] 31 Dec 2024)
@Pure_Science_and_Technology12 сағат бұрын
Titans oh my
@kevinpham665813 сағат бұрын
I see the OrthoGrad optimizer in the author's repo. It takes in a `base_optimizer_cls` param, which wraps something like `torch.optim.SGD` or `torch.optim.AdamW`. So we can just use this optimizer whereever a optimizer is passed into any transformers Trainer?
@letsRegulateSociopaths13 сағат бұрын
Musk does grokkking
@cristiantatu642713 сағат бұрын
Does this only applies to classification? What about regression?
@bernardoramos940913 сағат бұрын
Maybe this is why dropout helps. By zeroing some weights, the logits will be different and then it will have gradients again
@ricosrealmСағат бұрын
dropout is a form of regularization, and yes regularization helps.
@bernardoramos940914 сағат бұрын
Another solutions would be: 1. increase the amount of training data 2. decrease the size of the model If the model is small compared to the training data, it cannot memorize the training data, and so it is forced to generalize. Using growth (like in TokenFormer, Masked Structural Growth or LlaMA Pro) could probably be an option also
@bernardoramos940914 сағат бұрын
this may explain why dropout is useful sometimes. it would modify the output, and then there would be a gradient again
@irbsurfer158514 сағат бұрын
Wow, this is the explanation I've been waiting for! The "aha!" moment was real. I'm not wasting any time - I'm already implementing the Stable Max and orthogonal gradient optimizer for my transformer architectures. I'm really excited to see if this can accelerate training and improve generalization. My convolutional and recurrent networks are next on the list! Improved performance and numerical stability are the goals. Thank you so much for making this accessible - fantastic contribution! Superb explanations!
@Pure_Science_and_Technology12 сағат бұрын
Huh? 😊
@irbsurfer158511 сағат бұрын
@@Pure_Science_and_Technology LMFAO, I think I am finally starting to get better at AI now that I can speak the jargon. lol I still feel like an amateur though. lol
@bernardoramos940910 сағат бұрын
Post some results if you can
@pepediedrich56096 сағат бұрын
@@irbsurfer1585 how do you measure the generalization?
@pixelsort14 сағат бұрын
2nd part already! This will be StableMax. 🍿
@maxwelldylla333414 сағат бұрын
It still doesn't explain why learning continues past the point when 100% training accuracy is achieved, just the mechanism for the delay. Why does continuous updates to the weights, using training data that the model has already learned to correctly classify, lead to generalization on the unseen test data?
@tescOne14 сағат бұрын
I am assuming I'm dumb and missed a step but I have the same doubt. It's clear how to avoid the delay and why it happens, but I'm still not getting why it suddenly goes better on the evaluation after a while. EDIT: after giving it another shot I think we are misunderstanding this: we see NLM as necessarily leading to SC. But it's not true: it CAN lead to either grokking delay OR SC (depending on the behavior of parameters i guess). So generalization only stops when NLM leads to SC. This makes grokking not a magic / singular event; rather something that naturally should happen, but it's not happening because of some "artificial" limitations.
@pensiveintrovert431813 сағат бұрын
Not 100% on the development/validation dataset.
@oliverdaniels496113 сағат бұрын
I assume one of two things is happening: 1. It gets stuck in a local minimum while the "grokked" state is a more global minimum 2. Weight decay becomes so strong that learning can't continue along the existing paths, effectively reducing the size of the neural net, which increases generalization (and building two or three different paths that overfit to one dataset will also be be more general)
@irbsurfer158511 сағат бұрын
I could very well be mistaking but I think of the analogy of a student cramming for an exam by memorizing answers. After the exam, they reflect on what they learned and realize there are deeper patterns or connections between the topics. The act of reviewing helps them generalize their knowledge, not just recall specific answers. In this case, the model is "reviewing" and reorganizing its internal understanding after reaching perfect accuracy. That's just my guess though and I have nothing to back that up yet.
@dennisestenson78204 сағат бұрын
The paper is describing ways grokking is generally prevented, then offering a couple ways to avoid that. It's not explaining how or why grokking happens in the wild.
@IxCIHAoX15 сағат бұрын
Is it correct to assume the new optimizer is meant to be used only after overfitting?
@irbsurfer158514 сағат бұрын
Great question, and I am probably not qualified to answer this question so take it with a grain of salt but based on the information presented (I have not read the paper in full yet), it is more accurate to say the new optimizer is designed to prevent the problematic dynamics that occur after overfiting, rather than being exclusively meant for use only after overfitting has fully set in. While the motivation for this optimizer stems from problems observed after overfitting, its intended use is likely throughout the training process, particularly as the model begins to saturate on the training data, to prevent the optimization process from getting stuck in the naive scaling regime. Experimentation would be key to see what works best in practice for your specific model and task. Im certinaly going to give it a try and see if I can figure it out too. Good luck brother!
@oliverdaniels496112 сағат бұрын
It seems like it should work from the start, but I would expect a slower initial convergence to the "fully fitting training data" point
@samankittani885616 сағат бұрын
You mention LLM models, but could this apply to different models? I assume it applies to any model that has softmax activations at each layer. Or is it just the final layer?
@AbelShields16 сағат бұрын
Most models only use softmax as final layer, I've never heard of it being used in intermediate layers
@novantha115 сағат бұрын
Most modern models use softmax (in the context of this paper) in the language head to scale the output confidence in the tokens to a probability distribution. Ie: it rescales confidence into a 0-1 range where it's really easy to produce random numbers and select one of them. Generally speaking other uses of softmax are on the downtrend. Ie: softmax as an activation in MLP can lead to competition between channels, so you get really inefficient utilization of parameters, or softmax in MoE selection does the same thing (because MoE is just an approximation of MLP), and even for the Attention mechanism isn't not immediately clear anymore that softmax is great because it's known that softmax can saturate with too many entries, which places a hard limit on the scale and expressivity in attention heads as you scale them. People still hang onto it in the Attention mechanism though because softmax is so well researched as it was the standard, and the alternatives aren't as well understood, so that one might stick around for a while, yet.
@irbsurfer158514 сағат бұрын
Do you mean besides just transfomer NNs? Yes, I plan on testing it with recurrent and convolutional NNs. soon, focus is on tNN for now.
@pabloescobar273816 сағат бұрын
😮😊 thank
@code4AI16 сағат бұрын
With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.
@timealchemist750817 сағат бұрын
Just want to say I really appreciate this channel for the deep dives. Best in class. Keep it going! I am learning a lot.
@awesomedata897317 сағат бұрын
I'm not a math guy (at least not in the traditional sense), but I love your channel's exploration of the bleeding edge of those progressing (and trying to understand) this technology - but mostly how, despite the complexity of the subject-matter, you make it as accessible as you can to the average person (like me), who doesn't delve too deeply into mathematical rigor. I feel like more physical (and less abstract) explanations of phenomenon that are easy to visualize (i.e. in story form?), wherever possible, generally helps me to visualize this stuff more, but you at least take the time to go slowly over the material and explain everything that's necessary from a beginner's level without leaving anything out. I can generally still keep up with this because it's clear you're passionate about this stuff and it makes me want to learn more than I already want to (allowing me to pay closer attention). Your channel doesn't have to be flashy to get it right. Great job! - Keep up the amazing work!
@N_Patil8018 сағат бұрын
Bro, firstly I love your content but I wish you had a better accent, or do something with AI voice generator. I can only understand some of what you are saying.
@freed.d18 сағат бұрын
too fast for rag is classic ... /sob
@tiagotiagot19 сағат бұрын
Would the solution be to always introduce noise that is just above the scale of the rounding error?
@fontenbleau20 сағат бұрын
Why they using such dorky word? What about calling that "acceleration" or self-tuning?
@trucid223 сағат бұрын
How do models overcome the softmax collapse and start learning representations that generalize?
@asfaust312Күн бұрын
link your papers in the description please.
@dzhukovКүн бұрын
Instant subscribe!
@code4AIКүн бұрын
With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.
@d3fau1thmph21 сағат бұрын
I would prefer synthetic English. With good pronunciation.
@LudicrousTachyonКүн бұрын
This is pretty reasonable. If we always give all the data, the system doesn't need to generalize because it has all the data. If you take away some of the data, the system now needs to start assuming the portions that are missing or guessing what the possibly missing data is and thus needs to train to generalize the data is does have to what is possible. This is what we do with children. We throw tons of input at them, then throw the kid into a situation, which may or may not have all the same inputs. Finally, we correct or promote their behavior. Thus the child learns to deal with variable situations.
@yakmage8085Күн бұрын
Love it
@geldokuКүн бұрын
this video must be important but I don't understand a word of it.
@drdca8263Күн бұрын
22:30 : hm, is it even grokking at that point? This sounds like maybe grokking is just, “the NN after being stuck in a rut, eventually gets lucky and stops being stuck”? Ah. Cliffhanger.
@oliverdaniels4961Күн бұрын
This could be as big a jump as transformers, or overcoming the gradient explosion in DNNs
@Anonymous-lw1zyКүн бұрын
Why not just run softmax at far higher precision (double: 64 bit float (53 bit significand)? quadruple: 128 bit float (113 bit significand)?) so you don't get precisely 1,0,0,... from the softmax? Or rather than compute softmax with exp(), use an adjustable steepness function, adjusting it over time, much like learning rates are adjusted over time. Put in a control loop to keep it away from getting pinned to 1,0,0,... --- OK, I read the paper. These are suggested solutions.
@tbirdal18 сағат бұрын
Good point and we did experiment with these in the paper. While it helps, you cannot get rid of the intrinsic problem by simply using higher precision.
@robtaylor144414 сағат бұрын
@@tbirdalhave you looked at posits?
@tikendrawКүн бұрын
Let say if a company hires you as a data science researcher, with your current knowledge and same limitations as tech giants how far you can take the llm if designed and trained from scratch ?
@Sirus20x6Күн бұрын
so train a low rank until you run out of space to fit more learning, and slowly up the quantization?
@be1tubeКүн бұрын
This did not explain grokking to me because my question is not "why does it take so long to happen" but why does it happen at all after memorization? Why is there still a training signal after the network has memorized the training examples?
@fackarov941223 сағат бұрын
the model memorize the training data on "primitive" structures, with training the model continues to evolve and if better structures emerge to represent the data, it exploits them and if it finds the "exact" data structure it activates grokking once a better structure is find it activates a exponential space of "good" candidate structures to go better and better
@mircorichter137521 сағат бұрын
People use the term "grokking" differently. Here and in most research 'grokking' is not the generalization AFTER the delay but the delay before the expected generalization. So a model that generalizes on par or soon after training memorization does not grok, while a model groks if there is a delay...
@mircorichter137521 сағат бұрын
That is arguably counterintuitive due to the meaning of the word 'grokking'
@daPawlak21 сағат бұрын
@@mircorichter1375 so it's just a problem with getting result that could be achievable without that delay or is it a additional functionality provided by that additional compute time?
@be1tube20 сағат бұрын
@@fackarov9412 What was confusing me is the 100% on training. However, I have a hypothesis now that the graph shown in these situations plots training accuracy, not training loss. So there is a signal (the difference between an output of (.2,.2,.6) and (0,0,1)), it's just not shown in the graph.
@be1tubeКүн бұрын
There was a paper many months ago showing that increasing certain frequency components of the gradients (I think it was a low-pass filter but it could have been the opposite) skipped most of the delay for grokking.
@MatthewKelley-mq4ceКүн бұрын
Yep. I remember that one.
@TheDarkhawk243Күн бұрын
Why do you never link papers?
@mohl-bodell2948Күн бұрын
What a cliff hanger...
@polyscopesКүн бұрын
For real haha llm training getting 2x + cheaper overnight
@hdot2613Күн бұрын
Such a tease 😂
@breakablecКүн бұрын
Sounds like: 1. once the error become low you want to stop the maximisation of optimal parameters to increase gradient 2. you want to use decimals with large integer part and small fractional part to increase precision
@KilgoreTroutAsfКүн бұрын
This is the vanishing gradient problem all over again.
@yeetdeetsКүн бұрын
Exploding in this case though, right?
@ChaseFreedomMusicianКүн бұрын
I'm really looking forward to part 2!!
@rubncarmonaКүн бұрын
This makes me think of nGPT from Nvidia a bunch
@andikunar7183Күн бұрын
Thanks, AMAZING content, WOW! Does this mean in non-mathematical/laymen‘s terms, that good, smaller context, knowledge-samples (decreased dimensionality) during training help with grokking?
@polyscopesКүн бұрын
I think he was saying that the decreased dimensionality helped prevent memorization forcing it to generalize sooner instead of memorizing the training data first and then learning to generalize.
@mloewen248Күн бұрын
Wouldn't simply adding noise to the input data then solve this issue after repeated epochs?
@HoldMyDataКүн бұрын
So now, what am I going to waste my time waiting on training runs? @24:35 😅 Great explanation, thanks. I remember Nous or someone when all of this first started with "Ok so we just keep going and going and eventually it works". This makes it understandable.