With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.
@maxwelldylla333412 сағат бұрын
It still doesn't explain why learning continues past the point when 100% training accuracy is achieved, just the mechanism for the delay. Why does continuous updates to the weights, using training data that the model has already learned to correctly classify, lead to generalization on the unseen test data?
@tescOne11 сағат бұрын
I am assuming I'm dumb and missed a step but I have the same doubt. It's clear how to avoid the delay and why it happens, but I'm still not getting why it suddenly goes better on the evaluation after a while. EDIT: after giving it another shot I think we are misunderstanding this: we see NLM as necessarily leading to SC. But it's not true: it CAN lead to either grokking delay OR SC (depending on the behavior of parameters i guess). So generalization only stops when NLM leads to SC. This makes grokking not a magic / singular event; rather something that naturally should happen, but it's not happening because of some "artificial" limitations.
@pensiveintrovert431810 сағат бұрын
Not 100% on the development/validation dataset.
@oliverdaniels496110 сағат бұрын
I assume one of two things is happening: 1. It gets stuck in a local minimum while the "grokked" state is a more global minimum 2. Weight decay becomes so strong that learning can't continue along the existing paths, effectively reducing the size of the neural net, which increases generalization (and building two or three different paths that overfit to one dataset will also be be more general)
@irbsurfer15858 сағат бұрын
I could very well be mistaking but I think of the analogy of a student cramming for an exam by memorizing answers. After the exam, they reflect on what they learned and realize there are deeper patterns or connections between the topics. The act of reviewing helps them generalize their knowledge, not just recall specific answers. In this case, the model is "reviewing" and reorganizing its internal understanding after reaching perfect accuracy. That's just my guess though and I have nothing to back that up yet.
@dennisestenson78202 сағат бұрын
I think the paper is describing ways grokking is generally prevented, then offering a couple ways to avoid that. It's not explaining how or why grokking happens in the wild.
@irbsurfer158511 сағат бұрын
Wow, this is the explanation I've been waiting for! The "aha!" moment was real. I'm not wasting any time - I'm already implementing the Stable Max and orthogonal gradient optimizer for my transformer architectures. I'm really excited to see if this can accelerate training and improve generalization. My convolutional and recurrent networks are next on the list! Improved performance and numerical stability are the goals. Thank you so much for making this accessible - fantastic contribution! Superb explanations!
@Pure_Science_and_Technology9 сағат бұрын
Huh? 😊
@irbsurfer15858 сағат бұрын
@@Pure_Science_and_Technology LMFAO, I think I am finally starting to get better at AI now that I can speak the jargon. lol I still feel like an amateur though. lol
@bernardoramos94097 сағат бұрын
Post some results if you can
@pepediedrich56093 сағат бұрын
@@irbsurfer1585 how do you measure the generalization?
@bernardoramos940911 сағат бұрын
Maybe this is why dropout helps. By zeroing some weights, the logits will be different and then it will have gradients again
@pabloescobar273813 сағат бұрын
😮😊 thank
@bernardoramos940911 сағат бұрын
Another solutions would be: 1. increase the amount of training data 2. decrease the size of the model If the model is small compared to the training data, it cannot memorize the training data, and so it is forced to generalize. Using growth (like in TokenFormer, Masked Structural Growth or LlaMA Pro) could probably be an option also
@Pure_Science_and_Technology9 сағат бұрын
Titans oh my
@samankittani885613 сағат бұрын
You mention LLM models, but could this apply to different models? I assume it applies to any model that has softmax activations at each layer. Or is it just the final layer?
@AbelShields13 сағат бұрын
Most models only use softmax as final layer, I've never heard of it being used in intermediate layers
@novantha112 сағат бұрын
Most modern models use softmax (in the context of this paper) in the language head to scale the output confidence in the tokens to a probability distribution. Ie: it rescales confidence into a 0-1 range where it's really easy to produce random numbers and select one of them. Generally speaking other uses of softmax are on the downtrend. Ie: softmax as an activation in MLP can lead to competition between channels, so you get really inefficient utilization of parameters, or softmax in MoE selection does the same thing (because MoE is just an approximation of MLP), and even for the Attention mechanism isn't not immediately clear anymore that softmax is great because it's known that softmax can saturate with too many entries, which places a hard limit on the scale and expressivity in attention heads as you scale them. People still hang onto it in the Attention mechanism though because softmax is so well researched as it was the standard, and the alternatives aren't as well understood, so that one might stick around for a while, yet.
@irbsurfer158511 сағат бұрын
Do you mean besides just transfomer NNs? Yes, I plan on testing it with recurrent and convolutional NNs. soon, focus is on tNN for now.
@IxCIHAoX12 сағат бұрын
Is it correct to assume the new optimizer is meant to be used only after overfitting?
@irbsurfer158511 сағат бұрын
Great question, and I am probably not qualified to answer this question so take it with a grain of salt but based on the information presented (I have not read the paper in full yet), it is more accurate to say the new optimizer is designed to prevent the problematic dynamics that occur after overfiting, rather than being exclusively meant for use only after overfitting has fully set in. While the motivation for this optimizer stems from problems observed after overfitting, its intended use is likely throughout the training process, particularly as the model begins to saturate on the training data, to prevent the optimization process from getting stuck in the naive scaling regime. Experimentation would be key to see what works best in practice for your specific model and task. Im certinaly going to give it a try and see if I can figure it out too. Good luck brother!
@oliverdaniels49619 сағат бұрын
It seems like it should work from the start, but I would expect a slower initial convergence to the "fully fitting training data" point
@corgirun78923 сағат бұрын
The value of some papers lies in persuading the community:We need to change our approach.
@tikendraw8 сағат бұрын
have you seen this: Titans: Learning to Memorize at Test Time (arXiv:2501.00663v1 [cs.LG] 31 Dec 2024)
@cristiantatu642711 сағат бұрын
Does this only applies to classification? What about regression?
@kevinpham665810 сағат бұрын
I see the OrthoGrad optimizer in the author's repo. It takes in a `base_optimizer_cls` param, which wraps something like `torch.optim.SGD` or `torch.optim.AdamW`. So we can just use this optimizer whereever a optimizer is passed into any transformers Trainer?