Activate GROKKING NOW - Performance Phase of LLMs (II)

Рет қаралды 1,831

Күн бұрын

Пікірлер: 29

@code4AI 13 сағат бұрын

With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.

@maxwelldylla3334 12 сағат бұрын

It still doesn't explain why learning continues past the point when 100% training accuracy is achieved, just the mechanism for the delay. Why does continuous updates to the weights, using training data that the model has already learned to correctly classify, lead to generalization on the unseen test data?

@tescOne 11 сағат бұрын

I am assuming I'm dumb and missed a step but I have the same doubt. It's clear how to avoid the delay and why it happens, but I'm still not getting why it suddenly goes better on the evaluation after a while. EDIT: after giving it another shot I think we are misunderstanding this: we see NLM as necessarily leading to SC. But it's not true: it CAN lead to either grokking delay OR SC (depending on the behavior of parameters i guess). So generalization only stops when NLM leads to SC. This makes grokking not a magic / singular event; rather something that naturally should happen, but it's not happening because of some "artificial" limitations.

@pensiveintrovert4318 10 сағат бұрын

Not 100% on the development/validation dataset.

@oliverdaniels4961 10 сағат бұрын

I assume one of two things is happening: 1. It gets stuck in a local minimum while the "grokked" state is a more global minimum 2. Weight decay becomes so strong that learning can't continue along the existing paths, effectively reducing the size of the neural net, which increases generalization (and building two or three different paths that overfit to one dataset will also be be more general)

@irbsurfer1585 8 сағат бұрын

I could very well be mistaking but I think of the analogy of a student cramming for an exam by memorizing answers. After the exam, they reflect on what they learned and realize there are deeper patterns or connections between the topics. The act of reviewing helps them generalize their knowledge, not just recall specific answers. In this case, the model is "reviewing" and reorganizing its internal understanding after reaching perfect accuracy. That's just my guess though and I have nothing to back that up yet.

@dennisestenson7820 2 сағат бұрын

I think the paper is describing ways grokking is generally prevented, then offering a couple ways to avoid that. It's not explaining how or why grokking happens in the wild.

@irbsurfer1585 11 сағат бұрын

Wow, this is the explanation I've been waiting for! The "aha!" moment was real. I'm not wasting any time - I'm already implementing the Stable Max and orthogonal gradient optimizer for my transformer architectures. I'm really excited to see if this can accelerate training and improve generalization. My convolutional and recurrent networks are next on the list! Improved performance and numerical stability are the goals. Thank you so much for making this accessible - fantastic contribution! Superb explanations!

@Pure_Science_and_Technology 9 сағат бұрын

Huh? 😊

@irbsurfer1585 8 сағат бұрын

@@Pure_Science_and_Technology LMFAO, I think I am finally starting to get better at AI now that I can speak the jargon. lol I still feel like an amateur though. lol

@bernardoramos9409 7 сағат бұрын

Post some results if you can

@pepediedrich5609 3 сағат бұрын

@@irbsurfer1585 how do you measure the generalization?

@bernardoramos9409 11 сағат бұрын

Maybe this is why dropout helps. By zeroing some weights, the logits will be different and then it will have gradients again

@pabloescobar2738 13 сағат бұрын

😮😊 thank

@bernardoramos9409 11 сағат бұрын

Another solutions would be: 1. increase the amount of training data 2. decrease the size of the model If the model is small compared to the training data, it cannot memorize the training data, and so it is forced to generalize. Using growth (like in TokenFormer, Masked Structural Growth or LlaMA Pro) could probably be an option also

@Pure_Science_and_Technology 9 сағат бұрын

Titans oh my

@samankittani8856 13 сағат бұрын

You mention LLM models, but could this apply to different models? I assume it applies to any model that has softmax activations at each layer. Or is it just the final layer?

@AbelShields 13 сағат бұрын

Most models only use softmax as final layer, I've never heard of it being used in intermediate layers

@novantha1 12 сағат бұрын

Most modern models use softmax (in the context of this paper) in the language head to scale the output confidence in the tokens to a probability distribution. Ie: it rescales confidence into a 0-1 range where it's really easy to produce random numbers and select one of them. Generally speaking other uses of softmax are on the downtrend. Ie: softmax as an activation in MLP can lead to competition between channels, so you get really inefficient utilization of parameters, or softmax in MoE selection does the same thing (because MoE is just an approximation of MLP), and even for the Attention mechanism isn't not immediately clear anymore that softmax is great because it's known that softmax can saturate with too many entries, which places a hard limit on the scale and expressivity in attention heads as you scale them. People still hang onto it in the Attention mechanism though because softmax is so well researched as it was the standard, and the alternatives aren't as well understood, so that one might stick around for a while, yet.

@irbsurfer1585 11 сағат бұрын

Do you mean besides just transfomer NNs? Yes, I plan on testing it with recurrent and convolutional NNs. soon, focus is on tNN for now.

@IxCIHAoX 12 сағат бұрын

Is it correct to assume the new optimizer is meant to be used only after overfitting?

@irbsurfer1585 11 сағат бұрын

Great question, and I am probably not qualified to answer this question so take it with a grain of salt but based on the information presented (I have not read the paper in full yet), it is more accurate to say the new optimizer is designed to prevent the problematic dynamics that occur after overfiting, rather than being exclusively meant for use only after overfitting has fully set in. While the motivation for this optimizer stems from problems observed after overfitting, its intended use is likely throughout the training process, particularly as the model begins to saturate on the training data, to prevent the optimization process from getting stuck in the naive scaling regime. Experimentation would be key to see what works best in practice for your specific model and task. Im certinaly going to give it a try and see if I can figure it out too. Good luck brother!

@oliverdaniels4961 9 сағат бұрын

It seems like it should work from the start, but I would expect a slower initial convergence to the "fully fitting training data" point

@corgirun7892 3 сағат бұрын

The value of some papers lies in persuading the community：We need to change our approach.

@tikendraw 8 сағат бұрын

have you seen this: Titans: Learning to Memorize at Test Time (arXiv:2501.00663v1 [cs.LG] 31 Dec 2024)

@cristiantatu6427 11 сағат бұрын

Does this only applies to classification? What about regression?

@kevinpham6658 10 сағат бұрын

I see the OrthoGrad optimizer in the author's repo. It takes in a `base_optimizer_cls` param, which wraps something like `torch.optim.SGD` or `torch.optim.AdamW`. So we can just use this optimizer whereever a optimizer is passed into any transformers Trainer?