GraphRAG vs In-Context Learning ICL
33:29
14 сағат бұрын
TINY LM Agents on Edge Devices: Can We Scale?
28:08
NEW Transformer for RAG: ModernBERT
17:07
Monolithic AI vs Modular AI
27:08
Пікірлер
@narrativeless404
@narrativeless404 48 минут бұрын
So AI gets BETRAYED and TRAPPED in a HYPERBOLIC TIME CHAMBER for a 1000 years until it starts "GROKKING".
@nizanklinghoffer4620
@nizanklinghoffer4620 Сағат бұрын
That stablemax loss completely ruined my model's generalization 😅
@gjadams74
@gjadams74 2 сағат бұрын
Ripples in the waves of the feedback machine
@corgirun7892
@corgirun7892 6 сағат бұрын
The value of some papers lies in persuading the community:We need to change our approach.
@dandushi9872
@dandushi9872 8 сағат бұрын
What capabilities will an LCM have over an LLM? I understand that it can understand whole sentences but what are the benefits?
@tikendraw
@tikendraw 11 сағат бұрын
have you seen this: Titans: Learning to Memorize at Test Time (arXiv:2501.00663v1 [cs.LG] 31 Dec 2024)
@Pure_Science_and_Technology
@Pure_Science_and_Technology 12 сағат бұрын
Titans oh my
@kevinpham6658
@kevinpham6658 13 сағат бұрын
I see the OrthoGrad optimizer in the author's repo. It takes in a `base_optimizer_cls` param, which wraps something like `torch.optim.SGD` or `torch.optim.AdamW`. So we can just use this optimizer whereever a optimizer is passed into any transformers Trainer?
@letsRegulateSociopaths
@letsRegulateSociopaths 13 сағат бұрын
Musk does grokkking
@cristiantatu6427
@cristiantatu6427 13 сағат бұрын
Does this only applies to classification? What about regression?
@bernardoramos9409
@bernardoramos9409 13 сағат бұрын
Maybe this is why dropout helps. By zeroing some weights, the logits will be different and then it will have gradients again
@ricosrealm
@ricosrealm Сағат бұрын
dropout is a form of regularization, and yes regularization helps.
@bernardoramos9409
@bernardoramos9409 14 сағат бұрын
Another solutions would be: 1. increase the amount of training data 2. decrease the size of the model If the model is small compared to the training data, it cannot memorize the training data, and so it is forced to generalize. Using growth (like in TokenFormer, Masked Structural Growth or LlaMA Pro) could probably be an option also
@bernardoramos9409
@bernardoramos9409 14 сағат бұрын
this may explain why dropout is useful sometimes. it would modify the output, and then there would be a gradient again
@irbsurfer1585
@irbsurfer1585 14 сағат бұрын
Wow, this is the explanation I've been waiting for! The "aha!" moment was real. I'm not wasting any time - I'm already implementing the Stable Max and orthogonal gradient optimizer for my transformer architectures. I'm really excited to see if this can accelerate training and improve generalization. My convolutional and recurrent networks are next on the list! Improved performance and numerical stability are the goals. Thank you so much for making this accessible - fantastic contribution! Superb explanations!
@Pure_Science_and_Technology
@Pure_Science_and_Technology 12 сағат бұрын
Huh? 😊
@irbsurfer1585
@irbsurfer1585 11 сағат бұрын
@@Pure_Science_and_Technology LMFAO, I think I am finally starting to get better at AI now that I can speak the jargon. lol I still feel like an amateur though. lol
@bernardoramos9409
@bernardoramos9409 10 сағат бұрын
Post some results if you can
@pepediedrich5609
@pepediedrich5609 6 сағат бұрын
​@@irbsurfer1585 how do you measure the generalization?
@pixelsort
@pixelsort 14 сағат бұрын
2nd part already! This will be StableMax. 🍿
@maxwelldylla3334
@maxwelldylla3334 14 сағат бұрын
It still doesn't explain why learning continues past the point when 100% training accuracy is achieved, just the mechanism for the delay. Why does continuous updates to the weights, using training data that the model has already learned to correctly classify, lead to generalization on the unseen test data?
@tescOne
@tescOne 14 сағат бұрын
I am assuming I'm dumb and missed a step but I have the same doubt. It's clear how to avoid the delay and why it happens, but I'm still not getting why it suddenly goes better on the evaluation after a while. EDIT: after giving it another shot I think we are misunderstanding this: we see NLM as necessarily leading to SC. But it's not true: it CAN lead to either grokking delay OR SC (depending on the behavior of parameters i guess). So generalization only stops when NLM leads to SC. This makes grokking not a magic / singular event; rather something that naturally should happen, but it's not happening because of some "artificial" limitations.
@pensiveintrovert4318
@pensiveintrovert4318 13 сағат бұрын
Not 100% on the development/validation dataset.
@oliverdaniels4961
@oliverdaniels4961 13 сағат бұрын
I assume one of two things is happening: 1. It gets stuck in a local minimum while the "grokked" state is a more global minimum 2. Weight decay becomes so strong that learning can't continue along the existing paths, effectively reducing the size of the neural net, which increases generalization (and building two or three different paths that overfit to one dataset will also be be more general)
@irbsurfer1585
@irbsurfer1585 11 сағат бұрын
I could very well be mistaking but I think of the analogy of a student cramming for an exam by memorizing answers. After the exam, they reflect on what they learned and realize there are deeper patterns or connections between the topics. The act of reviewing helps them generalize their knowledge, not just recall specific answers. In this case, the model is "reviewing" and reorganizing its internal understanding after reaching perfect accuracy. That's just my guess though and I have nothing to back that up yet.
@dennisestenson7820
@dennisestenson7820 4 сағат бұрын
The paper is describing ways grokking is generally prevented, then offering a couple ways to avoid that. It's not explaining how or why grokking happens in the wild.
@IxCIHAoX
@IxCIHAoX 15 сағат бұрын
Is it correct to assume the new optimizer is meant to be used only after overfitting?
@irbsurfer1585
@irbsurfer1585 14 сағат бұрын
Great question, and I am probably not qualified to answer this question so take it with a grain of salt but based on the information presented (I have not read the paper in full yet), it is more accurate to say the new optimizer is designed to prevent the problematic dynamics that occur after overfiting, rather than being exclusively meant for use only after overfitting has fully set in. While the motivation for this optimizer stems from problems observed after overfitting, its intended use is likely throughout the training process, particularly as the model begins to saturate on the training data, to prevent the optimization process from getting stuck in the naive scaling regime. Experimentation would be key to see what works best in practice for your specific model and task. Im certinaly going to give it a try and see if I can figure it out too. Good luck brother!
@oliverdaniels4961
@oliverdaniels4961 12 сағат бұрын
It seems like it should work from the start, but I would expect a slower initial convergence to the "fully fitting training data" point
@samankittani8856
@samankittani8856 16 сағат бұрын
You mention LLM models, but could this apply to different models? I assume it applies to any model that has softmax activations at each layer. Or is it just the final layer?
@AbelShields
@AbelShields 16 сағат бұрын
Most models only use softmax as final layer, I've never heard of it being used in intermediate layers
@novantha1
@novantha1 15 сағат бұрын
Most modern models use softmax (in the context of this paper) in the language head to scale the output confidence in the tokens to a probability distribution. Ie: it rescales confidence into a 0-1 range where it's really easy to produce random numbers and select one of them. Generally speaking other uses of softmax are on the downtrend. Ie: softmax as an activation in MLP can lead to competition between channels, so you get really inefficient utilization of parameters, or softmax in MoE selection does the same thing (because MoE is just an approximation of MLP), and even for the Attention mechanism isn't not immediately clear anymore that softmax is great because it's known that softmax can saturate with too many entries, which places a hard limit on the scale and expressivity in attention heads as you scale them. People still hang onto it in the Attention mechanism though because softmax is so well researched as it was the standard, and the alternatives aren't as well understood, so that one might stick around for a while, yet.
@irbsurfer1585
@irbsurfer1585 14 сағат бұрын
Do you mean besides just transfomer NNs? Yes, I plan on testing it with recurrent and convolutional NNs. soon, focus is on tNN for now.
@pabloescobar2738
@pabloescobar2738 16 сағат бұрын
😮😊 thank
@code4AI
@code4AI 16 сағат бұрын
With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.
@timealchemist7508
@timealchemist7508 17 сағат бұрын
Just want to say I really appreciate this channel for the deep dives. Best in class. Keep it going! I am learning a lot.
@awesomedata8973
@awesomedata8973 17 сағат бұрын
I'm not a math guy (at least not in the traditional sense), but I love your channel's exploration of the bleeding edge of those progressing (and trying to understand) this technology - but mostly how, despite the complexity of the subject-matter, you make it as accessible as you can to the average person (like me), who doesn't delve too deeply into mathematical rigor. I feel like more physical (and less abstract) explanations of phenomenon that are easy to visualize (i.e. in story form?), wherever possible, generally helps me to visualize this stuff more, but you at least take the time to go slowly over the material and explain everything that's necessary from a beginner's level without leaving anything out. I can generally still keep up with this because it's clear you're passionate about this stuff and it makes me want to learn more than I already want to (allowing me to pay closer attention). Your channel doesn't have to be flashy to get it right. Great job! - Keep up the amazing work!
@N_Patil80
@N_Patil80 18 сағат бұрын
Bro, firstly I love your content but I wish you had a better accent, or do something with AI voice generator. I can only understand some of what you are saying.
@freed.d
@freed.d 18 сағат бұрын
too fast for rag is classic ... /sob
@tiagotiagot
@tiagotiagot 19 сағат бұрын
Would the solution be to always introduce noise that is just above the scale of the rounding error?
@fontenbleau
@fontenbleau 20 сағат бұрын
Why they using such dorky word? What about calling that "acceleration" or self-tuning?
@trucid2
@trucid2 23 сағат бұрын
How do models overcome the softmax collapse and start learning representations that generalize?
@asfaust312
@asfaust312 Күн бұрын
link your papers in the description please.
@dzhukov
@dzhukov Күн бұрын
Instant subscribe!
@code4AI
@code4AI Күн бұрын
With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.
@d3fau1thmph
@d3fau1thmph 21 сағат бұрын
I would prefer synthetic English. With good pronunciation.
@LudicrousTachyon
@LudicrousTachyon Күн бұрын
This is pretty reasonable. If we always give all the data, the system doesn't need to generalize because it has all the data. If you take away some of the data, the system now needs to start assuming the portions that are missing or guessing what the possibly missing data is and thus needs to train to generalize the data is does have to what is possible. This is what we do with children. We throw tons of input at them, then throw the kid into a situation, which may or may not have all the same inputs. Finally, we correct or promote their behavior. Thus the child learns to deal with variable situations.
@yakmage8085
@yakmage8085 Күн бұрын
Love it
@geldoku
@geldoku Күн бұрын
this video must be important but I don't understand a word of it.
@drdca8263
@drdca8263 Күн бұрын
22:30 : hm, is it even grokking at that point? This sounds like maybe grokking is just, “the NN after being stuck in a rut, eventually gets lucky and stops being stuck”? Ah. Cliffhanger.
@oliverdaniels4961
@oliverdaniels4961 Күн бұрын
This could be as big a jump as transformers, or overcoming the gradient explosion in DNNs
@Anonymous-lw1zy
@Anonymous-lw1zy Күн бұрын
Why not just run softmax at far higher precision (double: 64 bit float (53 bit significand)? quadruple: 128 bit float (113 bit significand)?) so you don't get precisely 1,0,0,... from the softmax? Or rather than compute softmax with exp(), use an adjustable steepness function, adjusting it over time, much like learning rates are adjusted over time. Put in a control loop to keep it away from getting pinned to 1,0,0,... --- OK, I read the paper. These are suggested solutions.
@tbirdal
@tbirdal 18 сағат бұрын
Good point and we did experiment with these in the paper. While it helps, you cannot get rid of the intrinsic problem by simply using higher precision.
@robtaylor1444
@robtaylor1444 14 сағат бұрын
@@tbirdalhave you looked at posits?
@tikendraw
@tikendraw Күн бұрын
Let say if a company hires you as a data science researcher, with your current knowledge and same limitations as tech giants how far you can take the llm if designed and trained from scratch ?
@Sirus20x6
@Sirus20x6 Күн бұрын
so train a low rank until you run out of space to fit more learning, and slowly up the quantization?
@be1tube
@be1tube Күн бұрын
This did not explain grokking to me because my question is not "why does it take so long to happen" but why does it happen at all after memorization? Why is there still a training signal after the network has memorized the training examples?
@fackarov9412
@fackarov9412 23 сағат бұрын
the model memorize the training data on "primitive" structures, with training the model continues to evolve and if better structures emerge to represent the data, it exploits them and if it finds the "exact" data structure it activates grokking once a better structure is find it activates a exponential space of "good" candidate structures to go better and better
@mircorichter1375
@mircorichter1375 21 сағат бұрын
People use the term "grokking" differently. Here and in most research 'grokking' is not the generalization AFTER the delay but the delay before the expected generalization. So a model that generalizes on par or soon after training memorization does not grok, while a model groks if there is a delay...
@mircorichter1375
@mircorichter1375 21 сағат бұрын
That is arguably counterintuitive due to the meaning of the word 'grokking'
@daPawlak
@daPawlak 21 сағат бұрын
@@mircorichter1375 so it's just a problem with getting result that could be achievable without that delay or is it a additional functionality provided by that additional compute time?
@be1tube
@be1tube 20 сағат бұрын
@@fackarov9412 What was confusing me is the 100% on training. However, I have a hypothesis now that the graph shown in these situations plots training accuracy, not training loss. So there is a signal (the difference between an output of (.2,.2,.6) and (0,0,1)), it's just not shown in the graph.
@be1tube
@be1tube Күн бұрын
There was a paper many months ago showing that increasing certain frequency components of the gradients (I think it was a low-pass filter but it could have been the opposite) skipped most of the delay for grokking.
@MatthewKelley-mq4ce
@MatthewKelley-mq4ce Күн бұрын
Yep. I remember that one.
@TheDarkhawk243
@TheDarkhawk243 Күн бұрын
Why do you never link papers?
@mohl-bodell2948
@mohl-bodell2948 Күн бұрын
What a cliff hanger...
@polyscopes
@polyscopes Күн бұрын
For real haha llm training getting 2x + cheaper overnight
@hdot2613
@hdot2613 Күн бұрын
Such a tease 😂
@breakablec
@breakablec Күн бұрын
Sounds like: 1. once the error become low you want to stop the maximisation of optimal parameters to increase gradient 2. you want to use decimals with large integer part and small fractional part to increase precision
@KilgoreTroutAsf
@KilgoreTroutAsf Күн бұрын
This is the vanishing gradient problem all over again.
@yeetdeets
@yeetdeets Күн бұрын
Exploding in this case though, right?
@ChaseFreedomMusician
@ChaseFreedomMusician Күн бұрын
I'm really looking forward to part 2!!
@rubncarmona
@rubncarmona Күн бұрын
This makes me think of nGPT from Nvidia a bunch
@andikunar7183
@andikunar7183 Күн бұрын
Thanks, AMAZING content, WOW! Does this mean in non-mathematical/laymen‘s terms, that good, smaller context, knowledge-samples (decreased dimensionality) during training help with grokking?
@polyscopes
@polyscopes Күн бұрын
I think he was saying that the decreased dimensionality helped prevent memorization forcing it to generalize sooner instead of memorizing the training data first and then learning to generalize.
@mloewen248
@mloewen248 Күн бұрын
Wouldn't simply adding noise to the input data then solve this issue after repeated epochs?
@HoldMyData
@HoldMyData Күн бұрын
So now, what am I going to waste my time waiting on training runs? @24:35 😅 Great explanation, thanks. I remember Nous or someone when all of this first started with "Ok so we just keep going and going and eventually it works". This makes it understandable.