Discover AI

Discover AI

We explore the true frontiers of AI. Focussing on the groundbreaking work being done by researchers and innovators across the globe - those applying AI to solve real-world problems in industry, biomedicine, urban design, climate change, and more. AI to generate real value.

This isn’t another channel echoing corporate PR. Instead, I spotlight the unsung heroes in hospitals, labs, and universities who are quietly transforming our world. Join me as we dive into the real impact of AI, driven by those who dare to push boundaries and create a better future.

Finally: Grokking Solved - It's Not What You Think

27:02

Finally: Grokking Solved - It's Not What You Think

2 сағат бұрын

Molecular Language Models of Proteins - or Diffusion?

16:13

Molecular Language Models of Proteins - or Diffusion?

4 сағат бұрын

Training Script & Data to update LLM to o1 Reasoning (Sky-T1 UC Berkeley)

19:40

Training Script & Data to update LLM to o1 Reasoning (Sky-T1 UC Berkeley)

7 сағат бұрын

Code CoT w/ Self-Evolution LLM: rStar-Math Explained

34:05

Code CoT w/ Self-Evolution LLM: rStar-Math Explained

9 сағат бұрын

NEW "Autonomous CoT": Beyond o1 for Next-Level AI

25:15

NEW "Autonomous CoT": Beyond o1 for Next-Level AI

12 сағат бұрын

GraphRAG vs In-Context Learning ICL

33:29

GraphRAG vs In-Context Learning ICL

14 сағат бұрын

TINY LM Agents on Edge Devices: Can We Scale?

28:08

TINY LM Agents on Edge Devices: Can We Scale?

16 сағат бұрын

Mercedes BENZ: Small LM for In-Vehicle Function Calling

19:02

Mercedes BENZ: Small LM for In-Vehicle Function Calling

19 сағат бұрын

Two-Phase Pretraining: Unlocking LLM Scalability & Precision (NVIDIA, Stanford)

18:45

Two-Phase Pretraining: Unlocking LLM Scalability & Precision (NVIDIA, Stanford)

21 сағат бұрын

Dirichlet Energy Minimization Explains In-Context Learning (Harvard)

29:27

Dirichlet Energy Minimization Explains In-Context Learning (Harvard)

Күн бұрын

NEW: Better In-Context Learning ICL, Improved RAG (Harvard)

26:43

NEW: Better In-Context Learning ICL, Improved RAG (Harvard)

Күн бұрын

Simple Code w/ Smolagents For Multi-Agent AI

26:02

Simple Code w/ Smolagents For Multi-Agent AI

Күн бұрын

New - Easy to Learn - AI Agents: Smolagents (by HuggingFace)

30:45

New - Easy to Learn - AI Agents: Smolagents (by HuggingFace)

Күн бұрын

Smarter Reasoning w/o RAG: SOLUTION for Short-Context LLMs

33:32

Smarter Reasoning w/o RAG: SOLUTION for Short-Context LLMs

14 күн бұрын

Goodbye RAG - Smarter CAG w/ KV Cache Optimization

26:19

Goodbye RAG - Smarter CAG w/ KV Cache Optimization

14 күн бұрын

Beyond RAG: New Continual Learning of LLM w/ InCA

39:56

Beyond RAG: New Continual Learning of LLM w/ InCA

14 күн бұрын

INFERENCE Policy Defines New RL (Test Time)

16:39

INFERENCE Policy Defines New RL (Test Time)

14 күн бұрын

NEW INFERENCE SFT & RL by Google - First Thoughts

31:03

NEW INFERENCE SFT & RL by Google - First Thoughts

14 күн бұрын

NEW Knowledge Graph based RAG: SimGRAG (no training)

18:42

NEW Knowledge Graph based RAG: SimGRAG (no training)

21 күн бұрын

o3 Inference Reasoning: How to Build the Training Data Set

33:16

o3 Inference Reasoning: How to Build the Training Data Set

21 күн бұрын

o3 Inference Time CoT Reasoning: How relevant is SFT and RL?

27:21

o3 Inference Time CoT Reasoning: How relevant is SFT and RL?

21 күн бұрын

o3 Model by OpenAI TESTED ($1800+ per task)

23:02

o3 Model by OpenAI TESTED ($1800+ per task)

21 күн бұрын

NEW Transformer for RAG: ModernBERT

17:07

NEW Transformer for RAG: ModernBERT

21 күн бұрын

Byte Latent Transformer - BLT explained (Entropy of Next Byte, META)

37:31

Byte Latent Transformer - BLT explained (Entropy of Next Byte, META)

21 күн бұрын

VEO2 Video Temporal Reasoning for 3D World Model Coherence

22:34

VEO2 Video Temporal Reasoning for 3D World Model Coherence

28 күн бұрын

AI Analyzes Videos to Deliver Task-Specific Insights

18:38

AI Analyzes Videos to Deliver Task-Specific Insights

28 күн бұрын

New Video AI by META & Stanford Univ: APOLLO (7B)

14:55

New Video AI by META & Stanford Univ: APOLLO (7B)

28 күн бұрын

Monolithic AI vs Modular AI

27:08

Monolithic AI vs Modular AI

Ай бұрын

LCM: The Ultimate Evolution of AI? Large Concept Models

30:13

LCM: The Ultimate Evolution of AI? Large Concept Models

Ай бұрын

Пікірлер

@narrativeless404

@narrativeless404 48 минут бұрын

So AI gets BETRAYED and TRAPPED in a HYPERBOLIC TIME CHAMBER for a 1000 years until it starts "GROKKING".

@nizanklinghoffer4620

@nizanklinghoffer4620 Сағат бұрын

That stablemax loss completely ruined my model's generalization 😅

@gjadams74 2 сағат бұрын

Ripples in the waves of the feedback machine

@corgirun7892 6 сағат бұрын

The value of some papers lies in persuading the community：We need to change our approach.

@dandushi9872 8 сағат бұрын

What capabilities will an LCM have over an LLM? I understand that it can understand whole sentences but what are the benefits?

@tikendraw 11 сағат бұрын

have you seen this: Titans: Learning to Memorize at Test Time (arXiv:2501.00663v1 [cs.LG] 31 Dec 2024)

@Pure_Science_and_Technology

@Pure_Science_and_Technology 12 сағат бұрын

Titans oh my

@kevinpham6658 13 сағат бұрын

I see the OrthoGrad optimizer in the author's repo. It takes in a `base_optimizer_cls` param, which wraps something like `torch.optim.SGD` or `torch.optim.AdamW`. So we can just use this optimizer whereever a optimizer is passed into any transformers Trainer?

@letsRegulateSociopaths

@letsRegulateSociopaths 13 сағат бұрын

Musk does grokkking

@cristiantatu6427

@cristiantatu6427 13 сағат бұрын

Does this only applies to classification? What about regression?

@bernardoramos9409

@bernardoramos9409 13 сағат бұрын

Maybe this is why dropout helps. By zeroing some weights, the logits will be different and then it will have gradients again

@ricosrealm Сағат бұрын

dropout is a form of regularization, and yes regularization helps.

@bernardoramos9409

@bernardoramos9409 14 сағат бұрын

Another solutions would be: 1. increase the amount of training data 2. decrease the size of the model If the model is small compared to the training data, it cannot memorize the training data, and so it is forced to generalize. Using growth (like in TokenFormer, Masked Structural Growth or LlaMA Pro) could probably be an option also

@bernardoramos9409

@bernardoramos9409 14 сағат бұрын

this may explain why dropout is useful sometimes. it would modify the output, and then there would be a gradient again

@irbsurfer1585 14 сағат бұрын

Wow, this is the explanation I've been waiting for! The "aha!" moment was real. I'm not wasting any time - I'm already implementing the Stable Max and orthogonal gradient optimizer for my transformer architectures. I'm really excited to see if this can accelerate training and improve generalization. My convolutional and recurrent networks are next on the list! Improved performance and numerical stability are the goals. Thank you so much for making this accessible - fantastic contribution! Superb explanations!

@Pure_Science_and_Technology

@Pure_Science_and_Technology 12 сағат бұрын

Huh? 😊

@irbsurfer1585 11 сағат бұрын

@@Pure_Science_and_Technology LMFAO, I think I am finally starting to get better at AI now that I can speak the jargon. lol I still feel like an amateur though. lol

@bernardoramos9409

@bernardoramos9409 10 сағат бұрын

Post some results if you can

@pepediedrich5609

@pepediedrich5609 6 сағат бұрын

@@irbsurfer1585 how do you measure the generalization?

@pixelsort 14 сағат бұрын

2nd part already! This will be StableMax. 🍿

@maxwelldylla3334

@maxwelldylla3334 14 сағат бұрын

It still doesn't explain why learning continues past the point when 100% training accuracy is achieved, just the mechanism for the delay. Why does continuous updates to the weights, using training data that the model has already learned to correctly classify, lead to generalization on the unseen test data?

@tescOne 14 сағат бұрын

I am assuming I'm dumb and missed a step but I have the same doubt. It's clear how to avoid the delay and why it happens, but I'm still not getting why it suddenly goes better on the evaluation after a while. EDIT: after giving it another shot I think we are misunderstanding this: we see NLM as necessarily leading to SC. But it's not true: it CAN lead to either grokking delay OR SC (depending on the behavior of parameters i guess). So generalization only stops when NLM leads to SC. This makes grokking not a magic / singular event; rather something that naturally should happen, but it's not happening because of some "artificial" limitations.

@pensiveintrovert4318

@pensiveintrovert4318 13 сағат бұрын

Not 100% on the development/validation dataset.

@oliverdaniels4961

@oliverdaniels4961 13 сағат бұрын

I assume one of two things is happening: 1. It gets stuck in a local minimum while the "grokked" state is a more global minimum 2. Weight decay becomes so strong that learning can't continue along the existing paths, effectively reducing the size of the neural net, which increases generalization (and building two or three different paths that overfit to one dataset will also be be more general)

@irbsurfer1585 11 сағат бұрын

I could very well be mistaking but I think of the analogy of a student cramming for an exam by memorizing answers. After the exam, they reflect on what they learned and realize there are deeper patterns or connections between the topics. The act of reviewing helps them generalize their knowledge, not just recall specific answers. In this case, the model is "reviewing" and reorganizing its internal understanding after reaching perfect accuracy. That's just my guess though and I have nothing to back that up yet.

@dennisestenson7820

@dennisestenson7820 4 сағат бұрын

The paper is describing ways grokking is generally prevented, then offering a couple ways to avoid that. It's not explaining how or why grokking happens in the wild.

@IxCIHAoX 15 сағат бұрын

Is it correct to assume the new optimizer is meant to be used only after overfitting?

@irbsurfer1585 14 сағат бұрын

Great question, and I am probably not qualified to answer this question so take it with a grain of salt but based on the information presented (I have not read the paper in full yet), it is more accurate to say the new optimizer is designed to prevent the problematic dynamics that occur after overfiting, rather than being exclusively meant for use only after overfitting has fully set in. While the motivation for this optimizer stems from problems observed after overfitting, its intended use is likely throughout the training process, particularly as the model begins to saturate on the training data, to prevent the optimization process from getting stuck in the naive scaling regime. Experimentation would be key to see what works best in practice for your specific model and task. Im certinaly going to give it a try and see if I can figure it out too. Good luck brother!

@oliverdaniels4961

@oliverdaniels4961 12 сағат бұрын

It seems like it should work from the start, but I would expect a slower initial convergence to the "fully fitting training data" point

@samankittani8856

@samankittani8856 16 сағат бұрын

You mention LLM models, but could this apply to different models? I assume it applies to any model that has softmax activations at each layer. Or is it just the final layer?

@AbelShields 16 сағат бұрын

Most models only use softmax as final layer, I've never heard of it being used in intermediate layers

@novantha1 15 сағат бұрын

Most modern models use softmax (in the context of this paper) in the language head to scale the output confidence in the tokens to a probability distribution. Ie: it rescales confidence into a 0-1 range where it's really easy to produce random numbers and select one of them. Generally speaking other uses of softmax are on the downtrend. Ie: softmax as an activation in MLP can lead to competition between channels, so you get really inefficient utilization of parameters, or softmax in MoE selection does the same thing (because MoE is just an approximation of MLP), and even for the Attention mechanism isn't not immediately clear anymore that softmax is great because it's known that softmax can saturate with too many entries, which places a hard limit on the scale and expressivity in attention heads as you scale them. People still hang onto it in the Attention mechanism though because softmax is so well researched as it was the standard, and the alternatives aren't as well understood, so that one might stick around for a while, yet.

@irbsurfer1585 14 сағат бұрын

Do you mean besides just transfomer NNs? Yes, I plan on testing it with recurrent and convolutional NNs. soon, focus is on tNN for now.

@pabloescobar2738

@pabloescobar2738 16 сағат бұрын

😮😊 thank

@code4AI 16 сағат бұрын

With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.

@timealchemist7508

@timealchemist7508 17 сағат бұрын

Just want to say I really appreciate this channel for the deep dives. Best in class. Keep it going! I am learning a lot.

@awesomedata8973

@awesomedata8973 17 сағат бұрын

I'm not a math guy (at least not in the traditional sense), but I love your channel's exploration of the bleeding edge of those progressing (and trying to understand) this technology - but mostly how, despite the complexity of the subject-matter, you make it as accessible as you can to the average person (like me), who doesn't delve too deeply into mathematical rigor. I feel like more physical (and less abstract) explanations of phenomenon that are easy to visualize (i.e. in story form?), wherever possible, generally helps me to visualize this stuff more, but you at least take the time to go slowly over the material and explain everything that's necessary from a beginner's level without leaving anything out. I can generally still keep up with this because it's clear you're passionate about this stuff and it makes me want to learn more than I already want to (allowing me to pay closer attention). Your channel doesn't have to be flashy to get it right. Great job! - Keep up the amazing work!

@N_Patil80 18 сағат бұрын

Bro, firstly I love your content but I wish you had a better accent, or do something with AI voice generator. I can only understand some of what you are saying.

@freed.d 18 сағат бұрын

too fast for rag is classic ... /sob

@tiagotiagot 19 сағат бұрын

Would the solution be to always introduce noise that is just above the scale of the rounding error?

@fontenbleau 20 сағат бұрын

Why they using such dorky word? What about calling that "acceleration" or self-tuning?

@trucid2 23 сағат бұрын

How do models overcome the softmax collapse and start learning representations that generalize?

@asfaust312 Күн бұрын

link your papers in the description please.

@dzhukov Күн бұрын

Instant subscribe!

@code4AI Күн бұрын

With the automatic audio dubbing from KZbin /Google you hear a synthetic voice in your regional language. To hear my original voice in English, switch to "Default" or "English" in the settings. Thank you.

@d3fau1thmph 21 сағат бұрын

I would prefer synthetic English. With good pronunciation.

@LudicrousTachyon

@LudicrousTachyon Күн бұрын

This is pretty reasonable. If we always give all the data, the system doesn't need to generalize because it has all the data. If you take away some of the data, the system now needs to start assuming the portions that are missing or guessing what the possibly missing data is and thus needs to train to generalize the data is does have to what is possible. This is what we do with children. We throw tons of input at them, then throw the kid into a situation, which may or may not have all the same inputs. Finally, we correct or promote their behavior. Thus the child learns to deal with variable situations.

@yakmage8085 Күн бұрын

Love it

@geldoku Күн бұрын

this video must be important but I don't understand a word of it.

@drdca8263 Күн бұрын

22:30 : hm, is it even grokking at that point? This sounds like maybe grokking is just, “the NN after being stuck in a rut, eventually gets lucky and stops being stuck”? Ah. Cliffhanger.

@oliverdaniels4961

@oliverdaniels4961 Күн бұрын

This could be as big a jump as transformers, or overcoming the gradient explosion in DNNs

@Anonymous-lw1zy

@Anonymous-lw1zy Күн бұрын

Why not just run softmax at far higher precision (double: 64 bit float (53 bit significand)? quadruple: 128 bit float (113 bit significand)?) so you don't get precisely 1,0,0,... from the softmax? Or rather than compute softmax with exp(), use an adjustable steepness function, adjusting it over time, much like learning rates are adjusted over time. Put in a control loop to keep it away from getting pinned to 1,0,0,... --- OK, I read the paper. These are suggested solutions.

@tbirdal 18 сағат бұрын

Good point and we did experiment with these in the paper. While it helps, you cannot get rid of the intrinsic problem by simply using higher precision.

@robtaylor1444 14 сағат бұрын

@@tbirdalhave you looked at posits?

@tikendraw Күн бұрын

Let say if a company hires you as a data science researcher, with your current knowledge and same limitations as tech giants how far you can take the llm if designed and trained from scratch ?

@Sirus20x6 Күн бұрын

so train a low rank until you run out of space to fit more learning, and slowly up the quantization?

@be1tube Күн бұрын

This did not explain grokking to me because my question is not "why does it take so long to happen" but why does it happen at all after memorization? Why is there still a training signal after the network has memorized the training examples?

@fackarov9412 23 сағат бұрын

the model memorize the training data on "primitive" structures, with training the model continues to evolve and if better structures emerge to represent the data, it exploits them and if it finds the "exact" data structure it activates grokking once a better structure is find it activates a exponential space of "good" candidate structures to go better and better

@mircorichter1375

@mircorichter1375 21 сағат бұрын

People use the term "grokking" differently. Here and in most research 'grokking' is not the generalization AFTER the delay but the delay before the expected generalization. So a model that generalizes on par or soon after training memorization does not grok, while a model groks if there is a delay...

@mircorichter1375

@mircorichter1375 21 сағат бұрын

That is arguably counterintuitive due to the meaning of the word 'grokking'

@daPawlak 21 сағат бұрын

@@mircorichter1375 so it's just a problem with getting result that could be achievable without that delay or is it a additional functionality provided by that additional compute time?

@be1tube 20 сағат бұрын

@@fackarov9412 What was confusing me is the 100% on training. However, I have a hypothesis now that the graph shown in these situations plots training accuracy, not training loss. So there is a signal (the difference between an output of (.2,.2,.6) and (0,0,1)), it's just not shown in the graph.

@be1tube Күн бұрын

There was a paper many months ago showing that increasing certain frequency components of the gradients (I think it was a low-pass filter but it could have been the opposite) skipped most of the delay for grokking.

@MatthewKelley-mq4ce

@MatthewKelley-mq4ce Күн бұрын

Yep. I remember that one.

@TheDarkhawk243

@TheDarkhawk243 Күн бұрын

Why do you never link papers?

@mohl-bodell2948

@mohl-bodell2948 Күн бұрын

What a cliff hanger...

@polyscopes Күн бұрын

For real haha llm training getting 2x + cheaper overnight

@hdot2613 Күн бұрын

Such a tease 😂

@breakablec Күн бұрын

Sounds like: 1. once the error become low you want to stop the maximisation of optimal parameters to increase gradient 2. you want to use decimals with large integer part and small fractional part to increase precision

@KilgoreTroutAsf

@KilgoreTroutAsf Күн бұрын

This is the vanishing gradient problem all over again.

@yeetdeets Күн бұрын

Exploding in this case though, right?

@ChaseFreedomMusician

@ChaseFreedomMusician Күн бұрын

I'm really looking forward to part 2!!

@rubncarmona Күн бұрын

This makes me think of nGPT from Nvidia a bunch

@andikunar7183 Күн бұрын

Thanks, AMAZING content, WOW! Does this mean in non-mathematical/laymen‘s terms, that good, smaller context, knowledge-samples (decreased dimensionality) during training help with grokking?

@polyscopes Күн бұрын

I think he was saying that the decreased dimensionality helped prevent memorization forcing it to generalize sooner instead of memorizing the training data first and then learning to generalize.

@mloewen248 Күн бұрын

Wouldn't simply adding noise to the input data then solve this issue after repeated epochs?

@HoldMyData Күн бұрын

So now, what am I going to waste my time waiting on training runs? @24:35 😅 Great explanation, thanks. I remember Nous or someone when all of this first started with "Ok so we just keep going and going and eventually it works". This makes it understandable.