Interesting, but It's not entirely clear how R works?.... </eos>
@sameerreddy10855 күн бұрын
Awesome paper and video! I didn't realize what a cool innovation the NTK approach was and the fact that it first originated on reddit r/LocalLlama! Crazy (awesome) times.
@EkShunya6 күн бұрын
great one agn :)
@michaeltse3216 күн бұрын
basically openai used deepseek research to develop o1 and o3
@legallyregarded5 күн бұрын
Asian counterpart of "We wuz kingz"
@amerramday6 күн бұрын
It's Chinese engineers against Chinese engineers....😂
@RajajaTube6 күн бұрын
You were doing so well until you "inadvertently" used dismissive language. Hindsight is always 20/20. Why didn't I think of this?!
@legallyregarded6 күн бұрын
You're a troglodyte
@niazhimselfangels5 күн бұрын
Hope you took more away from this video than the comment you left :) This was a super informative video for me.
@layer4down3 күн бұрын
It was clearly meant as a compliment to the Deepseek team.
@johnk70256 күн бұрын
Liked; for the details in this review.
@mattiasfagerlund6 күн бұрын
Excellent!
@autohmae6 күн бұрын
What is most important (as some people claim, BIG IF: IF this is true), the RL part was for a largely trained on synthetic data. Because if that is true, the more time and compute you throw at it, the more performance you can probably get. Which means the sky is the limit. Which is maybe what o3 is...?
@AurobindoTripathy6 күн бұрын
Skip the prelude and go straight to 3:42 Lot of preamble, skip to 14:50 Base Model V3 Innovations: MoE Top-k vs Top-1 Routing: 18:50 Base Model V3 Innovations: Parallelization: 23:23 Base Model V3 Innovations: Use of FP8 training: 28:02 DeepSeek R1 35:39 Quotable 40:39: "infuriatingly simple...write down a checker (verifiability)...run a model against that checker...it'll learn to get the right answer and do it in a way using it CoT to think about the answer"
@RamaRamakrishnan6 күн бұрын
Fantastic review!!
@johnmorgun99616 күн бұрын
Western led technology communities??? For who?? Deepseek sounds like a game for the LLM story. This is a very expensive story.
@abdeslamkabiri93926 күн бұрын
Tanks Sasha 🎉😊
@edzq91556 күн бұрын
Thank you so much Sasha!!!
@adrienforbu51656 күн бұрын
amazing review
@Dpraj0496 күн бұрын
I am not able to understand the compress function. Can someone please explain?
@jfokjfoodkdnnkfkfnnksolskn10 күн бұрын
followup?
@srush_nlp9 күн бұрын
Coming soon!
@PyJam15 күн бұрын
13 (pad_to) could also be - where(arange(j) < i, a[arange(j)%i],0)
@supervaka958422 күн бұрын
```python def bucketize(v: TT["i"], boundaries: TT["j"]) -> TT["i"]: i = v.shape[0] j = boundaries.shape[0] return (1 * (v[:, None] >= boundaries)) @ ones(j) ``` Is this a matmul of shapes (i, j) and (j,) resulting in (i,)? I'm not used to this because when it comes to matmuls I think of row vectors having shape of (1, j) Yet when I try `ones(i) @ (1 * (v[:, None] >= boundaries))` I get a (j,) tensor back So in this case it is interpreted as a matmul of shapes (1, i) and (i,j)? What is going on here?
@sakshamguptasakshamroyal25 күн бұрын
Great video. One question tho? isn't computation of attention 2TD? instead of TD? I say that because when you do a QK_transpose. You compute in TD and then when you multiply with V values you again multiply with TD? making it 2 TD for each time step?
@ZenBen_the_Elder27 күн бұрын
'The innovations driving rapid AI research' [9:41-12:34]
@burnytech29 күн бұрын
Shouldn't equation in 18:07 be E_(y~p(·|,z_(1:t),x))[Ver(y)]? Adding z_(1:t) into the expectation value equation's subscript.
@skanderbegvictor6487Ай бұрын
Nice
@andriymulyar8013Ай бұрын
🔥
@AndrewRafasАй бұрын
Why do you think that spamming subscribers' "Subscription" lists with a dozen videos a day is something they want?
@srush_nlpАй бұрын
Sorry, didn't realize the Shorts went to the standard subscription list. KZbin docs are confusing.
@TunadorableАй бұрын
lmao scroll your finger an extra inch on your mouse to find the rest of your subscriptions. no reason to complain about an above-normal amount of content
@TunadorableАй бұрын
Hype playlist😤 How're u finding shorts as opposed to horizontal video? Tried it for awhile and didn't seem worth the extra effort but I'm considering revisiting the idea
@srush_nlpАй бұрын
Probably should have just made a regular video :) got a bunch of complaints about the shorts
@TunadorableАй бұрын
@@srush_nlp yeah it definitely seems like people are only looking for mindless broadly-appealing stuff in vertical video given the short duration. I'd bet for this series you could stitch them all back to back into one single video and switch all the diagrams/bullet-points to a horizontal layout and it'd perform better, if not in views than at least in RPM
@ivyzhang45013 күн бұрын
Would love if you compiled it into a video anyways!
@vk0508-x7rАй бұрын
Thats what i am looking for. Thanks
@gergerger53Ай бұрын
"When paralyzing a neural network...." 🤨 -- it's a common thing to hear in ML videos and podcasts due to relaxed speech patterns (though I believe some people think it's genuinely the correct way to say 'parallelizing"), but it always catches me off guard and sounds so incorrect. Random thought of the day. Looking forward to the rest of the videos on this series. Thanks for uploading!
@srush_nlpАй бұрын
I'm hoping that by next year the speech AI corrector will just correct it for me
@fingerstyleguitarjustingao729Ай бұрын
🎉
@jennyvega3864Ай бұрын
Thanks for this video! It's really useful. Something I don't fully understand is how for Fine-tuning (bottom) at 10:07 you get 2x5BxD for the activations. Could you please help me to understand it?
@sakshamguptasakshamroyal26 күн бұрын
I believe it should just be 5BxD right? I was confused myself. Activation needs to be stored for all layers and that;s just 5BxD
@altayykАй бұрын
Thank you, this is super useful and very insightful! As a new industrial phd student with limited support from my professor and not a lot of expertise in my specific field (3D generative models) in the team, I really appreciate this video.
@VeptisАй бұрын
Welcome to the beautiful world of shader coding... has your brain started to see the world around you in little code snippets of procedural functions? I wrote my bachelor thesis on how awful language models are for shader code completion, which is also accepted at a conference workshop now and will get published eventually. I didn't look at instruction models or closed models, so it's promising to see Claude being somewhat useful with WGSL even. There is a leaderboard for all the open models I run on HF, look for a space called Shadermatch. While I didn't start the python implementation, I worked on it for the past year and half and still do. The repo might look a bit abandoned but there is a large PR in the works that adds more functionality that is present on the website: multipass shaders (called Buffers on the Shadertoy website). Feedback greatly welcome
@Charles-DarwinАй бұрын
This is probably the 3rd time I've watched this since you posted it. I still can't shake it, you capture the 5Ws+H perfectly imo. It seems to have been overlooked in the general as I keep seeing people conflating things across models and architectures - which I find annoying. It could be a byproduct of the development cadence or the many hype trains littering the internet, but I always feel this is so significant that it's worth trying to portray it in such discussions. I have this itching question that I'm curious what your thoughts are on it. Thinking beyond what you've covered: in the same manner that LLMs derive these language patterns and then are leveraged, do you think that o1, in it's TTC graphs and results, along with supplying the output is there a more general heuristic pattern being derived from the process? If so, wouldn't these general heuristics effectively apply to different scopes and across domains, at least inevitably?
@420_gunnaАй бұрын
Coming back a month later, this is still the goat (We've gotten more of a consensus since then about what's happening, but who's to say that some of these strategies aren't (or couldn't) be using at train-time [though, in usual OAI fashion, it's probably just the simple thing scaled up -- RL with ORMs]).
@소금-v8zАй бұрын
Hey, great video! I've been trying to wrap my head around O1 for a while, and this really helped me put things into perspective. I'm surprised I haven't seen more discussion about using special tokens for reasoning. It seems like trying to generate these really long, abstract sequences for reasoning can be difficult and hard to evaluate. I have a strong feeling that we could make LLMs more stable by using special tokens as anchors to keep them from going down the wrong path during reasoning.
@JustSayin24Ай бұрын
Since you strike me as the type of person who responds well to thoughtful feedback: It would've been nice to have a 90-second overview of the paper and its core ideas before starting. I know this isn't a video about GPT-Q, but I'd have found it useful context when going through the sections and studying the authors' approach to writing. Regardless, another amazing video!
@Louis-f8d2 ай бұрын
Hi Sasha, Thank you for your Module 1 & 2. That's really interesting! And can't wait to enjoy Module 3 & 4!
@jaewooklee58442 ай бұрын
Thank you so much for your detailed information. 🙏
@JustSayin242 ай бұрын
i love the fact that not only does this research exist, but someone went through the effort to distil it in such an intelligible way. Thank you!
@abitintostep2 ай бұрын
Did you guys just recently read the original BERT paper, added a random masking, a few repeats and done?
@srush_nlp2 ай бұрын
Yeah! just read it over the summer. good paper.
@TonyBell-ye4hw2 ай бұрын
Sasha is the tops.
@mindhoc2 ай бұрын
🎉❤terrific video, thank you
@varunsai97362 ай бұрын
Amazing talk professor , recently attended your talk at Penn state .
@bobsoup23192 ай бұрын
I’d really like to see this combined with nGPT as that model seemed to innately have an ability to generalize to out of distribution sequence lengths
@ASarkar-ML2 ай бұрын
@srush_nlp Great explanation! How do you think discrete diffusion models should be modified to enable long context sequence generation comparable to LLMs?
@MultiBussenАй бұрын
See 4.2 in the paper! It talks about how to use MDLM for autoregressive modeling, which results in text of arbitrary length
@wiktorm98582 ай бұрын
Cool lecture, thanks!
@SLAM29772 ай бұрын
The o1 test time compute plot x-axis is on the log scale, that means that you will need exponential compute to make a linear improvement, so it will be grinding to a halt
@francisco4442 ай бұрын
Hence the 7 Tril bet
@diophantine15982 ай бұрын
They apparently only just started scaling this. For example, there’s no reason that this couldn’t be applied to writing other than the fact that it is difficult to craft a reward signal for it. Saying that they’ll quickly hit a wall now would be like saying the same when we were at GPT-2. Sure, it’ll eventually happen, but we’re a ways off from it happening.
@Sams-li8tj2 ай бұрын
Great video! Looking forward to more academia advice content.
@boussouarsari44822 ай бұрын
My way of doing compress! def compress(g: TT["i", bool], v: TT["i"], i:int) -> TT["i"]: small = v[g] i = g.shape[0] j = small.shape[0] return small @ eye(i)[:j,:]