Long-Context LLM Extension
25:45
4 ай бұрын
Hands on Human-AI Coding
11:35
4 ай бұрын
Street Fighting Transformers
25:13
Simple Diffusion Language Models
15:08
Do we need Attention? A Mamba Primer
33:50
Пікірлер
@barisdenizsaglam
@barisdenizsaglam Күн бұрын
Thank you so much. Your channel is a gem.
@user---------
@user--------- 4 күн бұрын
Interesting, but It's not entirely clear how R works?.... </eos>
@sameerreddy1085
@sameerreddy1085 5 күн бұрын
Awesome paper and video! I didn't realize what a cool innovation the NTK approach was and the fact that it first originated on reddit r/LocalLlama! Crazy (awesome) times.
@EkShunya
@EkShunya 6 күн бұрын
great one agn :)
@michaeltse321
@michaeltse321 6 күн бұрын
basically openai used deepseek research to develop o1 and o3
@legallyregarded
@legallyregarded 5 күн бұрын
Asian counterpart of "We wuz kingz"
@amerramday
@amerramday 6 күн бұрын
It's Chinese engineers against Chinese engineers....😂
@RajajaTube
@RajajaTube 6 күн бұрын
You were doing so well until you "inadvertently" used dismissive language. Hindsight is always 20/20. Why didn't I think of this?!
@legallyregarded
@legallyregarded 6 күн бұрын
You're a troglodyte
@niazhimselfangels
@niazhimselfangels 5 күн бұрын
Hope you took more away from this video than the comment you left :) This was a super informative video for me.
@layer4down
@layer4down 3 күн бұрын
It was clearly meant as a compliment to the Deepseek team.
@johnk7025
@johnk7025 6 күн бұрын
Liked; for the details in this review.
@mattiasfagerlund
@mattiasfagerlund 6 күн бұрын
Excellent!
@autohmae
@autohmae 6 күн бұрын
What is most important (as some people claim, BIG IF: IF this is true), the RL part was for a largely trained on synthetic data. Because if that is true, the more time and compute you throw at it, the more performance you can probably get. Which means the sky is the limit. Which is maybe what o3 is...?
@AurobindoTripathy
@AurobindoTripathy 6 күн бұрын
Skip the prelude and go straight to 3:42 Lot of preamble, skip to 14:50 Base Model V3 Innovations: MoE Top-k vs Top-1 Routing: 18:50 Base Model V3 Innovations: Parallelization: 23:23 Base Model V3 Innovations: Use of FP8 training: 28:02 DeepSeek R1 35:39 Quotable 40:39: "infuriatingly simple...write down a checker (verifiability)...run a model against that checker...it'll learn to get the right answer and do it in a way using it CoT to think about the answer"
@RamaRamakrishnan
@RamaRamakrishnan 6 күн бұрын
Fantastic review!!
@johnmorgun9961
@johnmorgun9961 6 күн бұрын
Western led technology communities??? For who?? Deepseek sounds like a game for the LLM story. This is a very expensive story.
@abdeslamkabiri9392
@abdeslamkabiri9392 6 күн бұрын
Tanks Sasha 🎉😊
@edzq9155
@edzq9155 6 күн бұрын
Thank you so much Sasha!!!
@adrienforbu5165
@adrienforbu5165 6 күн бұрын
amazing review
@Dpraj049
@Dpraj049 6 күн бұрын
I am not able to understand the compress function. Can someone please explain?
@jfokjfoodkdnnkfkfnnksolskn
@jfokjfoodkdnnkfkfnnksolskn 10 күн бұрын
followup?
@srush_nlp
@srush_nlp 9 күн бұрын
Coming soon!
@PyJam
@PyJam 15 күн бұрын
13 (pad_to) could also be - where(arange(j) < i, a[arange(j)%i],0)
@supervaka9584
@supervaka9584 22 күн бұрын
```python def bucketize(v: TT["i"], boundaries: TT["j"]) -> TT["i"]: i = v.shape[0] j = boundaries.shape[0] return (1 * (v[:, None] >= boundaries)) @ ones(j) ``` Is this a matmul of shapes (i, j) and (j,) resulting in (i,)? I'm not used to this because when it comes to matmuls I think of row vectors having shape of (1, j) Yet when I try `ones(i) @ (1 * (v[:, None] >= boundaries))` I get a (j,) tensor back So in this case it is interpreted as a matmul of shapes (1, i) and (i,j)? What is going on here?
@sakshamguptasakshamroyal
@sakshamguptasakshamroyal 25 күн бұрын
Great video. One question tho? isn't computation of attention 2TD? instead of TD? I say that because when you do a QK_transpose. You compute in TD and then when you multiply with V values you again multiply with TD? making it 2 TD for each time step?
@ZenBen_the_Elder
@ZenBen_the_Elder 27 күн бұрын
'The innovations driving rapid AI research' [9:41-12:34]
@burnytech
@burnytech 29 күн бұрын
Shouldn't equation in 18:07 be E_(y~p(·|,z_(1:t),x))[Ver(y)]? Adding z_(1:t) into the expectation value equation's subscript.
@skanderbegvictor6487
@skanderbegvictor6487 Ай бұрын
Nice
@andriymulyar8013
@andriymulyar8013 Ай бұрын
🔥
@AndrewRafas
@AndrewRafas Ай бұрын
Why do you think that spamming subscribers' "Subscription" lists with a dozen videos a day is something they want?
@srush_nlp
@srush_nlp Ай бұрын
Sorry, didn't realize the Shorts went to the standard subscription list. KZbin docs are confusing.
@Tunadorable
@Tunadorable Ай бұрын
lmao scroll your finger an extra inch on your mouse to find the rest of your subscriptions. no reason to complain about an above-normal amount of content
@Tunadorable
@Tunadorable Ай бұрын
Hype playlist😤 How're u finding shorts as opposed to horizontal video? Tried it for awhile and didn't seem worth the extra effort but I'm considering revisiting the idea
@srush_nlp
@srush_nlp Ай бұрын
Probably should have just made a regular video :) got a bunch of complaints about the shorts
@Tunadorable
@Tunadorable Ай бұрын
@@srush_nlp yeah it definitely seems like people are only looking for mindless broadly-appealing stuff in vertical video given the short duration. I'd bet for this series you could stitch them all back to back into one single video and switch all the diagrams/bullet-points to a horizontal layout and it'd perform better, if not in views than at least in RPM
@ivyzhang450
@ivyzhang450 13 күн бұрын
Would love if you compiled it into a video anyways!
@vk0508-x7r
@vk0508-x7r Ай бұрын
Thats what i am looking for. Thanks
@gergerger53
@gergerger53 Ай бұрын
"When paralyzing a neural network...." 🤨 -- it's a common thing to hear in ML videos and podcasts due to relaxed speech patterns (though I believe some people think it's genuinely the correct way to say 'parallelizing"), but it always catches me off guard and sounds so incorrect. Random thought of the day. Looking forward to the rest of the videos on this series. Thanks for uploading!
@srush_nlp
@srush_nlp Ай бұрын
I'm hoping that by next year the speech AI corrector will just correct it for me
@fingerstyleguitarjustingao729
@fingerstyleguitarjustingao729 Ай бұрын
🎉
@jennyvega3864
@jennyvega3864 Ай бұрын
Thanks for this video! It's really useful. Something I don't fully understand is how for Fine-tuning (bottom) at 10:07 you get 2x5BxD for the activations. Could you please help me to understand it?
@sakshamguptasakshamroyal
@sakshamguptasakshamroyal 26 күн бұрын
I believe it should just be 5BxD right? I was confused myself. Activation needs to be stored for all layers and that;s just 5BxD
@altayyk
@altayyk Ай бұрын
Thank you, this is super useful and very insightful! As a new industrial phd student with limited support from my professor and not a lot of expertise in my specific field (3D generative models) in the team, I really appreciate this video.
@Veptis
@Veptis Ай бұрын
Welcome to the beautiful world of shader coding... has your brain started to see the world around you in little code snippets of procedural functions? I wrote my bachelor thesis on how awful language models are for shader code completion, which is also accepted at a conference workshop now and will get published eventually. I didn't look at instruction models or closed models, so it's promising to see Claude being somewhat useful with WGSL even. There is a leaderboard for all the open models I run on HF, look for a space called Shadermatch. While I didn't start the python implementation, I worked on it for the past year and half and still do. The repo might look a bit abandoned but there is a large PR in the works that adds more functionality that is present on the website: multipass shaders (called Buffers on the Shadertoy website). Feedback greatly welcome
@Charles-Darwin
@Charles-Darwin Ай бұрын
This is probably the 3rd time I've watched this since you posted it. I still can't shake it, you capture the 5Ws+H perfectly imo. It seems to have been overlooked in the general as I keep seeing people conflating things across models and architectures - which I find annoying. It could be a byproduct of the development cadence or the many hype trains littering the internet, but I always feel this is so significant that it's worth trying to portray it in such discussions. I have this itching question that I'm curious what your thoughts are on it. Thinking beyond what you've covered: in the same manner that LLMs derive these language patterns and then are leveraged, do you think that o1, in it's TTC graphs and results, along with supplying the output is there a more general heuristic pattern being derived from the process? If so, wouldn't these general heuristics effectively apply to different scopes and across domains, at least inevitably?
@420_gunna
@420_gunna Ай бұрын
Coming back a month later, this is still the goat (We've gotten more of a consensus since then about what's happening, but who's to say that some of these strategies aren't (or couldn't) be using at train-time [though, in usual OAI fashion, it's probably just the simple thing scaled up -- RL with ORMs]).
@소금-v8z
@소금-v8z Ай бұрын
Hey, great video! I've been trying to wrap my head around O1 for a while, and this really helped me put things into perspective. I'm surprised I haven't seen more discussion about using special tokens for reasoning. It seems like trying to generate these really long, abstract sequences for reasoning can be difficult and hard to evaluate. I have a strong feeling that we could make LLMs more stable by using special tokens as anchors to keep them from going down the wrong path during reasoning.
@JustSayin24
@JustSayin24 Ай бұрын
Since you strike me as the type of person who responds well to thoughtful feedback: It would've been nice to have a 90-second overview of the paper and its core ideas before starting. I know this isn't a video about GPT-Q, but I'd have found it useful context when going through the sections and studying the authors' approach to writing. Regardless, another amazing video!
@Louis-f8d
@Louis-f8d 2 ай бұрын
Hi Sasha, Thank you for your Module 1 & 2. That's really interesting! And can't wait to enjoy Module 3 & 4!
@jaewooklee5844
@jaewooklee5844 2 ай бұрын
Thank you so much for your detailed information. 🙏
@JustSayin24
@JustSayin24 2 ай бұрын
i love the fact that not only does this research exist, but someone went through the effort to distil it in such an intelligible way. Thank you!
@abitintostep
@abitintostep 2 ай бұрын
Did you guys just recently read the original BERT paper, added a random masking, a few repeats and done?
@srush_nlp
@srush_nlp 2 ай бұрын
Yeah! just read it over the summer. good paper.
@TonyBell-ye4hw
@TonyBell-ye4hw 2 ай бұрын
Sasha is the tops.
@mindhoc
@mindhoc 2 ай бұрын
🎉❤terrific video, thank you
@varunsai9736
@varunsai9736 2 ай бұрын
Amazing talk professor , recently attended your talk at Penn state .
@bobsoup2319
@bobsoup2319 2 ай бұрын
I’d really like to see this combined with nGPT as that model seemed to innately have an ability to generalize to out of distribution sequence lengths
@ASarkar-ML
@ASarkar-ML 2 ай бұрын
@srush_nlp Great explanation! How do you think discrete diffusion models should be modified to enable long context sequence generation comparable to LLMs?
@MultiBussen
@MultiBussen Ай бұрын
See 4.2 in the paper! It talks about how to use MDLM for autoregressive modeling, which results in text of arbitrary length
@wiktorm9858
@wiktorm9858 2 ай бұрын
Cool lecture, thanks!
@SLAM2977
@SLAM2977 2 ай бұрын
The o1 test time compute plot x-axis is on the log scale, that means that you will need exponential compute to make a linear improvement, so it will be grinding to a halt
@francisco444
@francisco444 2 ай бұрын
Hence the 7 Tril bet
@diophantine1598
@diophantine1598 2 ай бұрын
They apparently only just started scaling this. For example, there’s no reason that this couldn’t be applied to writing other than the fact that it is difficult to craft a reward signal for it. Saying that they’ll quickly hit a wall now would be like saying the same when we were at GPT-2. Sure, it’ll eventually happen, but we’re a ways off from it happening.
@Sams-li8tj
@Sams-li8tj 2 ай бұрын
Great video! Looking forward to more academia advice content.
@boussouarsari4482
@boussouarsari4482 2 ай бұрын
My way of doing compress! def compress(g: TT["i", bool], v: TT["i"], i:int) -> TT["i"]: small = v[g] i = g.shape[0] j = small.shape[0] return small @ eye(i)[:j,:]