Sasha Rush

22:14

Should I do a Postdoc? - Niloofar Mireshghallah

2 ай бұрын

47:56

Speculations on Test-Time Scaling (o1)

2 ай бұрын

25:45

Long-Context LLM Extension

4 ай бұрын

11:35

Hands on Human-AI Coding

4 ай бұрын

20:07

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

5 ай бұрын

32:04

How to write an okay research paper.

5 ай бұрын

49:23

Sewon Min - Rethinking Data Use in Large Language Models

6 ай бұрын

25:13

Street Fighting Transformers

6 ай бұрын

15:08

Simple Diffusion Language Models

7 ай бұрын

22:43

Hao Zhang - Chatbot Arena (UCSD / LMSys)

9 ай бұрын

26:15

Hanna Hajishirzi (AI2) - OLMo: Findings of Training an Open LM

9 ай бұрын

16:26

MambaByte: Token-Free Language Modeling

9 ай бұрын

12:05

Luca Soldaini - Curating Pretrain Data (AI2 / Dolma)

10 ай бұрын

33:50

Do we need Attention? A Mamba Primer

10 ай бұрын

10:38

Swabha Swayamdipta: Towards (Closed-Source) LLM Accountability via Logit Signatures (USC)

10 ай бұрын

10:33

Ying Sheng - Bridging human and LLM systems

10 ай бұрын

23:37

Tatsu Hashimoto - Lessons from the Alpaca Project (Stanford)

10 ай бұрын

7:39

Louis Castricato - RLAIF, User Autonomy, and Controllability (Eleuther / Synthlabs)

10 ай бұрын

12:50

Eugene Cheah - From idea to LLM (RWKV / Recursal)

10 ай бұрын

20:02

Daphne Ippolito (CMU / Google) - No One-Size Fits All Pre-Training Data

10 ай бұрын

24:11

Ludwig Schmidt - Open source AI for Multimodality

10 ай бұрын

8:40

Leshem Chosen - Wiki-models through Natural Feedback

10 ай бұрын

9:30

Irina Rish (Mila) - Continual Learning of Foundation Models

10 ай бұрын

5:41

Niklas Muennighoff - From GPU poor to poor GPU rich

10 ай бұрын

19:52

Graham Neubig (CMU) - Can we make building with open-source AI as simple as prompting ChatGPT?

10 ай бұрын

58:02

Large Language Models in Five Formulas

Жыл бұрын

12:45

RNNs for Diffusion? Generating Images with DiffuSSM

Жыл бұрын

18:50

Insanely Fast Speech Recognition: Sequence Distilation for Whisper

Жыл бұрын

15:07

AIF + DPO: Distilling Zephyr and friends

Жыл бұрын

Пікірлер

@barisdenizsaglam Күн бұрын

Thank you so much. Your channel is a gem.

@user--------- 4 күн бұрын

Interesting, but It's not entirely clear how R works?.... </eos>

@sameerreddy1085 5 күн бұрын

Awesome paper and video! I didn't realize what a cool innovation the NTK approach was and the fact that it first originated on reddit r/LocalLlama! Crazy (awesome) times.

@EkShunya 6 күн бұрын

great one agn :)

@michaeltse321 6 күн бұрын

basically openai used deepseek research to develop o1 and o3

@legallyregarded 5 күн бұрын

Asian counterpart of "We wuz kingz"

@amerramday 6 күн бұрын

It's Chinese engineers against Chinese engineers....😂

@RajajaTube 6 күн бұрын

You were doing so well until you "inadvertently" used dismissive language. Hindsight is always 20/20. Why didn't I think of this?!

@legallyregarded 6 күн бұрын

You're a troglodyte

@niazhimselfangels 5 күн бұрын

Hope you took more away from this video than the comment you left :) This was a super informative video for me.

@layer4down 3 күн бұрын

It was clearly meant as a compliment to the Deepseek team.

@johnk7025 6 күн бұрын

Liked; for the details in this review.

@mattiasfagerlund 6 күн бұрын

Excellent!

@autohmae 6 күн бұрын

What is most important (as some people claim, BIG IF: IF this is true), the RL part was for a largely trained on synthetic data. Because if that is true, the more time and compute you throw at it, the more performance you can probably get. Which means the sky is the limit. Which is maybe what o3 is...?

@AurobindoTripathy 6 күн бұрын

Skip the prelude and go straight to 3:42 Lot of preamble, skip to 14:50 Base Model V3 Innovations: MoE Top-k vs Top-1 Routing: 18:50 Base Model V3 Innovations: Parallelization: 23:23 Base Model V3 Innovations: Use of FP8 training: 28:02 DeepSeek R1 35:39 Quotable 40:39: "infuriatingly simple...write down a checker (verifiability)...run a model against that checker...it'll learn to get the right answer and do it in a way using it CoT to think about the answer"

@RamaRamakrishnan 6 күн бұрын

Fantastic review!!

@johnmorgun9961 6 күн бұрын

Western led technology communities??? For who?? Deepseek sounds like a game for the LLM story. This is a very expensive story.

@abdeslamkabiri9392 6 күн бұрын

Tanks Sasha 🎉😊

@edzq9155 6 күн бұрын

Thank you so much Sasha!!!

@adrienforbu5165 6 күн бұрын

amazing review

@Dpraj049 6 күн бұрын

I am not able to understand the compress function. Can someone please explain?

@jfokjfoodkdnnkfkfnnksolskn 10 күн бұрын

followup?

@srush_nlp 9 күн бұрын

Coming soon!

@PyJam 15 күн бұрын

13 (pad_to) could also be - where(arange(j) < i, a[arange(j)%i],0)

@supervaka9584 22 күн бұрын

```python def bucketize(v: TT["i"], boundaries: TT["j"]) -> TT["i"]: i = v.shape[0] j = boundaries.shape[0] return (1 * (v[:, None] >= boundaries)) @ ones(j) ``` Is this a matmul of shapes (i, j) and (j,) resulting in (i,)? I'm not used to this because when it comes to matmuls I think of row vectors having shape of (1, j) Yet when I try `ones(i) @ (1 * (v[:, None] >= boundaries))` I get a (j,) tensor back So in this case it is interpreted as a matmul of shapes (1, i) and (i,j)? What is going on here?

@sakshamguptasakshamroyal 25 күн бұрын

Great video. One question tho? isn't computation of attention 2TD? instead of TD? I say that because when you do a QK_transpose. You compute in TD and then when you multiply with V values you again multiply with TD? making it 2 TD for each time step?

@ZenBen_the_Elder 27 күн бұрын

'The innovations driving rapid AI research' [9:41-12:34]

@burnytech 29 күн бұрын

Shouldn't equation in 18:07 be E_(y~p(·|,z_(1:t),x))[Ver(y)]? Adding z_(1:t) into the expectation value equation's subscript.

@skanderbegvictor6487 Ай бұрын

Nice

@andriymulyar8013 Ай бұрын

🔥

@AndrewRafas Ай бұрын

Why do you think that spamming subscribers' "Subscription" lists with a dozen videos a day is something they want?

@srush_nlp Ай бұрын

Sorry, didn't realize the Shorts went to the standard subscription list. KZbin docs are confusing.

@Tunadorable Ай бұрын

lmao scroll your finger an extra inch on your mouse to find the rest of your subscriptions. no reason to complain about an above-normal amount of content

@Tunadorable Ай бұрын

Hype playlist😤 How're u finding shorts as opposed to horizontal video? Tried it for awhile and didn't seem worth the extra effort but I'm considering revisiting the idea

@srush_nlp Ай бұрын

Probably should have just made a regular video :) got a bunch of complaints about the shorts

@Tunadorable Ай бұрын

@@srush_nlp yeah it definitely seems like people are only looking for mindless broadly-appealing stuff in vertical video given the short duration. I'd bet for this series you could stitch them all back to back into one single video and switch all the diagrams/bullet-points to a horizontal layout and it'd perform better, if not in views than at least in RPM

@ivyzhang450 13 күн бұрын

Would love if you compiled it into a video anyways!

@vk0508-x7r Ай бұрын

Thats what i am looking for. Thanks

@gergerger53 Ай бұрын

"When paralyzing a neural network...." 🤨 -- it's a common thing to hear in ML videos and podcasts due to relaxed speech patterns (though I believe some people think it's genuinely the correct way to say 'parallelizing"), but it always catches me off guard and sounds so incorrect. Random thought of the day. Looking forward to the rest of the videos on this series. Thanks for uploading!

@srush_nlp Ай бұрын

I'm hoping that by next year the speech AI corrector will just correct it for me

@fingerstyleguitarjustingao729 Ай бұрын

🎉

@jennyvega3864 Ай бұрын

Thanks for this video! It's really useful. Something I don't fully understand is how for Fine-tuning (bottom) at 10:07 you get 2x5BxD for the activations. Could you please help me to understand it?

@sakshamguptasakshamroyal 26 күн бұрын

I believe it should just be 5BxD right? I was confused myself. Activation needs to be stored for all layers and that;s just 5BxD

@altayyk Ай бұрын

Thank you, this is super useful and very insightful! As a new industrial phd student with limited support from my professor and not a lot of expertise in my specific field (3D generative models) in the team, I really appreciate this video.

@Veptis Ай бұрын

Welcome to the beautiful world of shader coding... has your brain started to see the world around you in little code snippets of procedural functions? I wrote my bachelor thesis on how awful language models are for shader code completion, which is also accepted at a conference workshop now and will get published eventually. I didn't look at instruction models or closed models, so it's promising to see Claude being somewhat useful with WGSL even. There is a leaderboard for all the open models I run on HF, look for a space called Shadermatch. While I didn't start the python implementation, I worked on it for the past year and half and still do. The repo might look a bit abandoned but there is a large PR in the works that adds more functionality that is present on the website: multipass shaders (called Buffers on the Shadertoy website). Feedback greatly welcome

@Charles-Darwin Ай бұрын

This is probably the 3rd time I've watched this since you posted it. I still can't shake it, you capture the 5Ws+H perfectly imo. It seems to have been overlooked in the general as I keep seeing people conflating things across models and architectures - which I find annoying. It could be a byproduct of the development cadence or the many hype trains littering the internet, but I always feel this is so significant that it's worth trying to portray it in such discussions. I have this itching question that I'm curious what your thoughts are on it. Thinking beyond what you've covered: in the same manner that LLMs derive these language patterns and then are leveraged, do you think that o1, in it's TTC graphs and results, along with supplying the output is there a more general heuristic pattern being derived from the process? If so, wouldn't these general heuristics effectively apply to different scopes and across domains, at least inevitably?

@420_gunna Ай бұрын

Coming back a month later, this is still the goat (We've gotten more of a consensus since then about what's happening, but who's to say that some of these strategies aren't (or couldn't) be using at train-time [though, in usual OAI fashion, it's probably just the simple thing scaled up -- RL with ORMs]).

@소금-v8z Ай бұрын

Hey, great video! I've been trying to wrap my head around O1 for a while, and this really helped me put things into perspective. I'm surprised I haven't seen more discussion about using special tokens for reasoning. It seems like trying to generate these really long, abstract sequences for reasoning can be difficult and hard to evaluate. I have a strong feeling that we could make LLMs more stable by using special tokens as anchors to keep them from going down the wrong path during reasoning.

@JustSayin24 Ай бұрын

Since you strike me as the type of person who responds well to thoughtful feedback: It would've been nice to have a 90-second overview of the paper and its core ideas before starting. I know this isn't a video about GPT-Q, but I'd have found it useful context when going through the sections and studying the authors' approach to writing. Regardless, another amazing video!

@Louis-f8d 2 ай бұрын

Hi Sasha, Thank you for your Module 1 & 2. That's really interesting! And can't wait to enjoy Module 3 & 4!

@jaewooklee5844 2 ай бұрын

Thank you so much for your detailed information. 🙏

@JustSayin24 2 ай бұрын

i love the fact that not only does this research exist, but someone went through the effort to distil it in such an intelligible way. Thank you!

@abitintostep 2 ай бұрын

Did you guys just recently read the original BERT paper, added a random masking, a few repeats and done?

@srush_nlp 2 ай бұрын

Yeah! just read it over the summer. good paper.

@TonyBell-ye4hw 2 ай бұрын

Sasha is the tops.

@mindhoc 2 ай бұрын

🎉❤terrific video, thank you

@varunsai9736 2 ай бұрын

Amazing talk professor , recently attended your talk at Penn state .

@bobsoup2319 2 ай бұрын

I’d really like to see this combined with nGPT as that model seemed to innately have an ability to generalize to out of distribution sequence lengths

@ASarkar-ML 2 ай бұрын

@srush_nlp Great explanation! How do you think discrete diffusion models should be modified to enable long context sequence generation comparable to LLMs?

@MultiBussen Ай бұрын

See 4.2 in the paper! It talks about how to use MDLM for autoregressive modeling, which results in text of arbitrary length

@wiktorm9858 2 ай бұрын

Cool lecture, thanks!

@SLAM2977 2 ай бұрын

The o1 test time compute plot x-axis is on the log scale, that means that you will need exponential compute to make a linear improvement, so it will be grinding to a halt

@francisco444 2 ай бұрын

Hence the 7 Tril bet

@diophantine1598 2 ай бұрын

They apparently only just started scaling this. For example, there’s no reason that this couldn’t be applied to writing other than the fact that it is difficult to craft a reward signal for it. Saying that they’ll quickly hit a wall now would be like saying the same when we were at GPT-2. Sure, it’ll eventually happen, but we’re a ways off from it happening.

@Sams-li8tj 2 ай бұрын

Great video! Looking forward to more academia advice content.

@boussouarsari4482 2 ай бұрын

My way of doing compress! def compress(g: TT["i", bool], v: TT["i"], i:int) -> TT["i"]: small = v[g] i = g.shape[0] j = small.shape[0] return small @ eye(i)[:j,:]

Ең жақсы KZbin

Пікірлер