Ultimate Guide To Scaling ML Models - Megatron-LM | ZeRO | DeepSpeed

Ultimate Guide To Scaling ML Models - Megatron-LM | ZeRO | DeepSpeed | Mixed Precision

Рет қаралды 26,937

Күн бұрын

Пікірлер: 50

@TheAIEpiphany 2 жыл бұрын

🚀 Sign up for AssemblyAI's free API token using my link 🚀www.assemblyai.com/? I cover the fundamental ideas behind all of the recent big ML models you must have heard of like Meta's OPT-175B, BigScience BLOOM 176B, EleutherAI's GPT-NeoX-20B, GPT-J, OpenAI's GPT-3, Google's PaLM, DeepMind's Chinchilla/Gopher models, etc. Do let me know your thoughts on this one! I'm so excited about the knowledge I accumulated over the past weeks. You can expect exciting videos going forward - that's all I'll say! :))

@na7515 Жыл бұрын

I'll give one criticism which is that this video is going to be for people who mostly already have some basic/good understanding of all the different techniques out there so the best way to watch this video would be to follow along and then do your research as the instructor is going through the topics. Having said that, this is a fantastic overview of all the techniques that are being used in the industry for large model training. It's really awesome the amount that you were able cover in an 80 minute video so huge props.

@theresabarton858 2 жыл бұрын

Very excited to see deepspeed covered on KZbin

@TheAIEpiphany 2 жыл бұрын

🥳🦄

@unsaturated8482 3 ай бұрын

FYI , we are still adding them 27:00 , but at the same time, instead of seperate times , Great diagram regardless thanks.

@btnt5209 Жыл бұрын

18:18 the reason that skip connections are replicated is because at the time PyTorch did not support Tensor stashing/popping. That meant that the skip connections, which aren't sequential, required copying outputs for certain layers throughout multiple GPUs or nodes, which is more time consuming than simply keeping the skip connection on each node. Now that torch.distributed supports Tensor stashes, the skip connections need not be duplicated.

@brianpulfer4159 2 жыл бұрын

Went through the whole video. Absolutely amazing stuff Aleksa! Learning with you is very enjoyable! Never stop making videos :)

@TheAIEpiphany 2 жыл бұрын

Thanks a lot Brian!

@iNTERnazionaleNotizia589 10 ай бұрын

Bro, I think you should start a new Playlist: "Paper Walkthrough", because you can explain most Deep Learning papers better than my Professor!

@oneman7094 2 жыл бұрын

Ah how many times have I searched for ZERO, finally something 🔥

@SatoBois Жыл бұрын

Thank you for making something that seemed so duanting so much more approachable! King behaviour 😤👌

@sacramentofwilderness6656 2 жыл бұрын

Thanks, Aleksa, for the great job! Very thorough, self-contrained and understanble explanation.

@TheAIEpiphany 2 жыл бұрын

Thank you!

@vaishnavisonawane8559 10 ай бұрын

This is very helpful. Thanks for sharing, Aleksa!

@fouriertransformationsucks438 11 ай бұрын

Amazing video, love it!🤗

@everythinganime867 2 жыл бұрын

Was always wondering this . Thank you

@TheAIEpiphany 2 жыл бұрын

You're welcome!

@beegbrain Жыл бұрын

Incredible knowledge that you gave in this video, thank you very much for you clear explanations !

@DistortedV12 2 жыл бұрын

You guys are smart. I don't know if I'd have the patience or a career to learn new topics on a weekly basis

@TheAIEpiphany 2 жыл бұрын

After a while you start asking yourself the opposite: would I be able to do something that doesn't involve continual learning 😅

@armish4197 2 жыл бұрын

@@TheAIEpiphany That would be a dream job

@ayushjain3391 9 ай бұрын

literally loved the video :)

@erkinsagroglu8519 11 ай бұрын

Hello, one of the most amazing materials that I’ve seen for years. The thing that I didn’t get on 28:20, why did we do row-wise split rather than the column wise split, what did change from the first part for the feed forward part where we did vertical/column-wise split?

@saidtaghadouini6225 10 ай бұрын

because in the first part we had the same input (X) which was duplicated while in the second part we already had separate inputs(Y1 and Y2) so we need to split the B matrix row-wise otherwise we can not compute the product; the result is Y1B1(device 1) + Y2B2 (device 2) so we need an all-reduce to get teh result on the same device.

@anishbhanushali 2 жыл бұрын

I was just wondering how do i gather info on large scale distributed DL training framework...and you sir, just read my mind !!!

@jakekalstad2494 2 жыл бұрын

Great stuff as always

@TheAIEpiphany 2 жыл бұрын

Thanks Jake!

@rachadlakis1 5 күн бұрын

Can you add a tutorial on Distributed Training (FSDP) on AWS?? It would be great if you add it :)

@MengLin-l8b 10 ай бұрын

for the dp method, is averaging usually the default? it's quite uncommon because for samples in a batch, the gradients are usually sumed instead. Very grateful if you can answer my quetion.

@nicom9853 2 жыл бұрын

Zdravo Aleksa super ti je video! Imam jedno pitanje koje odlazem vec neko vreme, koji tip softvera koristis za otvaranja, anotiranje, crtanje i grupisanje vise pdf dokumenata? Ovaj na pocetku tvog videa deluje cool, a mozda imas i neki drugi savet? Spremam se za doktorske studije pa trazim neki organizacioni softver da bih se snasao sa stotinama dokumenata haha. Hvala unapred!

@TheAIEpiphany 2 жыл бұрын

Cao hvala! Ja koristim OneNote za hvatanje beleski a pdf-ove jednostavno grupisem po direktorijumima. Baci pogled i na Notion mozda ti pomogne.

@nicom9853 2 жыл бұрын

@@TheAIEpiphany Super, hvala ti !

@DED_Search Жыл бұрын

34:55 Is it a add-up or concatenation? I think it should be a concate.

@MariuszWoloszyn 2 жыл бұрын

You've accidentally linked to the old (v2) version of the ZeRO paper in the description. The one shown in the video is here: arxiv.org/pdf/1910.02054v3.pdf

@TheAIEpiphany 2 жыл бұрын

Oops thanks! Will update it

@hongtaoyang3759 Жыл бұрын

Thanks for the great video! Can you explain more about ZeRO-3 model parallelism vs Megatron tensor parallelism? It sounds to me that ZeRO 3 include Megatron tensor parallelism, or are they different techniques and can be applied together?

@ChiragAhuja1 Жыл бұрын

Do you share the annotated papers also ?

@bodasadala3516 2 жыл бұрын

Great work, thanks for the effort!

@TheAIEpiphany 2 жыл бұрын

Thanks!

@bingbingsun6304 2 жыл бұрын

3D U-net, with input 1024 by 1024 by 1024, any suggestions?

@ahmadhamdan44 2 жыл бұрын

TOP!!!!!

@TheAIEpiphany 2 жыл бұрын

gotta stop uploadling on Sunday lol 😂

@stephennfernandes 2 жыл бұрын

I really wanted to learn deep about how mellanox infiniband switches work, how the networking, routing, configurations work. How to setup your own GPU cluster from scratch. But upon searching for months on the internet i couldn't find anything. Does anyone have any good resources on this ?

@mraihanafiandi Жыл бұрын

up, i have the same concern like you

@mraihanafiandi Жыл бұрын

@TheAIEpiphany

@stephennfernandes 2 жыл бұрын

🎉🎉🎉✨✨

@ShishilKumar 8 ай бұрын

the video doesnt demonstrate the actual steps to actually deploy large models using deepspeed? which is much more important than understanding all the theory stuff

@eugeneku3239 4 ай бұрын

Nah theory reigns supreme

@juliusvalentinas 2 ай бұрын

A100 gpu is 30k usd, is this offloading all theoretical nonsense? Where is apps that allow to run actual llama 3.1 on one or two 3090? Offloading non used stuff on nvme ssd?