Lecture 1 How to profile CUDA kernels in PyTorch

  Рет қаралды 10,692

CUDA MODE

CUDA MODE

4 ай бұрын

Slides: docs.google.com/presentation/...
Code: github.com/msaroufim/cudamode...

Пікірлер: 19
@TheAIEpiphany
@TheAIEpiphany 10 күн бұрын
Nice walk-through mark! So in practice on a high level one would profile the code, identify the perf bottlenecks and then replace some of the functions associated with that bottleneck with a direct CUDA/Triton implementation?
@Graverman
@Graverman 4 ай бұрын
thanks for providing this for free!
@mlock1000
@mlock1000 3 ай бұрын
'i believe thing i see' I'm in the right place. Thanks!!
@DiogoSanti
@DiogoSanti 25 күн бұрын
Awesome, this channel is a gem.
@zerotwo7319
@zerotwo7319 4 ай бұрын
oh no. now I have no excuse to be a productive member in my village. Oh, I accidentally subscribed, the terror.
@JasonKuanCapillaryJ
@JasonKuanCapillaryJ 2 күн бұрын
Nice talk
@loabrasumente2283
@loabrasumente2283 4 ай бұрын
at 30:40 where you change the BLOCK_SIZE to 1024. How is it possible to reach 8000GB/s when max memory bandwidth of A10G is only 600GB/s? I think setting BLOCK_SIZE = 1024 makes triton compute only the first 1024 columns of the matrix while ignoring the rest, so when you computing the GB/s, the "seconds" part is fixed, while the "GB" grows linearly (128 * i), that's why you're seeing the perf growing linearly. Also the reason why the little `torch.allclose` test didn't complain, is that you are only testing a small matrix (1823, 781) here, whose n_cols
@CUDAMODE
@CUDAMODE 4 ай бұрын
Indeed and actually after rerunning this the torch.allclose did indeed complain so that slide is just plain wrong, will revisit what went wrong
@elliot6285
@elliot6285 4 ай бұрын
Do you have any suggestions for comprehensive resources or study materials that can help a beginner learn about CPUs and GPUs, particularly focusing on their roles and functions in Machine Learning and Deep Learning? I'm looking for in-depth yet accessible information to build a strong foundation in this area, which will enable me to understand the technical aspects discussed in certain videos related to ML/DL, especially this one :).
@mikhailkilianovski8024
@mikhailkilianovski8024 4 ай бұрын
This course could be helpful, I am going through it with a pleasure kzbin.info/www/bejne/rYXPZqqIebljebc
@CUDAMODE
@CUDAMODE 2 ай бұрын
This is a good start github.com/cuda-mode/resource-stream
@kobefourthirty1058
@kobefourthirty1058 4 ай бұрын
Second
@sujantkumarkv5498
@sujantkumarkv5498 4 ай бұрын
First
@Flynuxs
@Flynuxs 4 ай бұрын
Sixième
@RazhanHameed
@RazhanHameed 4 ай бұрын
Fifth
@edzehoo
@edzehoo 4 ай бұрын
Fourth
@forheuristiclifeksh7836
@forheuristiclifeksh7836 2 ай бұрын
12:25
@CS_n00b
@CS_n00b 4 ай бұрын
Third
@franciscovicencar
@franciscovicencar 4 ай бұрын
7
Is Coding still worth it in 2024? (as an ex-Google programmer)
13:36
La final estuvo difícil
00:34
Juan De Dios Pantoja
Рет қаралды 27 МЛН
Тяжелые будни жены
00:46
К-Media
Рет қаралды 5 МЛН
FOOTBALL WITH PLAY BUTTONS ▶️ #roadto100m
00:29
Celine Dept
Рет қаралды 73 МЛН
Lecture 2 Ch1-3 PMPP book
52:26
CUDA MODE
Рет қаралды 4,2 М.
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 223 М.
Reliable, fully local RAG agents with LLaMA3
21:19
LangChain
Рет қаралды 82 М.
Flow Matching for Generative Modeling (Paper Explained)
56:16
Yannic Kilcher
Рет қаралды 36 М.
GTC 2022 - How CUDA Programming Works - Stephen Jones, CUDA Architect, NVIDIA
41:15
Gitlab DELETING Production Databases | Prime Reacts
17:27
ThePrimeTime
Рет қаралды 295 М.
Supercharge your Python App with RAG and Ollama in Minutes
9:42
Matt Williams
Рет қаралды 25 М.
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 103 М.
La final estuvo difícil
00:34
Juan De Dios Pantoja
Рет қаралды 27 МЛН