Nice walk-through mark! So in practice on a high level one would profile the code, identify the perf bottlenecks and then replace some of the functions associated with that bottleneck with a direct CUDA/Triton implementation?
@Graverman4 ай бұрын
thanks for providing this for free!
@mlock10003 ай бұрын
'i believe thing i see' I'm in the right place. Thanks!!
@DiogoSanti25 күн бұрын
Awesome, this channel is a gem.
@zerotwo73194 ай бұрын
oh no. now I have no excuse to be a productive member in my village. Oh, I accidentally subscribed, the terror.
@JasonKuanCapillaryJ2 күн бұрын
Nice talk
@loabrasumente22834 ай бұрын
at 30:40 where you change the BLOCK_SIZE to 1024. How is it possible to reach 8000GB/s when max memory bandwidth of A10G is only 600GB/s? I think setting BLOCK_SIZE = 1024 makes triton compute only the first 1024 columns of the matrix while ignoring the rest, so when you computing the GB/s, the "seconds" part is fixed, while the "GB" grows linearly (128 * i), that's why you're seeing the perf growing linearly. Also the reason why the little `torch.allclose` test didn't complain, is that you are only testing a small matrix (1823, 781) here, whose n_cols
@CUDAMODE4 ай бұрын
Indeed and actually after rerunning this the torch.allclose did indeed complain so that slide is just plain wrong, will revisit what went wrong
@elliot62854 ай бұрын
Do you have any suggestions for comprehensive resources or study materials that can help a beginner learn about CPUs and GPUs, particularly focusing on their roles and functions in Machine Learning and Deep Learning? I'm looking for in-depth yet accessible information to build a strong foundation in this area, which will enable me to understand the technical aspects discussed in certain videos related to ML/DL, especially this one :).
@mikhailkilianovski80244 ай бұрын
This course could be helpful, I am going through it with a pleasure kzbin.info/www/bejne/rYXPZqqIebljebc
@CUDAMODE2 ай бұрын
This is a good start github.com/cuda-mode/resource-stream