Thank you so much for making these puzzles! It was a great introduction to GPU code for a beginner like myself! I am looking for more beginner-friendly GPU implementations like these. Will try to look up some material on Triton and TVM!
@wilbyyang455417 күн бұрын
Thank you so much for giving the amazing education! Do you about somthing like SIMD puzzles?
@SurflFilms7 ай бұрын
thank you for making this, you are awesome.
@sourabmangrulkar910511 ай бұрын
Thank you! very insightful and easy to follow.
@ms-wk2wg Жыл бұрын
Thanks for following up on my earlier comment and re-uploading so quickly!
@jonathanlee796911 ай бұрын
Thanks for the great content. At the same time I have a question, is it necessary to insert cuda.syncthreads() after the inner loop in puzzle 14? Because shared memory in thread blocks gets modified repeatedly.
@srush_nlp11 ай бұрын
Possible that I missed a syncthreads. You should add one always after writing and before reading.
@jonathanlee796911 ай бұрын
@@srush_nlp Thanks for your reply. How about after reading and before writing to the same memory? I think it's also necessary to insert a syncthreads.
@rohanreddy1083 Жыл бұрын
Thanks for the reupload Sasha, the sound and video are in sync now! I just finished following along, thanks so much for the tutorial, there are sparse resources available to gain an intuitive understanding of CUDA for beginners. I believe you had a comment on the prior upload about what next steps to take after completing these puzzles, but I can't access it anymore as that video is down now. Especially for someone who may be interested in working with low-level optimizations and CUDA-type frameworks professionally, what would be some good projects to try and build? For example, implement some recent ML papers in Numba or Triton? Or try to make a basic version of such compilers myself?
@srush_nlp Жыл бұрын
Yeah, so I think both Triton and TVM are worth learning. If you are looking for harder projects, I would say flash-attention and gptq are both interesting things to try to implement. Both of them have raw CUDA and Triton kernels that you can find online to look at and study.
@rohanreddy1083 Жыл бұрын
Thank you!!
@grosucelmic8 ай бұрын
The `int` had no attribute inputs in puzzle 12 was due to initialising the cache using 0 instead of 0.0 (apparently numba doesn’t want to coerce that to a numba.float32 on its own)
@sotasearcher10 ай бұрын
I passed all the puzzles up to 10 before hitting a wall and deciding to check my solutions so far with this. I read "adds 10 to each position of a and stores it in out" as "adds 10 to each index of a and stores it in out", and every value is it's index, so I was just adding local_i and everything was passing 😅
@sotasearcher10 ай бұрын
also I was first adding `out[local_i]` (which is 0) to deal with type issues
@yousefalnaser17516 ай бұрын
You have one bug in the Prefix Sum code. If you didn't run the "else" branch that sets cache[local_i] = 0, you will get incorrect results. You have to have that else branch! You got correct results because you already initialized the cache with zeroes and did not reset the GPU in between. Just fyi :) otherwise it's a really good practice