GPU Puzzles: Let's Play

Рет қаралды 7,932

Күн бұрын

Пікірлер: 15

@eyannoronha831 Жыл бұрын

Thank you so much for making these puzzles! It was a great introduction to GPU code for a beginner like myself! I am looking for more beginner-friendly GPU implementations like these. Will try to look up some material on Triton and TVM!

@wilbyyang4554 17 күн бұрын

Thank you so much for giving the amazing education! Do you about somthing like SIMD puzzles?

@SurflFilms 7 ай бұрын

thank you for making this, you are awesome.

@sourabmangrulkar9105 11 ай бұрын

Thank you! very insightful and easy to follow.

@ms-wk2wg Жыл бұрын

Thanks for following up on my earlier comment and re-uploading so quickly!

@jonathanlee7969 11 ай бұрын

Thanks for the great content. At the same time I have a question, is it necessary to insert cuda.syncthreads() after the inner loop in puzzle 14? Because shared memory in thread blocks gets modified repeatedly.

@srush_nlp 11 ай бұрын

Possible that I missed a syncthreads. You should add one always after writing and before reading.

@jonathanlee7969 11 ай бұрын

@@srush_nlp Thanks for your reply. How about after reading and before writing to the same memory? I think it's also necessary to insert a syncthreads.

@rohanreddy1083 Жыл бұрын

Thanks for the reupload Sasha, the sound and video are in sync now! I just finished following along, thanks so much for the tutorial, there are sparse resources available to gain an intuitive understanding of CUDA for beginners. I believe you had a comment on the prior upload about what next steps to take after completing these puzzles, but I can't access it anymore as that video is down now. Especially for someone who may be interested in working with low-level optimizations and CUDA-type frameworks professionally, what would be some good projects to try and build? For example, implement some recent ML papers in Numba or Triton? Or try to make a basic version of such compilers myself?

@srush_nlp Жыл бұрын

Yeah, so I think both Triton and TVM are worth learning. If you are looking for harder projects, I would say flash-attention and gptq are both interesting things to try to implement. Both of them have raw CUDA and Triton kernels that you can find online to look at and study.

@rohanreddy1083 Жыл бұрын

Thank you!!

@grosucelmic 8 ай бұрын

The `int` had no attribute inputs in puzzle 12 was due to initialising the cache using 0 instead of 0.0 (apparently numba doesn’t want to coerce that to a numba.float32 on its own)

@sotasearcher 10 ай бұрын

I passed all the puzzles up to 10 before hitting a wall and deciding to check my solutions so far with this. I read "adds 10 to each position of a and stores it in out" as "adds 10 to each index of a and stores it in out", and every value is it's index, so I was just adding local_i and everything was passing 😅

@sotasearcher 10 ай бұрын

also I was first adding `out[local_i]` (which is 0) to deal with type issues

@yousefalnaser1751 6 ай бұрын

You have one bug in the Prefix Sum code. If you didn't run the "else" branch that sets cache[local_i] = 0, you will get incorrect results. You have to have that else branch! You got correct results because you already initialized the cache with zeroes and did not reset the GPU in between. Just fyi :) otherwise it's a really good practice