CUDA Crash Course: Cache Tiled Matrix Multiplication

  Рет қаралды 29,561

Nick

Nick

Күн бұрын

Пікірлер: 61
@yuhangliu45
@yuhangliu45 4 жыл бұрын
You are a great teacher! Your videos help me learn CUDA much faster than reading the official guide.
@NotesByNick
@NotesByNick 4 жыл бұрын
Glad to hear that you found it useful!
@user-kj8cb6ch4w
@user-kj8cb6ch4w 4 жыл бұрын
I just started studying this course in one of the finest universities in India, and let me tell you that I've learned more from you than my so-called highly qualified professor!
@NotesByNick
@NotesByNick 4 жыл бұрын
Glad you've found them useful!
@user-kj8cb6ch4w
@user-kj8cb6ch4w 4 жыл бұрын
@@NotesByNick hey, which book do you suggest to learn more about CUDA?
@NotesByNick
@NotesByNick 4 жыл бұрын
@@user-kj8cb6ch4w Programming Massively Parallel Processors: A Hands-on Approach is the book most people use
@bluewhale37
@bluewhale37 8 ай бұрын
Yeah.
@jereziah
@jereziah 5 жыл бұрын
Thanks for the tutorials mate, please keep 'em coming!
@NotesByNick
@NotesByNick 5 жыл бұрын
Will do! Thanks for the feedback fella!
@chadmcintire4128
@chadmcintire4128 3 жыл бұрын
Hey your content is great, I haven't taken any of the nvidia courses, but this is pretty great, and free. Thank you! Also this looks like a convolution.
@yassermorsy8481
@yassermorsy8481 Жыл бұрын
I feel like I am not grasphing this. Doesn't this mean every thread running in parallel has to go through the process of loading the row and column tiles into the shared memory. Why is that not more expensive?
@muneshchauhan
@muneshchauhan 4 жыл бұрын
Thanks for the wonderful video. Just one query on the shared memory size. Since you stated that it will be a 16 x 16 tile, so we need to remove the further multiplication by 4 in SHMEM_SIZE
@NotesByNick
@NotesByNick 4 жыл бұрын
Yes, that's correct! It's a mistake in the original video. The code on GitHub does not contain that error.
@arsalananwari4377
@arsalananwari4377 10 ай бұрын
I think he got confused between size of elements and size of bytes thus why he multiplies by 4 like you do in cudaMalloc
@drewb2053
@drewb2053 3 жыл бұрын
I think theres a mistake in the SHMEM_SIZE calculation. You multiply 16*16*4 with the 4 to account for integer size but wouldn't that be include for when the compiler computes an integer array for A and B? Curious as to whether I was misunderstanding something about shared memory declaration.
@NotesByNick
@NotesByNick 3 жыл бұрын
Yes, it's bug in the video - consult the code on GitHub for the most up-to-date code
@drewb2053
@drewb2053 3 жыл бұрын
@@NotesByNick thank you so much for your expedient reply. Your video was incredibly helpful
@olivercopleston
@olivercopleston 5 жыл бұрын
Thanks for the video, really helped with one of my Computer Science modules.
@NotesByNick
@NotesByNick 5 жыл бұрын
Glad you found it useful!
@Chiarettoo
@Chiarettoo 3 жыл бұрын
Thanks for the video !! I'm new to cuda programming and I don't know if I'm missing a concept, so I was trying to change the size of the matrices to see performance in larger multiplications, but when I tried to multiply 4096 x 4096 (n
@pranavshankar1972
@pranavshankar1972 2 жыл бұрын
Great video. If I am trying to implement kronecker product all the change is that I need to make is the logic from tmp += result of c to tmp = result of c right?
@_jasneetsingh_
@_jasneetsingh_ Жыл бұрын
on the code on github, you are defining shared memory size to be 1
@金安迪-g5q
@金安迪-g5q 8 ай бұрын
great video, finally understood this mechanism!
@_lilkm
@_lilkm 5 жыл бұрын
Hi CoffeeBeforeArch, i'm wondering why we don't free up the memory allocated in the error check function pointed by (verify_c) variable
@NotesByNick
@NotesByNick 5 жыл бұрын
It should have been (it's a memory leak, although the program does immediately end afterward so it is automatically). I've updated the github repo with a verification function that doesn't allocate any new memory as it should have been from the start.
@_lilkm
@_lilkm 5 жыл бұрын
Thank you@@NotesByNick
@EduardoValdez-f4j
@EduardoValdez-f4j Жыл бұрын
Question: when loading the columns to the shared memory, couldn't we transpose these values to have the be sequentially accessed during the execution?
@unknownentity5354
@unknownentity5354 9 ай бұрын
I was thinking the same thing. I would suppose its fine considering that's what we do when speeding up CPU usage
@aneesmd7837
@aneesmd7837 3 жыл бұрын
Any tutorial on how to use extern shared memory iam getting assertion error when i use extern shared mem?? anyone please give some leads
@TYGR2115
@TYGR2115 4 жыл бұрын
ohhh how i love you for this video absolute life saver
@NotesByNick
@NotesByNick 4 жыл бұрын
Glad you found it helpful!
@joseluismatabuenasedano6881
@joseluismatabuenasedano6881 Жыл бұрын
Thank, this is really good material :) One quick question, where is the temp sum allocated?
@sinaasadiyan
@sinaasadiyan 4 жыл бұрын
what about nonsquare tiles? for example instear of 16*16, we use 8*16 or 16*8 (in first matrix and appropriate to second )
@NotesByNick
@NotesByNick 4 жыл бұрын
I'm not quite sure what your question is. If you are asking if it's possible to use non-square tiles, then yes, it is absolutely possible. It just requires a small change to the indexing to address the fact that the X and Y dimensions are no longer equal.
@krisdobrev3546
@krisdobrev3546 4 жыл бұрын
Thanks for the video! By the way, I have that noticed that when working with smaller matrices ( 16x16, 32x,32, 64x64) the normal( using global memory) matrix multiplication seems to perform slightly better than the tiled one. However, once the matrices become bigger than 128x128, tiled matrix mutplication perfroms significantly better. Do you have an idea why this could be hapenning?
@NotesByNick
@NotesByNick 4 жыл бұрын
Smaller matrices like that aren't a great target for the GPU (unless many of them are batched together). Just copying the data back and forth is likely slower than just running on the CPU. A likely reason why the non shared memory version is faster is that the extra instructions for managing the shared memory (extra loads and stores into shared memory) outweighs the tiny amount of re-use you are getting. Your data is likely also small enough to just fit in your caches without any manual tiling using shared memory. Hope this helps! --Nick
@krisdobrev3546
@krisdobrev3546 4 жыл бұрын
@@NotesByNick Oh wow! I was not expecting a reply so quickly. It did actually help a lot. Thank you.
@NotesByNick
@NotesByNick 4 жыл бұрын
Christian Dobrev no problem! Always happy to help!
@yuanchen9602
@yuanchen9602 3 жыл бұрын
can anyone tells the skills of debugging cuda code in visual studio2017/2019
@shuaiwang3901
@shuaiwang3901 4 жыл бұрын
Thanks for the video. It helped a lot!
@NotesByNick
@NotesByNick 4 жыл бұрын
No problem! Glad you found it useful!
@bobhut8613
@bobhut8613 4 жыл бұрын
Thanks this really helped!
@NotesByNick
@NotesByNick 4 жыл бұрын
No problem!
@tunaalatan35
@tunaalatan35 3 жыл бұрын
Great video!
@williamgomez6226
@williamgomez6226 3 жыл бұрын
Amazing channel. Thank You sooo much¡
@khizaranjum9296
@khizaranjum9296 4 жыл бұрын
Hi, so I think the SHMEM _SIZE definition should be just #define SHMEM_SIZE 16*16 because you are using it to initialize a shared memory array, and it needs the number of elements, not the number of bytes.
@NotesByNick
@NotesByNick 4 жыл бұрын
That’s correct! If you look at the code on GitHub, that issue (and others in the series) has been fixed.
@khizaranjum9296
@khizaranjum9296 4 жыл бұрын
@@NotesByNick Thank you for the timely reply. This series is great! There is just one more thing. I do not understand why you sometimes do: int GRID_SIZE = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; because you are converting to an int, the -1 in there gives you the same answer as n / BLOCK_SIZE due to floor when converting from float to int.
@NotesByNick
@NotesByNick 4 жыл бұрын
@@khizaranjum9296 so in all the updated code on Github, floating point numbers are never used (or converted from) in the block and grid dimension calculations. That formula is just a simple way of rounding n/BLOCK_SIZE up to the nearest whole number using only integers.
@grantgasser136
@grantgasser136 5 жыл бұрын
what if the matrix is not square
@NotesByNick
@NotesByNick 5 жыл бұрын
Great question! It just requires a small change to the indexing. Here's a quick modification I just wrote to show that indexing off - github.com/CoffeeBeforeArch/cuda_programming/blob/master/matrixMul/matrixMul_tiled_rectangular.cu . Handling a matrix dim that is not a multiple of the block size is another interesting problem that you can approach in a number of ways. The easiest way is to just pad the matrix with zeros until it becomes a multiple, and the extra computation does not impact the final results. Cheers! -- Nick
@rafinryan4218
@rafinryan4218 4 жыл бұрын
@@NotesByNick Hey Nick, great content as always. I am struggling a bit when it comes to multiplication of non square matrices. The above link does not work. Can you please give an update on that?
@shawnli1256
@shawnli1256 Жыл бұрын
Thanks for the series, they are really helpful. But I find them not perfectly for real fresh learners of cuda, better watch these after reading the book "cuda by example".
@SuperChrzaszcz
@SuperChrzaszcz 5 жыл бұрын
The `int GRID_SIZE = (int)ceil(n / BLOCK_SIZE);` survived here.
@NotesByNick
@NotesByNick 5 жыл бұрын
All the code on Github should be updated as of this morning.
@armagaan009
@armagaan009 2 жыл бұрын
👏🏼👏🏼👏🏼
CUDA Crash Course: Why Coalescing Matters
8:27
Nick
Рет қаралды 10 М.
From Scratch: Cache Tiled Matrix Multiplication in CUDA
43:14
Don’t Choose The Wrong Box 😱
00:41
Topper Guild
Рет қаралды 62 МЛН
Мясо вегана? 🧐 @Whatthefshow
01:01
История одного вокалиста
Рет қаралды 7 МЛН
Performance x64: Cache Blocking (Matrix Blocking)
12:24
Creel
Рет қаралды 39 М.
From Scratch: Matrix Multiplication in CUDA
30:37
Nick
Рет қаралды 22 М.
Adding Nested Loops Makes this Algorithm 120x FASTER?
15:41
DepthBuffer
Рет қаралды 131 М.
CUDA Crash Course: GPU Performance Optimizations Part 1
22:23
NVIDIA Tensor Cores Programming
6:54
0Mean1Sigma
Рет қаралды 8 М.
CUDA Hardware
42:21
Tom Nurkkala
Рет қаралды 19 М.
3 2 6 Reduce Miss Rate by Blocking
11:41
Prof. Dr. Ben H. Juurlink
Рет қаралды 13 М.