You are a great teacher! Your videos help me learn CUDA much faster than reading the official guide.
@NotesByNick4 жыл бұрын
Glad to hear that you found it useful!
@user-kj8cb6ch4w4 жыл бұрын
I just started studying this course in one of the finest universities in India, and let me tell you that I've learned more from you than my so-called highly qualified professor!
@NotesByNick4 жыл бұрын
Glad you've found them useful!
@user-kj8cb6ch4w4 жыл бұрын
@@NotesByNick hey, which book do you suggest to learn more about CUDA?
@NotesByNick4 жыл бұрын
@@user-kj8cb6ch4w Programming Massively Parallel Processors: A Hands-on Approach is the book most people use
@bluewhale378 ай бұрын
Yeah.
@jereziah5 жыл бұрын
Thanks for the tutorials mate, please keep 'em coming!
@NotesByNick5 жыл бұрын
Will do! Thanks for the feedback fella!
@chadmcintire41283 жыл бұрын
Hey your content is great, I haven't taken any of the nvidia courses, but this is pretty great, and free. Thank you! Also this looks like a convolution.
@yassermorsy8481 Жыл бұрын
I feel like I am not grasphing this. Doesn't this mean every thread running in parallel has to go through the process of loading the row and column tiles into the shared memory. Why is that not more expensive?
@muneshchauhan4 жыл бұрын
Thanks for the wonderful video. Just one query on the shared memory size. Since you stated that it will be a 16 x 16 tile, so we need to remove the further multiplication by 4 in SHMEM_SIZE
@NotesByNick4 жыл бұрын
Yes, that's correct! It's a mistake in the original video. The code on GitHub does not contain that error.
@arsalananwari437710 ай бұрын
I think he got confused between size of elements and size of bytes thus why he multiplies by 4 like you do in cudaMalloc
@drewb20533 жыл бұрын
I think theres a mistake in the SHMEM_SIZE calculation. You multiply 16*16*4 with the 4 to account for integer size but wouldn't that be include for when the compiler computes an integer array for A and B? Curious as to whether I was misunderstanding something about shared memory declaration.
@NotesByNick3 жыл бұрын
Yes, it's bug in the video - consult the code on GitHub for the most up-to-date code
@drewb20533 жыл бұрын
@@NotesByNick thank you so much for your expedient reply. Your video was incredibly helpful
@olivercopleston5 жыл бұрын
Thanks for the video, really helped with one of my Computer Science modules.
@NotesByNick5 жыл бұрын
Glad you found it useful!
@Chiarettoo3 жыл бұрын
Thanks for the video !! I'm new to cuda programming and I don't know if I'm missing a concept, so I was trying to change the size of the matrices to see performance in larger multiplications, but when I tried to multiply 4096 x 4096 (n
@pranavshankar19722 жыл бұрын
Great video. If I am trying to implement kronecker product all the change is that I need to make is the logic from tmp += result of c to tmp = result of c right?
@_jasneetsingh_ Жыл бұрын
on the code on github, you are defining shared memory size to be 1
@金安迪-g5q8 ай бұрын
great video, finally understood this mechanism!
@_lilkm5 жыл бұрын
Hi CoffeeBeforeArch, i'm wondering why we don't free up the memory allocated in the error check function pointed by (verify_c) variable
@NotesByNick5 жыл бұрын
It should have been (it's a memory leak, although the program does immediately end afterward so it is automatically). I've updated the github repo with a verification function that doesn't allocate any new memory as it should have been from the start.
@_lilkm5 жыл бұрын
Thank you@@NotesByNick
@EduardoValdez-f4j Жыл бұрын
Question: when loading the columns to the shared memory, couldn't we transpose these values to have the be sequentially accessed during the execution?
@unknownentity53549 ай бұрын
I was thinking the same thing. I would suppose its fine considering that's what we do when speeding up CPU usage
@aneesmd78373 жыл бұрын
Any tutorial on how to use extern shared memory iam getting assertion error when i use extern shared mem?? anyone please give some leads
@TYGR21154 жыл бұрын
ohhh how i love you for this video absolute life saver
@NotesByNick4 жыл бұрын
Glad you found it helpful!
@joseluismatabuenasedano6881 Жыл бұрын
Thank, this is really good material :) One quick question, where is the temp sum allocated?
@sinaasadiyan4 жыл бұрын
what about nonsquare tiles? for example instear of 16*16, we use 8*16 or 16*8 (in first matrix and appropriate to second )
@NotesByNick4 жыл бұрын
I'm not quite sure what your question is. If you are asking if it's possible to use non-square tiles, then yes, it is absolutely possible. It just requires a small change to the indexing to address the fact that the X and Y dimensions are no longer equal.
@krisdobrev35464 жыл бұрын
Thanks for the video! By the way, I have that noticed that when working with smaller matrices ( 16x16, 32x,32, 64x64) the normal( using global memory) matrix multiplication seems to perform slightly better than the tiled one. However, once the matrices become bigger than 128x128, tiled matrix mutplication perfroms significantly better. Do you have an idea why this could be hapenning?
@NotesByNick4 жыл бұрын
Smaller matrices like that aren't a great target for the GPU (unless many of them are batched together). Just copying the data back and forth is likely slower than just running on the CPU. A likely reason why the non shared memory version is faster is that the extra instructions for managing the shared memory (extra loads and stores into shared memory) outweighs the tiny amount of re-use you are getting. Your data is likely also small enough to just fit in your caches without any manual tiling using shared memory. Hope this helps! --Nick
@krisdobrev35464 жыл бұрын
@@NotesByNick Oh wow! I was not expecting a reply so quickly. It did actually help a lot. Thank you.
@NotesByNick4 жыл бұрын
Christian Dobrev no problem! Always happy to help!
@yuanchen96023 жыл бұрын
can anyone tells the skills of debugging cuda code in visual studio2017/2019
@shuaiwang39014 жыл бұрын
Thanks for the video. It helped a lot!
@NotesByNick4 жыл бұрын
No problem! Glad you found it useful!
@bobhut86134 жыл бұрын
Thanks this really helped!
@NotesByNick4 жыл бұрын
No problem!
@tunaalatan353 жыл бұрын
Great video!
@williamgomez62263 жыл бұрын
Amazing channel. Thank You sooo much¡
@khizaranjum92964 жыл бұрын
Hi, so I think the SHMEM _SIZE definition should be just #define SHMEM_SIZE 16*16 because you are using it to initialize a shared memory array, and it needs the number of elements, not the number of bytes.
@NotesByNick4 жыл бұрын
That’s correct! If you look at the code on GitHub, that issue (and others in the series) has been fixed.
@khizaranjum92964 жыл бұрын
@@NotesByNick Thank you for the timely reply. This series is great! There is just one more thing. I do not understand why you sometimes do: int GRID_SIZE = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; because you are converting to an int, the -1 in there gives you the same answer as n / BLOCK_SIZE due to floor when converting from float to int.
@NotesByNick4 жыл бұрын
@@khizaranjum9296 so in all the updated code on Github, floating point numbers are never used (or converted from) in the block and grid dimension calculations. That formula is just a simple way of rounding n/BLOCK_SIZE up to the nearest whole number using only integers.
@grantgasser1365 жыл бұрын
what if the matrix is not square
@NotesByNick5 жыл бұрын
Great question! It just requires a small change to the indexing. Here's a quick modification I just wrote to show that indexing off - github.com/CoffeeBeforeArch/cuda_programming/blob/master/matrixMul/matrixMul_tiled_rectangular.cu . Handling a matrix dim that is not a multiple of the block size is another interesting problem that you can approach in a number of ways. The easiest way is to just pad the matrix with zeros until it becomes a multiple, and the extra computation does not impact the final results. Cheers! -- Nick
@rafinryan42184 жыл бұрын
@@NotesByNick Hey Nick, great content as always. I am struggling a bit when it comes to multiplication of non square matrices. The above link does not work. Can you please give an update on that?
@shawnli1256 Жыл бұрын
Thanks for the series, they are really helpful. But I find them not perfectly for real fresh learners of cuda, better watch these after reading the book "cuda by example".
@SuperChrzaszcz5 жыл бұрын
The `int GRID_SIZE = (int)ceil(n / BLOCK_SIZE);` survived here.
@NotesByNick5 жыл бұрын
All the code on Github should be updated as of this morning.