your examples and illustration are helpful! thanks!
@NotesByNick5 жыл бұрын
Happy to help fella!
@eaemmm3333 жыл бұрын
Thank you for the informative courses one feedback is your picture hid the lower right side which might contain some information if kindly make it small it would be better
About the inside of the `verify_result`: Are You relying on malloc initiallizing the `verify_c` to zero?
@NotesByNick5 жыл бұрын
Got lucky that I happened to get chunk of zeroed out memory when I recorded the video. The code on Github was fixed post-release of this video on Feb 22, and does not rely on undefined behavior.
@hanwang59405 жыл бұрын
Is the intermediate variable temp_sum necessary? My train of thought is that since array c is in global memory. It will take longer to accumulate the multiplication for each index using array c. Instead, here we use temp_sum which is in local memory to accumulate and then assign the final value to global memory. Thus reducing the time of memory transfer. Is my reasoning correct?
@NotesByNick5 жыл бұрын
The compiler should put temp in a register, so it should be faster to access than global memory. You don't have to do it, but it likely will improve performance (significantly in this case). I just tested that code with and without the temp_sum variable, and got about 5.6 and 14.1 seconds respectively (matrix size of 2^14 x 2^14) on a TITAN V GPU.
@hanwang59405 жыл бұрын
@@NotesByNick I see, thanks for the explanation. I tried with my GPU(gtx 1050). Matrix size of 2 ^13 x 2 ^ 13. With temp_sum and without temp_sum is 13.5s and 14.5s(the avg of several test runs). Its a slight improvement but not as substantial as yours. Im guessing its because TITAN V is much more powerful than GTX 1050.
@NotesByNick5 жыл бұрын
@@hanwang5940 Performance optimizations rarely translate 1:1 between different GPUs of different architectures (sometimes they don't translate at all). I did the same test on my local machine and found it improved a 2^12 matrix from ~2 seconds to ~1.6 or so. Another influencing factor will be if you are using unified memory or not (if you want to really isolate the performance changes, you don't want paging to influence your results). You also want to make sure that same GPU isn't being used to drive a display.
@seanhiggins20852 жыл бұрын
I followed the code in this video and for some reason it runs just fine in Release but when I try to run it in Debug it actually fails the assert(). Any clue why this is?
@VincentZalzal Жыл бұрын
The C version of the matrix multiplication is not initializating the destination memory to 0 before looping over += operations. In Debug, depending on your standard library, OS, etc, memory might be default-initialized to 0 after a call to malloc, but that step is usually skipped in Release mode.
@ryanmckenna2047 Жыл бұрын
Doesn't that give us 2^12 threads in total since its a 16x16 grid with 4 thread blocks along each grid axis, giving a 64 x 64 grid which 2^6 x 2^6 = 2^12 threads in total. Why wouldn't we want one thread per element if there are 1024 x 1024 elements in the matrix in total. In the vector addition example we had one thread per element, in this case we have 2^8 = 256 threads less than that. Please explain.
@zijiali83493 ай бұрын
I had the same confusion in the beginning. I believe he "purposely" allocated more threads and space than necessary, only to demonstrate that you don't need to perfectly match. This is handled in the if statement in kernel.
@Omgcoconuts4 жыл бұрын
What about matrix-vector multiplication. How does the indexing change?
@NotesByNick4 жыл бұрын
Good question! The indexing doesn't really change much. Matrix-vector multiplication is just a special case of matrix-matrix multiplication, where the dimension in one of the matrices is 1. The only major change you would make to the indexing is to do row-major accesses instead of column-major accesses for the vector.
@KishanKumar-mz3xr4 жыл бұрын
Is it necessary to store data in a linear array, can't we initialize a and b as 2-D array of size n*n?
@NotesByNick4 жыл бұрын
You can, but that would be more inefficient. Instead of having 1 pointer to n x n elements, you would need n pointers that each point to 1 x n elements. So now you have to not only store n x n total elements, you’d need n pointers (instead of just 1). There’s also something to be said for not having fragmented memory. Why break up a giant piece of memory you know you need into small chunks of you don’t have to? This can lead to an even larger amount of memory being used because of padding for each row you allocate individually (instead just a small amount of padding for a single large allocation)
@BlackbodyEconomics3 жыл бұрын
You totally saved me. Thanks :)
@turnipsheep3 жыл бұрын
Excellent. Really helpful. Thanks
@eduardojreis3 жыл бұрын
Thank you so much for this incredible Crash Course. I think I understood the need to the `thread block` to be 2D, but I am not sure about the `grid size`. Why does it need to be 2D as well? Also I am a bit confused with the `dim3` having only 2 components. Shouldn't it be `dim2`?
@eduardojreis3 жыл бұрын
Would it be the case that `thread block` have a limit size and `grids` don't? I might had missed that.
@Aditya-ec1ts11 ай бұрын
Yeah @@eduardojreis I was thinking the same. We can still have a 1d block with 2d threads in it and make it work. I don't think it will affect the compute that much either.
@michaelscheinfeild9768 Жыл бұрын
nick great cuda course thank you
@farinazf30403 жыл бұрын
Thank you very much! You are a lifesaver!
@eduardojreis3 жыл бұрын
3:50 - If Thread Blocks > Waps > Threads. Shouldn't this be than a "tiny 2D Warp" instead of a "tiny 2D thread block"?
@jellyjiggler53114 жыл бұрын
arent there four threads per block?
@NotesByNick4 жыл бұрын
Only in the slides for the example. The code in the video uses 256 threads per block, and the updated uses 1024.
@elucasvargas18744 жыл бұрын
There is a mistake, int GRID_SIZE = (int)ceil(n / BLOCK_SIZE); must to be int GRID_SIZE = (int)ceil(n / BLOCK_SIZE + 1);
@NotesByNick4 жыл бұрын
ELucas Vargas thanks, for the comment. This was already fixed in the code on GitHub, and in version two of the CUDA Crash Course series. Cheers, -Nick