CUDA Crash Course: Matrix Multiplication

Рет қаралды 37,661

Nick

Күн бұрын

Пікірлер: 34

@yuhaolin7337 5 жыл бұрын

your examples and illustration are helpful! thanks!

@NotesByNick 5 жыл бұрын

Happy to help fella!

@eaemmm333 3 жыл бұрын

Thank you for the informative courses one feedback is your picture hid the lower right side which might contain some information if kindly make it small it would be better

@matrx2wi3omf 11 ай бұрын

Row = blockIdx.y * blockDim.y + threadIdx.y Col = blockIdx.x * blockDim.x + threadIdx.x

@SuperChrzaszcz 5 жыл бұрын

About the inside of the `verify_result`: Are You relying on malloc initiallizing the `verify_c` to zero?

@NotesByNick 5 жыл бұрын

Got lucky that I happened to get chunk of zeroed out memory when I recorded the video. The code on Github was fixed post-release of this video on Feb 22, and does not rely on undefined behavior.

@hanwang5940 5 жыл бұрын

Is the intermediate variable temp_sum necessary? My train of thought is that since array c is in global memory. It will take longer to accumulate the multiplication for each index using array c. Instead, here we use temp_sum which is in local memory to accumulate and then assign the final value to global memory. Thus reducing the time of memory transfer. Is my reasoning correct?

@NotesByNick 5 жыл бұрын

The compiler should put temp in a register, so it should be faster to access than global memory. You don't have to do it, but it likely will improve performance (significantly in this case). I just tested that code with and without the temp_sum variable, and got about 5.6 and 14.1 seconds respectively (matrix size of 2^14 x 2^14) on a TITAN V GPU.

@hanwang5940 5 жыл бұрын

@@NotesByNick I see, thanks for the explanation. I tried with my GPU(gtx 1050). Matrix size of 2 ^13 x 2 ^ 13. With temp_sum and without temp_sum is 13.5s and 14.5s(the avg of several test runs). Its a slight improvement but not as substantial as yours. Im guessing its because TITAN V is much more powerful than GTX 1050.

@NotesByNick 5 жыл бұрын

@@hanwang5940 Performance optimizations rarely translate 1:1 between different GPUs of different architectures (sometimes they don't translate at all). I did the same test on my local machine and found it improved a 2^12 matrix from ~2 seconds to ~1.6 or so. Another influencing factor will be if you are using unified memory or not (if you want to really isolate the performance changes, you don't want paging to influence your results). You also want to make sure that same GPU isn't being used to drive a display.

@seanhiggins2085 2 жыл бұрын

I followed the code in this video and for some reason it runs just fine in Release but when I try to run it in Debug it actually fails the assert(). Any clue why this is?

@VincentZalzal Жыл бұрын

The C version of the matrix multiplication is not initializating the destination memory to 0 before looping over += operations. In Debug, depending on your standard library, OS, etc, memory might be default-initialized to 0 after a call to malloc, but that step is usually skipped in Release mode.

@ryanmckenna2047 Жыл бұрын

Doesn't that give us 2^12 threads in total since its a 16x16 grid with 4 thread blocks along each grid axis, giving a 64 x 64 grid which 2^6 x 2^6 = 2^12 threads in total. Why wouldn't we want one thread per element if there are 1024 x 1024 elements in the matrix in total. In the vector addition example we had one thread per element, in this case we have 2^8 = 256 threads less than that. Please explain.

@zijiali8349 3 ай бұрын

I had the same confusion in the beginning. I believe he "purposely" allocated more threads and space than necessary, only to demonstrate that you don't need to perfectly match. This is handled in the if statement in kernel.

@Omgcoconuts 4 жыл бұрын

What about matrix-vector multiplication. How does the indexing change?

@NotesByNick 4 жыл бұрын

Good question! The indexing doesn't really change much. Matrix-vector multiplication is just a special case of matrix-matrix multiplication, where the dimension in one of the matrices is 1. The only major change you would make to the indexing is to do row-major accesses instead of column-major accesses for the vector.

@KishanKumar-mz3xr 4 жыл бұрын

Is it necessary to store data in a linear array, can't we initialize a and b as 2-D array of size n*n?

@NotesByNick 4 жыл бұрын

You can, but that would be more inefficient. Instead of having 1 pointer to n x n elements, you would need n pointers that each point to 1 x n elements. So now you have to not only store n x n total elements, you’d need n pointers (instead of just 1). There’s also something to be said for not having fragmented memory. Why break up a giant piece of memory you know you need into small chunks of you don’t have to? This can lead to an even larger amount of memory being used because of padding for each row you allocate individually (instead just a small amount of padding for a single large allocation)

@BlackbodyEconomics 3 жыл бұрын

You totally saved me. Thanks :)

@turnipsheep 3 жыл бұрын

Excellent. Really helpful. Thanks

@eduardojreis 3 жыл бұрын

Thank you so much for this incredible Crash Course. I think I understood the need to the `thread block` to be 2D, but I am not sure about the `grid size`. Why does it need to be 2D as well? Also I am a bit confused with the `dim3` having only 2 components. Shouldn't it be `dim2`?

@eduardojreis 3 жыл бұрын

Would it be the case that `thread block` have a limit size and `grids` don't? I might had missed that.

@Aditya-ec1ts 11 ай бұрын

Yeah @@eduardojreis I was thinking the same. We can still have a 1d block with 2d threads in it and make it work. I don't think it will affect the compute that much either.

@michaelscheinfeild9768 Жыл бұрын

nick great cuda course thank you

@farinazf3040 3 жыл бұрын

Thank you very much! You are a lifesaver!

@eduardojreis 3 жыл бұрын

3:50 - If Thread Blocks > Waps > Threads. Shouldn't this be than a "tiny 2D Warp" instead of a "tiny 2D thread block"?

@jellyjiggler5311 4 жыл бұрын

arent there four threads per block?

@NotesByNick 4 жыл бұрын

Only in the slides for the example. The code in the video uses 256 threads per block, and the updated uses 1024.

@elucasvargas1874 4 жыл бұрын

There is a mistake, int GRID_SIZE = (int)ceil(n / BLOCK_SIZE); must to be int GRID_SIZE = (int)ceil(n / BLOCK_SIZE + 1);

@NotesByNick 4 жыл бұрын

ELucas Vargas thanks, for the comment. This was already fixed in the code on GitHub, and in version two of the CUDA Crash Course series. Cheers, -Nick

@elucasvargas1874 4 жыл бұрын

@@NotesByNick Thanks, great job