2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU

2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU

Рет қаралды 12,719

Күн бұрын

Пікірлер: 43

@divyamxdeep Ай бұрын

While clicking at the video, never in a million years could’ve I imagined that you explain all of this stuff in such simple and comprehensive manner. Great Work.

@0mean1sigma Ай бұрын

Glad you liked it. I would appreciate your feedback on my blog posts as well (link in the description). I started writing early this month and keen on improving there as well. Thanks a lot again 😃

@divyamxdeep Ай бұрын

@@0mean1sigma I’ll sure take a look

@gs1987100 4 күн бұрын

So. clearly explainrd, most valuable video on topic ever made..... wow...

@0mean1sigma 4 күн бұрын

Thanks a lot. Glad you liked it 😃

@ProjectPhysX Ай бұрын

The real magic starts with cache tiling and shared memory optimization. Hope to see this in Episode 2!

@0mean1sigma Ай бұрын

Yup, that's ep 2 and 3. 😃

@bilal_ali Ай бұрын

How did you make this animation like 3blue1brown btw you name 0mean1sigma is quite Standardized.

@0mean1sigma Ай бұрын

Manim is open source. All my work is open source as well and I've provided a link to the animation code for my videos. Thanks a lot for watching... 😃

@deepak_nigwal Ай бұрын

for a moment, i literally thought it was 3blue1brown video 😅

@illustrationvaz Ай бұрын

Thank you for this video!! Great content and nice animations

@0mean1sigma Ай бұрын

Thanks a lot 😀

@finmat95 Ай бұрын

Simple and clear. Awesome.

@jakeaustria5445 Ай бұрын

Hi, Standard Normal, thanks for the great vid!😊

@0mean1sigma Ай бұрын

Glad you liked it 😃

@plutoz1152 Ай бұрын

Crisp and clean explanation! I wondered can you do a video on warps, thread tiling, different types of kernel reduction and fusion in a simple application based example ?

@0mean1sigma Ай бұрын

The next video is on Warps and details related to the GPU memory (shared memory, registers, etc.)! After that video, I'll make another one on tiled matrix multiplication. If you're interested please sign up on my website where I post the video notes. That way you can access the detailed content and post your thoughts in the discussion section. Thanks a lot for the comment 😃

@dan_pal Ай бұрын

This was an amazing explanation, thanks for sharing.

@0mean1sigma Ай бұрын

Thanks a lot. Glad you liked it 😃

@dtamien Ай бұрын

I loved this video. I wished it had kept going on

@0mean1sigma Ай бұрын

Thanks a lot 😃 I've a few more videos on GPU programming coming up...

@dipi71 2 күн бұрын

What about ARM, AMD, Gallium, IBM, Intel, Texas Instruments or POCL? Your CUDA example only runs on Nvidia hardware. Use OpenCL.

@finmat95 Ай бұрын

Just two questions: 1- What if you want to use the GPU power and efficiency without rely on CUDA and use a general code to perform operations on a general GPU (AMD users for example)? What code do you have to write? 2- The performance would be the same?

@0mean1sigma Ай бұрын

AMD has ROCm (that's their CUDA). However, I'm not sure how well it works. I tried ROCm some 4-5 years ago and it was not a pleasant experience back then. I moved to nvidia after that because with nvidia, things just work. There's also opencl, but again it has installation and performance issues (in some cases). But the concepts of parallel programming that I focus more on are the same everywhere. Only the syntax changes. Hope this answers your questions. 😃

@MrHaggyy 16 күн бұрын

In Python you have modules like CuPy to use the GPU. Tensorflow and PyTorch will also use it under the hood. The performance will be slightly worse as you have some overhead for abstraction. But it should be a point of deminishing returns in almost every case.

@finmat95 16 күн бұрын

@@MrHaggyy In Python, and in C?

@MrHaggyy 15 күн бұрын

@@finmat95 I'm not aware of C/C++ modules that hide HW abstraction on the level that Python does. But those Python modules use a lot of C/C++ themself so you could look into their code. Or use a framework like Chlorine or Kompute. Kompute is nice because it has the same examples in C++ and Python.

@sehbanomer8151 Ай бұрын

Great introduction. One thing to add, each thread can also contain a small block of output elements rather than a single one.

@0mean1sigma Ай бұрын

You're right! But that's only required if the matrix is too large. Otherwise you'll be looping over the elements sequentially, defeating the purpose of parallelization. There are several optimizations yet to be done and I'm working on a video right now where I'll explain how we can use GPU hardware smartly (especially different memory components) to speed up the computations even more. If you're interested in an early discussion on that topic, please sign up to the blog posts on my website and I would love technical discussions in the comment section there (I generally post early there and you'll get notified in your mail). BTW, Thanks a lot for watching the video 😃

@vigneshs.666 Ай бұрын

amazing video!

@0mean1sigma Ай бұрын

Thanks a lot 😃

@empatikokumalar8202 Ай бұрын

In the matrix multiplications used at 2:00, are the numbers of rows and columns in the matrices variable or fixed? If it is variable, in what value range, if it is constant, in what value. Also, how many bit operations do these matrices use?

@0mean1sigma Ай бұрын

In the real world, matrix size is set by the data so for different problems the number of rows/columns will be different (so it's a variable in that sense). However, once the execution begins, the matrix size does not change. The range of values can be anything (there's no limit on that). However, for very large and small matrices, parallelization techniques will change. I'll cover a few of those in my next video (like: using shared memory, memory coalescing, thread coarsening), so please keep an eye on that (you can also sign up on my website where I publish detailed blog posts and you'll get notified when I publish something). As far as the operations are concerned, I have done a detailed analysis of computation cost in the video notes (link in the description), in short, matrix multiplication is of O(N^3) complexity. I appreciate you watching the video, and if you've more questions after reading the notes, I'm happy to answer those in the discussion section of my website. 😃

@empatikokumalar8202 Ай бұрын

@@0mean1sigma I would be very happy if you could make a video about H100s sometime.

@empatikokumalar8202 Ай бұрын

@@0mean1sigma So I developed a different processor method. It is not transistor based. Therefore, it has features that are faster and consume less energy than you can imagine. But I can't find a way to use it and make money from it. I am no longer sure that companies and states are really searching for this issue.

@0mean1sigma Ай бұрын

I'm not sure what I can say about H100s. My focus is on writing fast (enough) + easy to understand code by understanding the general hardware components (not specific to a GPU/CPU model), so that it scales well with new hardware generations. In any case, I'm not influential enough to get access to H100s so I don't think I would be able to code by keeping H100 specs in mind (at least at this point in time).