2678x Faster with CUDA C: Simple Matrix Multiplication on a GPU | Episode 1: Introduction to GPGPU

  Рет қаралды 12,719

0Mean1Sigma

0Mean1Sigma

Күн бұрын

Пікірлер: 43
@divyamxdeep
@divyamxdeep Ай бұрын
While clicking at the video, never in a million years could’ve I imagined that you explain all of this stuff in such simple and comprehensive manner. Great Work.
@0mean1sigma
@0mean1sigma Ай бұрын
Glad you liked it. I would appreciate your feedback on my blog posts as well (link in the description). I started writing early this month and keen on improving there as well. Thanks a lot again 😃
@divyamxdeep
@divyamxdeep Ай бұрын
@@0mean1sigma I’ll sure take a look
@gs1987100
@gs1987100 4 күн бұрын
So. clearly explainrd, most valuable video on topic ever made..... wow...
@0mean1sigma
@0mean1sigma 4 күн бұрын
Thanks a lot. Glad you liked it 😃
@ProjectPhysX
@ProjectPhysX Ай бұрын
The real magic starts with cache tiling and shared memory optimization. Hope to see this in Episode 2!
@0mean1sigma
@0mean1sigma Ай бұрын
Yup, that's ep 2 and 3. 😃
@bilal_ali
@bilal_ali Ай бұрын
How did you make this animation like 3blue1brown btw you name 0mean1sigma is quite Standardized.
@0mean1sigma
@0mean1sigma Ай бұрын
Manim is open source. All my work is open source as well and I've provided a link to the animation code for my videos. Thanks a lot for watching... 😃
@deepak_nigwal
@deepak_nigwal Ай бұрын
for a moment, i literally thought it was 3blue1brown video 😅
@illustrationvaz
@illustrationvaz Ай бұрын
Thank you for this video!! Great content and nice animations
@0mean1sigma
@0mean1sigma Ай бұрын
Thanks a lot 😀
@finmat95
@finmat95 Ай бұрын
Simple and clear. Awesome.
@jakeaustria5445
@jakeaustria5445 Ай бұрын
Hi, Standard Normal, thanks for the great vid!😊
@0mean1sigma
@0mean1sigma Ай бұрын
Glad you liked it 😃
@plutoz1152
@plutoz1152 Ай бұрын
Crisp and clean explanation! I wondered can you do a video on warps, thread tiling, different types of kernel reduction and fusion in a simple application based example ?
@0mean1sigma
@0mean1sigma Ай бұрын
The next video is on Warps and details related to the GPU memory (shared memory, registers, etc.)! After that video, I'll make another one on tiled matrix multiplication. If you're interested please sign up on my website where I post the video notes. That way you can access the detailed content and post your thoughts in the discussion section. Thanks a lot for the comment 😃
@dan_pal
@dan_pal Ай бұрын
This was an amazing explanation, thanks for sharing.
@0mean1sigma
@0mean1sigma Ай бұрын
Thanks a lot. Glad you liked it 😃
@dtamien
@dtamien Ай бұрын
I loved this video. I wished it had kept going on
@0mean1sigma
@0mean1sigma Ай бұрын
Thanks a lot 😃 I've a few more videos on GPU programming coming up...
@dipi71
@dipi71 2 күн бұрын
What about ARM, AMD, Gallium, IBM, Intel, Texas Instruments or POCL? Your CUDA example only runs on Nvidia hardware. Use OpenCL.
@finmat95
@finmat95 Ай бұрын
Just two questions: 1- What if you want to use the GPU power and efficiency without rely on CUDA and use a general code to perform operations on a general GPU (AMD users for example)? What code do you have to write? 2- The performance would be the same?
@0mean1sigma
@0mean1sigma Ай бұрын
AMD has ROCm (that's their CUDA). However, I'm not sure how well it works. I tried ROCm some 4-5 years ago and it was not a pleasant experience back then. I moved to nvidia after that because with nvidia, things just work. There's also opencl, but again it has installation and performance issues (in some cases). But the concepts of parallel programming that I focus more on are the same everywhere. Only the syntax changes. Hope this answers your questions. 😃
@MrHaggyy
@MrHaggyy 16 күн бұрын
In Python you have modules like CuPy to use the GPU. Tensorflow and PyTorch will also use it under the hood. The performance will be slightly worse as you have some overhead for abstraction. But it should be a point of deminishing returns in almost every case.
@finmat95
@finmat95 16 күн бұрын
@@MrHaggyy In Python, and in C?
@MrHaggyy
@MrHaggyy 15 күн бұрын
@@finmat95 I'm not aware of C/C++ modules that hide HW abstraction on the level that Python does. But those Python modules use a lot of C/C++ themself so you could look into their code. Or use a framework like Chlorine or Kompute. Kompute is nice because it has the same examples in C++ and Python.
@sehbanomer8151
@sehbanomer8151 Ай бұрын
Great introduction. One thing to add, each thread can also contain a small block of output elements rather than a single one.
@0mean1sigma
@0mean1sigma Ай бұрын
You're right! But that's only required if the matrix is too large. Otherwise you'll be looping over the elements sequentially, defeating the purpose of parallelization. There are several optimizations yet to be done and I'm working on a video right now where I'll explain how we can use GPU hardware smartly (especially different memory components) to speed up the computations even more. If you're interested in an early discussion on that topic, please sign up to the blog posts on my website and I would love technical discussions in the comment section there (I generally post early there and you'll get notified in your mail). BTW, Thanks a lot for watching the video 😃
@vigneshs.666
@vigneshs.666 Ай бұрын
amazing video!
@0mean1sigma
@0mean1sigma Ай бұрын
Thanks a lot 😃
@empatikokumalar8202
@empatikokumalar8202 Ай бұрын
In the matrix multiplications used at 2:00, are the numbers of rows and columns in the matrices variable or fixed? If it is variable, in what value range, if it is constant, in what value. Also, how many bit operations do these matrices use?
@0mean1sigma
@0mean1sigma Ай бұрын
In the real world, matrix size is set by the data so for different problems the number of rows/columns will be different (so it's a variable in that sense). However, once the execution begins, the matrix size does not change. The range of values can be anything (there's no limit on that). However, for very large and small matrices, parallelization techniques will change. I'll cover a few of those in my next video (like: using shared memory, memory coalescing, thread coarsening), so please keep an eye on that (you can also sign up on my website where I publish detailed blog posts and you'll get notified when I publish something). As far as the operations are concerned, I have done a detailed analysis of computation cost in the video notes (link in the description), in short, matrix multiplication is of O(N^3) complexity. I appreciate you watching the video, and if you've more questions after reading the notes, I'm happy to answer those in the discussion section of my website. 😃
@empatikokumalar8202
@empatikokumalar8202 Ай бұрын
@@0mean1sigma I would be very happy if you could make a video about H100s sometime.
@empatikokumalar8202
@empatikokumalar8202 Ай бұрын
@@0mean1sigma So I developed a different processor method. It is not transistor based. Therefore, it has features that are faster and consume less energy than you can imagine. But I can't find a way to use it and make money from it. I am no longer sure that companies and states are really searching for this issue.
@0mean1sigma
@0mean1sigma Ай бұрын
I'm not sure what I can say about H100s. My focus is on writing fast (enough) + easy to understand code by understanding the general hardware components (not specific to a GPU/CPU model), so that it scales well with new hardware generations. In any case, I'm not influential enough to get access to H100s so I don't think I would be able to code by keeping H100 specs in mind (at least at this point in time).
@cariyaputta
@cariyaputta Ай бұрын
Nice channel.
@0mean1sigma
@0mean1sigma Ай бұрын
Thanks. Glad you liked the content 😃
@user-nb6bo6hl6d
@user-nb6bo6hl6d Ай бұрын
🙌👏
@warpdrive9229
@warpdrive9229 Ай бұрын
Namaste Tushar bhai! Kaise ho!
@0mean1sigma
@0mean1sigma Ай бұрын
All good! Hope you're doing well 😃
@warpdrive9229
@warpdrive9229 Ай бұрын
@@0mean1sigma I too have enrolled myself in a PhD program in Machine Learning like you. Pretty anxious. Hope things turn out well XD
@0mean1sigma
@0mean1sigma Ай бұрын
All the best 😃
What P vs NP is actually about
17:58
Polylog
Рет қаралды 86 М.
Cute
00:16
Oyuncak Avı
Рет қаралды 5 МЛН
Modus males sekolah
00:14
fitrop
Рет қаралды 15 МЛН
The FASTEST way to PASS SNACKS! #shorts #mingweirocks
00:36
mingweirocks
Рет қаралды 15 МЛН
هذه الحلوى قد تقتلني 😱🍬
00:22
Cool Tool SHORTS Arabic
Рет қаралды 98 МЛН
comparing GPUs to CPUs isn't fair
6:30
Low Level Learning
Рет қаралды 289 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 308 М.
Terence Tao at IMO 2024: AI and Mathematics
57:24
AIMO Prize
Рет қаралды 300 М.
The fastest matrix multiplication algorithm
11:28
Dr. Trefor Bazett
Рет қаралды 288 М.
How AI Discovered a Faster Matrix Multiplication Algorithm
13:00
Quanta Magazine
Рет қаралды 1,4 МЛН
Dynamic Programming isn't too hard. You just don't know what it is.
22:31
DecodingIntuition
Рет қаралды 135 М.
How might LLMs store facts | Chapter 7, Deep Learning
22:43
3Blue1Brown
Рет қаралды 431 М.
Andrew Kelley   Practical Data Oriented Design (DoD)
46:40
ChimiChanga
Рет қаралды 98 М.
Cute
00:16
Oyuncak Avı
Рет қаралды 5 МЛН