Lecture 41: FlashInfer
1:08:51
Күн бұрын
Lecture 39: Torchtitan
1:43:28
Ай бұрын
Lecture 38: Low Bit ARM kernels
1:03:41
Lecture 35: SGLang
45:19
2 ай бұрын
Lecture 34: Low Bit Triton Kernels
1:45:31
Lecture 33: Bitblas
1:01:48
3 ай бұрын
Lecture 32: Unsloth
1:24:54
3 ай бұрын
Lecture 31: Beginners Guide to Metal
1:32:08
Lecture 30: Quantized Training
1:16:40
GPU MODE IRL 2024 Keynotes
1:48:19
4 ай бұрын
Lecture 29: Triton Internals
1:04:49
Lecture 26: SYCL Mode (Intel GPU)
1:18:35
Lecture 23: Tensor Cores
1:47:50
7 ай бұрын
Lecture 21: Scan Algorithm Part 2
1:04:41
Lecture 20: Scan Algorithm
1:02:20
8 ай бұрын
Lecture 18: Fusing Kernels
1:23:22
8 ай бұрын
Lecture 17: NCCL
59:43
8 ай бұрын
Lecture 16: On Hands Profiling
55:41
Bonus Lecture: CUDA C++ llm.cpp
1:16:05
Пікірлер
@hsubyron2277
@hsubyron2277 2 күн бұрын
🐐
@ufilh
@ufilh 3 күн бұрын
Hi Mark, IMHO, arithmetic intensity of ReLU should not be dependent on whether the input is larger than zero or not as either way you need to write to the output tensor (not in-place update). My understanding is that Nvidia article mentioned the intensity of ReLU is 0.25 simply because FP16 is assumed instead of FP32 as assumed in the video.
@吴俊-n8q
@吴俊-n8q 4 күн бұрын
it is very useful for me
@mohitarora3506
@mohitarora3506 8 күн бұрын
Please share presentation
@Applestaffman
@Applestaffman 9 күн бұрын
img.flatten should equal to cxhxw, not the 33750, should be 33750 x 3
@realisticlevel2553
@realisticlevel2553 11 күн бұрын
Thanks for the re-upload so that we can catch it later! This is a banger talk
@esaliya
@esaliya 12 күн бұрын
thanks for the comprehensive tutorial!
@tomasruiz94
@tomasruiz94 16 күн бұрын
This guy is 100% cracked! No slides & no fluff. Super high-density content ♥
@TheAIEpiphany
@TheAIEpiphany 19 күн бұрын
Cade is cracked! :) really fun lecture!
@SpiderLLL-i1l
@SpiderLLL-i1l 22 күн бұрын
Thank you for your excellent work!
@MrMalish13
@MrMalish13 22 күн бұрын
There are no sync instructions for global load and store, does it mean that those instructions are synchronous?
@MrMalish13
@MrMalish13 22 күн бұрын
in the PTX code, the instruction on line 139: @%p11 st.global.b32 [ %rd14 + 0 ], { %r22 }; in which %rd14 assigned on line 124: add.s64 %fd14, %rd9, 2560; looks like those might be fused. "+0" in global store seems like can be replaced with +2560. Were you using non -O3 option?
@karanjakhar
@karanjakhar 23 күн бұрын
Great talk
@konstantinwilleke6292
@konstantinwilleke6292 23 күн бұрын
Excellent lecture. Thank you <3
@erfanmiahi9462
@erfanmiahi9462 25 күн бұрын
Thank you. I was trying to find the collectives GitHub repository but it looks like it's not public anymore. Do you plan to share it sometime?
@iamsiddhantsahu
@iamsiddhantsahu 26 күн бұрын
Great video -- really good talk by Charles!
@panosdimi4713
@panosdimi4713 Ай бұрын
Happy new year my friend and happy jobs 🎈
@wolpumba4099
@wolpumba4099 Ай бұрын
*CUDA Docs for Humans: A Comprehensive Guide to GPU Programming* * *0:00** Introduction:* This presentation discusses "CUDA Docs for Humans," a comprehensive, interconnected resource for understanding GPU programming, particularly the CUDA stack. * *1:02** Live Demo:* The resource is demonstrated, showcasing its interconnected nature, linking terms like "compute capability" to related concepts like "streaming multiprocessor architecture" with diagrams. * *2:35** Design:* The documentation features a visually engaging, "CUDA mode" design for an enjoyable learning experience. * *3:03** Origin and Motivation:* The project originated from the presenter's experiences in machine learning research and deployment, highlighting the need for a unified understanding of the CUDA stack. * *4:52** Debugging Performance Issues:* Tracing tools like the PyTorch profiler are crucial for understanding performance bottlenecks and the asynchronous nature of CUDA operations. * *6:32** Deployment and Scaling:* The presenter's current work focuses on real-world GPU deployment and scaling, leading to extensive debugging and the creation of the document "I am done not understanding the CUDA stack." * *7:28** Motivation for Public Documentation:* The need for a comprehensive, publicly available resource became apparent after discussions with other professionals who had created similar internal documents, recognizing the limitations of existing, scattered documentation. * *9:49** High-Level Takeaways:* The presenter shares key insights gained from compiling the documentation, emphasizing the multifaceted nature of CUDA and the importance of the PTX ISA. * *10:22** Multiple Meanings of "CUDA":* CUDA refers to different layers of the stack: the software platform, the abstract programming model, and the hardware architecture. * *11:38** CUDA Software Platform:* This layer includes the CUDA runtime and driver APIs, facilitating interaction between application code and the GPU. * *13:28** CUDA Programming Model:* This abstract model defines how programs are written at the thread level, emphasizing shared memory and synchronization within thread blocks. * *16:47** Independence of Abstraction:* The programming model is independent of specific hardware or language implementations, focusing on what can and cannot be assumed about parallel execution. * *17:48** Compute Unified Device Architecture (CUDA):* This approach to hardware design emphasizes a homogeneous array of streaming multiprocessors (SMs) for scalability, contrasting with earlier heterogeneous GPU designs. * *20:38** Historical Context:* The 2008 whitepaper by Lindholm et al. provides a comprehensive overview of the CUDA vision, from hardware to ecosystem. * *21:31** Recommendation:* The presenter strongly recommends reading the Lindholm et al. whitepaper for a deeper understanding of CUDA's foundational principles. * *21:52** Parallel Thread Execution (PTX) ISA:* PTX is highlighted as the most crucial part of the stack, acting as an intermediate representation that enables forward compatibility and transparent scaling. * *26:25** PTX and Forward Compatibility:* PTX allows programs to run on different GPUs and benefit from new hardware features without recompilation. * *27:45** PTX Virtual Machine:* PTX defines a virtual machine with multiple processors and a memory hierarchy, ensuring predictable program behavior. * *28:41** Constraints and Scalability:* The constraints in the CUDA programming model, such as limitations on synchronization, enable transparent scaling across different hardware configurations. * *30:15** Future of the Project:* The presenter discusses plans to enhance the resource, including interactive elements, expanded content, and potential collaborations. * *31:56** Community Resource:* The goal is to make the documentation a valuable community resource, potentially open-sourcing it in the future. * *32:09** Short-Term Goals:* Plans include making the documentation compatible with language models (e.g., "CUDA Docs for Chatbots"), adding interactive code snippets, improving diagrams, and expanding content on synchronization and thread block clusters. * *38:21** External Feedback:* The presenter emphasizes the importance of community feedback and contributions to improve the resource. * *38:45** Medium-Term Goals:* Future plans involve covering performance debugging, GPU fleet management, multi-GPU execution, and potentially partnering with universities for educational content. * *40:53** Call for Collaboration:* The presenter invites collaboration on these medium-term goals, particularly in areas like performance debugging and multi-GPU programming. * *44:06** Hiring at Modal:* The presenter's company, Modal, is hiring GPU experts and offering opportunities for open-source contributions. * *45:48** Closing Remarks:* The presenter thanks the audience and encourages further engagement on Discord and Twitter. * *46:55** Document Availability:* The internal document "I am done not understanding the CUDA stack" has been incorporated into the public GPU Glossary. * *47:01** Discussion on Documentation Issues:* The presenter acknowledges challenges with interlinking information across different documentation sources and suggests that community involvement can help address these issues. * *49:16** Challenges of Documentation:* The presenter notes that documenting the absolute frontiers of performance is inherently difficult due to the breakdown of abstractions and the need for real-world experience to refine understanding. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.04 Input tokens: 25218 Output tokens: 1137
@mastershredder2002
@mastershredder2002 Ай бұрын
bruh is a netrunner and nvidia pooed on his head.
@ArpitAgarwal1
@ArpitAgarwal1 Ай бұрын
very helpful talk! love this
@skanderbegvictor6487
@skanderbegvictor6487 Ай бұрын
Always liked listening to charles.
@charles_irl
@charles_irl Ай бұрын
thanks!
@shisanliu6314
@shisanliu6314 Ай бұрын
why do not we need mask during naive_matmul_k?
@yolo4eva
@yolo4eva Ай бұрын
This is an awesome lecture. Thank you so much!
@rehanbhatti5843
@rehanbhatti5843 Ай бұрын
thank you
@zaursamedov8906
@zaursamedov8906 Ай бұрын
Thank you❤
@mytech6779
@mytech6779 2 ай бұрын
One minute in and my "um" counter already threw an overflow exception.
@anastasiiafilippova5212
@anastasiiafilippova5212 2 ай бұрын
Thank you for this awesome content, super helpful! Just a small advice for those who does not have cuda (for instance, macOS users).: I am using collab Tesla 4 GPU
@xiyanwang-n6m
@xiyanwang-n6m 2 ай бұрын
Could you please tell me where to download the dataset you demonstrated in your video?
@sucim
@sucim 2 ай бұрын
You misspelled his name it is "Wizard" not Aroun. This was amazing. He is so well spoken and knowledgable! How can he even make all of this up on the fly?! And there is still a good structure in all of it. I could listen to him speak about this for anouther couple hours! This also highly motivates me in converting my own project into CUDA. He conveys this "you can just do things" feeling. I did a good amount of CUDA coding already but it always felt like poking a black box. I think the tools and his workflows around them might be even more valuable than the other details
@mobu-o1g
@mobu-o1g 2 ай бұрын
👑
@Pleexed
@Pleexed 2 ай бұрын
Could you add a link to Jay's blog? Also, what's the blog by Simon referred to in the beginning? Thanks!
@kevbuh
@kevbuh 2 ай бұрын
good lecture
@literailly
@literailly 2 ай бұрын
Fantastic, thank you!
@ProgrammingWIthRiley
@ProgrammingWIthRiley 2 ай бұрын
I’m here. I made it
@IsaacLeong-y4k
@IsaacLeong-y4k 2 ай бұрын
11:18 I think in a previous lecture has mentioned that floating point are commutative but not associative, which is actually what is causing the problem in the parallel reduction algorithm.
@madankd
@madankd 2 ай бұрын
Can we get code file
@diakorudd7268
@diakorudd7268 2 ай бұрын
This is pure gold!
@aviralgoel5709
@aviralgoel5709 2 ай бұрын
Thank you
@aviralgoel5709
@aviralgoel5709 2 ай бұрын
Thank you
@mitchellcheng8688
@mitchellcheng8688 2 ай бұрын
Awesome!
@ahmedtremo
@ahmedtremo 2 ай бұрын
Great explanation, thanks
@ahmedtremo
@ahmedtremo 2 ай бұрын
Great video really!
@mobu-o1g
@mobu-o1g 3 ай бұрын
@oguzhanercan4701
@oguzhanercan4701 3 ай бұрын
worse but faster
@deependu__
@deependu__ 3 ай бұрын
thanks for the video. Triton's documentation tutorial starts directly with coding, and it was really difficult for me. Thanks for sharing the programming model first.
@wolpumba4099
@wolpumba4099 3 ай бұрын
*Lecture 33: Bitblas - Enabling Efficient Low Precision Computing with Hardware Aware Transformations* * *0:00** Introduction:* James Melvin introduces Lei Wang, a research intern at Microsoft Research, who presents Bitblas, a kernel library and end-to-end compiler for high-performance mixed-precision computing. He also introduces a Triton-like programming language called TI Language. * *1:58** Mixed Precision Computing Background:* Lei Wang explains the shift towards lower bit formats in AI models for memory efficiency. He outlines three challenges: lack of custom precision format support in hardware/software, limited mixed-precision instructions, and vast computation combinations requiring extensive optimization. * *6:20** Insights and abstractions:* Two core insights drive Bitblas: flexible data type representation in memory allows reinterpretation in software, and custom data types can be converted to standard types to leverage existing hardware instructions. * *7:18** Tensor-Centric Abstractions:* Bitblas introduces abstractions like TI Type (custom data types), TI Tile (tensors), Index Map (data layout), and scheduling templates to manipulate tensors. This enables defining computations with explicit data types and layouts. * *13:30** Finding the Right Instructions:* Bitblas includes a "Bit Machine Instruction" framework to select the most efficient hardware instructions based on data type and FLOPs. An iterator classification method maps computations to target instructions (e.g., Tensor Cores). * *17:34** Optimizing Data Layouts:* Bitblas infers memory layouts aligned with hardware instructions to minimize memory access issues. The TI approach further optimizes by fusing operators and propagating layouts through the tensor graph. * *20:40** Layout Propagation:* Challenges in layout propagation include misalignment between problem scale and instructions, computations outside core instructions, and layout transformations affecting correctness. Bitblas categorizes layouts and implements specific propagation methods. * *26:14** Deciding When to Dequantize:* Bitblas uses a latency-oriented policy to determine the optimal stage for dequantization (registers, shared memory, or global memory), trading off compute overhead and memory savings. * *29:00** Bitblas Systems: Later and Bitblas:* Later is an end-to-end compiler that optimizes operator fusion and generates efficient CUDA kernels. Bitblas is a kernel library with a simplified API abstracting tensor transformations. * *32:58** Optimization Tricks:* Bitblas implements fast dequantization techniques using vectorization and specialized instructions for improved performance, especially for low bit widths. * *40:58** Kernel Code Generation for Dynamic Shapes:* Bitblas addresses the challenge of dynamic shapes in LLMs by generating code for segments of the dynamic dimension and storing optimal configurations for dispatch. * *46:42** Performance Results:* Bitblas demonstrates significant speedups over existing systems and hand-written kernels across various hardware and models, including AMD GPUs. Scaling experiments with Llama models show memory and compute benefits with lower precision. * *51:06** Challenges and Future Work:* Kernel compilation time, complexity of Bitblas scheduling, and the limitations of schedule-based implementations are highlighted as areas for future work. * *51:49** Bitblas Code Overview and TI Language:* Lei Wang provides a brief overview of the Bitblas code structure and highlights TI Language, a new programming language designed for ease of kernel development with support for custom data types, layouts, and hardware instructions. I used gemini-1.5-pro-002 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.03 Input tokens: 24672 Output tokens: 716
@saladpalad
@saladpalad 3 ай бұрын
what a cutie patootie
@kunalsuri8316
@kunalsuri8316 3 ай бұрын
Super useful! Thank you!!!
@sludgekicker
@sludgekicker 3 ай бұрын
Thanks! Very helpful.