Lecture 24: Scan at the Speed of Light

  Рет қаралды 2,026

GPU MODE

GPU MODE

Күн бұрын

Пікірлер: 2
@wolpumba4099
@wolpumba4099 4 ай бұрын
*Lecture 24: Scan at the Speed of Light - Optimizing Parallel Prefix Sum on GPUs* * *0:00** Introduction:* Scan (parallel prefix sum) is a powerful algorithm for parallel computing, enabling parallelization of seemingly sequential operations. * *0:30** Importance in Machine Learning:* Scan underpins many modern architectures like Mamba, making its optimization crucial for performance. * *0:48** CUDA COR Libraries:* The talk focuses on optimizing scan using the CUDA COR (Cooperative Computing Libraries), specifically the CUB library's device scan and decoupled lookback abstractions. * *2:19** Speed of Light Analysis:* This performance analysis technique involves identifying the theoretical peak performance (speed of light) and comparing algorithm performance against it, measured as a percentage of peak. * *8:58** Scan Speed of Light:* Scan is memory-bound, with a minimum of 2N memory operations (N reads, N writes), making memory copy its speed of light benchmark. * *14:02** Analyzing Scan Implementations:* * Reduce then Scan: Performs 3N memory operations, limiting it to 66% of speed of light. * Hierarchical Scan: Performs 4N memory operations, limiting it to 50% of speed of light. * Stream Scan: Though only performing 2N memory operations, it's limited by memory latency due to thread block serialization, achieving only a fraction of speed of light. * *23:21** Stream Scan Optimization:* Addressing a bug in the original Stream Scan implementation by using std::atomic for proper memory ordering results in a 3x speedup. * *26:43** Block Scan Optimization:* Replacing the manual block prefix sum with CUB's Cooperative API block scan leads to further performance gains. * *29:39** Increasing Tile Size:* Increasing the number of items processed per thread in Stream Scan improves throughput by reducing idle time during message passing latency. * *31:13** Coalesced Loads:* Using CUB's functionality for coalesced memory loads significantly improves performance by reducing redundant memory accesses. * *32:39** Decoupled Lookback:* This algorithm overcomes Stream Scan's latency limitations by allowing thread blocks to work with partial results, achieving higher parallelism. * *36:07** Lookback Window and Backoff:* Decoupled Lookback uses a window to load multiple tile states simultaneously but introduces memory contention. Backoff mechanisms (fixed delay, exponential backoff) are implemented in CUB to mitigate this. * *38:38** CUB Device Scan Performance:* CUB Device Scan, leveraging Decoupled Lookback and extensive tuning, achieves 86% of peak bandwidth, significantly outperforming Stream Scan. * *39:04** Importance of Tuning:* CUB Device Scan's superior performance compared to a basic Decoupled Lookback implementation highlights the impact of parameter tuning. * *40:55** Conclusion:* While implementing scan algorithms provides valuable learning, using optimized libraries like CUB is crucial for production code due to their speed of light performance achieved through sophisticated algorithms and extensive tuning. * *41:54** Feedback:* Users are encouraged to provide feedback or report issues on GitHub or Discord if CUB's abstractions don't meet specific needs. * *52:38** GPU Diversity and Benchmarking:* Addressing the challenge of performance testing across diverse GPUs, the speakers suggest: * Leveraging CUB, as it's tested on a wide range of GPUs. * Outsourcing the tuning process to the community via CUB's open-source tuning infrastructure. * Using tools like Nsight Compute for detailed kernel performance analysis. * Building simple CLI-based utilities for quick performance checks. * Considering data center-specific features when optimizing for those GPUs. * *56:05** User-Side Performance Feedback:* Simple tools that provide users with easy-to-understand performance feedback (e.g., estimated MFU) are valuable for identifying performance bottlenecks. * *1:00:20** GPU Direct Storage (GDS) on Laptops:* GDS is typically unavailable on entry-level and laptop GPUs. Feature requests are the recommended way to advocate for its inclusion. * *1:00:20** Optimizing CPU Offloading:* Techniques like custom memory allocators can improve CPU offloading performance when VRAM is limited. I used gemini-1.5-pro-exp-0827 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.04 Input tokens: 28148 Output tokens: 970
@alfinal5787
@alfinal5787 7 ай бұрын
It would’ve been nice if they explained the algorithms properly in the beginning
one year of studying (it was a mistake)
12:51
Jeffrey Codes
Рет қаралды 323 М.
Lecture 20: Scan Algorithm
1:02:20
GPU MODE
Рет қаралды 1,8 М.
My scorpion was taken away from me 😢
00:55
TyphoonFast 5
Рет қаралды 2,7 МЛН
UFC 310 : Рахмонов VS Мачадо Гэрри
05:00
Setanta Sports UFC
Рет қаралды 1,2 МЛН
Mom Hack for Cooking Solo with a Little One! 🍳👶
00:15
5-Minute Crafts HOUSE
Рет қаралды 23 МЛН
Support each other🤝
00:31
ISSEI / いっせい
Рет қаралды 81 МЛН
CUDA Mode Keynote | Tri Dao | Together.ai
16:53
Accel
Рет қаралды 550
Lecture 8: CUDA Performance Checklist
1:08:10
GPU MODE
Рет қаралды 5 М.
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
SGLang Developer Sync - 20241102
1:06:13
LMSYS Org Official
Рет қаралды 220
Lecture 16: On Hands Profiling
55:41
GPU MODE
Рет қаралды 2,9 М.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,6 МЛН
Inside the V3 Nazi Super Gun
19:52
Blue Paw Print
Рет қаралды 3 МЛН
AI Laptops: Exposing the Truth
14:29
Just Josh
Рет қаралды 4,7 М.
AI Is Making You An Illiterate Programmer
27:22
ThePrimeTime
Рет қаралды 292 М.
Lecture 40: CUDA Docs for Humans
51:07
GPU MODE
Рет қаралды 2,4 М.
My scorpion was taken away from me 😢
00:55
TyphoonFast 5
Рет қаралды 2,7 МЛН