Lecture 33: Bitblas

  Рет қаралды 1,079

GPU MODE

GPU MODE

Күн бұрын

Пікірлер: 2
@wolpumba4099
@wolpumba4099 3 ай бұрын
*Lecture 33: Bitblas - Enabling Efficient Low Precision Computing with Hardware Aware Transformations* * *0:00** Introduction:* James Melvin introduces Lei Wang, a research intern at Microsoft Research, who presents Bitblas, a kernel library and end-to-end compiler for high-performance mixed-precision computing. He also introduces a Triton-like programming language called TI Language. * *1:58** Mixed Precision Computing Background:* Lei Wang explains the shift towards lower bit formats in AI models for memory efficiency. He outlines three challenges: lack of custom precision format support in hardware/software, limited mixed-precision instructions, and vast computation combinations requiring extensive optimization. * *6:20** Insights and abstractions:* Two core insights drive Bitblas: flexible data type representation in memory allows reinterpretation in software, and custom data types can be converted to standard types to leverage existing hardware instructions. * *7:18** Tensor-Centric Abstractions:* Bitblas introduces abstractions like TI Type (custom data types), TI Tile (tensors), Index Map (data layout), and scheduling templates to manipulate tensors. This enables defining computations with explicit data types and layouts. * *13:30** Finding the Right Instructions:* Bitblas includes a "Bit Machine Instruction" framework to select the most efficient hardware instructions based on data type and FLOPs. An iterator classification method maps computations to target instructions (e.g., Tensor Cores). * *17:34** Optimizing Data Layouts:* Bitblas infers memory layouts aligned with hardware instructions to minimize memory access issues. The TI approach further optimizes by fusing operators and propagating layouts through the tensor graph. * *20:40** Layout Propagation:* Challenges in layout propagation include misalignment between problem scale and instructions, computations outside core instructions, and layout transformations affecting correctness. Bitblas categorizes layouts and implements specific propagation methods. * *26:14** Deciding When to Dequantize:* Bitblas uses a latency-oriented policy to determine the optimal stage for dequantization (registers, shared memory, or global memory), trading off compute overhead and memory savings. * *29:00** Bitblas Systems: Later and Bitblas:* Later is an end-to-end compiler that optimizes operator fusion and generates efficient CUDA kernels. Bitblas is a kernel library with a simplified API abstracting tensor transformations. * *32:58** Optimization Tricks:* Bitblas implements fast dequantization techniques using vectorization and specialized instructions for improved performance, especially for low bit widths. * *40:58** Kernel Code Generation for Dynamic Shapes:* Bitblas addresses the challenge of dynamic shapes in LLMs by generating code for segments of the dynamic dimension and storing optimal configurations for dispatch. * *46:42** Performance Results:* Bitblas demonstrates significant speedups over existing systems and hand-written kernels across various hardware and models, including AMD GPUs. Scaling experiments with Llama models show memory and compute benefits with lower precision. * *51:06** Challenges and Future Work:* Kernel compilation time, complexity of Bitblas scheduling, and the limitations of schedule-based implementations are highlighted as areas for future work. * *51:49** Bitblas Code Overview and TI Language:* Lei Wang provides a brief overview of the Bitblas code structure and highlights TI Language, a new programming language designed for ease of kernel development with support for custom data types, layouts, and hardware instructions. I used gemini-1.5-pro-002 on rocketrecap dot com to summarize the transcript. Cost (if I didn't use the free tier): $0.03 Input tokens: 24672 Output tokens: 716
@kunalsuri8316
@kunalsuri8316 3 ай бұрын
Super useful! Thank you!!!
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
Lecture 36: CUTLASS and Flash Attention 3
1:49:16
GPU MODE
Рет қаралды 3,1 М.
So Cute 🥰 who is better?
00:15
dednahype
Рет қаралды 19 МЛН
Mom Hack for Cooking Solo with a Little One! 🍳👶
00:15
5-Minute Crafts HOUSE
Рет қаралды 23 МЛН
Леон киллер и Оля Полякова 😹
00:42
Канал Смеха
Рет қаралды 4,7 МЛН
黑天使被操控了#short #angel #clown
00:40
Super Beauty team
Рет қаралды 61 МЛН
Lecture 34: Low Bit Triton Kernels
1:45:31
GPU MODE
Рет қаралды 1,5 М.
What is generative AI and how does it work? - The Turing Lectures with Mirella Lapata
46:02
¿HA ESTALLADO LA BURBUJA DE LA IA? - DeepSeek y la caída de NVIDIA
25:22
Artificial Intelligence | 60 Minutes Full Episodes
53:30
60 Minutes
Рет қаралды 8 МЛН
DeepSeek R1: genial para startups, malo para Silicon Valley
18:58
Lecture 38: Low Bit ARM kernels
1:03:41
GPU MODE
Рет қаралды 659
СИЖУ БЕЗ ЕДЫ, ПЬЮ ОДНУ ВОДИЧКУ.
21:37
Быть Добру
Рет қаралды 79 М.
В Европе заставят Apple сделать в айфонах USB Type-C
0:18
Короче, новости
Рет қаралды 1,1 МЛН
В Европе заставят Apple сделать в айфонах USB Type-C
0:18
Короче, новости
Рет қаралды 1,1 МЛН
три кошака и ростелеком
0:26
Мистер Денала
Рет қаралды 2,4 МЛН
BIP HOUSE / БИП ХАУС #SHORTS
1:00
bip_house
Рет қаралды 3 МЛН
#trending #foryou #challenge #fyp #viral #short #tiktok #vs
0:15
Misiсatсh
Рет қаралды 2,4 МЛН
BIP HOUSE  .бип хаус 🥰🏡  #shorts
0:13
bip_house
Рет қаралды 1,2 МЛН