Рет қаралды 8,584
Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference.
In this video, we discuss model quantization, first introducing what it is, and how to get an intuition of rescaling and the problems it creates. Then we introduce the different types of quantization: dynamic post-training quantization, static post-training quantization, and quantization-aware training. Finally, we start looking at and comparing actual quantization techniques: PyTorch, ZeroQuant, and bitsandbytes.
In part 2 • Deep Dive: Quantizing ... , we look at and compare more advanced quantization techniques: SmoothQuant, GPTQ, AWQ, HQQ, and the Hugging Face Optimum Intel library based on Intel Neural Compressor and Intel OpenVINO.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
02:05 What is quantization?
06:50 Rescaling weights and activations
08:17 The mapping function
12:38 Picking the input range
16:15 Getting rid of outliers
19:50 When can we apply quantization?
26:00 Dynamic post-training quantization with PyTorch
28:42 ZeroQuant
34:50 bitsandbytes