Рет қаралды 1,000
Quantization is an excellent technique to compress Large Language Models (LLM) and accelerate their inference.
Following up on part 1 • Deep Dive: Quantizing ... , we look at and compare more advanced quantization techniques: SmoothQuant, GPTQ, AWQ, HQQ, and the Hugging Face Optimum Intel library based on Intel Neural Compressor and Intel OpenVINO.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
00:55 SmoothQuant
07:00 Group-wise Precision Tuning Quantization (GPTQ)
12:35 Activation-aware Weight Quantization (AWQ)
18:10 Half-Quadratic Quantization (HQQ)
23:15 Optimum Intel
25:45 Accelerating Stable Diffusion with Intel OpenVINO