Fake quantization means quantize and dequantize but how does it benifits? like we are converting float to int and int to float.can you please elobrate
@OscarSavolainen7 күн бұрын
So it depends on quite a lot. For some uses cases, fake-quant is only useful for determining good quantization paramaters, and then one has to actually convert the model to a "true" quantized model to run on integer arithmetic. However, in certain cases, fake-quant in itself is the goal. For exmaple, certain hardware, e.g. Intel, often does the quantied operations in fake-quant space. There are other examples too that are a bit of a mix. During LLM inference, we typically use quantization to compress the weight tensors to pack multiple weight elements into as single int32 value, but the actual matrix multiplication happens in floating point. So the kernel dequantizes the compressed weight tensors on the fly, and does the matrix multiplication in floating point space, and quantization is just used to compress weight tensors to reduce the amount of data being loaded from GPU global memory,.
@none-hr6zh5 ай бұрын
What is fake quantization .why it is called fake?
@OscarSavolainen7 күн бұрын
Because it doesn't actually convert the tensor to a different dtype, it stays as fp32. However, it simulates the effect of quantization via the quantization modules attached to weights and activations.