Efficient ML is the new direction, thanks for this content and I hope more content will come here on quantisation topics. Cheers Oscar
@SaikSaketh16 күн бұрын
This is awesome, can you link me your doctoral thesis :), thanks
@OscarSavolainen8 күн бұрын
Sure! I do warn you, it;s on Neurotechnology, which is a bit removed from ML haha It's on Spiral, you may have to log in: spiral.imperial.ac.uk/handle/10044/1/105363 Title: Hardware-efficient data compression in wireless intracortical brain-machine interfaces
@SaikSaketh6 күн бұрын
@ thanks
@pavanpandya9080Ай бұрын
This tutorial is amazing! I'm still trying to absorb all the information.
@flavio_corseАй бұрын
Hi, why don't you convert it with quantize_fx.convert_fx(fx_model) after preparing the model? Relatively isn't this the operation that actually quantizes the model?
@OscarSavolainenАй бұрын
Yep, it’s correct that fake quantization isn’t “true” quantization! To have true quantization, one has to do the “conversion” to actual int8. But I tend not to focus on that API too much, since once you have quantization parameters there are many ways to convert the model, e.g. via ONNX, via some custom method that one builds, etc. For example, many on-GPU quantization kernels rey on custom conversion methods where one does bitpacking. So yeah, I tend to leave that part out, but will make future videos on the conversion options (am working on a project to show all of the different conversion options)!
@TheCrmagicАй бұрын
These are very helpful. Thanks a lot.
@realhousewivesofnlp2 ай бұрын
your video is a blessing, oscar! thanks for the great quantization content!!
@이제민-h2q2 ай бұрын
I’ve been waiting for this for so long! GPTQ is truly an essential algorithm that is beneficial for everyone.
@Alexandra-l6v9o2 ай бұрын
Thanks for video! I have 2 questions 1) when you renamed layers from .relu to .relu1 and .relu_out in model definition, wouldn't it affect the correct loading weights from the pretrained checkpoint? 2) what if the model has LeakyRelu instead of Relu or GroupNorm instead of BatchNorm - does it mean we can't fuse them with conv?
@OscarSavolainen8 күн бұрын
For 1), fortunately ReLUs are stateless, so there aren't any parameters! As a result, it'll load fine. For 2), fusing of layers in eager mode is currently limited to ReLU, but if you use the more modern FX Graph mode quantization (I have a 3-part series on that if you want, and the doxs are great: pytorch.org/docs/stable/fx.html) that will fuse leaky ReLUs / PReLUs as well! If you really want to do it in Eager mode you can do it manually by writing some complicated wrapper, but I wouldn't recommend it (speaking as someone who did it once). A lot of hardware, e.g. Intel via OpenVINO, does support fusing of "advanced" activations and layers, so it's mainly just an Eager mode PyTorch limitatioN!
@YCL892 ай бұрын
if I want to quantize a layer, what kind of function I can use instead of PerChannelMinMax one?
@OscarSavolainen8 күн бұрын
Really there are a handful of observers available, the main ones being PerChannelMinMax (best for weight tensors, since it gives per-channel resolution). A good one for activations is HistogramObserver, it calculates the MSE between the floating point and fake-quant activations and uses that to assign qparams, but it is slow. There are observers that allow one to fix qparams at given values (e.g. discuss.pytorch.org/t/fixed-scale-and-zero-point-with-fixedqparamsobserver/200306), but you can also do that via the learnable ones. Reading the observer docs is a good place to start! pytorch.org/docs/stable/quantization-support.html
@김태호-m3d6l2 ай бұрын
Hi! thank you for fx quantization tutorial, i have one question. in your code, you do not convert fake quantized model to real quantized model which indicates no inference speed up, no model size reduction. why you didn't convert to real quantized model?
@OscarSavolainen8 күн бұрын
So the main reason I sometimes don't do it in my tutorials is because how one converts to a true quantized model depends a lot on what hardware one is trying to target. For example, PyTorch's original quantization work was all for edge devices, and doesn't run on GPUs. Their new generation of quantization work is mainly on GPUs in the torchao library, and focuses a lot on LLMs. tl;dr: the best way to convert is very hardware specific. I have very recently open sourced a project that helps one figure out how to optimize one's model to any hardware, and we are building a ton of quantization options into it! github.com/saifhaq/alma
@vallabhasharma62073 ай бұрын
Hello! Thank you for the amazing tutorial. I am working on performing Quantization Aware Training on EfficientDet Model. Can you help me with that...
@vallabhasharma62073 ай бұрын
Hello! Thank you for the amazing tutorial. I am working performing Quantization Aware Training on EfficientDet Model. Can you help me with that..
@NamVu-x2o4 ай бұрын
i got a question, may all int8 model quantized (i.e from yolov8n.pt) on the pc function on an embedded board like jetson nano ??
@none-hr6zh5 ай бұрын
Fake quantization means quantize and dequantize but how does it benifits? like we are converting float to int and int to float.can you please elobrate
@OscarSavolainen8 күн бұрын
So it depends on quite a lot. For some uses cases, fake-quant is only useful for determining good quantization paramaters, and then one has to actually convert the model to a "true" quantized model to run on integer arithmetic. However, in certain cases, fake-quant in itself is the goal. For exmaple, certain hardware, e.g. Intel, often does the quantied operations in fake-quant space. There are other examples too that are a bit of a mix. During LLM inference, we typically use quantization to compress the weight tensors to pack multiple weight elements into as single int32 value, but the actual matrix multiplication happens in floating point. So the kernel dequantizes the compressed weight tensors on the fly, and does the matrix multiplication in floating point space, and quantization is just used to compress weight tensors to reduce the amount of data being loaded from GPU global memory,.
@none-hr6zh5 ай бұрын
What is fake quantization .why it is called fake?
@OscarSavolainen8 күн бұрын
Because it doesn't actually convert the tensor to a different dtype, it stays as fp32. However, it simulates the effect of quantization via the quantization modules attached to weights and activations.
@Chendadon5 ай бұрын
Thanks for that! Please make a tutorial quantizing aiming to tensorrt. And deploying to nvidia hw
@OscarSavolainen8 күн бұрын
Sorry for the extremely slow reply, I ended up creating a whole open source project that allows one to deploy one's model to basically any hardware, including Tensor RT! github.com/saifhaq/alma Short answer is, torch.compile with the tensorRT backend is a good way to do it!
@TrooperJet6 ай бұрын
Hello! Torch fx graph mode seems to have an experimental _equalization_config optional parameter of prepare_fx/ prepare_fx_qat. But I can't seem to find in the sources whether not providing such config mapping still does some kind of equalization by default, and how could such mapping look like (how to mark pairs for mapping in such config) - do you perhaps know how to look for such answers? I feel that documentation is incomplete or I don't know how to read it. Also in your graphs for weights, doesn't it seem that both without CLE and with CLE are fairly similar? Wouldn't that indicate that there is actually a default equalization applied and what you do is that you add another equalization prior to the fx equalization?
@OscarSavolainen8 күн бұрын
I haven't dug through the docs for that specifically, but can say that by default FX won't make those kinds of changes to one's model. There is a difference between the weight tensors, but it's not super obvious from the plots because the plots (for weights) work by taking each channel, quantizing it individually, projecting them onto their quantized integer scales, and plotting all of the channels on top of each other. Something like CLE is hard to spot in that circumstance just from overlain weights. Sorry for the late answer!
@ShAqSaif6 ай бұрын
🦙🚀🔥🕳️
@archieloong11186 ай бұрын
nice tutorial👍
@OscarSavolainen6 ай бұрын
Thanks 🙂
@mayaambalapat36716 ай бұрын
Amazing tutorials! Can you consider doing a video on exporting the fake quantized model to a true integer model, running inference pass on this model. Also, it would be helpful to observe the INT weights and activation values of a simple layer.
@OscarSavolainen6 ай бұрын
Sure, I can do that! I’ll set it into the video schedule!
@mayaambalapat36716 ай бұрын
Great video! and thanks for taking us through the thought process of coding each line.
@OscarSavolainen6 ай бұрын
No problem, I’m glad it was useful!
@kapwing27406 ай бұрын
hi, great works, i have a question, with graph mode quantization, is possible if i convert quantized model to onnx then to tensorrt, because i converted it to onnx but when from onnx to tensorrt i got this error : [E] [TRT] ModelImporter.cpp:726: While parsing node number 71 [DequantizeLinear -> "/rbr_reparam/DequantizeLinear_output_0"]: [E] [TRT] ModelImporter.cpp:727: --- Begin node --- [E] [TRT] ModelImporter.cpp:728: input: "/rbr_reparam/Cast_output_0
@OscarSavolainen6 ай бұрын
Sorry for the late reply! I’m not familiar with that error message, but it looks like a fake-quantised layer being ported rather than a true quantised layer. The discussion in this thread may be helpful, but if not please reach out! github.com/OscarSavolainenDR/Quantization-Tutorials/issues/12#issuecomment-2123471131
@이제민-h2q7 ай бұрын
Thank you! This video is really helpful.
@shashankshekhar70528 ай бұрын
Amazing piece of work!
@victorarinopellejero84108 ай бұрын
Suggestion: You have an issue with video and audio synchronization. How to fix it: When editing the video, separate the audio and video into two parts and then slightly delay the audio.
@andrewferguson69018 ай бұрын
You can clap every now and then to make visible anchors in the track
@OscarSavolainen8 ай бұрын
I appreciate it! Truth is my hardware is a bit old, I was gonna update that and see if that fixes the OBS sync issues. I’m still getting my setup sorted, but ideally I’ll find a solution that fixes the root cause. But if nothing else works, I’ll have to spend time syncing… Thanks for the suggestion!
@cv462-l4x8 ай бұрын
@@OscarSavolainen neural networks training on an old machine? How is it possible?) Cloud? Seriously, the problem is not just video sync. Also noise from your keybord and mouse. It can be solved by using mic stand without connection your table. I wish you success, your idea seems interesting.
@haianhle99848 ай бұрын
awesome
@haianhle99849 ай бұрын
can you introduce a benchmark between resnet and quant_resnet . Thank you so much
@OscarSavolainen9 ай бұрын
Sure, I can add it in the future videos and onto the Github! Generally, ResNets are measured on top-1 accuracy (e.g. does it correctly classify any image to the correct class). So far I've only been dealing with one image, so top 1 accuracy isn't a great metric. I'll get some more validation data for future videos!
@haianhle99849 ай бұрын
awesome project
@boveykou20519 ай бұрын
Could you show how to put the fx qat model to tensorrt?
@OscarSavolainen9 ай бұрын
Sure! I’ll make that video a priority, let’s say I’ll have it in 2-3 weeks!
@asdfasdfasdfasfdsaasf10 ай бұрын
Are you Finnish? Greetings from Aalto University!
@OscarSavolainen10 ай бұрын
Joo! The name really gives it away haha. I grew up abroad though. Nice one! My friend Yuxin did her PhD at Aalto, it’s a great uni!
@boveykou205110 ай бұрын
I wait for this video for soooooooooooooooo looooooooooooooooooong