Oscar Savolainen

Пікірлер

@SaikSaketh 13 күн бұрын

Efficient ML is the new direction, thanks for this content and I hope more content will come here on quantisation topics. Cheers Oscar

@SaikSaketh 16 күн бұрын

This is awesome, can you link me your doctoral thesis :), thanks

@OscarSavolainen 8 күн бұрын

Sure! I do warn you, it;s on Neurotechnology, which is a bit removed from ML haha It's on Spiral, you may have to log in: spiral.imperial.ac.uk/handle/10044/1/105363 Title: Hardware-efficient data compression in wireless intracortical brain-machine interfaces

@SaikSaketh 6 күн бұрын

@ thanks

@pavanpandya9080 Ай бұрын

This tutorial is amazing! I'm still trying to absorb all the information.

@flavio_corse Ай бұрын

Hi, why don't you convert it with quantize_fx.convert_fx(fx_model) after preparing the model? Relatively isn't this the operation that actually quantizes the model?

@OscarSavolainen Ай бұрын

Yep, it’s correct that fake quantization isn’t “true” quantization! To have true quantization, one has to do the “conversion” to actual int8. But I tend not to focus on that API too much, since once you have quantization parameters there are many ways to convert the model, e.g. via ONNX, via some custom method that one builds, etc. For example, many on-GPU quantization kernels rey on custom conversion methods where one does bitpacking. So yeah, I tend to leave that part out, but will make future videos on the conversion options (am working on a project to show all of the different conversion options)!

@TheCrmagic Ай бұрын

These are very helpful. Thanks a lot.

@realhousewivesofnlp 2 ай бұрын

your video is a blessing, oscar! thanks for the great quantization content!!

@이제민-h2q 2 ай бұрын

I’ve been waiting for this for so long! GPTQ is truly an essential algorithm that is beneficial for everyone.

@Alexandra-l6v9o 2 ай бұрын

Thanks for video! I have 2 questions 1) when you renamed layers from .relu to .relu1 and .relu_out in model definition, wouldn't it affect the correct loading weights from the pretrained checkpoint? 2) what if the model has LeakyRelu instead of Relu or GroupNorm instead of BatchNorm - does it mean we can't fuse them with conv?

@OscarSavolainen 8 күн бұрын

For 1), fortunately ReLUs are stateless, so there aren't any parameters! As a result, it'll load fine. For 2), fusing of layers in eager mode is currently limited to ReLU, but if you use the more modern FX Graph mode quantization (I have a 3-part series on that if you want, and the doxs are great: pytorch.org/docs/stable/fx.html) that will fuse leaky ReLUs / PReLUs as well! If you really want to do it in Eager mode you can do it manually by writing some complicated wrapper, but I wouldn't recommend it (speaking as someone who did it once). A lot of hardware, e.g. Intel via OpenVINO, does support fusing of "advanced" activations and layers, so it's mainly just an Eager mode PyTorch limitatioN!

@YCL89 2 ай бұрын

if I want to quantize a layer, what kind of function I can use instead of PerChannelMinMax one?

@OscarSavolainen 8 күн бұрын

Really there are a handful of observers available, the main ones being PerChannelMinMax (best for weight tensors, since it gives per-channel resolution). A good one for activations is HistogramObserver, it calculates the MSE between the floating point and fake-quant activations and uses that to assign qparams, but it is slow. There are observers that allow one to fix qparams at given values (e.g. discuss.pytorch.org/t/fixed-scale-and-zero-point-with-fixedqparamsobserver/200306), but you can also do that via the learnable ones. Reading the observer docs is a good place to start! pytorch.org/docs/stable/quantization-support.html

@김태호-m3d6l 2 ай бұрын

Hi! thank you for fx quantization tutorial, i have one question. in your code, you do not convert fake quantized model to real quantized model which indicates no inference speed up, no model size reduction. why you didn't convert to real quantized model?

@OscarSavolainen 8 күн бұрын

So the main reason I sometimes don't do it in my tutorials is because how one converts to a true quantized model depends a lot on what hardware one is trying to target. For example, PyTorch's original quantization work was all for edge devices, and doesn't run on GPUs. Their new generation of quantization work is mainly on GPUs in the torchao library, and focuses a lot on LLMs. tl;dr: the best way to convert is very hardware specific. I have very recently open sourced a project that helps one figure out how to optimize one's model to any hardware, and we are building a ton of quantization options into it! github.com/saifhaq/alma

@vallabhasharma6207 3 ай бұрын

Hello! Thank you for the amazing tutorial. I am working on performing Quantization Aware Training on EfficientDet Model. Can you help me with that...

@vallabhasharma6207 3 ай бұрын

Hello! Thank you for the amazing tutorial. I am working performing Quantization Aware Training on EfficientDet Model. Can you help me with that..

@NamVu-x2o 4 ай бұрын

i got a question, may all int8 model quantized (i.e from yolov8n.pt) on the pc function on an embedded board like jetson nano ??

@none-hr6zh 5 ай бұрын

Fake quantization means quantize and dequantize but how does it benifits? like we are converting float to int and int to float.can you please elobrate

@OscarSavolainen 8 күн бұрын

So it depends on quite a lot. For some uses cases, fake-quant is only useful for determining good quantization paramaters, and then one has to actually convert the model to a "true" quantized model to run on integer arithmetic. However, in certain cases, fake-quant in itself is the goal. For exmaple, certain hardware, e.g. Intel, often does the quantied operations in fake-quant space. There are other examples too that are a bit of a mix. During LLM inference, we typically use quantization to compress the weight tensors to pack multiple weight elements into as single int32 value, but the actual matrix multiplication happens in floating point. So the kernel dequantizes the compressed weight tensors on the fly, and does the matrix multiplication in floating point space, and quantization is just used to compress weight tensors to reduce the amount of data being loaded from GPU global memory,.

@none-hr6zh 5 ай бұрын

What is fake quantization .why it is called fake?

@OscarSavolainen 8 күн бұрын

Because it doesn't actually convert the tensor to a different dtype, it stays as fp32. However, it simulates the effect of quantization via the quantization modules attached to weights and activations.

@Chendadon 5 ай бұрын

Thanks for that! Please make a tutorial quantizing aiming to tensorrt. And deploying to nvidia hw

@OscarSavolainen 8 күн бұрын

Sorry for the extremely slow reply, I ended up creating a whole open source project that allows one to deploy one's model to basically any hardware, including Tensor RT! github.com/saifhaq/alma Short answer is, torch.compile with the tensorRT backend is a good way to do it!

@TrooperJet 6 ай бұрын

Hello! Torch fx graph mode seems to have an experimental _equalization_config optional parameter of prepare_fx/ prepare_fx_qat. But I can't seem to find in the sources whether not providing such config mapping still does some kind of equalization by default, and how could such mapping look like (how to mark pairs for mapping in such config) - do you perhaps know how to look for such answers? I feel that documentation is incomplete or I don't know how to read it. Also in your graphs for weights, doesn't it seem that both without CLE and with CLE are fairly similar? Wouldn't that indicate that there is actually a default equalization applied and what you do is that you add another equalization prior to the fx equalization?

@OscarSavolainen 8 күн бұрын

I haven't dug through the docs for that specifically, but can say that by default FX won't make those kinds of changes to one's model. There is a difference between the weight tensors, but it's not super obvious from the plots because the plots (for weights) work by taking each channel, quantizing it individually, projecting them onto their quantized integer scales, and plotting all of the channels on top of each other. Something like CLE is hard to spot in that circumstance just from overlain weights. Sorry for the late answer!

@ShAqSaif 6 ай бұрын

🦙🚀🔥🕳️

@archieloong1118 6 ай бұрын

nice tutorial👍

@OscarSavolainen 6 ай бұрын

Thanks 🙂

@mayaambalapat3671 6 ай бұрын

Amazing tutorials! Can you consider doing a video on exporting the fake quantized model to a true integer model, running inference pass on this model. Also, it would be helpful to observe the INT weights and activation values of a simple layer.

@OscarSavolainen 6 ай бұрын

Sure, I can do that! I’ll set it into the video schedule!

@mayaambalapat3671 6 ай бұрын

Great video! and thanks for taking us through the thought process of coding each line.

@OscarSavolainen 6 ай бұрын

No problem, I’m glad it was useful!

@kapwing2740 6 ай бұрын

hi, great works, i have a question, with graph mode quantization, is possible if i convert quantized model to onnx then to tensorrt, because i converted it to onnx but when from onnx to tensorrt i got this error : [E] [TRT] ModelImporter.cpp:726: While parsing node number 71 [DequantizeLinear -> "/rbr_reparam/DequantizeLinear_output_0"]: [E] [TRT] ModelImporter.cpp:727: --- Begin node --- [E] [TRT] ModelImporter.cpp:728: input: "/rbr_reparam/Cast_output_0

@OscarSavolainen 6 ай бұрын

Sorry for the late reply! I’m not familiar with that error message, but it looks like a fake-quantised layer being ported rather than a true quantised layer. The discussion in this thread may be helpful, but if not please reach out! github.com/OscarSavolainenDR/Quantization-Tutorials/issues/12#issuecomment-2123471131

@이제민-h2q 7 ай бұрын

Thank you! This video is really helpful.

@shashankshekhar7052 8 ай бұрын

Amazing piece of work!

@victorarinopellejero8410 8 ай бұрын

Suggestion: You have an issue with video and audio synchronization. How to fix it: When editing the video, separate the audio and video into two parts and then slightly delay the audio.

@andrewferguson6901 8 ай бұрын

You can clap every now and then to make visible anchors in the track

@OscarSavolainen 8 ай бұрын

I appreciate it! Truth is my hardware is a bit old, I was gonna update that and see if that fixes the OBS sync issues. I’m still getting my setup sorted, but ideally I’ll find a solution that fixes the root cause. But if nothing else works, I’ll have to spend time syncing… Thanks for the suggestion!

@cv462-l4x 8 ай бұрын

@@OscarSavolainen neural networks training on an old machine? How is it possible?) Cloud? Seriously, the problem is not just video sync. Also noise from your keybord and mouse. It can be solved by using mic stand without connection your table. I wish you success, your idea seems interesting.

@haianhle9984 8 ай бұрын

awesome

@haianhle9984 9 ай бұрын

can you introduce a benchmark between resnet and quant_resnet . Thank you so much

@OscarSavolainen 9 ай бұрын

Sure, I can add it in the future videos and onto the Github! Generally, ResNets are measured on top-1 accuracy (e.g. does it correctly classify any image to the correct class). So far I've only been dealing with one image, so top 1 accuracy isn't a great metric. I'll get some more validation data for future videos!

@haianhle9984 9 ай бұрын

awesome project

@boveykou2051 9 ай бұрын

Could you show how to put the fx qat model to tensorrt?

@OscarSavolainen 9 ай бұрын

Sure! I’ll make that video a priority, let’s say I’ll have it in 2-3 weeks!

@asdfasdfasdfasfdsaasf 10 ай бұрын

Are you Finnish? Greetings from Aalto University!

@OscarSavolainen 10 ай бұрын

Joo! The name really gives it away haha. I grew up abroad though. Nice one! My friend Yuxin did her PhD at Aalto, it’s a great uni!

@boveykou2051 10 ай бұрын

I wait for this video for soooooooooooooooo looooooooooooooooooong

@ShAqSaif 11 ай бұрын

🔥

@Sara-gm6on 11 ай бұрын

"Promo SM" 😏

Ең жақсы KZbin

Пікірлер