If you are not aware let me tell you. You are helping a generation of ML practitioners learn all this for free. Huge respect to you Umar. Thank you for all your hard work ❤
@savvysuraj7 ай бұрын
The content made by Umar is helping me alot.Kudos to Umar.
@nithinma86973 күн бұрын
To the point crystal clear concept explanation.
@ankush46179 ай бұрын
I keep hearing about quantization so much, this is the first time i have seen someone go so deep into this topic and come up with such clear explanations! Keep up all your great work, you are a gem to the AI community!! I’m hoping that you will have a video on Mixtral MoE soon 😊
@umarjamilai9 ай бұрын
You read my mind about Mistral. Stay tuned! 😺
@ankush46179 ай бұрын
@@umarjamilai❤
@BRH_SoC29 күн бұрын
Very useful video ! thanks for this valuable (and very well delivered) knowledge !
@abdullatifhalabi820122 күн бұрын
This is one of the best videos so far !👋
@vik21895 ай бұрын
Fantastic video! Probably the best 50 minutes spent on AI related concepts in the past 1 year or so.
@dariovicenzo81395 ай бұрын
Great job, in particular the examples regarding the conversion from/to integer not only with formulas but with true numbers too!
@alexandermedina495013 күн бұрын
This lecture is very useful, thank you.
@krystofjakubek93769 ай бұрын
Great video! Just a clarification: on modern processors floating point operations are NOT slower than integer operations. It very much depends on the exact processor and even then the difference is usually extremely small compared to the other overheads of executing the code. HOWEVER the reduction of size from 32 bit float to 8 bit integer does itself make the operations faster a lot faster. The cause is two fold: 1) modern CPUs and GPUs are typically memory bound and so simply put if we reduce the amount of data the processor needs to load in by 4x we expect the time the processor spends waiting for another set of data to come by to shrink by 4x as well. 2) pretty much all machine learning code is vectorized. This means the processor instead of executing each instruction on a single number grabs N numbers and executes the instruction on all of them at once (SIMD instructions). However most processors dont have N set instead have set the total number of bits all N numbers occupy (for example AVX2 can do operations on 256 bits at a time) so if we go from 32 bits to 8 bits we can do 4x more instructions at once! This is likely what you mean by operations being faster. Note thag CPUs or GPUs are very much similar in this regard, only GPUs have much more SIMD lanes (much more bits).
@umarjamilai9 ай бұрын
Thanks for the clarification! I was even going to talk about the internal hardware of adders (Carry-lookahead adder) to show how a simple operation like addition works and compare it with the many steps required for the floating-point number (which also involves normalization). You explanation nailed it! Thanks again!
@jaymn53187 ай бұрын
Great lecture. Clean explanation of the field and gives an excellent perspective on these technical topics. Love your lectures. Thanks !
@asra1kumar6 ай бұрын
This channel features exceptional lectures, and the quality of explanation is truly outstanding. 👌
@TechPlanet-b9b6 ай бұрын
I was searching for Quantization basics and could not find relevant videos... this is a life-saver!! thanks and please keep up the amazing work!
@AbdennacerAyeb9 ай бұрын
Keep Going. This is perfect. Thank you for the effort you are making
@AmishaHSomaiya4 ай бұрын
Thank you for the great content. Especially the goal of QAT to have a wider loss function and how that makes it robust to errors due to quantization. Thank you.
@КириллКлимушин6 ай бұрын
This is one of a few channels that I subscribed to after watching one video. Your content is very easy to follow and you are covering topic holistically with additional clarifications, what a man)
@王之灿8 ай бұрын
Thanks a lot for the fantastic tutorial. Looking forward to the more series on the LLM quantization!👏
@jiahaosu8 ай бұрын
The best video about quantization, thank you very much!!!! It really helps!
@Aaron-hs4gj5 ай бұрын
Excellent explanation, very intuitive. Thanks so much! ❤
@myaseena9 ай бұрын
Really high quality exposition. Also thanks for providing the slides.
@ojay6666 ай бұрын
Fantastic tutorial!!!👍👍👍I’m hoping that you will post a tutorial on model pruning soon🤩
@andrewchen77107 ай бұрын
Umar, I've watched your videos on llama, mistral, and now quantization. They're absolutely brilliant and I've shared your channel to my colleagues. If you're in Shanghai, allow me to buy you a meal haha! I'm curious of your research process. During the preparation of your next video, I think it would be neat if you document the timeline of your research/learning, and share it with us in a separate video!
@umarjamilai7 ай бұрын
Hi Andrew! Connect with me on LinkedIn and we can share our WeChat. Have a nice day!
@Patrick-wn6uj6 ай бұрын
Glad to see fellow shanghai people here hhhhhhh
@jaymn53187 ай бұрын
Great lecture. Clean explanation of the field and gives a excellent perspective on these technical topics.
@Youngzeez18 ай бұрын
wow, what an eye-opener! I read lots of research papers but mostly confusing! but your explanation just opened my eyes! Thank you. Please can you do a video on the quantization of vision transformers for object detection?
@TheEldadcohen8 ай бұрын
Umar I've seen many of your videos and you are a great teacher! Thank you for your effort in explaining in plain (Italian accent) English all of these complicated topics. Regarding the content of the video - you showed the quantization-aware training and you were surprised of the worse result it showed in comparison to the post-training quantization in the concrete example you made. I think it is because you trained the post-training quantization on the same data that you tested it on, so the parameters learned (alpha, beta) are overfitted to the test data, that's why the accuracy was better. I think that if you had tested it with true test data, you probably would have seen the result you anticipated.
@shivamkaushik66372 ай бұрын
Thank you for this lecture.
@aireddy2 ай бұрын
@Uumarjamilai Great job breaking down complex concepts into actionable insights, you explained the concepts very simple and easy to understand fashion with practical examples.
@mandarinboy8 ай бұрын
Great introductory video! Looking forward to GPTQ and AWQ
@HeyFaheem9 ай бұрын
You are a hidden gem, my brotherr
@ngmson9 ай бұрын
Thank your for your sharing.
@MaksymSutkovenko7 ай бұрын
Need more advanced videos about advanced Quantization!
@aminamoudjar45619 ай бұрын
Very helpful thank you so much
@sebastientetaud74857 ай бұрын
Excellent Video ! Grazie !
@bluecup259 ай бұрын
Thank you, super clear
@harshithreddy8865Ай бұрын
This guy is fking talented dude, not this video, but every doubt
@NJCLM8 ай бұрын
Great video ! Thank you !!
@RaviPrakash-dz9fm4 ай бұрын
Legendary content!!
@koushikkumardey8829 ай бұрын
becoming a big fan of your work!!
@manishsharma22119 ай бұрын
beautiful again, thanks for sharing these
@汪茶水9 ай бұрын
vary good
@tetnojj24838 ай бұрын
Nice video :) A video on the .gguf file format for models would be very interesting :)
@ziyadmuhammad37343 ай бұрын
Thanks!
@swiftmindai9 ай бұрын
I noticed a small correction needs to done at timestamp @28:53 [slide: Low precision matrix multiplication]. In the first line, the dot products between each row of X with each column of Y [Instead of Y, it should be W - the weight matrix]
@umarjamilai9 ай бұрын
You're right, thanks! Thankfully the diagram of the multiply block is correct. I'll fix the slides
@RudraPratapDhara9 ай бұрын
One request could you explain mixture of experts I bet you can breakdown the explanation good
@asra1kumar6 ай бұрын
Thanks
@bamless956 ай бұрын
Be careful, cpython does not do JIT compilation, it is a pretty stragithforward stack-based bytecode interpreter
@umarjamilai6 ай бұрын
Bytecode has to be converted into machine code somehow. That's also how .NET works: first C# gets compiled into MSIL (an intermediate representation), and then it just-in-time compiles the MSIL into the machine code for the underlying architecture.
@bamless956 ай бұрын
Not necessarily, bytecode can just be interpreted in place. In a loose sense it is being "converted" to machine code, meaning that we are executing different snippets of machine code through branching, but JIT compilation has a very different meaning in the compiler and interpreter field. What python is really doing is executing a loop and a switch branching on every possible opcode. By looking at the interpreter implementation on the cpython github repo in `Python/ceval.c` and `Python/generated_cases.c.h` (alas youtube is not letting me post links) you can clearly see there is no JIT compilation involved.
@bamless956 ай бұрын
What you are saying about C# (and for that matter java and some other languages like luaJIT or v8 javascript) is indeed true, they typically JIT the code either before or during interpretation. But cpython is a much simpler (and thus slower) implementation of a bytecode interprer, that does not implement neither JIT compilation nor any form of serious code optimization (aside from a fairly rudimentary peephole optimization step)
@bamless956 ай бұрын
Don't get me wrong, I think the video is phenomenal. Just wanted to correct a little imperfection that, as a programming language nerd, I feel it is important to get right. Also, greetings from italy! It is good for once to see a fellow Italian doing content that is worth watching on YT 😄
@amitshukla14959 ай бұрын
wohooo ❤
@lukeskywalker70296 ай бұрын
@Umar Jamil you said most embedded devices dont support floating point operatins at all? Is that right? What would be an example and how is that chip architecture called? Does an RaspberryPi or an Arduino operate on only integer operations internally?
@tubercn9 ай бұрын
Thanks, Great video🐱🏍🐱🏍 But I have a question, because we'll dequantize the output of the last layer by calibration, why we need another "torch.quantization.DeQuantStub()" layer in the model to dequantize the output, it seems we have two dequantizes consequently
@pravingaikwad13375 ай бұрын
For one layer Y = XW + b, if X, W and b are quantized so we get Y in the quantized form, then what is the need of dequantizing this Y to feed it to the next layer?
@Erosis9 ай бұрын
You're making all of my lecture materials pointless! (But keep up the great work!)
@AleksandarCvetkovic-db7lm5 ай бұрын
Could the difference in accuracy between Static/Dynamic quantization and Quantization Aware Training be because the model was trained for 5 epochs for Static/Dynamic Quant and only one epoch for Quant Aware training? I tend to think that 4 more epochs make more difference than Quantization method
@theguyinthevideo41837 ай бұрын
This may be a stupid question, but what's stopping us from just setting the weights and biases to be in integer form? Is it due to the nature of backprop?
@umarjamilai7 ай бұрын
Forcing the weights and biases to be integers means adding more constraints to the gradient descent algorithm, which is not easy and computationally expensive. It's like I ask you to solve the equation x^2 - 5x + 4 = 0 but only for integer X. This means you can't just use the formula you learnt in high school for quadratic equations, because that returns real numbers. Hope it helps
@venkateshr61279 ай бұрын
Could you please make a video on how to make tokenizers for other languages than English please.
@DiegoSilva-dv9uf9 ай бұрын
Valeu!
@dzvsow26439 ай бұрын
Aslamu aleykum Brother. Thanks for your videos! I have been working on game development using pygame for a while and I just want to start deep learning in python so could you make a road map video?! Thank you again
@umarjamilai9 ай бұрын
Hi! I will do my best! Stay tuned
@elieelezra27349 ай бұрын
Umar, thanks for all your content. I step up a lot thanks to your work! But there is something I don't get about quantization. Let's say you quantize all the weights of your large model. The prediction is not the same anymore! Does it mean you need to dequantize the prediction? If yes, you do not talk about it right? Can I have your email to get more details please?
@umarjamilai9 ай бұрын
Hi! Since the output of the last layer (the matrix Y) will be dequantized, the prediction of the output will be "the same" (very similar) as the dequantized model. The Y matrix of each layer is always dequantized, so that the output of each layer is more or less equal to the dequantized model
@alainrieger69059 ай бұрын
Hi thanks for your answer@@umarjamilai Does it mean, for the post training quantization, that the more the layers in a model, the greater is the difference between the quantized and dequantized model since the error accumulates at each New layer? Thanks in advance
@umarjamilai9 ай бұрын
@@alainrieger6905 That's not necessarily true, because the error in one layer may be "positive", and in another "negative", and they may compensate for each other. For sure the number of bits used for quantization is a good metric on the quality of quantization: if you use less bits, you will have more error. It's like you have an image that is originally 10 MB, and you try to compress it to 1 MB or 1 KB. Of course in the latter case you'd lose much more quality than the first one.
@alainrieger69059 ай бұрын
@@umarjamilaithanks you Sir! Last question : when you talk about dequantizing layer's activations, does it mean that the values go back to 32 bits format ?
@umarjamilai9 ай бұрын
@@alainrieger6905 yes, it means going back to floating-point format