Comparing Quantizations of the Same Model

Comparing Quantizations of the Same Model - Ollama Course

Рет қаралды 4,798

Күн бұрын

Пікірлер: 40

@romulopontual6254 Ай бұрын

On the subject of the quantization, you could have included how to choose amongst q4_0, q4_K, q4_K_M. If for the same quantization level one is better, for example if q4_K is better than q4_0, why are we creating both? Thank you for these videos, the model of the course focusing on a single fact per session is very nice!

@tudor-octavian4520 Ай бұрын

I'm really in awe by how well you explain everything. I wish I had professors with your patience and teaching abilities when I was at university. Anyway. Thank you the lesson. I already love Ollama but your content is making me really see LLMs and, in this case, quantizations, with different eyes. To be honest, I used to think that anything lower than q8 for, let's say, a 7 or 8 billion parameter model, would be pretty much useless, but after experimenting with Llama 3.1, Mistral, and a few others, I think q4 is definitely the sweet spot for my needs. Llama 3.1-q4 retains a decent amount of reasoning capabilities and I can increase the context length to have it work better with whatever information I want to feed it on the spot. Thanks again for the content. It's awesome!

@tylerlindsay9263 Ай бұрын

I agree with all the points made in the video and would just like to add my own experience for viewers looking for more details in the comments. One important factor to consider when choosing quantization levels is the impact of hardware constraints. For example, I've been running the LLaMA 3.1 70B model, which fits in 48GB of VRAM at Q4 but with a limited context window. I found that running the 70B model at Q2 (which frees up memory for extending the context window) gave me better results than the 8B model with an extended context window. This balance between model size, quantization, and context window size can be crucial depending on your specific use case and hardware capabilities.

@technovangelist Ай бұрын

I guess that is partially implied since I can't run 70b fp16 on my mac

@tylerlindsay9263 Ай бұрын

@@technovangelist You’re absolutely right-I overlooked the fact that FP16 didn't run on your Mac. I like how well you emphasized that there's no one-size-fits-all solution when it comes to running these models. It really comes down to experimenting with different configurations to see what works best with your specific hardware and needs.

@fabriai Ай бұрын

Excellent video, Matt. Your take on benchmarks usefulness is smart.

@LuisYax Ай бұрын

Great tutorial as usual. Btw, here is a windows powershell command to update all the models if you have more then 1 already installed on Windows. ollama ls | ForEach-Object {"{0}" -f($_ -split '\s+')} | Where-Object { $_ -notmatch 'failed' -and $_ -notmatch 'NAME' } | ForEach-Object {$model = $_; "Updating model $model"; ollama pull $model} This one works both on Mac and Linux: ollama ls | awk '{print $1}' | grep -v NAME | while read model; do echo "### updating $model ###"; ollama pull $model; done Those I wrote them myself but you can ask your favorite GPT for an explanation.

@ReidKimball Ай бұрын

You changed my thinking on which quant to use! I’ll experiment more with running the lowest quant to get acceptable answers. Thanks!

@ManjaroBlack Ай бұрын

This is great. I’d love to see the difference between the quant levels as the length of the prompt increases. I find that the lower quants don’t handle longer inputs very well. I’m. Im not sure why that is.

@muchainganga9563 Ай бұрын

Great stuff!

@ISK_VAGR Ай бұрын

Thanks Matt. Very well explain and informative. I employ three of thoughts, obviously with multi-shoot prompting, and those are normally very complex tasks where the model needs to pick the hints in a clinical case description to help diagnose or optimize the treatment of the patient. I see the bigger parameters models perform better, because they pick more details and correlate better the facts. I noticed that Claude and Gemini are the kings there. What about quantization in this case? any recommendation?

@soylentpink7845 Ай бұрын

Interesting video! Thanks!

@solyarisoftware Ай бұрын

@technovangelist Thanks, Matt. The experiment was surprising. All in all, it seems that higher quantization achieves better results (at least for function calling), which is counterintuitive to me. However, if this is statistically true, it's good news for local applications driven by a small LLM that calls external (but still local) services. In other words, it's promising for real-time on-prem automation! As someone suggested in a comment, the balance between model size, quantization, and context window size seems to be crucial. I’d suggest dedicating a session to the context window size and its usage. I’m personally confused by the default length value in Ollama and how to set the desired window size. Thanks for this course. Giorgio

@yamitanomura Ай бұрын

Great video

@vulcan4d Ай бұрын

Would love to see how to get it running in Proxmox with multiple GPUs. Lots of old articles out there

@AndyAinsworth Ай бұрын

Another great video thanks. Would you mind adding the link to the youtube videos, related to the folder in your videoprojects repo, in each README file? Would make it a lot easier to find the video when browsing the repo. Ta, keep up the good work.

@technovangelist Ай бұрын

Great idea. Thanks

@VinCarbone Ай бұрын

Rule of thumb Take the k-quants that fit into your gpu memory. Usually until q3 there is really a negligible loss. If the model+context fit into the memory use I-Quants

@technovangelist Ай бұрын

I don’t think there is a rule of thumb for all. K tend to be a touch slower. It depends on your needs. You should test for yourself

@VinCarbone Ай бұрын

@@technovangelist Yes but the I-Quants are better and faster than K-Quants only if they fits completely into the Vram. Ollama automatic vram split it's quite conservative so usually its better to keep an eye on this. Anyway really appreciate your testing! Just one suggestion in this kind of testing it's better to use always a fixed seed to compare and mantain repeatibility!

@technovangelist Ай бұрын

There is no best for all. That was the point. Op would normally use. If a temp has been set, use that recommended temp. Otherwise your testing is irrelevant.

@michaelmistaken2863 Ай бұрын

For Blackhole prompt there are some faults in the physics wording of the Q2 response that had me rank it second, and the Q4 highest. The bad English style cues of the FP16 (two uses of "pull" in the opening sentence) had me rate it lowest when, in retrospect, it probably contains less problematic physics wording of the three. (Background: failed '90s astrophysics major) I guess prompting is important. ;)

@escain Ай бұрын

I usually benchmark models with a simple programming question: ``` You are a software engineer experienced in C++: Write a trivial C++ program that follow this code-style: Use modern C++20 Use the auto func(...) -> ret syntax, even for auto main()->int Always open curly braces on new line: DONT auto main()->int{ ... (with no new line between int and '{'); but DO: auto main() ->int { ... instead (with new line between int and '{'). Comment your code. No explanation, no introduction, keep verbosity to the minimum, only code. ``` Even this dummy question fails most of the time on any Q4 model I tried on Ollama. I hope to get better results with better quantization, but I need to upgrade my computer for this.

@technovangelist Ай бұрын

this is not a great prompt to use. You don't tell it what you want the code to do. you probably want to output it as json so you can more easily get just the code. and some of the instructions are inconsistent. You will probably get better results just working on making a better prompt.

@escain Ай бұрын

@@technovangelist Correct, I am not asking for any specific features or output. I tried tens of prompts, and this is one of the most successful one that I got. Most models seems unable to follow more than few 4-5 code styles rules (especially if they are against usual code style for that language), they fail in adding a new line before the opening curly brace, and "no explanation/introduction" seems also problematic. What exactly is inconsistent?

@sammcj2000 Ай бұрын

What about with complex coding tasks such as refactoring a codebase?

@technovangelist Ай бұрын

Exactly. Ask the questions. Test what you need.

@UnwalledGarden Ай бұрын

Would you (in general) prefer a greater parameter/smaller quant size model over a lesser parameter/larger quant size model?

@technovangelist Ай бұрын

I’d prefer the smallest in both that gets me the answers I like. 8b q4 is usually ideal

@JNET_Reloaded Ай бұрын

is there a way i can split a big model like llama3. w.e to 2 models of 50% of the size so i can load it on a rpi5 8gb without seeing not enough memory error? also any way to find out what the model consists off so i can strip out unwanted parts and keep only parts i find usefull?

@technovangelist Ай бұрын

Nope. Not yet

@TheDiverJim Ай бұрын

if you're ever visiting space camp, let me know. I'll make you a Mojito and we can talk AI.

@technovangelist Ай бұрын

Nice. It will be a few years before Stella is ready but my wife had such a good time there when she was that age.

@supercurioTube Ай бұрын

I found the evaluation interesting, and the conclusion wise: try with your own prompts and see what happens. I would suggest to extend the evaluation to many more conversation turns, because some models get lost later on despite doing well on the first reply. Your evaluation made me curious to try different quants at temperature 0 and the same seed, to see if some of the quants end up with an identical output!

@technovangelist Ай бұрын

That’s great. Glad you found it helpful

@SonGoku-pc7jl Күн бұрын

"With this Mac code, I understand that ollama pull model downloads it again if it's updated, and if it's not updated, it doesn't download it again? Because I don't see any fetch in the code, and ChatGPT made me think otherwise, I don't know."

@technovangelist Күн бұрын

ollama pull will get the latest version of any model if there is a newer version.

@SonGoku-pc7jl Күн бұрын

today llama3.2 3b have use tools :) saying... ;)

@technovangelist Күн бұрын

llama 3 and 2 all work with tools. Every model works with tools and functions if you use them right

@SonGoku-pc7jl Күн бұрын

thanks! :)