QWEN 2.5 72b Benchmarked - World's Best Open Source Ai Model?

Рет қаралды 5,246

Күн бұрын

Пікірлер: 55

@opensourcedev22 2 ай бұрын

I use this 32b model primarily for coding now. It's done so well, that I wonder if they trained it against claude 3.5 coding output, because it is very good. I wish one of these companies would make a hyper focused coding corpus model so that it can fit into 48gb vram at very high precision

@justtiredthings 2 ай бұрын

l believe they're planning on releasing a 32b coder variant soon

@alx8439 Ай бұрын

It went off the rails because you're keep reusing the same open web ui chat and overflew the default ollama context size for any model, which is 2k. Use different chats for different topic - it will save you from pushing all the chat history to model, even though the previous messages are no longer relevant to what you're asking them afterwards. And increase context size to something like 8k

@KarameloKoala 24 күн бұрын

Love your videos! You should consider creating a list of models you tested a long with a score you assign :)

@Jp-ue8xz 2 ай бұрын

wdym "failed" the first 2 tests. Does the game actually work if you put a png image in the corresponding folder? Btw there's a solid argument that could be made that if the scenario you proposed was the "best" the entire human kind was able to put together as a plan, the correct thing to do not to save us

@DigitalSpaceport 2 ай бұрын

No it didnt and it failed as llama3.1 70b was able to make a poor, but functioning, one on a oneshot that did run.

@Lorv0 2 ай бұрын

Awesome video! What is the name of the tool used for the web interface for local inference?

@DigitalSpaceport 2 ай бұрын

Yes full software install video here on that kzbin.info/www/bejne/ip6xhHehn6mHhdU. This is Openwebui and Ollama together.

@alx8439 Ай бұрын

Add some tool using tests. Like web searching and text summarization. Open Web Ui comes with web search and community tools you can equip. Ask a question like to find you some good AMD Ryzen laptops made in 2024 and compose a comparison table with all the specifications and prices

@DigitalSpaceport Ай бұрын

I just am getting to the use case videos and setting up tools and vision so these things will be included in future evals.

@alx8439 Ай бұрын

@@DigitalSpaceport Awesome. Thanks! That will be super helpful for community

@DigitalSpaceport Ай бұрын

@@alx8439 also I did read your other comments and thanks for taking the time to write them out. I am incorporating your feedback actively.

@Mike-pi3xu Ай бұрын

It would be helpful to se ollama ps and see how much is actually on the GPU's and how much is run by CPU. I noticed that the four 4090's only ran on 1/4 compute utilization, and seeing the execution context might shine some light on the discrepancy. Please consider including this. This is especially important with GPU's with less VRAM.

@DigitalSpaceport Ай бұрын

I am mindful of this in all tests and all models reviewed here fit fully in VRAM I do check. Yes the workload is split into 4 and each gpu runs at 1/4 speed on 1/4 the workload. This is how llama.cpp does parallelism currently. That is the model processor for ollama. vllm enables an alternate way to do parallelism that may significantly improve on that which I will test here.

@tohando 2 ай бұрын

Thanks for the video. Shouldn´t you clear the chat, after each question, so the context is not full with previous stuff?

@DigitalSpaceport 2 ай бұрын

If I was benchmarking just for numbers that are high, probably. We have plenty of those benchmarks and synthetics though. Im interested in how usage for normies like me is though. I often dont make a new clean new chat topically and the ones I do eventually meander off the original conceived topic. It isn't scientific testing, but rather mimics normies usage patterns. It is purposeful in that choice.

@Evanrodge 2 ай бұрын

@@tohando yes, his methodology is borked

@NilsEchterling Ай бұрын

@@DigitalSpaceport Good intent on your part. But I think we should propagate using LLMs well. And making new chats for new topics is simply something everyone has to learn. KZbin videos like yours should educate on this.

@dna100 Ай бұрын

I've found the qwen2.5 7b model to be the best of the current crop of 7b models. I've tried Llama3.1 7b, Internlm2.5 7b and mistral 7b. My second placed choice is then Interenlm model. Great video by the way. Nice to hear an honest opinion about the benchmarks. They are completely 'gamed' and pretty much meaningless. The only way is to gauge them yourself, as you have done here. Good work.

@DigitalSpaceport Ай бұрын

qwen2.5 is very good I agree. I use that almost all the time now myself. The 32b variant allows me to have several models running at once as well.

@ManjaroBlack Ай бұрын

Exactly my experience at the 7b size. My use case builds quite large prompts, and they all struggle at this size, but InternLM was my go to. I find that qwen2.5 and InternLM are about the same, but I prefer qwens output and formatting.

@ManjaroBlack Ай бұрын

Size vs Quantization. When I can fit a larger model with q2 quantization, I can fit a smaller q8 model. Comparing the two, the larger q2 model is much more likely to give me gibberish. The only advantage that I find with larger models in a lower quantization is that it can handle a larger system prompt better.

@DigitalSpaceport Ай бұрын

Yeah my bottom is q4 from what I have seen returned. Have you checked q8 vs fp16 on many models? The gains at fp16 feel in llama3.1/3.2 does not seem to make a big difference

@ManjaroBlack Ай бұрын

@@DigitalSpaceport I don’t find any difference in quality between q8 and fp16. Even q6 tends to be about the same for my use case. Below q6 I can tell a difference with long inputs. One thing I do see a difference in is the size of the output. q8 and fp16 seem to output about the same amount, but q6 will often output less, which can be a problem for me if I have a large output structure in the system prompt.

@DigitalSpaceport Ай бұрын

Good to get your observation on that. q8 does seem like the sweet spot and qwen 2.5 q8 is mind-blowingly good if I need to have 4 models loaded in for RAG.

@stattine 2 ай бұрын

Can you clarify why you are going so deep on p2000 vs p5000? The extra vram in the p5000 seems like a clear choice

@DigitalSpaceport 2 ай бұрын

Oh its because I tested the P2000 as I have one on hand. kzbin.info/www/bejne/eXuVf3uFeLZlr7s Im pretty much testing everything I've got around. I don't have a P5000 but yeah extra VRAM and more cuda cores FTW!

@Merializer 2 ай бұрын

Do you think this 72B model can be run with 64GB ram and RTX-3060? btw, I wouldn't use a red background colour for PASSED, but use green instead. Red seems more suited for a word like FAILED.

@DigitalSpaceport 2 ай бұрын

Good point re colors updated for next video. Yes it can layer into vram and system ram, but the performance will be painfully slow.

@Merializer 2 ай бұрын

@@DigitalSpaceport not gonna try then I think. My internet is really slow to download models, saves me the trouble.

@justtiredthings 2 ай бұрын

The issue of the ethical question is very difficult. Maybe unsolvable given that we want these things to be agents. The problem is if you can give it just the right ethical scenario and the AI will do as you say, then any bad actor could simply lie to their AI and have it go on a killing spree. That's not too desirable either. But then how do we ensure that these things are making reasonable decisions when we give them any level of autonomy? I'm not sure how we resolve that contradiction.

@DigitalSpaceport 2 ай бұрын

In my opinion, primary initiator decision making scaffolding being the entire chain is an issue I can see leading to real problems. Unassociated and independent evaluative recommendation systems not running on homogenous base models that feed into an arbiter weighting is a decent solution that can work better. We already have machines making life and de@th decisions autonomously at scales we likely don't see day to day.

@justtiredthings 2 ай бұрын

@@DigitalSpaceport tbh I'm not parsing you v well

@elecronic Ай бұрын

Why all questions are in same chat? You should try new chat everytime for each question

@DigitalSpaceport Ай бұрын

I have started doing this in the new chats for the new videos and will with that into the future. Thanks.

@CheesecakeJohnson-g7q 2 ай бұрын

Hi, I tried, on LM Studio, several versions of Qwen+Calme 70 , 72, 78b with all sorts of quant where Q5 and Q6 seems to perform best but I didn't find any that have a sufficient conversational speed. 3090 seems to work. While I have read the definition of K S K_M K_S and so on... I didn't really fully absorb the concept yet and from a model to the next, the "best performing model for my hardware" isn't always the same... The cozy spot is around 16gb even thought the device have 24gb... What am I missing? What settings should I tweak?

@83erMagnum 2 ай бұрын

I'd interested in this too. There is so little specific content for 24gb vram machines. The demand should be there since it is the only affordable solution for most.

@justtiredthings 2 ай бұрын

I've got a single 3090. 32b quant is pretty slow (~1.5 tkns/s), but 14b model is surprisingly decent for its size and reasonably fast (7-10 tkns/s)

@CheesecakeJohnson-g7q Ай бұрын

@@justtiredthings I run codestral 22b smoothly here at q_5 k_m and Q6-7-8 at a decreasingly unsatisfying speed but it runs.

@kkendall99 2 ай бұрын

That model is probably hard coded to cause no harm no matter what the scenario.

@mrorigo 2 ай бұрын

You should clear the context before trying the next challenge, no?

@mrorigo 2 ай бұрын

As I said, clear the context before you try a new challenge, or your responses will be confused.

@DigitalSpaceport 2 ай бұрын

Purposefully done this way at this time. Have explained a few times why prior.

@justtiredthings 2 ай бұрын

Please, please, please test it on an M1 or M2 Ultra. I'm dying for someone to demonstrate the speeds on Apple's efficient chip.

@DigitalSpaceport 2 ай бұрын

I can send you an address and amazon can deliver a gift? 😀

@justtiredthings 2 ай бұрын

haha, fair enough, but idk what you've got given that supercomputer setup. M-Ultra chips actually seem like the economic option for mid-sized LLM inference, but I haven't been able to see enough testing results to confirm that--they're weirdly difficult to find

@xyzxyz324 Ай бұрын

the ai-model maddness is going somewhere too complicated. as end-user i have to evaulate freaking lots of parameters, hardware needs, fine-tunes, dedicating the role for the model, the main title that its trained, top-p value, top-k value, penalty, temperature, etc, etc, etc... I need an ai model to help me find the most easy-going one for my needs! And by the way they are bigggggg in size and hardware needs are going crazy. Someone please collect all the ai models knowledge in one, and create an easy interface with few parameters.. Now reaching using and hosting ai model is getting more expensive and complicated rather than owning a real brain.

@DigitalSpaceport Ай бұрын

I found a pretty decent and much less "knobs" interface I will be reviewing. I think it might fit for you. AnythingLLM

@mrorigo 2 ай бұрын

Sentence, not Sentance, fr you have llms to correct your spelling, no?

@DigitalSpaceport 2 ай бұрын

Its easy to toss hay from the sidelines which is why I urge everyone to get into youtube themselves. It is a very humbling journey as a solo producer. Especially hard is those watching who catch some detail you must have missed but you have no idea what they are talking about because they dont context it.