Qwen2.5 Coder 32B vs GPT4o vs Claude 3.5 Sonnet (new)

Рет қаралды 5,402

Күн бұрын

Пікірлер: 46

@tpadilha84 Ай бұрын

Funny thing: I tried the same tetris example locally with the q8 and fp16 versions of Qwen coder 2.5 32b and it generated buggy code in both cases. When I tried with the default quantization (q4_k_m if I'm not mistaken) it got perfect the first time (properly bounded and you could lose the game too). I guess there's a luck factor involved.

@volkovolko Ай бұрын

Yeah, it might be because of the luck factor. Or maybe the architecture of qwen is optimised for high quantizations levels 🤷‍♂️ Or maybe your q8 version wasn't properly quantized, I think they updated their weight at one moment

@66_meme_99 Ай бұрын

luck it's called temperature nowadays :D

@volkovolko Ай бұрын

Yeah, I now. Top_k also right ? @@66_meme_99

@DOCTOR-FLEX Ай бұрын

Thank you for this demonstration. In the future, please work on more complex apps. I’m happy you tried Tetris instead of only the snake game.

@volkovolko Ай бұрын

They the issue is that we need to balance the complexity of the tasks. If it's too easy all models get it right so we cannot compare them If it's too difficult all models fails so we cannot compare them. Tetris and Pac man games seems currently a good fit for SOTA and aren't that tested so that's why I use them

@Kusmoti Ай бұрын

nice vid! what's your 3090 setup my guy

@volkovolko Ай бұрын

Asus ROG STRIX 3090 32Go ddr4 3200MHz i9 11900kf

@Rolandfart Ай бұрын

you should ask for physics demos like softbody , particles, fluid particles, cloth. Anything math heavy pretty much.

@volkovolko Ай бұрын

Okay, I will try in the next video

@IgnatShining Ай бұрын

Sweet. I remember, when Chat GPT just appeared, feeling very pessimistic that this tech would be locked in big companies datacenters. Glad I was wrong

@volkovolko Ай бұрын

Yes, it's so awesome they this technology is going toward open sourcing 👍

@bigbrotherr 29 күн бұрын

Not a great test to me because these models have been trained with these games before and the codes are in there. Let's try something custom and let's see how it can reason, create and solve problems. That will make it a good model. Also Claud 3.5 sonet is the best coder and very hard to make mistakes when coding.

@volkovolko 23 күн бұрын

I would be happy to test with any prompt you give to me ^^

@owonobrandon8747 Ай бұрын

the error produced by gpt was minimal; a "hallucination"

@electroheadfx Ай бұрын

amazing, thanks for the test

@volkovolko Ай бұрын

Glad you liked it!

@oguzhan.yilmaz Ай бұрын

Nice video but i think Claude is still better. If i compare these models at first i always say to myself "If these models are slightly close to each other (In terms of technical specifications) it is okay to compare but if it's not what is the point? Like i understand comparing between open source models like Qwen and Llama or closed source models like Gpt4-o and Claude 3.5 Sonnet

@volkovolko Ай бұрын

Yes, the results of the tests I made in this video seems to show that : GPT4o < Qwen2.5coder32b < Claude 3.5 Sonnet (new)

@sthobvious Ай бұрын

The point is to compare quality... simple as that. Once you know quality, you can consider other factors like speed, price, availability, and of course confidentiality. The fact that Qwen2.5-Coder-32B is even close to Claude while being a _small_ open-weight model is amazing. Of course other factors can matter more than just quality. Speed and price are just as important. But limiting it to "Only compare quality when technical specs are comparable" makes no sense.

@oguzhan.yilmaz Ай бұрын

@@sthobvious actually makes sense because if you think to compare gpt-3.5 and gpt-o1 or gpt-4o, do you really think this is fair? Gpt-3.5: 😭 Gpt-4o & gpt-o1: 🗿🗿

@nashh600 Ай бұрын

Thanks for the comparison but this was painful to watch. Please cut the parts that are not relevant to the subject or at least add timestamps

@volkovolko Ай бұрын

I'm trying to do my best. When I made this video. I didn't had any speakers so I couldn't test the audio nor make great cuts

@SpaceReii Ай бұрын

This is pretty cool to see! It's nice to see how the models compare between each other. For me, even the 3B model was amazing at making a Python snake game. Thanks for the comparison, it really does show the difference.

@volkovolko Ай бұрын

Yeah, I totaly agree. The Qwen series (especially the coding one for me) are just so amazing. I don't know why they aren't as known as the llama ones.

@volkovolko Ай бұрын

Do you want me to make a video comparing the 3B to the 32B ?

@SpaceReii Ай бұрын

@@volkovolko Yeah, that would be really cool to see! I'd love to see how the models perform.

@volkovolko Ай бұрын

Okay, I will try to do it tomorrow

@mathiasmamsch3648 Ай бұрын

Why do people do these stupid tests where the code can be found 1000 times on the internet.

@volkovolko Ай бұрын

As explained in the video, I'm looking for other original tests. If you have one that you want me to try in a following video feel free to leave it in a comment so that I can try it in the following video

@mathiasmamsch3648 Ай бұрын

@@volkovolko If you are testing how to write a snake game, then you are basically testing knowledge retrieval, because that code exists in 1000 variants on the Internet. It gets interesting if you demand variations, like 'but the snake grows in both directions' or 'random obstacles appear and disappear after some time in not too close proximity of the snakes'. Think of whatever you want, but if you can do Tetris or snake is hardly a test for llms these days.

@mathiasmamsch3648 Ай бұрын

@5m5tj5wg The 'better' model is not one that can retrieve known solutions better, but the one that can piece the solution to a unheard but related problem better. If you can find the question and the answer on the net then comparing a model with 32B params to a Multi-hundred-billion parameter model like GPT4o or sonnet makes even less sense, because of cause they can store more knowledge. You need to ask for solutions to problems where you cannot find the answer on the Internet to evaluate how good a model will be in practical use.

@volkovolko Ай бұрын

Yes, there is a part of true. However, I think you can all agree that you don't want a 50+ min video. Also most of the code you will ask it to make in the real world will also be knowledge retrieval. As developper we very often have to remake what as already been made. And the Snake game isn't that easy for LLMs. The Tetris game is very difficult and I didn't ever see a first try fully working

@volkovolko Ай бұрын

And it is interresting to see that the Qwen model did better on these "retrieval" questions than GPT and Anthropic despite being way smaller in terms of parameters. It indicates that knowledge can still be compress a lot more than what we thought

@cerilza_kiyowo Ай бұрын

I think you should ask qwen 2.5 coder 32B again to make the tetris game better so it will be fair .. In my opininion In tetris game qwen literally win .. even claude generate better after error , but offcource it failed at first

@volkovolko Ай бұрын

Yeah, for me the win was for Qwen. But okay, for the following videos, I will always let one second chance for all models. I will soon make a video comparing each size of qwen2.5 coder (so 0.5B vs 1.5B vs 3B vs 7B vs 14B vs 32b) So subscribe if you want be notified ^^ I also started to quantize each model in GGUF and EXL2 on HuggingFace for those who are interested : huggingface.co/Volko76

@renerens Ай бұрын

Seems very interesting I will try it tomorrow, for me nemotron 70b was the best but even on my 4090 I can't run it locally.

@volkovolko Ай бұрын

I made the video comparing sizes : kzbin.info/www/bejne/jYHdmnaoltmVpsUsi=o3eKo-3pGY78wmMr

@volkovolko Ай бұрын

Yes, 70B is still a bit too much for consumer grade GPUs

@kobi2187 Ай бұрын

If you do a real software project, you'll find out claude sonnet new is the best, and gpt4 is very good at organizing.

@volkovolko Ай бұрын

I do real software projects as I'm a developer. While Claude and GPT4o are still better for big projects, qwen is a good alternative for just little prompting to avoid going to stack overflow for quick and simple questions