Qwen2.5 Coder 32B vs GPT4o vs Claude 3.5 Sonnet (new)

  Рет қаралды 5,402

Volko Volko

Volko Volko

Күн бұрын

Пікірлер: 46
@tpadilha84
@tpadilha84 Ай бұрын
Funny thing: I tried the same tetris example locally with the q8 and fp16 versions of Qwen coder 2.5 32b and it generated buggy code in both cases. When I tried with the default quantization (q4_k_m if I'm not mistaken) it got perfect the first time (properly bounded and you could lose the game too). I guess there's a luck factor involved.
@volkovolko
@volkovolko Ай бұрын
Yeah, it might be because of the luck factor. Or maybe the architecture of qwen is optimised for high quantizations levels 🤷‍♂️ Or maybe your q8 version wasn't properly quantized, I think they updated their weight at one moment
@66_meme_99
@66_meme_99 Ай бұрын
luck it's called temperature nowadays :D
@volkovolko
@volkovolko Ай бұрын
Yeah, I now. Top_k also right ? ​@@66_meme_99
@DOCTOR-FLEX
@DOCTOR-FLEX Ай бұрын
Thank you for this demonstration. In the future, please work on more complex apps. I’m happy you tried Tetris instead of only the snake game.
@volkovolko
@volkovolko Ай бұрын
They the issue is that we need to balance the complexity of the tasks. If it's too easy all models get it right so we cannot compare them If it's too difficult all models fails so we cannot compare them. Tetris and Pac man games seems currently a good fit for SOTA and aren't that tested so that's why I use them
@Kusmoti
@Kusmoti Ай бұрын
nice vid! what's your 3090 setup my guy
@volkovolko
@volkovolko Ай бұрын
Asus ROG STRIX 3090 32Go ddr4 3200MHz i9 11900kf
@Rolandfart
@Rolandfart Ай бұрын
you should ask for physics demos like softbody , particles, fluid particles, cloth. Anything math heavy pretty much.
@volkovolko
@volkovolko Ай бұрын
Okay, I will try in the next video
@IgnatShining
@IgnatShining Ай бұрын
Sweet. I remember, when Chat GPT just appeared, feeling very pessimistic that this tech would be locked in big companies datacenters. Glad I was wrong
@volkovolko
@volkovolko Ай бұрын
Yes, it's so awesome they this technology is going toward open sourcing 👍
@bigbrotherr
@bigbrotherr 29 күн бұрын
Not a great test to me because these models have been trained with these games before and the codes are in there. Let's try something custom and let's see how it can reason, create and solve problems. That will make it a good model. Also Claud 3.5 sonet is the best coder and very hard to make mistakes when coding.
@volkovolko
@volkovolko 23 күн бұрын
I would be happy to test with any prompt you give to me ^^
@owonobrandon8747
@owonobrandon8747 Ай бұрын
the error produced by gpt was minimal; a "hallucination"
@electroheadfx
@electroheadfx Ай бұрын
amazing, thanks for the test
@volkovolko
@volkovolko Ай бұрын
Glad you liked it!
@oguzhan.yilmaz
@oguzhan.yilmaz Ай бұрын
Nice video but i think Claude is still better. If i compare these models at first i always say to myself "If these models are slightly close to each other (In terms of technical specifications) it is okay to compare but if it's not what is the point? Like i understand comparing between open source models like Qwen and Llama or closed source models like Gpt4-o and Claude 3.5 Sonnet
@volkovolko
@volkovolko Ай бұрын
Yes, the results of the tests I made in this video seems to show that : GPT4o < Qwen2.5coder32b < Claude 3.5 Sonnet (new)
@sthobvious
@sthobvious Ай бұрын
The point is to compare quality... simple as that. Once you know quality, you can consider other factors like speed, price, availability, and of course confidentiality. The fact that Qwen2.5-Coder-32B is even close to Claude while being a _small_ open-weight model is amazing. Of course other factors can matter more than just quality. Speed and price are just as important. But limiting it to "Only compare quality when technical specs are comparable" makes no sense.
@oguzhan.yilmaz
@oguzhan.yilmaz Ай бұрын
@@sthobvious actually makes sense because if you think to compare gpt-3.5 and gpt-o1 or gpt-4o, do you really think this is fair? Gpt-3.5: 😭 Gpt-4o & gpt-o1: 🗿🗿
@nashh600
@nashh600 Ай бұрын
Thanks for the comparison but this was painful to watch. Please cut the parts that are not relevant to the subject or at least add timestamps
@volkovolko
@volkovolko Ай бұрын
I'm trying to do my best. When I made this video. I didn't had any speakers so I couldn't test the audio nor make great cuts
@SpaceReii
@SpaceReii Ай бұрын
This is pretty cool to see! It's nice to see how the models compare between each other. For me, even the 3B model was amazing at making a Python snake game. Thanks for the comparison, it really does show the difference.
@volkovolko
@volkovolko Ай бұрын
Yeah, I totaly agree. The Qwen series (especially the coding one for me) are just so amazing. I don't know why they aren't as known as the llama ones.
@volkovolko
@volkovolko Ай бұрын
Do you want me to make a video comparing the 3B to the 32B ?
@SpaceReii
@SpaceReii Ай бұрын
​@@volkovolko Yeah, that would be really cool to see! I'd love to see how the models perform.
@volkovolko
@volkovolko Ай бұрын
Okay, I will try to do it tomorrow
@mathiasmamsch3648
@mathiasmamsch3648 Ай бұрын
Why do people do these stupid tests where the code can be found 1000 times on the internet.
@volkovolko
@volkovolko Ай бұрын
As explained in the video, I'm looking for other original tests. If you have one that you want me to try in a following video feel free to leave it in a comment so that I can try it in the following video
@mathiasmamsch3648
@mathiasmamsch3648 Ай бұрын
@@volkovolko If you are testing how to write a snake game, then you are basically testing knowledge retrieval, because that code exists in 1000 variants on the Internet. It gets interesting if you demand variations, like 'but the snake grows in both directions' or 'random obstacles appear and disappear after some time in not too close proximity of the snakes'. Think of whatever you want, but if you can do Tetris or snake is hardly a test for llms these days.
@mathiasmamsch3648
@mathiasmamsch3648 Ай бұрын
@5m5tj5wg The 'better' model is not one that can retrieve known solutions better, but the one that can piece the solution to a unheard but related problem better. If you can find the question and the answer on the net then comparing a model with 32B params to a Multi-hundred-billion parameter model like GPT4o or sonnet makes even less sense, because of cause they can store more knowledge. You need to ask for solutions to problems where you cannot find the answer on the Internet to evaluate how good a model will be in practical use.
@volkovolko
@volkovolko Ай бұрын
Yes, there is a part of true. However, I think you can all agree that you don't want a 50+ min video. Also most of the code you will ask it to make in the real world will also be knowledge retrieval. As developper we very often have to remake what as already been made. And the Snake game isn't that easy for LLMs. The Tetris game is very difficult and I didn't ever see a first try fully working
@volkovolko
@volkovolko Ай бұрын
And it is interresting to see that the Qwen model did better on these "retrieval" questions than GPT and Anthropic despite being way smaller in terms of parameters. It indicates that knowledge can still be compress a lot more than what we thought
@cerilza_kiyowo
@cerilza_kiyowo Ай бұрын
I think you should ask qwen 2.5 coder 32B again to make the tetris game better so it will be fair .. In my opininion In tetris game qwen literally win .. even claude generate better after error , but offcource it failed at first
@volkovolko
@volkovolko Ай бұрын
Yeah, for me the win was for Qwen. But okay, for the following videos, I will always let one second chance for all models. I will soon make a video comparing each size of qwen2.5 coder (so 0.5B vs 1.5B vs 3B vs 7B vs 14B vs 32b) So subscribe if you want be notified ^^ I also started to quantize each model in GGUF and EXL2 on HuggingFace for those who are interested : huggingface.co/Volko76
@renerens
@renerens Ай бұрын
Seems very interesting I will try it tomorrow, for me nemotron 70b was the best but even on my 4090 I can't run it locally.
@volkovolko
@volkovolko Ай бұрын
I made the video comparing sizes : kzbin.info/www/bejne/jYHdmnaoltmVpsUsi=o3eKo-3pGY78wmMr
@volkovolko
@volkovolko Ай бұрын
Yes, 70B is still a bit too much for consumer grade GPUs
@kobi2187
@kobi2187 Ай бұрын
If you do a real software project, you'll find out claude sonnet new is the best, and gpt4 is very good at organizing.
@volkovolko
@volkovolko Ай бұрын
I do real software projects as I'm a developer. While Claude and GPT4o are still better for big projects, qwen is a good alternative for just little prompting to avoid going to stack overflow for quick and simple questions
@mnageh-bo1mm
@mnageh-bo1mm Ай бұрын
try a next js app.
@volkovolko
@volkovolko Ай бұрын
Okay, I will try in the next video
Ultimate Guide: Easily Quantize Your LLM in Any Format
3:08
Volko Volko
Рет қаралды 162
I Tried Every AI Coding Assistant
24:50
Conner Ardman
Рет қаралды 870 М.
黑天使被操控了#short #angel #clown
00:40
Super Beauty team
Рет қаралды 61 МЛН
My scorpion was taken away from me 😢
00:55
TyphoonFast 5
Рет қаралды 2,7 МЛН
黑天使只对C罗有感觉#short #angel #clown
00:39
Super Beauty team
Рет қаралды 36 МЛН
Llama 3.3 vs Llama 3.2 ! HUGE IMPROVEMENTS !
3:57
Volko Volko
Рет қаралды 1,9 М.
Anthropic MCP with Ollama, No Claude? Watch This!
29:55
Chris Hay
Рет қаралды 15 М.
Local LLM Challenge | Speed vs Efficiency
16:25
Alex Ziskind
Рет қаралды 112 М.
Qwen2.5 Coder 32B on Ollama - Run Locally on Less VRAM
15:27
Fahd Mirza
Рет қаралды 4,9 М.
Ditch Expensive Tools and Build Anything with Bolt.new for FREE
18:01
Hardy's Integral
13:47
Michael Penn
Рет қаралды 1 М.
黑天使被操控了#short #angel #clown
00:40
Super Beauty team
Рет қаралды 61 МЛН