Funny thing: I tried the same tetris example locally with the q8 and fp16 versions of Qwen coder 2.5 32b and it generated buggy code in both cases. When I tried with the default quantization (q4_k_m if I'm not mistaken) it got perfect the first time (properly bounded and you could lose the game too). I guess there's a luck factor involved.
@volkovolkoАй бұрын
Yeah, it might be because of the luck factor. Or maybe the architecture of qwen is optimised for high quantizations levels 🤷♂️ Or maybe your q8 version wasn't properly quantized, I think they updated their weight at one moment
@66_meme_99Ай бұрын
luck it's called temperature nowadays :D
@volkovolkoАй бұрын
Yeah, I now. Top_k also right ? @@66_meme_99
@DOCTOR-FLEXАй бұрын
Thank you for this demonstration. In the future, please work on more complex apps. I’m happy you tried Tetris instead of only the snake game.
@volkovolkoАй бұрын
They the issue is that we need to balance the complexity of the tasks. If it's too easy all models get it right so we cannot compare them If it's too difficult all models fails so we cannot compare them. Tetris and Pac man games seems currently a good fit for SOTA and aren't that tested so that's why I use them
@KusmotiАй бұрын
nice vid! what's your 3090 setup my guy
@volkovolkoАй бұрын
Asus ROG STRIX 3090 32Go ddr4 3200MHz i9 11900kf
@RolandfartАй бұрын
you should ask for physics demos like softbody , particles, fluid particles, cloth. Anything math heavy pretty much.
@volkovolkoАй бұрын
Okay, I will try in the next video
@IgnatShiningАй бұрын
Sweet. I remember, when Chat GPT just appeared, feeling very pessimistic that this tech would be locked in big companies datacenters. Glad I was wrong
@volkovolkoАй бұрын
Yes, it's so awesome they this technology is going toward open sourcing 👍
@bigbrotherr29 күн бұрын
Not a great test to me because these models have been trained with these games before and the codes are in there. Let's try something custom and let's see how it can reason, create and solve problems. That will make it a good model. Also Claud 3.5 sonet is the best coder and very hard to make mistakes when coding.
@volkovolko23 күн бұрын
I would be happy to test with any prompt you give to me ^^
@owonobrandon8747Ай бұрын
the error produced by gpt was minimal; a "hallucination"
@electroheadfxАй бұрын
amazing, thanks for the test
@volkovolkoАй бұрын
Glad you liked it!
@oguzhan.yilmazАй бұрын
Nice video but i think Claude is still better. If i compare these models at first i always say to myself "If these models are slightly close to each other (In terms of technical specifications) it is okay to compare but if it's not what is the point? Like i understand comparing between open source models like Qwen and Llama or closed source models like Gpt4-o and Claude 3.5 Sonnet
@volkovolkoАй бұрын
Yes, the results of the tests I made in this video seems to show that : GPT4o < Qwen2.5coder32b < Claude 3.5 Sonnet (new)
@sthobviousАй бұрын
The point is to compare quality... simple as that. Once you know quality, you can consider other factors like speed, price, availability, and of course confidentiality. The fact that Qwen2.5-Coder-32B is even close to Claude while being a _small_ open-weight model is amazing. Of course other factors can matter more than just quality. Speed and price are just as important. But limiting it to "Only compare quality when technical specs are comparable" makes no sense.
@oguzhan.yilmazАй бұрын
@@sthobvious actually makes sense because if you think to compare gpt-3.5 and gpt-o1 or gpt-4o, do you really think this is fair? Gpt-3.5: 😭 Gpt-4o & gpt-o1: 🗿🗿
@nashh600Ай бұрын
Thanks for the comparison but this was painful to watch. Please cut the parts that are not relevant to the subject or at least add timestamps
@volkovolkoАй бұрын
I'm trying to do my best. When I made this video. I didn't had any speakers so I couldn't test the audio nor make great cuts
@SpaceReiiАй бұрын
This is pretty cool to see! It's nice to see how the models compare between each other. For me, even the 3B model was amazing at making a Python snake game. Thanks for the comparison, it really does show the difference.
@volkovolkoАй бұрын
Yeah, I totaly agree. The Qwen series (especially the coding one for me) are just so amazing. I don't know why they aren't as known as the llama ones.
@volkovolkoАй бұрын
Do you want me to make a video comparing the 3B to the 32B ?
@SpaceReiiАй бұрын
@@volkovolko Yeah, that would be really cool to see! I'd love to see how the models perform.
@volkovolkoАй бұрын
Okay, I will try to do it tomorrow
@mathiasmamsch3648Ай бұрын
Why do people do these stupid tests where the code can be found 1000 times on the internet.
@volkovolkoАй бұрын
As explained in the video, I'm looking for other original tests. If you have one that you want me to try in a following video feel free to leave it in a comment so that I can try it in the following video
@mathiasmamsch3648Ай бұрын
@@volkovolko If you are testing how to write a snake game, then you are basically testing knowledge retrieval, because that code exists in 1000 variants on the Internet. It gets interesting if you demand variations, like 'but the snake grows in both directions' or 'random obstacles appear and disappear after some time in not too close proximity of the snakes'. Think of whatever you want, but if you can do Tetris or snake is hardly a test for llms these days.
@mathiasmamsch3648Ай бұрын
@5m5tj5wg The 'better' model is not one that can retrieve known solutions better, but the one that can piece the solution to a unheard but related problem better. If you can find the question and the answer on the net then comparing a model with 32B params to a Multi-hundred-billion parameter model like GPT4o or sonnet makes even less sense, because of cause they can store more knowledge. You need to ask for solutions to problems where you cannot find the answer on the Internet to evaluate how good a model will be in practical use.
@volkovolkoАй бұрын
Yes, there is a part of true. However, I think you can all agree that you don't want a 50+ min video. Also most of the code you will ask it to make in the real world will also be knowledge retrieval. As developper we very often have to remake what as already been made. And the Snake game isn't that easy for LLMs. The Tetris game is very difficult and I didn't ever see a first try fully working
@volkovolkoАй бұрын
And it is interresting to see that the Qwen model did better on these "retrieval" questions than GPT and Anthropic despite being way smaller in terms of parameters. It indicates that knowledge can still be compress a lot more than what we thought
@cerilza_kiyowoАй бұрын
I think you should ask qwen 2.5 coder 32B again to make the tetris game better so it will be fair .. In my opininion In tetris game qwen literally win .. even claude generate better after error , but offcource it failed at first
@volkovolkoАй бұрын
Yeah, for me the win was for Qwen. But okay, for the following videos, I will always let one second chance for all models. I will soon make a video comparing each size of qwen2.5 coder (so 0.5B vs 1.5B vs 3B vs 7B vs 14B vs 32b) So subscribe if you want be notified ^^ I also started to quantize each model in GGUF and EXL2 on HuggingFace for those who are interested : huggingface.co/Volko76
@renerensАй бұрын
Seems very interesting I will try it tomorrow, for me nemotron 70b was the best but even on my 4090 I can't run it locally.
@volkovolkoАй бұрын
I made the video comparing sizes : kzbin.info/www/bejne/jYHdmnaoltmVpsUsi=o3eKo-3pGY78wmMr
@volkovolkoАй бұрын
Yes, 70B is still a bit too much for consumer grade GPUs
@kobi2187Ай бұрын
If you do a real software project, you'll find out claude sonnet new is the best, and gpt4 is very good at organizing.
@volkovolkoАй бұрын
I do real software projects as I'm a developer. While Claude and GPT4o are still better for big projects, qwen is a good alternative for just little prompting to avoid going to stack overflow for quick and simple questions