If this video is a true reflection of its capabilities, benchmarks aren't just bad, they are broken.
@ctwolf14 күн бұрын
this 100%
@TheSuperColonel15 күн бұрын
I like your channel; it's straight to the point. There is likely a lot of competition among the AI with tons of hype. We will see how many of them will survive the next 5 years.
@maddoxthorne229715 күн бұрын
Others: It answers the benchmark questions well so no need to run it. AICodeKing: Hold my beer.👑
@ctwolf14 күн бұрын
AICodeKing is actually a deity
@bamit197915 күн бұрын
Thank you for saving our time! :)
@TaughtByTech15 күн бұрын
i know right. really the AI king
@ashgtd15 күн бұрын
yup saved me a big fat download today
@notme213612 күн бұрын
yup, saved me a chunk of my time this week.
@Quitcool11 күн бұрын
Wrong, that's a great model according to other youtubers and the open source community.
@ashgtd11 күн бұрын
@@Quitcool are they just saying that for clicks though? if I see a video with this model not sucking ass then I'll try it
@CPM9415 күн бұрын
Those dancing pokemons clearly stole the spotlight of the vidoe
@Andres-m2u11 күн бұрын
the maximum achievable with Qwen2.5-Coder32b (131k context window) was a around 100k tokens. Then it slowed down to a timeout. But impressive...
@RaffaelloTamagnini10 күн бұрын
true , just tested , and with 24gb gpu offload too on a machine with 192gb of ram. 131k context want too much memory
@developersdiary199515 күн бұрын
Thanks for sharing this with us, your content is gold! I tried Qwen 2.5 coder yesterday on my Intel Core I7, 16GB RAM DDR4, RTX 3050 (4GB VRAM) and it struggled with Bolt. So I guess that I should only use Open-Source Local AI models for generating text, for now...
@aleksanderspiridonov725115 күн бұрын
YOU NEED AT LEAST MAC WITH 32GB RAM M3-M4 I THINK BUT BETTER 2-3 3090 MINIMUM FOR +- GOOD WORK BUT ALSO OPENROUTER CHEAP
@johnnyarcade15 күн бұрын
@@aleksanderspiridonov7251 WOULD THE NEW MACBOOK PRO WITH 40 GPU CORES AND 48GB RAM WORK WELL ENOUGH OR SHOULD I OPT FOR MORE RAM?
@handfuloflight14 күн бұрын
@@aleksanderspiridonov7251 y u screamin son
@alexjensen99015 күн бұрын
BTW, you had me laughing so hard at the whole "why the hell am I using it then!" comment. Truly priceless.
@sammcj200015 күн бұрын
Looking at your output it almost seems as if you or the model provider you're using is using the wrong chat templates + inference parameters that aren't configured for coding tasks. What about the temperature - it should be set to 0 for coding, and you should use a top_p of no higher than about 0.85. Did you set the context size to something reasonable? I've found the 32b model to be really impressive, certainly the best open weight model out there by far. In in my experience Cline specially it's not very good with any models other than Claude which it was originally written for.
@AICodeKing15 күн бұрын
I did try it with Fireworks and it was the same results. It might be that Cline is not okay with the model. But, even if you consider the aider results.. It's too buggy and not good at all if you're working on bigger application with mutliple context of files..
@sammcj200015 күн бұрын
@@AICodeKingthanks for the extra info. I might try a couple of your common prompts running the model directly without aider or cline in the mix to see if it's a templating issue. It could be something like then using the default chatml template and not the proper updated Qwen 2.5 toolcalling template - or something along those lines.
@MM-2415 күн бұрын
@@sammcj2000 would love to see what analysis you come up with - thanks for double checking. super helpful
@bodyguardik14 күн бұрын
He didnt even downloaded it as it seems. This video is about some crap online service
@gmag1115 күн бұрын
I love your style. Go on like this. AI coding a great use case for LLM. I'm learning a lot with your videos
@fezkhanna690015 күн бұрын
hahahah, "man if i have to implement it myself, why the hell am I using this". This made me laugh (9:50)
@peacekeepermoe12 күн бұрын
same 😂😂 man I'd be mad too if AI is asking me to do something I asked it to do for me in the first place. It's like who is the master and who is the slave here goddamnit?
@darkreader0115 күн бұрын
I also did a test before seeing your video and my conclusion was "trash", at least for my use case. After seeing your video, I see that I am not the only one! It's not worth the hype.
@AaronBlox-h2t5 күн бұрын
Interesting....thanks for the video.
@raj462412 күн бұрын
thanks for this hyperbolic webite.. it helped me
@davidcarey3715 күн бұрын
Thank you very much. Well explained and informative as always, and in this case it has definitely “seperated the wheat from the chaff” … qwen 2.5 coder seems very disappointing.
@tecnopadre15 күн бұрын
Truth testing = reality Great job, as usual. Congratulations 🎉
@jaynucca14 күн бұрын
Thank you for being honest! I wanted to love Qwen 2.5 Coder as well, but it just can't actually do anything useful beyond VERY simple applications.
@Dyson_Lu11 күн бұрын
Strange, Cole Medin got great results and did Simon Willison. Both were extremely impressed.
@phoenyfeifei14 күн бұрын
I find Cline just doesn't work with OLLAMA local model very well. Their developer appears to blame these OLLAMA models are heavily quantized which I do agree, but I run Q8 and FP16 models but still getting same shitty result
@cgimoonai15 күн бұрын
Thank you man!
@PhuPhillipTrinh13 күн бұрын
lmao good testing king! will you change pokemon one day?
@maertscisum14 күн бұрын
I am guessing that the benchmarking use a carefully engineered prompting to beat other models. I have always questions validity of each model benchmark claim. There should be a formal body with standard test sets to run the benchmark.
@jamesbuesnel505415 күн бұрын
Ahahah your hate for cursor is hilarious 😂
@yoannthomann15 күн бұрын
The point is done, we need better benchmarks 😢😅
@onmetrics14 күн бұрын
here for the low frequency roasts
@chadpogs797315 күн бұрын
Wow!! This is it!!
@tomwawer571415 күн бұрын
I run 32b on 6GB vram it’s slow about token/s but works.
@justtiredthings15 күн бұрын
what a bummer. I had high hopes for this model
@isheriff8215 күн бұрын
so true bro, i hate it when ppl do that! also aider and cline is way better at everything!
@xelerator239815 күн бұрын
Thank you!
@diplobla15 күн бұрын
thanks for this 👍
@konstantinoskonstantinos852415 күн бұрын
Is Hyperbolic using the Instruct model or the Base one?
@AICodeKing15 күн бұрын
Instruct and unquantized as well.
@JeffreyWang-hh4ss14 күн бұрын
I like your objectivity, these small model hypes + marketing are pretty annoying.
@DAZEOFFICIAL15 күн бұрын
Strange, though I think this is a milestone for a local model to be able to even create something using Aider. From my testing. Properly used Aider, cline did not work in my testing. I have a 3090 and it did run at workable speeds.
@AICodeKing15 күн бұрын
Yes, but claiming unbelievable things is never good
@HikaruAkitsuki15 күн бұрын
Dude. Can you review Blackbox AI? It has Gemini Pro, GPT 4o, Claude Sonnet 3.5 and it's own Blackbox model. It's mostly a chat app AI like anything else but there is also VS Code and JetBrains extension.
@MacS7n15 күн бұрын
You made me hate cursor 😅 and to be honest you're right about cline being better 😅
@ctwolf14 күн бұрын
me @3:50 hell yeah, dancing Pokémon
@fmatake10 күн бұрын
Benchmarks always come out 'pretty,' but in real life, I've found that it's far behind even claude-3-5-haiku and gpt-4o-mini.
@ghosert14 күн бұрын
what is the smaller local LLM model which you think is better than Qwen 2.5 coder 32b, thanks, you didn't mention which video I should take a look.
benchmarks with smaller models usually are completely bs. They probably distill the bigger models into it, making it memorize benchmark like questions without actually making them smarter.
@mz875515 күн бұрын
It's such a small model and the hype to try compare it with sonnet is where all these start to fail. It should do what a small model should do in some specialized cases. Not to run a general coding agent. It is also specialized for code generation while powering an aider is much more demanding on versital intelligence
@christerjohanzzon15 күн бұрын
Great video! Real tests in real apps. I would like to see a full workflow test, from figma design to tested product. Done with NextJS, TS, TailwindCSS and assisted coding with AI all the way from setup to testing, reviewing and deployment.
@wolverin013 күн бұрын
could you make a guide to use cline with the local qwen ?
@lcarv2015 күн бұрын
Hi there, In your first prompt qwen was trying to generate the build files, and node_modules, maybe if you had the project setup wouldn’t try to generate that much code? Can you try?
@lcarv2015 күн бұрын
Ok after seeing the whole video I understand that it wouldn’t matter.
@AICodeKing15 күн бұрын
I had created the NextJS App before hand.
@brandon190211 күн бұрын
To make matters worse, outside of coding Qwen2.5 is far worse than Qwen2. Most notably, it hallucinates far more across all domains of knowledge. I really do think you're right that Qwen is optimizing their LLMs for tests at the expense of overall performance. Qwen2 72b used to be almost as good as Llama 3.1 70b, but now Qwen2.5 72b is far worse despite climbing higher on benchmarks.
@Piotr_Sikora10 күн бұрын
Model in hyperbolic use 128k context window?
@antoniofuller233115 күн бұрын
16 million tokens uploaded just to generate 3 files??!!! 6:30
@AICodeKing15 күн бұрын
I think that it's a bug in cline and that's why it displays that.
@antoniofuller233115 күн бұрын
@AICodeKing hmm
@toCatchAnAI15 күн бұрын
is this available on Open Bolt?
@2005sty14 күн бұрын
Alibaba has qwen max model (not opensoure) which is far better then the open source version. But.. strangely they dont show it off. I suspect ...
@alainmona26815 күн бұрын
hey is it possible you can add Qwen2.5 32B to OpenHands? I tried a million different ways with the help of claude and copilot and chatgpt but couldnt get it running
@A-Jaradat-Z8 күн бұрын
openRouter?
@mrpocock15 күн бұрын
So why does it score well in benchmarks if it can't function in these ide or agentic contexts?
@AICodeKing15 күн бұрын
You can basically just train models on specific benchmark questions and make them score well in benchmarks but in real life this approach fails.
@mrpocock15 күн бұрын
@@AICodeKing That really sucks. Benchmark chasing should be an immediate disqualification. I wonder if there are ways to structure benchmarks so that they produce a randomised but equivalent task. Or alternatively, flood the market with so many benchmarks that it is not practical to over-fit to them all.
@AICodeKing15 күн бұрын
There are actually many benchmarks but you just need to select 5 or 10 and just compare the results with that..
@mrpocock15 күн бұрын
@@AICodeKing It would be better if model publishers were expected to submit their models to 3rd party benchmarking rather than doing it in-house. We used to have this problem with protein 3d reconstructions. People would publish papers on cooked benchmarks. That's why the CASP protein structure prediction competition was set up.
@ThrivingMotivation2814 күн бұрын
Every LLM model except Sonnet disappoints
@ZzzKekeke15 күн бұрын
can you make the dragons twerk?
@jjdorig971215 күн бұрын
I have it running on a single 3090, how do i check how much context window it has?
I tried QWEN 2.5 for math because I am taking part in the AIMO Kaggle competition. I cant say that with certainty but I feel they train their models on the benchmarks. In one weird case it did a function calling but also provided me the result (without actually performing the function calling).
@jose-lael14 күн бұрын
That’s common for LLMs, try using a wider variety of them and you’ll have an intuition for how LLMs behave.
@wasimdorboz15 күн бұрын
please answer how u get to know the base url ? hyperbolic ?
@AICodeKing15 күн бұрын
You can see it by going to Hyperbolic API Script thing
@wasimdorboz15 күн бұрын
@@AICodeKing alright thanks bro , you good developer
@wasimdorboz15 күн бұрын
@@AICodeKing bro there is qwen 2.5 72b and i looked over ai and google i didnt get the base url or how to use it exactly Qwen/...instruct wow instruct and boom work, u good developer
@wasimdorboz15 күн бұрын
bro please us etutorial on electron or tauri or any open source one
@emilianosteinvorth701715 күн бұрын
I have had some issues with cline using models that are not claude/gpt since i think cline requires a model with proper agentic features. It could be a reason why the performance was so poor with it. I think testing qwen using a chatting interface could change the results.
@a1_Cr3at0r15 күн бұрын
Dude make video about g4f (gpt4free) API + Cline
@hipotures15 күн бұрын
The same with python, garbage produced without end. Maybe it's a problem with ollama?
@BeastModeDR61415 күн бұрын
It doesn't follow instruction
@QorQar15 күн бұрын
Hyberpolc free or no?
@AICodeKing15 күн бұрын
Free $10 credits
@enloder14 күн бұрын
So I should not use Qwen2.5 Coder 7B anymore?
@AICodeKing14 күн бұрын
Depends on your choice.. I see no use of that model for me as of now.. I just use SmolLM2 which is better and can actually be used locally at great speeds on my machine. There's no one size fits all or anything like that.
@mnageh-bo1mm15 күн бұрын
test it with cursor
@alexjensen99015 күн бұрын
I'm pretty sure that Qwen is Chinese, right? That may explain the questionable benchmarking.
@다루루15 күн бұрын
Very powerful model!!!
@TawnyE15 күн бұрын
EE
@HansKonrad-ln1cg15 күн бұрын
very bad at instruction following. it has something in common with my wife there.
@justtiredthings15 күн бұрын
🙄
@meassess15 күн бұрын
I did everything right but I get this error: # VSCode Visible Files (No visible files) # VSCode Open Tabs (No open tabs) # Current Working Directory (d:/Mert - Workspace/test-ai-project) Files No files found.