How good is llama 3.2 REALLY? Ollama SLM & LLM Prompt Ranking (Qwen, Phi, Gemini Flash)

Рет қаралды 13,565

IndyDevDan

Күн бұрын

Пікірлер: 43

@techfren 2 ай бұрын

Thank you for continuing to post great content

@DARKSXIDE 2 ай бұрын

u too techfren you guys both rock! big fan of both channels !

@shockwavemasta Ай бұрын

Thanks for continuing this series - it's been super helpful

@johnkintree763 2 ай бұрын

Thanks for including generation of SQL queries among the tested tasks. The ability of models to interface with databases is crucial.

@zkiyyeller3525 2 ай бұрын

THANK YOU! I really appreciate your honest testing and taking us along with you on this journey!

@billydoughty7243 2 ай бұрын

@IndyDevDan - you da man, dan. experienced engineers can appreciate your methodology and the value of your content and the tools you create. inexperienced engineers can learn the value of a methodical, structured approach to software development, which includes analyzing, comparing, and building tools to maximize your productivity. great videos. keep 'em coming.

@ariramkilowan8051 2 ай бұрын

Would be cool to test image understanding. Basic OCR to start with then counting objects and doing reasoning over the images. LLM providers often tell us what their models can't do, or can't do well. Use that info as a signal of improvement would be very useful IMHO. Best still is that you can use code to check exactly how correct each model is, this can be harder when dealing with text where you need a human judge or an LLM as a judge (which then needs to be aligned with a human anyway). Also thanks for the video, I check in every Monday. Keep on keeping on. 👍

@saaaashaaaaa Ай бұрын

so beautiful testing!

@peciHilux 2 ай бұрын

Wow. nice. What I am missing is technical metrics for comparison, like response time, memory used to run the model...

@Jason-ju7df 2 ай бұрын

I wish that you put the model parameter sizes in the video description. Makes it easier to really give weight to your comparisons when you're comparing a 1B model to a 7B model

@pubfixture Ай бұрын

4-way gold medal of 7 contestants means you need harder questions at the top end to separate them out.

@zakkyang6476 2 ай бұрын

Interesting project.Since I am a lazy person, I will use another LLM model to score the output each time rather than manually.

@albertwang5974 2 ай бұрын

what an inspiration video!

@billybob9247 2 ай бұрын

What quantization sizes where you using for the models?? Love your channel! Keep it coming !!!

@enthusiast1 Ай бұрын

Great video, thank you! Creative, using a custom notebook for benchmarking/comparisons. 💯✨️

@aerotheory 2 ай бұрын

Lots of subs to be had in the SLM area, so many edge cases. Try 70b_q4 compared to 8b models.

@amitkot 2 ай бұрын

Great comparison, thanks for making this! I'm off to compare qwen2.5:latest with qwen2.5-coder:latest.

@indydevdan Ай бұрын

Thank you! Qwen2.5 was the real shocker here. When qwen 3 hits - it's prime time for on device models.

@DanielBowne Ай бұрын

Hands down before local model I have seen for function/tool calling.

@CheekoVids Ай бұрын

I know you don't do much of model training on this channel. But have you considered training the some of the local models on your good test results then seeing how the refined models perform?

@samsaraAI2025 2 ай бұрын

Thanks for the video! Could you make a tutorial in which a local installation of Llama can learn from the chats you have with the IA. I mean you just talk and somehow it is storing this information internally and not losing it when you close the computer.

@davidpower3102 Ай бұрын

I found it hard to understand how you benched the models. Was this mostly down to personal opinion? Maybe you might explain your tests before discussing the results. Your test tooling looks really nice!

@indydevdan Ай бұрын

100% personal opinion and vibes. I use promptfoo for more hands on assertion based testing. This notebook is more about understanding what the models can do at a high level.

@matthewjfoster1 2 ай бұрын

good video, thanks!

@Pure_Science_and_Technology 2 ай бұрын

Lama 3.2 hallucinates really bad.

@A_Me_Amy 9 күн бұрын

Alright, I'll share a dark humor joke with you: Why did the psychopath join the marching band? To get closer to the drummer.

@Snowman8526 25 күн бұрын

ну это немного странно сравнивать разные по размеру нейронки 1B llama\7B QWEN . + разные по квантаванию.

@NLPprompter 2 ай бұрын

I'm curious do you use 5k context in ollama default model right?

@ibrahims5636 28 күн бұрын

What about testing using law context, i found different models give different respond and sometimes they absolutely halucination

@DARKSXIDE 2 ай бұрын

maybe see how they perform with anthropics new contextual rag. then we can download devdocs and make even the slms smarter for coding

@husanaaulia4717 Ай бұрын

isn't qwen2.5 has 3B parameter model?

@wedding_photography Ай бұрын

12:55 you completely missed that llama3.2:1b failed at SQL. It's missing authed=TRUE.

@indydevdan Ай бұрын

nice catch

@acllhes 2 ай бұрын

What happened to your ai personal assistant?

@indydevdan Ай бұрын

We've been waiting for the realtime_api 🚀

@wedding_photography Ай бұрын

"ping" is the dumbest test I have seen. Go tell random people "ping" and see what they respond with.

@dr-zieg 20 күн бұрын

1b parameters is not a SLM. Thumbs down.

@prozacsf84 2 ай бұрын

Bro, it's useless to compare without o1-preview. It is times better

@indydevdan Ай бұрын

This was a local model focused test. o1-preview would score 100% on these tests, nothing to learn there.

@prozacsf84 Ай бұрын

@@indydevdan gpt-4o is local ?

@stephaneduhamel7706 Ай бұрын

Poor phi losing all its points because it tried to be extra helpful