u too techfren you guys both rock! big fan of both channels !
@shockwavemastaАй бұрын
Thanks for continuing this series - it's been super helpful
@johnkintree7632 ай бұрын
Thanks for including generation of SQL queries among the tested tasks. The ability of models to interface with databases is crucial.
@zkiyyeller35252 ай бұрын
THANK YOU! I really appreciate your honest testing and taking us along with you on this journey!
@billydoughty72432 ай бұрын
@IndyDevDan - you da man, dan. experienced engineers can appreciate your methodology and the value of your content and the tools you create. inexperienced engineers can learn the value of a methodical, structured approach to software development, which includes analyzing, comparing, and building tools to maximize your productivity. great videos. keep 'em coming.
@ariramkilowan80512 ай бұрын
Would be cool to test image understanding. Basic OCR to start with then counting objects and doing reasoning over the images. LLM providers often tell us what their models can't do, or can't do well. Use that info as a signal of improvement would be very useful IMHO. Best still is that you can use code to check exactly how correct each model is, this can be harder when dealing with text where you need a human judge or an LLM as a judge (which then needs to be aligned with a human anyway). Also thanks for the video, I check in every Monday. Keep on keeping on. 👍
@saaaashaaaaaАй бұрын
so beautiful testing!
@peciHilux2 ай бұрын
Wow. nice. What I am missing is technical metrics for comparison, like response time, memory used to run the model...
@Jason-ju7df2 ай бұрын
I wish that you put the model parameter sizes in the video description. Makes it easier to really give weight to your comparisons when you're comparing a 1B model to a 7B model
@pubfixtureАй бұрын
4-way gold medal of 7 contestants means you need harder questions at the top end to separate them out.
@zakkyang64762 ай бұрын
Interesting project.Since I am a lazy person, I will use another LLM model to score the output each time rather than manually.
@albertwang59742 ай бұрын
what an inspiration video!
@billybob92472 ай бұрын
What quantization sizes where you using for the models?? Love your channel! Keep it coming !!!
@enthusiast1Ай бұрын
Great video, thank you! Creative, using a custom notebook for benchmarking/comparisons. 💯✨️
@aerotheory2 ай бұрын
Lots of subs to be had in the SLM area, so many edge cases. Try 70b_q4 compared to 8b models.
@amitkot2 ай бұрын
Great comparison, thanks for making this! I'm off to compare qwen2.5:latest with qwen2.5-coder:latest.
@indydevdanАй бұрын
Thank you! Qwen2.5 was the real shocker here. When qwen 3 hits - it's prime time for on device models.
@DanielBowneАй бұрын
Hands down before local model I have seen for function/tool calling.
@CheekoVidsАй бұрын
I know you don't do much of model training on this channel. But have you considered training the some of the local models on your good test results then seeing how the refined models perform?
@samsaraAI20252 ай бұрын
Thanks for the video! Could you make a tutorial in which a local installation of Llama can learn from the chats you have with the IA. I mean you just talk and somehow it is storing this information internally and not losing it when you close the computer.
@davidpower3102Ай бұрын
I found it hard to understand how you benched the models. Was this mostly down to personal opinion? Maybe you might explain your tests before discussing the results. Your test tooling looks really nice!
@indydevdanАй бұрын
100% personal opinion and vibes. I use promptfoo for more hands on assertion based testing. This notebook is more about understanding what the models can do at a high level.
@matthewjfoster12 ай бұрын
good video, thanks!
@Pure_Science_and_Technology2 ай бұрын
Lama 3.2 hallucinates really bad.
@A_Me_Amy9 күн бұрын
Alright, I'll share a dark humor joke with you: Why did the psychopath join the marching band? To get closer to the drummer.
@Snowman852625 күн бұрын
ну это немного странно сравнивать разные по размеру нейронки 1B llama\7B QWEN . + разные по квантаванию.
@NLPprompter2 ай бұрын
I'm curious do you use 5k context in ollama default model right?
@ibrahims563628 күн бұрын
What about testing using law context, i found different models give different respond and sometimes they absolutely halucination
@DARKSXIDE2 ай бұрын
maybe see how they perform with anthropics new contextual rag. then we can download devdocs and make even the slms smarter for coding
@husanaaulia4717Ай бұрын
isn't qwen2.5 has 3B parameter model?
@wedding_photographyАй бұрын
12:55 you completely missed that llama3.2:1b failed at SQL. It's missing authed=TRUE.
@indydevdanАй бұрын
nice catch
@acllhes2 ай бұрын
What happened to your ai personal assistant?
@indydevdanАй бұрын
We've been waiting for the realtime_api 🚀
@wedding_photographyАй бұрын
"ping" is the dumbest test I have seen. Go tell random people "ping" and see what they respond with.
@dr-zieg20 күн бұрын
1b parameters is not a SLM. Thumbs down.
@prozacsf842 ай бұрын
Bro, it's useless to compare without o1-preview. It is times better
@indydevdanАй бұрын
This was a local model focused test. o1-preview would score 100% on these tests, nothing to learn there.
@prozacsf84Ай бұрын
@@indydevdan gpt-4o is local ?
@stephaneduhamel7706Ай бұрын
Poor phi losing all its points because it tried to be extra helpful