Llama.cpp for FULL LOCAL Semantic Router

Рет қаралды 12,060

Күн бұрын

Using fully local semantic router for agentic AI with llama.cpp LLM and HuggingFace embedding models.
There are many reasons we might decide to use local LLMs rather than use a third-party service like OpenAI. It could be cost, privacy, compliance, or fear of the OpenAI apocalypse. To help you out, we made Semantic Router fully local with local LLMs available via llama.cpp like Mistral 7B.
Using llama.cpp also enables the use of quantized GGUF models, reducing the memory footprint of deployed models and allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip. We also use LLM grammars to enable high output reliability even from the smallest of models.
In this video, we'll use HuggingFace's MiniLM encoder, and llama.cpp's Mistral-7B-instruct GGUF quantized.
⭐ GitHub Repo:
github.com/aur...
📌 Code:
github.com/aur...
🔥 Semantic Router Course:
www.aurelio.ai...
👋🏼 AI Consulting:
aurelio.ai
👾 Discord:
/ discord
Twitter: / jamescalam
LinkedIn: / jamescalam

Пікірлер: 29

@thomasheggelund8352 7 ай бұрын

Just tested the Semantic Router on my M1 MacBook and I'm blown away by the speed! It was so fast that I thought I was still using OpenAI. Had to turn off my network to believe it wasn't! Excellent work

@jamesbriggs 7 ай бұрын

Haha yeah it’s pretty awesome, I’m very proud of the team in how quickly they’ve built something that works so well - we are using it across a lot of projects

@felipepadua2229 7 ай бұрын

Love this series. I have not followed any KZbinr's channel closely before. James Briggs' was the first one: I love the videos and following the latest on AI and LLMs.

@maxlgemeinderat9202 7 ай бұрын

Please create a video of using a local LLM with LlamaCpp and streaming the response with FastApi. Would be so nice

@KarlJuhl 7 ай бұрын

Great release James 🎉

@user-us6xc7fm7p 6 ай бұрын

I am working on reducing OpenAI cost - so glad to have found this. Will share this over a blog post. Quick question: Is it possible to pass any system message to RouteLayer? I would like the LLM to know today's date, so the function responds with the correct parameter. If not - is it possible to do so in _llm? Trying to avoid fine-tuning or writing additional messy code on my function. Thanks again!

@wgpubs 7 ай бұрын

Cool vid! Do you have any resources on the use of LLM grammars? That's a new one for me and wondering how they work and where they should be considered. Thanks.

@danielschoenbohm 7 ай бұрын

Awesome! Can't wait to test this out myself soon. Curious how this would scale with 100+ functions

@jamesbriggs 7 ай бұрын

I will try this soon - just need to figure out where to get 100s of interesting functions

@danielschoenbohm 7 ай бұрын

To test it effectively, it probably makes sense to strike a balance between very distinctive and very similar functions. Also, for testing purposes, the functions don't really need to be fully operational, right? You could perhaps focus on one topic, such as analytics, and have GPT-4 generate 100 analytic function ideas. Then, for each idea, you could generate example prompts to embed. Following that, you can create a test dataset for your use case on which to run tests to benchmark performance. Does that make sense?

@jamesbriggs 7 ай бұрын

@@danielschoenbohm 100% it's a great plan - coming soon for sure

@animaker7175 6 ай бұрын

Hi, is it possible to carry out several simultaneous (or at least sequential) function calls at a time from one request with the semantic router? For instance: User: “turn the lights on and tell me the weather forecast” Assistant: “The lights are on”, “The weather is….”

@jamesbriggs 6 ай бұрын

not yet, but we're adding multi-routes and I feel like a natural progression from there could be something like you describe here

@animaker7175 6 ай бұрын

@@jamesbriggs Do you mean setting a threshold value for multiple intents (or make it adjustable for each specific one) so that ones that surpass the threshold would be called, instead of just picking an intent with maximum predicted probability?

@jamesbriggs 6 ай бұрын

Yeah that’s probably how it will work

@user-sz1iw4zi4y 7 ай бұрын

Just tested it, when it asked "what time it is in Provo Utah" it correctly pulled up the time zone 'America/Denver' because all of Utah is in the Denver Colorado MST time zone. However when I asked 'What time is it in American Fork Utah' a smaller town in the same county, it didn't know what correct time zone to pull and it gave me 'America/New_York'. As I think about how to scale this idea, at what point should we begin using fine-tuned llms that know a lot more about my data or function use cases specifically?

@benoitmialet9842 7 ай бұрын

Hi thank you for this discovery. What is the purpose of the encoder here ? We are not doing classic RAG over a collection of text chunks that are embedded in a vector store. So why the user query text needs any encoder ? Isn't it directly interpreted by the LLM ? Many thanks

@jamesbriggs 7 ай бұрын

no the route choice is chosen by the encoder, we do that for speed and scalability - rather than a route choice being decided by an LLM (which requires many tokens, resulting in higher latency and cost) we perform some classification logic with the encoder which is super fast. This means static routes can be chosen in milliseconds (with LLM it would be seconds), dynamic routes can be chosen maybe 0.5-1 second faster (depending on the LLM), and we can also scale to 100s or 1000s of routes

@benoitmialet9842 7 ай бұрын

@@jamesbriggs oh of course, that's much clearer now. Indeed very interesting and simple way to accelerate total inference time. Thank you very much. Looking forward your next video.

@marktucker8537 6 ай бұрын

@@jamesbriggsWith some classic NLU models you can do both intent classification and slot filling/entity extraction. I think I understand how the encoder can be used be used for intent classification. Would the slot filling need to be done by the LLM?

@taredje4664 7 ай бұрын

thanks bro

@dr.mikeybee 7 ай бұрын

Can you set this up using ollama?

@alx8439 7 ай бұрын

Ollama doesn't provide OpenAI-compliant API. It's not yet there, though I might have seen its being prepared. For now, you'll need a wrapper for that - LightLLM