Llama.cpp for FULL LOCAL Semantic Router

  Рет қаралды 12,060

James Briggs

James Briggs

Күн бұрын

Using fully local semantic router for agentic AI with llama.cpp LLM and HuggingFace embedding models.
There are many reasons we might decide to use local LLMs rather than use a third-party service like OpenAI. It could be cost, privacy, compliance, or fear of the OpenAI apocalypse. To help you out, we made Semantic Router fully local with local LLMs available via llama.cpp like Mistral 7B.
Using llama.cpp also enables the use of quantized GGUF models, reducing the memory footprint of deployed models and allowing even 13-billion parameter models to run with hardware acceleration on an Apple M1 Pro chip. We also use LLM grammars to enable high output reliability even from the smallest of models.
In this video, we'll use HuggingFace's MiniLM encoder, and llama.cpp's Mistral-7B-instruct GGUF quantized.
⭐ GitHub Repo:
github.com/aur...
📌 Code:
github.com/aur...
🔥 Semantic Router Course:
www.aurelio.ai...
👋🏼 AI Consulting:
aurelio.ai
👾 Discord:
/ discord
Twitter: / jamescalam
LinkedIn: / jamescalam

Пікірлер: 29
@thomasheggelund8352
@thomasheggelund8352 7 ай бұрын
Just tested the Semantic Router on my M1 MacBook and I'm blown away by the speed! It was so fast that I thought I was still using OpenAI. Had to turn off my network to believe it wasn't! Excellent work
@jamesbriggs
@jamesbriggs 7 ай бұрын
Haha yeah it’s pretty awesome, I’m very proud of the team in how quickly they’ve built something that works so well - we are using it across a lot of projects
@felipepadua2229
@felipepadua2229 7 ай бұрын
Love this series. I have not followed any KZbinr's channel closely before. James Briggs' was the first one: I love the videos and following the latest on AI and LLMs.
@maxlgemeinderat9202
@maxlgemeinderat9202 7 ай бұрын
Please create a video of using a local LLM with LlamaCpp and streaming the response with FastApi. Would be so nice
@KarlJuhl
@KarlJuhl 7 ай бұрын
Great release James 🎉
@user-us6xc7fm7p
@user-us6xc7fm7p 6 ай бұрын
I am working on reducing OpenAI cost - so glad to have found this. Will share this over a blog post. Quick question: Is it possible to pass any system message to RouteLayer? I would like the LLM to know today's date, so the function responds with the correct parameter. If not - is it possible to do so in _llm? Trying to avoid fine-tuning or writing additional messy code on my function. Thanks again!
@wgpubs
@wgpubs 7 ай бұрын
Cool vid! Do you have any resources on the use of LLM grammars? That's a new one for me and wondering how they work and where they should be considered. Thanks.
@danielschoenbohm
@danielschoenbohm 7 ай бұрын
Awesome! Can't wait to test this out myself soon. Curious how this would scale with 100+ functions
@jamesbriggs
@jamesbriggs 7 ай бұрын
I will try this soon - just need to figure out where to get 100s of interesting functions
@danielschoenbohm
@danielschoenbohm 7 ай бұрын
To test it effectively, it probably makes sense to strike a balance between very distinctive and very similar functions. Also, for testing purposes, the functions don't really need to be fully operational, right? You could perhaps focus on one topic, such as analytics, and have GPT-4 generate 100 analytic function ideas. Then, for each idea, you could generate example prompts to embed. Following that, you can create a test dataset for your use case on which to run tests to benchmark performance. Does that make sense?
@jamesbriggs
@jamesbriggs 7 ай бұрын
@@danielschoenbohm 100% it's a great plan - coming soon for sure
@animaker7175
@animaker7175 6 ай бұрын
Hi, is it possible to carry out several simultaneous (or at least sequential) function calls at a time from one request with the semantic router? For instance: User: “turn the lights on and tell me the weather forecast” Assistant: “The lights are on”, “The weather is….”
@jamesbriggs
@jamesbriggs 6 ай бұрын
not yet, but we're adding multi-routes and I feel like a natural progression from there could be something like you describe here
@animaker7175
@animaker7175 6 ай бұрын
@@jamesbriggs Do you mean setting a threshold value for multiple intents (or make it adjustable for each specific one) so that ones that surpass the threshold would be called, instead of just picking an intent with maximum predicted probability?
@jamesbriggs
@jamesbriggs 6 ай бұрын
Yeah that’s probably how it will work
@user-sz1iw4zi4y
@user-sz1iw4zi4y 7 ай бұрын
Just tested it, when it asked "what time it is in Provo Utah" it correctly pulled up the time zone 'America/Denver' because all of Utah is in the Denver Colorado MST time zone. However when I asked 'What time is it in American Fork Utah' a smaller town in the same county, it didn't know what correct time zone to pull and it gave me 'America/New_York'. As I think about how to scale this idea, at what point should we begin using fine-tuned llms that know a lot more about my data or function use cases specifically?
@benoitmialet9842
@benoitmialet9842 7 ай бұрын
Hi thank you for this discovery. What is the purpose of the encoder here ? We are not doing classic RAG over a collection of text chunks that are embedded in a vector store. So why the user query text needs any encoder ? Isn't it directly interpreted by the LLM ? Many thanks
@jamesbriggs
@jamesbriggs 7 ай бұрын
no the route choice is chosen by the encoder, we do that for speed and scalability - rather than a route choice being decided by an LLM (which requires many tokens, resulting in higher latency and cost) we perform some classification logic with the encoder which is super fast. This means static routes can be chosen in milliseconds (with LLM it would be seconds), dynamic routes can be chosen maybe 0.5-1 second faster (depending on the LLM), and we can also scale to 100s or 1000s of routes
@benoitmialet9842
@benoitmialet9842 7 ай бұрын
@@jamesbriggs oh of course, that's much clearer now. Indeed very interesting and simple way to accelerate total inference time. Thank you very much. Looking forward your next video.
@marktucker8537
@marktucker8537 6 ай бұрын
@@jamesbriggsWith some classic NLU models you can do both intent classification and slot filling/entity extraction. I think I understand how the encoder can be used be used for intent classification. Would the slot filling need to be done by the LLM?
@taredje4664
@taredje4664 7 ай бұрын
thanks bro
@dr.mikeybee
@dr.mikeybee 7 ай бұрын
Can you set this up using ollama?
@alx8439
@alx8439 7 ай бұрын
Ollama doesn't provide OpenAI-compliant API. It's not yet there, though I might have seen its being prepared. For now, you'll need a wrapper for that - LightLLM
@GeandersonLenz
@GeandersonLenz 7 ай бұрын
Are there support to memory and RAG?
@tidymonkey81
@tidymonkey81 7 ай бұрын
Could I use this to route queries to a specific data store e.g. "what is the best dog breed?" > query > dog data vector store ?
@jamesbriggs
@jamesbriggs 7 ай бұрын
yeah 100%, I want to demo this soon
@dr.mikeybee
@dr.mikeybee 7 ай бұрын
I think you can import logging and set it to ERROR.
@jamesbriggs
@jamesbriggs 7 ай бұрын
Thanks will try :)
How to Make RAG Chatbots FAST
21:02
James Briggs
Рет қаралды 38 М.
host ALL your AI locally
24:20
NetworkChuck
Рет қаралды 1 МЛН
Men Vs Women Survive The Wilderness For $500,000
31:48
MrBeast
Рет қаралды 65 МЛН
WILL IT BURST?
00:31
Natan por Aí
Рет қаралды 43 МЛН
Violet Beauregarde Doll🫐
00:58
PIRANKA
Рет қаралды 51 МЛН
Semantic Chunking for RAG
29:56
James Briggs
Рет қаралды 22 М.
What is an LLM Router?
9:16
Sam Witteveen
Рет қаралды 28 М.
LLM-powered Topic Modeling
1:25:56
Apply AI like a Pro
Рет қаралды 3,5 М.
All You Need To Know About Running LLMs Locally
10:30
bycloud
Рет қаралды 148 М.
RAG But Better: Rerankers with Cohere AI
23:43
James Briggs
Рет қаралды 58 М.
Python RAG Tutorial (with Local LLMs): AI For Your PDFs
21:33
pixegami
Рет қаралды 223 М.
LangGraph 101: it's better than LangChain
32:26
James Briggs
Рет қаралды 71 М.
Installing Llama cpp on Windows
12:47
Cognibuild - Understanding AI
Рет қаралды 2,2 М.
Run your own AI (but private)
22:13
NetworkChuck
Рет қаралды 1,4 МЛН