Рет қаралды 187
Speaker: Matt Squire, CTO, Fuzzy Labs
Open source models have made running your own LLM accessible many people. It's pretty straightforward to set up a model like Mistral, with a vector database, and build your own RAG application.
But making it scale to high traffic demands is another story. LLM inference itself is slow, and GPUs are expensive, so we can't simply throw hardware at the problem. Once you add things like guardrails to your application, latencies compound.
In this talk, I'll share the lessons we've learned from our experience building and running LLMs for our customers at scale. Using real code examples, I'll cover performance profiling, getting the most out of GPUs, and interactions with guardrails.