Scaling Vector Database Usage Without Breaking the Bank Quantization and Adaptive Retrieval

  Рет қаралды 307

Toronto Machine Learning Series (TMLS)

Toronto Machine Learning Series (TMLS)

Күн бұрын

Speaker:
Zain Hassan, Senior ML Developer Advocate, Weaviate
Abstract:
Everybody loves vector search and enterprises now see its value thanks to the popularity of LLMs and RAG. The problem is that prod-level deployment of vector search requires boatloads of both CPU, for search, and GPU, for inference, compute. The bottom line is that if deployed incorrectly vector search can be prohibitively expensive compared to classical alternatives.
The solution: quantizing vectors and performing adaptive retrieval. These techniques allow you to scale applications into production by allowing you to balance and tune memory costs, latency performance, and retrieval accuracy very reliably.
I’ll talk about how you can perform realtime billion-scale vector search on your laptop! This includes covering different quantization techniques, including product, binary, scalar and matryoshka quantization that can be used to compress vectors trading off memory requirements for accuracy. I’ll also introduce the concept of adaptive retrieval where you first perform cheap hardware-optimized low-accuracy search to identify retrieval candidates using compressed vectors followed by a slower, higher-accuracy search to rescore and correct.
These quantization techniques when used with well-thought-out adaptive retrieval can lead to a 32x reduction in memory cost requirements at the cost of ~ 5% loss in retrieval recall in your RAG stack.

Пікірлер
Uncertainty Quantification with Conformal Prediction: A Path to Reliable ML Models
1:23:45
Toronto Machine Learning Series (TMLS)
Рет қаралды 151
Agentic AI: Unlocking Emergent Behavior in LLMs for Adaptive Workflow Automation
41:25
Toronto Machine Learning Series (TMLS)
Рет қаралды 341
String Competition for iPhone! 😱
00:37
Alan Chikin Chow
Рет қаралды 30 МЛН
100km/h Reflex Challenge 😱🚀
00:27
Celine Dept
Рет қаралды 156 МЛН
The perfect snowball 😳❄️ (via @vidough/TT)
00:31
SportsNation
Рет қаралды 77 МЛН
Vector Quantization: The Vector Clubhouse Episode 2
59:40
Weaviate • Vector Database
Рет қаралды 291
Kùzu   A fast, scalable graph database for analytical workloads
1:29:22
Toronto Machine Learning Series (TMLS)
Рет қаралды 143
How language model post-training is done today
53:51
Interconnects AI
Рет қаралды 6 М.
Modular Solutions for Knowledge Management at scale in RAG Systems
1:04:16
Toronto Machine Learning Series (TMLS)
Рет қаралды 89
What if all the world's biggest problems have the same solution?
24:52
From Chaos to Control   Mastering ML Reproducibility at scale
1:18:41
Toronto Machine Learning Series (TMLS)
Рет қаралды 68
Simon Sinek's Advice Will Leave You SPEECHLESS 2.0 (MUST WATCH)
20:43
Alpha Leaders
Рет қаралды 2,8 МЛН
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,8 МЛН
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 456 М.
Fuel iX: An enterprise grade Gen AI platform
40:03
Toronto Machine Learning Series (TMLS)
Рет қаралды 484