CONTEXT CACHING for Faster and Cheaper Inference

  Рет қаралды 1,721

Trelis Research

Trelis Research

Күн бұрын

Пікірлер: 12
@alleskepler9526
@alleskepler9526 24 күн бұрын
Bro u a gem
@TrelisResearch
@TrelisResearch 24 күн бұрын
appreciate it
@heski6847
@heski6847 2 ай бұрын
Thank you, as always very useful content!
@TrelisResearch
@TrelisResearch 2 ай бұрын
you're welcome
@Rishab-l1u
@Rishab-l1u 7 күн бұрын
How do we deal with hallucination resulting from our background info?
@TrelisResearch
@TrelisResearch 7 күн бұрын
Take a look at my video on synthetic data generation. I cover it there. Unless I’m misreading your Q and it relates to caching?
@MrMoonsilver
@MrMoonsilver 2 ай бұрын
Do you think this will come to open source, self-hosted models?
@TrelisResearch
@TrelisResearch 2 ай бұрын
Yup, I show SGLang (same approach for vLLM) in this video!
@MrMoonsilver
@MrMoonsilver 2 ай бұрын
Super cool, thank you so much.
@explorer945
@explorer945 2 ай бұрын
How does it different from cachi7by UI libraries like chainlit where they use redis to store the embeddings of prompt and if it matches they return the previous response without even hitting the llm api. Which is better?
@TrelisResearch
@TrelisResearch 2 ай бұрын
Howdy! What you're mentioning is embedding caching, which is a complete cache (i.e. the whole answer is stored and retrieved if there's a match). This here is kv cache embedding, it's partial embedding for LLM inference. When part of a prompt is being reused (and it has to be the first part), there are some intermediate values (k and v) that can be reused in the forward pass to generate the response.
@explorer945
@explorer945 2 ай бұрын
@@TrelisResearch got it. why it has to first part? i couldn't quite get it from the video. Also, it is based on initial layers or end layers? how does it help with RAG architectures?
How to use LLMs for Fact Checking
40:40
Trelis Research
Рет қаралды 1,6 М.
Automated Prompt Engineering with DSPy
45:52
Trelis Research
Рет қаралды 3 М.
1, 2, 3, 4, 5, 6, 7, 8, 9 🙈⚽️
00:46
Celine Dept
Рет қаралды 109 МЛН
Мама у нас строгая
00:20
VAVAN
Рет қаралды 10 МЛН
Увеличили моцареллу для @Lorenzo.bagnati
00:48
Кушать Хочу
Рет қаралды 8 МЛН
if else loop in solidity
2:08
Solidity Typescript DEV
Рет қаралды 1
How to save money with Gemini Context Caching
11:33
Sam Witteveen
Рет қаралды 8 М.
Fine tuning Pixtral - Multi-modal Vision and Text Model
55:22
Trelis Research
Рет қаралды 3,4 М.
Prompt Caching will not kill RAG
8:37
Yash - AI & Growth
Рет қаралды 870
Fine tune and Serve Faster Whisper Turbo
34:44
Trelis Research
Рет қаралды 2,3 М.
Output Predictions - Faster Inference with OpenAI or vLLM
24:23
Trelis Research
Рет қаралды 1,3 М.
Predicting Events with Large Language Models
25:09
Trelis Research
Рет қаралды 3,3 М.
Long Context Summarization
32:01
Trelis Research
Рет қаралды 1,7 М.
Understanding AI from Scratch - Neural Networks Course
3:44:18
freeCodeCamp.org
Рет қаралды 444 М.