Deep Dive: Optimizing LLM inference

  Рет қаралды 19,496

Julien Simon

Julien Simon

4 ай бұрын

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and deliver latency and throughput that are incompatible with your cost-performance objectives.
In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
01:15 Decoder-only inference
06:05 The KV cache
11:15 Continuous batching
16:17 Speculative decoding
25:28 Speculative decoding: small off-the-shelf model
26:40 Speculative decoding: n-grams
30:25 Speculative decoding: Medusa

Пікірлер: 18
@jiegong529
@jiegong529 25 күн бұрын
Thanks so much for the crystal clear explanations! You understand them so well and it's even more amazing how you show them in bullet points and graphs to make your audience understand as well!
@juliensimonfr
@juliensimonfr 16 күн бұрын
Glad you liked it, thank you!
@cybermanaudiobooks3231
@cybermanaudiobooks3231 4 ай бұрын
Thanks Julien. Your recent series of videos have been top quality. A future video, you might consider making, is one about the different prompts required when fine-tuning. Why does llama2 differ from mistral? left-padding for some, right-padding for others; how does trl help simplify things? what is the history of this? what is chatml? guanaco? etc.., To get a solid foundation in navigating this area would be a helpful video to say the least!
@juliensimonfr
@juliensimonfr 4 ай бұрын
Hi Cyberman, thank you for the kind words. I have a bit of an inference obsession at the moment, but I'll come back to training and fine-tuning after that. Your suggestions sound good, I'll add them to the list :)
@billykotsos4642
@billykotsos4642 Ай бұрын
very informative as always !
@juliensimonfr
@juliensimonfr 16 күн бұрын
Glad it was helpful!
@user-kd2st5vc5t
@user-kd2st5vc5t 2 ай бұрын
很快就讲清楚了,好厉害!爱来自瓷器。
@RoyAAD
@RoyAAD 3 ай бұрын
Very interesting. Can we batch 1 image of the LLM? Or we need multiple copies of it loaded in the gpu in order to batch? And how can we estimate the throughput? Say if I have a 70B model on 2 A100?
@juliensimonfr
@juliensimonfr 3 ай бұрын
This works even on a single GPU. Here's the paper if you want to dive deeper: www.usenix.org/conference/osdi22/presentation/yu. Regarding benchmarks, I suggest you run your own to find the right latency/throughput trade-off for your application. This should help: www.databricks.com/blog/llm-inference-performance-engineering-best-practices
@RoyAAD
@RoyAAD 3 ай бұрын
@@juliensimonfr Thanks. Do you have a link that explains how to calculate the feasability for an LLM?
@justwest
@justwest 4 ай бұрын
how does the big LLM handle the "predicted" tokens? I mean, how does it check whether these are good or not?
@juliensimonfr
@juliensimonfr 4 ай бұрын
Detailed explanation in huggingface.co/blog/assisted-generation. In a nutshell, you can run a forward pass with the large model on a speculative sequence and retrieve the logits (i.e. the probabilities) for the next token at each position in the sequence. If a speculative token does have the highest logit, then it's the token the large model would have generated.
@justwest
@justwest 4 ай бұрын
@@juliensimonfr thx a lot (for the vids in general also, of course). I think I am still missing a point here. why do you get the logits *at each position in the sequence*? isnt the ouput of the model just probabilities for the *next* token? If I would want to have this *at each position*, wouldnt I have to forward pass multiple times? thx!
@bibiworm
@bibiworm 2 ай бұрын
Would you mind sharing the slides please Sir? Thank you!
@rbrowne4255
@rbrowne4255 4 ай бұрын
Thanks for the video great job!!! in terms of Speculative decoding, can you provide any additional feedback on its impact on GPU performance/memory? i.e kv-cache usage or overall GPU memory resources
@juliensimonfr
@juliensimonfr 3 ай бұрын
The only overhead is the assistant model, which can share layers with the large model . For example, see huggingface.co/blog/whisper-speculative-decoding, which says that there's only 8% of RAM overhead.
@Gerald-xg3rq
@Gerald-xg3rq 3 ай бұрын
hi great video! how to set WAITING_SERVED_RATIO, MAX_BATCH_SIZE, MAX_BATCH_TOTAL_TOKENS, MAX_BATCH_PREFILL_TOKENS etc. for highest throughput? looking at llama2-7b-chat and llama3-8b-instruct with nvidia A10.
@juliensimonfr
@juliensimonfr 3 ай бұрын
Hi, the doc is available at huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher. I would increase batch size and measure.
Deep dive: model merging
47:26
Julien Simon
Рет қаралды 8 М.
ПРОВЕРИЛ АРБУЗЫ #shorts
00:34
Паша Осадчий
Рет қаралды 7 МЛН
No empty
00:35
Mamasoboliha
Рет қаралды 10 МЛН
What if my Intel CPU explodes?? - Probing Paul #88
14:20
Paul's Hardware
Рет қаралды 64 М.
Deep dive - Better Attention layers for Transformer models
40:54
Julien Simon
Рет қаралды 9 М.
What are AI Agents?
12:29
IBM Technology
Рет қаралды 110 М.
Key Value Cache in Large Language Models Explained
17:37
Tensordroid
Рет қаралды 739
Deep Dive: Quantizing Large Language Models, part 1
40:28
Julien Simon
Рет қаралды 9 М.
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 839 М.
Intro to RAG for AI (Retrieval Augmented Generation)
14:31
Matthew Berman
Рет қаралды 48 М.
host ALL your AI locally
24:20
NetworkChuck
Рет қаралды 929 М.
İĞNE İLE TELEFON TEMİZLEMEK!🤯
0:17
Safak Novruz
Рет қаралды 1,9 МЛН
Как удвоить напряжение? #электроника #умножитель
1:00
Hi Dev! – Электроника
Рет қаралды 1,1 МЛН
Опасность фирменной зарядки Apple
0:57
SuperCrastan
Рет қаралды 11 МЛН
Rate This Smartphone Cooler Set-up ⭐
0:10
Shakeuptech
Рет қаралды 6 МЛН
My iPhone 15 pro max 😱🫣😂
0:21
Nadir Show
Рет қаралды 1,4 МЛН