Deep Dive: Optimizing LLM inference

Рет қаралды 18,580

4 ай бұрын

Open-source LLMs are great for conversational applications, but they can be difficult to scale in production and deliver latency and throughput that are incompatible with your cost-performance objectives.
In this video, we zoom in on optimizing LLM inference, and study key mechanisms that help reduce latency and increase throughput: the KV cache, continuous batching, and speculative decoding, including the state-of-the-art Medusa approach.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
01:15 Decoder-only inference
06:05 The KV cache
11:15 Continuous batching
16:17 Speculative decoding
25:28 Speculative decoding: small off-the-shelf model
26:40 Speculative decoding: n-grams
30:25 Speculative decoding: Medusa

Пікірлер: 16

@jiegong529 2 күн бұрын

Thanks so much for the crystal clear explanations! You understand them so well and it's even more amazing how you show them in bullet points and graphs to make your audience understand as well!

@billykotsos4642 7 күн бұрын

very informative as always !

@cybermanaudiobooks3231 4 ай бұрын

Thanks Julien. Your recent series of videos have been top quality. A future video, you might consider making, is one about the different prompts required when fine-tuning. Why does llama2 differ from mistral? left-padding for some, right-padding for others; how does trl help simplify things? what is the history of this? what is chatml? guanaco? etc.., To get a solid foundation in navigating this area would be a helpful video to say the least!

@juliensimonfr 4 ай бұрын

Hi Cyberman, thank you for the kind words. I have a bit of an inference obsession at the moment, but I'll come back to training and fine-tuning after that. Your suggestions sound good, I'll add them to the list :)

@RoyAAD 2 ай бұрын

Very interesting. Can we batch 1 image of the LLM? Or we need multiple copies of it loaded in the gpu in order to batch? And how can we estimate the throughput? Say if I have a 70B model on 2 A100?

@juliensimonfr 2 ай бұрын

This works even on a single GPU. Here's the paper if you want to dive deeper: www.usenix.org/conference/osdi22/presentation/yu. Regarding benchmarks, I suggest you run your own to find the right latency/throughput trade-off for your application. This should help: www.databricks.com/blog/llm-inference-performance-engineering-best-practices

@RoyAAD 2 ай бұрын

@@juliensimonfr Thanks. Do you have a link that explains how to calculate the feasability for an LLM?

@bibiworm Ай бұрын

Would you mind sharing the slides please Sir? Thank you!

@justwest 3 ай бұрын

how does the big LLM handle the "predicted" tokens? I mean, how does it check whether these are good or not?

@juliensimonfr 3 ай бұрын

Detailed explanation in huggingface.co/blog/assisted-generation. In a nutshell, you can run a forward pass with the large model on a speculative sequence and retrieve the logits (i.e. the probabilities) for the next token at each position in the sequence. If a speculative token does have the highest logit, then it's the token the large model would have generated.

@justwest 3 ай бұрын

@@juliensimonfr thx a lot (for the vids in general also, of course). I think I am still missing a point here. why do you get the logits *at each position in the sequence*? isnt the ouput of the model just probabilities for the *next* token? If I would want to have this *at each position*, wouldnt I have to forward pass multiple times? thx!

@rbrowne4255 3 ай бұрын

Thanks for the video great job!!! in terms of Speculative decoding, can you provide any additional feedback on its impact on GPU performance/memory? i.e kv-cache usage or overall GPU memory resources

@juliensimonfr 3 ай бұрын

The only overhead is the assistant model, which can share layers with the large model . For example, see huggingface.co/blog/whisper-speculative-decoding, which says that there's only 8% of RAM overhead.

@user-kd2st5vc5t Ай бұрын

很快就讲清楚了，好厉害！爱来自瓷器。

@Gerald-xg3rq 2 ай бұрын

hi great video! how to set WAITING_SERVED_RATIO, MAX_BATCH_SIZE, MAX_BATCH_TOTAL_TOKENS, MAX_BATCH_PREFILL_TOKENS etc. for highest throughput? looking at llama2-7b-chat and llama3-8b-instruct with nvidia A10.

@juliensimonfr 2 ай бұрын

Hi, the doc is available at huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher. I would increase batch size and measure.