Speculative Decoding Explained

Рет қаралды 3,503

Күн бұрын

One Click Templates Repo (free): github.com/TrelisResearch/one...
Advanced Inference Repo (Paid Lifetime Membership): trelis.com/enterprise-server-...
Affiliate Links (support the channel):
- Vast AI - cloud.vast.ai/?ref_id=98762
- Runpod - tinyurl.com/4b6ecbbn
LLM Updates Newsletter: Trelis.Substack.com
Chapters
0:00 Faster inference with Speculative Decoding
0:22 Video Overview
1:09 How speculative decoding works?
5:45 Naive speculative decoding
7:16 Prompt based n-gram speculation
9:31 Lookahead decoding
11:48 Assisted decoding
13:41 Summary of Decoding Techniques
15:20 Performance Testing
32:28 Summary of Results
34:24 Tips for faster inference

Пікірлер: 17

@nirsarkar 7 ай бұрын

It reminds of the same problem that was solved with pipelines in microprocessors many years ago with speculative branch predictions. Excellent content by the way, thanks!

@TrelisResearch 7 ай бұрын

Ah that's really interesting. I didn't realise, but makes sense.

@WinsonDabbles 7 ай бұрын

Purchased access both of your repos! Eagerly waiting for your next video! Merry Xmas and Happy New Year!

@mangeshdiyewar4408 7 ай бұрын

Hey trelis I am a big fan of your work and research you do. Can you make a video on llm evaluation with metrics for the given task ?

@TrelisResearch 7 ай бұрын

Howdy, the closest existing vid to that is the one where I compare deepseek with codellama and gpt4. I'll consider where I can make some more eval vids, thanks for the tip.

@mangeshdiyewar4408 7 ай бұрын

@@TrelisResearch thankyou for replying

@electricskies1707 7 ай бұрын

Could this guesser model be placed into the mixtral MOE? I'm sure the training would take extensive time, but hypothetically could this speed up inference on that model type? Also previously you did Lora training on the MOE, but how is the router part trained and combined with the models? Any videos on this would be great

@TrelisResearch 7 ай бұрын

Hmm, I like the questions! - Training a guesser model (Medusa - style) should be possible but it would be custom because of the routers and experts. - However, speculative decoding is less effective in general for MoE because rarely (actually very rarely if it's a 2 out of 8 choice) will the n+2 token be sent to the same expert as the n+1 token. If that's the case, there's no parallel compute because different experts aren't inferenced in parallel. Maybe there is a way to build a guesser onto each expert though and make it so that whatever expert the n+1 token is going to, you make the n+2 token go there too. - Regarding LoRA training - only the attention layers are trained - the routers and the experts (which are linear layers) are left the same. You'll see that mentioned if you go back through (maybe I should have emphasised more).

@TrelisResearch 7 ай бұрын

Thinking more on this... even if the guesser runs on the chosen expert for the n+1th token (to come up with guesses), there is the problem of verifying the guesses. That verification will mostly need to be done on a different expert than is chosen for the n+1th token - meaning no parallelization is possible. If running very large batch sizes, where all experts are engaged, then there is perhaps an opportunity to do parallel decoding - but at that point the GPU is already being used for lots of parallel calculations so there would be a limit to how much speculation could be added.

@electricskies1707 7 ай бұрын

Thanks for your comments. I reemember you don't train the router, I guess my question was more how is the router made and combined with the other models to construct the MOE (this may be propietary info, I haven't yet looked for papers on it). We don't know atm what models make up mixtral and what their specialisms/biases are do we?

@TrelisResearch 7 ай бұрын

@@electricskies1707 actually everything is typically trained at once in MoE (although to speed things up, they probably copy pasted mistral 8b as each expert and then switched the linear layers to trainable). BTW, the routers are also trained simultaneously. You back prop through the whole network. It's not like the specialisms are interpretable either (i.e. one isn't english or maths etc., it's quite abstract how they naturally specialise). So when they train Mixtral they back prop through everything (routers included), but when we fine-tune we fix the linear layers in each expert and we fix the routers and only train the attention (which is the same in each expert). sorry, may just be repeating myself and not answering your q

@user-yu8sp2np2x 7 ай бұрын

Tell me one thing, For Fine-tuning mistral-instruct-7B-0.2v, If I am following llama-2 chat format for the prompt then how can I get the output in a specific format say in Python list, or something such that I could extract the result.

@TrelisResearch 7 ай бұрын

Probably have a look at the function calling video. It’s analogous to function calling - you want to focus the training loss on the response. The video in structured responses will help and then a little bit the recent function calling vid

@lhxperimental 7 ай бұрын

What do you mean by correct? How do you define it?

@TrelisResearch 7 ай бұрын

Howdy. Can you give me a timestamp in the video you're referring to?

@Kopern1k990 7 ай бұрын

I guess this only works for greedy decoding, which makes it useless in practice?

@TrelisResearch 7 ай бұрын

I'm not 100% sure but I believe the n-gram method is ok for non-greedy decoding. n-gram is close to free from a guessing standpoint, i.e. little compute is used to generate the guesses, they are just drawn from the prompt and already generated tokens. Guessing is more expensive with a draft model or in-built model. With non-greedy decoding, the guesses will be correct less often. So, this makes the benefit - particularly from expensive guessing methods - less worthwhile.