Speculative Decoding: When Two LLMs are Faster than One

  Рет қаралды 14,376

Efficient NLP

Efficient NLP

Күн бұрын

Пікірлер: 50
@dev.22.-1
@dev.22.-1 13 күн бұрын
Super clear explanation of speculative decoding. I have been working with this for a while but this clarified some of my questions
@kaenovama
@kaenovama Жыл бұрын
Really well done, my brain bulb went light up when you show the table! Thank you, keep it up!
@oc1655
@oc1655 2 ай бұрын
i recently wanted to brush up on this because it's been a while i read this paper. browsed a few tutorials/blog posts. it's funny how many people wrote about this without understanding it. you certainly did understand, and do a great job at breaking it down. thank you very much
@decycle2912
@decycle2912 Жыл бұрын
really informative! One thing that I don't understand is how does the LLM knows the previous probability distributions in a single pass? I thought decoder llm's only outputs the new token's probability distribution
@EfficientNLP
@EfficientNLP Жыл бұрын
This is due to the way the transformer architecture is set up: during decoding it takes as input all previous tokens and computes the hidden states for all previous tokens at each layer. Since we have the final layer hidden states, it is possible to obtain the probability distributions for all previous tokens.
@henkjekel4081
@henkjekel4081 6 ай бұрын
@@EfficientNLP Now I get it. Did you also explain this somewhere in the video? Maybe link to it in the describtion because this is also where I didnt understand it.
@shairuno
@shairuno 5 ай бұрын
We can compute the probability for each token in a batch fashion. The trick is to use mask attention.
@igorfilippov1221
@igorfilippov1221 5 ай бұрын
Very clear explanation, thank you!
@einsteinsapples2909
@einsteinsapples2909 11 ай бұрын
Thank you for the video, when i first heard this idea in February i was wondering how it made sense because i was picturing a large K, now seeing that the recommended K is about 3 I understand how most of the output will be the same.
@meisherenow
@meisherenow 27 күн бұрын
Clear and succinct. Well done.
@kevintai8656
@kevintai8656 8 ай бұрын
Great work! Thank you
@rajmankad2949
@rajmankad2949 3 ай бұрын
What an amazing explanation! Thank you so much
@vukrosic
@vukrosic 6 ай бұрын
Thank you for explaining it!
@shairuno
@shairuno 5 ай бұрын
I haven’t read the paper yet but my understanding is that we sample from q(x) - p(x) because we want the most surprising token that the draft model does not anticipate. It should maximize the entropy but then it should have log on the equation, anyway, I gotta read the paper to understand the math.
@甘楽-u7v
@甘楽-u7v 4 ай бұрын
Very easy to understand. Thanx so much.
@420_gunna
@420_gunna 7 ай бұрын
Love your video, thanks! If I had to give one request/critique, it'd be that I wish there were some slides in here similar to Samuel Albanie's videos that are quite information-dense recaps that could be lifted out of the presentations and put into our notes (or into a powerpoint for a paper club, or something).
@EfficientNLP
@EfficientNLP 7 ай бұрын
Interesting idea, though my videos often contain animations, drawings, screencasts, etc., and are not directly a recording of PowerPoint slides. Feel free to take screenshots of my videos for any educational purposes though!
@magalodontestnoneworld9043
@magalodontestnoneworld9043 9 ай бұрын
Very good video!
@paull923
@paull923 8 ай бұрын
great explanation thank you
@kylewilliams9214
@kylewilliams9214 Жыл бұрын
Google and DeepMind doing the Spiderman meme 😅
@jackzhang9053
@jackzhang9053 21 күн бұрын
Thanks for creating this amazing video! I’m wondering if you could open source the slides as well?
@EfficientNLP
@EfficientNLP 20 күн бұрын
I'm glad you enjoyed it! I'm not planning to release the slides, but I'm happy to answer any questions.
@rexyl547
@rexyl547 3 ай бұрын
Great video! Keep it up!
@anshumansinha5874
@anshumansinha5874 21 күн бұрын
Hi, thanks for the great content. I have a question, Let's say during speculative decoding (vocab size = 5 token only) we got to a stage where draft model has the next token distribution as = [0.35,0.3,0.15,0.2] and target_model = [0.4,0.5,0.05,0.05] so now the prob of token 1 in draft = 0.35 and prob of toke 1 in target = 0.4. what will the speculative algo do? Now if the speculative algo picks token 1 from the vocab, can we still say that we decode the exact same tokens what the larger model would decode? Thanks
@EfficientNLP
@EfficientNLP 20 күн бұрын
Yes, the algorithm will always decode the same tokens (more precisely, generate tokens in the same distribution) as the larger model, no matter what the draft model does. However if the draft model picks token 1, it does not always mean token 1 ends up in the final output, whether it gets accepted or not depends on the ratio of probabilities; that's the rejection sampling procedure described in the video.
@waynelau3256
@waynelau3256 Жыл бұрын
Thanks for this! I've been enjoying your videos! Do you think you do a review / explanation on flash-decoding by tri dao? I have been reading the pytorch blog but I don't really understand it
@EfficientNLP
@EfficientNLP Жыл бұрын
Thanks for the suggestion, I will add it to my list of future topics!
@dorianlin491
@dorianlin491 Жыл бұрын
Really helpful video!
@Basant5911
@Basant5911 6 ай бұрын
made very simple, but one more variable is choosing right draft model. Suppose if one chooses that is too too away from larger one's distribution then its also a problem.
@EfficientNLP
@EfficientNLP 6 ай бұрын
If the draft model is far from the target model's distribution, then speculative decoding will be less effective because it will have a higher rejection rate, thus reducing the speedup. However, the algorithm guarantees that the output sequence will be identical; therefore, even if the draft model is of poor quality, the text generation quality will not be affected.
@mingzhou2213
@mingzhou2213 8 ай бұрын
thank you for the explanations and the visuals. Does speculative decoding work with beam search? I understand that for LLM we generally just do greedy decoding in one pass, but for translation models like whisper, the performance increase significantly if we use beam search. I see even from hugging face official post discussing how speculative decoding improve whisper large inference speed by 2x, but to be honest, for non english audio data, with greedy decoding whisper is barely usable...
@EfficientNLP
@EfficientNLP 8 ай бұрын
Interesting idea, but I don't know of any attempts to combine them. Speculative decoding relies heavily on random sampling, whereas beam search is deterministic, so probably they are incompatible. For speeding up whisper inference, you might try using smaller or quantized models on faster engines like CTranslate2 or faster-whisper.
@laulinky334
@laulinky334 6 ай бұрын
Thanks for sharing, I am wondering how target model check the generated tokens of draft model and produce probability distribution q of x for each token?
@EfficientNLP
@EfficientNLP 6 ай бұрын
This is due to the parallel nature of transformers - when given a sequence of tokens, it can generate the logits for all of them in parallel, unlike generation which must be done autoregressively.
@saiashwalkaligotla6639
@saiashwalkaligotla6639 3 ай бұрын
Interesting. Curious if we can use mutiple different fine-tuned small models to do the same task along with a bigger model.
@EfficientNLP
@EfficientNLP 3 ай бұрын
That is an interesting idea-like using multiple smaller models to generate several completions and having the large model choose the best one. I'm not sure if anyone has tried this.
@saiashwalkaligotla6639
@saiashwalkaligotla6639 3 ай бұрын
@@EfficientNLP Yeah, hopefully might increase (or match) accuracy than original model.
@ArtemAlekhin
@ArtemAlekhin Жыл бұрын
Great video 👍
@feixyzliu5432
@feixyzliu5432 10 ай бұрын
why does target model running with K new tokens spend almost the same computation than with just 1 new token? I know K new tokens can be computed in parallel at one single forward pass, but self-attension with K new tokens indeed need more works than 1 token (suppose KV-cache is used), isn't it?
@EfficientNLP
@EfficientNLP 10 ай бұрын
It's true that the computation time in the two scenarios might not be exactly the same due to KV cache and other implementation details; however, for simplicity, we can assume one forward pass through a model as taking one unit of time. Both decoding one token and checking the probabilities of multiple tokens require one forward pass.
@ariellubonja7856
@ariellubonja7856 8 ай бұрын
Thanks! But doesn't the google paper define Mq as the draft model i.e. flips the definitions?
@EfficientNLP
@EfficientNLP 8 ай бұрын
You are right; the Google paper uses a different notation from the DeepMind paper, in this video I'm using the DeepMind notation.
@domenvake3077
@domenvake3077 7 ай бұрын
this is great! is there any chance you could demonstrate something like this in code?
@EfficientNLP
@EfficientNLP 7 ай бұрын
Good question - it appears that neither of the deepmind or google papers have an official source code implementation, but there are several implementations of this idea on GitHub, but I have not looked at them.
@gnorts_mr_alien
@gnorts_mr_alien 7 ай бұрын
For this to work, the two models need to have identical tokenizations right? Is there any way around it?
@EfficientNLP
@EfficientNLP 7 ай бұрын
That's right - the two models need to use the same vocabulary so that we can compare their logits meaningfully.
@gnorts_mr_alien
@gnorts_mr_alien 7 ай бұрын
@@EfficientNLP thank you for the quick response. that makes sense!
@murtazanazir9997
@murtazanazir9997 6 ай бұрын
Not necessarily. We can retokenize the predicted text by draft model. That can be slow though.
@christospapadopoulos7894
@christospapadopoulos7894 5 ай бұрын
Nothing really new about this, it seems that big tech companies really do have it easier when publishing research
@EfficientNLP
@EfficientNLP 5 ай бұрын
That’s the way it tends to go! One small step at a time
How is Beam Search Really Implemented?
8:15
Efficient NLP
Рет қаралды 13 М.
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 3,7 МЛН
Trick-or-Treating in a Rush. Part 2
00:37
Daniel LaBelle
Рет қаралды 47 МЛН
Motorbike Smashes Into Porsche! 😱
00:15
Caters Clips
Рет қаралды 23 МЛН
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
19:46
Rejection Sampling - VISUALLY EXPLAINED with EXAMPLES!
15:27
Kapil Sachdeva
Рет қаралды 28 М.
LoRA explained (and a bit about precision and quantization)
17:07
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 377 М.
Rotary Positional Embeddings: Combining Absolute and Relative
11:17
Efficient NLP
Рет қаралды 37 М.
Attention in transformers, visually explained | DL6
26:10
3Blue1Brown
Рет қаралды 1,8 МЛН
Accelerating LLM Inference with vLLM
35:53
Databricks
Рет қаралды 7 М.
What are Transformer Models and how do they work?
44:26
Serrano.Academy
Рет қаралды 127 М.