Output Predictions - Faster Inference with OpenAI or vLLM

Рет қаралды 1,355

Trelis Research

Күн бұрын

Пікірлер: 20

@notanape5415 7 күн бұрын

Thanks for the awesome video.

@TrelisResearch 7 күн бұрын

You’re welcome

@BiMoba 18 күн бұрын

LoRAX has something very similar. The approach is also very interesting, we can Llama 8B and train a Speculative Decoding Lora adapter and speed up the generation in a lot of cases

@TrelisResearch 16 күн бұрын

Nice. That’s interesting, does the adapter end up being like the SVD of the matrices? I guess you’re trying to get the adapter to simulate the full model?

@NLPprompter 19 күн бұрын

your video is totally what the world needed, I shared this to my fellow AI engineers newbies.. honestly with low IQ and dyslexia it is really hard to gain skill in this field... I'll never give up because helpful videos like this are really helping me a lot..

@TrelisResearch 18 күн бұрын

Nice!

@btaranto 19 күн бұрын

Very interesting... will try... tks 🎉

@MultiTheflyer 18 күн бұрын

Great video! What I don't get is how does this actually saves inference compute? I mean to be sure that my "predicted output" is correct do I not need to calculate the output token as I traditionally would? e.g. the cat is on the -> table. I already have table in my predicted outputs, however to be sure that that's correct don't I need to run inference on the llm anyways? thanks a lot!

@TrelisResearch 16 күн бұрын

Yup, it doesn’t save compute. In fact it not only costs compute to check your prediction it actually costs even more because first it will guess your full prediction and the it will use more tokens for everything that needs fixing. But because you give the prediction it can do all of the compute in one call, so it gives a speed up. More speed but more compute

@MultiTheflyer 18 күн бұрын

Nice, is there a similar feature also in transformers?

@TrelisResearch 16 күн бұрын

Transformers isn’t so much an inference library. Huggingface have TGI for inference and yes you can add speculation. Check my earlier video on speculative decoding

@fatshaddy-rz2wn 18 күн бұрын

Please make something on MLX and using not a very high end macbook for fine tuning small LMs locally and more

@TrelisResearch 18 күн бұрын

howdy, what specific kind of dataset and application are you looking for guidance on?

@fatshaddy-rz2wn 17 күн бұрын

@@TrelisResearch something more on creating computer use of anthropic kind of functionality

@eado9440 17 күн бұрын

Pros vs cons , faster but uses more tokens? Seems worse than cache, faster and cheaper. Dosent seem efficient to regeneration of everthing for just a small chage.

@TrelisResearch 16 күн бұрын

Yeah probably true for expensive models. But for cheaper models you probably pay for the speed up. As I understand, Cursor use a 70B model for fast apply. And they wouldn’t do fast apply if users didn’t like the speed up. So yeah I think you’re right but still there’s a case for doing the speed up - maybe more so for cheaper models