Output Predictions - Faster Inference with OpenAI or vLLM

  Рет қаралды 1,355

Trelis Research

Trelis Research

Күн бұрын

Пікірлер: 20
@notanape5415
@notanape5415 7 күн бұрын
Thanks for the awesome video.
@TrelisResearch
@TrelisResearch 7 күн бұрын
You’re welcome
@BiMoba
@BiMoba 18 күн бұрын
LoRAX has something very similar. The approach is also very interesting, we can Llama 8B and train a Speculative Decoding Lora adapter and speed up the generation in a lot of cases
@TrelisResearch
@TrelisResearch 16 күн бұрын
Nice. That’s interesting, does the adapter end up being like the SVD of the matrices? I guess you’re trying to get the adapter to simulate the full model?
@NLPprompter
@NLPprompter 19 күн бұрын
your video is totally what the world needed, I shared this to my fellow AI engineers newbies.. honestly with low IQ and dyslexia it is really hard to gain skill in this field... I'll never give up because helpful videos like this are really helping me a lot..
@TrelisResearch
@TrelisResearch 18 күн бұрын
Nice!
@btaranto
@btaranto 19 күн бұрын
Very interesting... will try... tks 🎉
@MultiTheflyer
@MultiTheflyer 18 күн бұрын
Great video! What I don't get is how does this actually saves inference compute? I mean to be sure that my "predicted output" is correct do I not need to calculate the output token as I traditionally would? e.g. the cat is on the -> table. I already have table in my predicted outputs, however to be sure that that's correct don't I need to run inference on the llm anyways? thanks a lot!
@TrelisResearch
@TrelisResearch 16 күн бұрын
Yup, it doesn’t save compute. In fact it not only costs compute to check your prediction it actually costs even more because first it will guess your full prediction and the it will use more tokens for everything that needs fixing. But because you give the prediction it can do all of the compute in one call, so it gives a speed up. More speed but more compute
@MultiTheflyer
@MultiTheflyer 18 күн бұрын
Nice, is there a similar feature also in transformers?
@TrelisResearch
@TrelisResearch 16 күн бұрын
Transformers isn’t so much an inference library. Huggingface have TGI for inference and yes you can add speculation. Check my earlier video on speculative decoding
@fatshaddy-rz2wn
@fatshaddy-rz2wn 18 күн бұрын
Please make something on MLX and using not a very high end macbook for fine tuning small LMs locally and more
@TrelisResearch
@TrelisResearch 18 күн бұрын
howdy, what specific kind of dataset and application are you looking for guidance on?
@fatshaddy-rz2wn
@fatshaddy-rz2wn 17 күн бұрын
@@TrelisResearch something more on creating computer use of anthropic kind of functionality
@eado9440
@eado9440 17 күн бұрын
Pros vs cons , faster but uses more tokens? Seems worse than cache, faster and cheaper. Dosent seem efficient to regeneration of everthing for just a small chage.
@TrelisResearch
@TrelisResearch 16 күн бұрын
Yeah probably true for expensive models. But for cheaper models you probably pay for the speed up. As I understand, Cursor use a 70B model for fast apply. And they wouldn’t do fast apply if users didn’t like the speed up. So yeah I think you’re right but still there’s a case for doing the speed up - maybe more so for cheaper models
@darkmatter9583
@darkmatter9583 19 күн бұрын
Please check datacenters Nvidia supercomputing ,AI Factories even NVIDIA DGX SuperPOD Please and beyond because there will be more just help
@TrelisResearch
@TrelisResearch 18 күн бұрын
What do you mean?
@mdrafatsiddiqui
@mdrafatsiddiqui 18 күн бұрын
hi ronan. do you have an email id to connect?
@TrelisResearch
@TrelisResearch 16 күн бұрын
Best to leave a comment here or else on a relevant page of trelis.com
How to Build an Inference Service
1:00:38
Trelis Research
Рет қаралды 1,2 М.
Predicting Events with Large Language Models
25:09
Trelis Research
Рет қаралды 3,3 М.
Ice Cream or Surprise Trip Around the World?
00:31
Hungry FAM
Рет қаралды 22 МЛН
How Much Tape To Stop A Lamborghini?
00:15
MrBeast
Рет қаралды 227 МЛН
OpenAI Fine-tuning vs Distillation - Free Colab Notebook
22:37
Trelis Research
Рет қаралды 1,6 М.
Speculative Decoding: When Two LLMs are Faster than One
12:46
Efficient NLP
Рет қаралды 14 М.
Ollama on Kubernetes: ChatGPT for free!
18:29
Mathis Van Eetvelde
Рет қаралды 7 М.
7 New AI Tools You Won't Believe Exist
14:09
Skill Leap AI
Рет қаралды 170 М.
Qwen Just Casually Started the Local AI Revolution
16:05
Cole Medin
Рет қаралды 89 М.
Fine tune and Serve Faster Whisper Turbo
34:44
Trelis Research
Рет қаралды 2,4 М.
How to use LLMs for Fact Checking
40:40
Trelis Research
Рет қаралды 1,6 М.
Accelerating LLM Inference with vLLM
35:53
Databricks
Рет қаралды 7 М.
Faster AI Using OpenAI's NEW Predicted Outputs-Here's How
7:12
Arseny Shatokhin
Рет қаралды 2,7 М.