LoRAX has something very similar. The approach is also very interesting, we can Llama 8B and train a Speculative Decoding Lora adapter and speed up the generation in a lot of cases
@TrelisResearch16 күн бұрын
Nice. That’s interesting, does the adapter end up being like the SVD of the matrices? I guess you’re trying to get the adapter to simulate the full model?
@NLPprompter19 күн бұрын
your video is totally what the world needed, I shared this to my fellow AI engineers newbies.. honestly with low IQ and dyslexia it is really hard to gain skill in this field... I'll never give up because helpful videos like this are really helping me a lot..
@TrelisResearch18 күн бұрын
Nice!
@btaranto19 күн бұрын
Very interesting... will try... tks 🎉
@MultiTheflyer18 күн бұрын
Great video! What I don't get is how does this actually saves inference compute? I mean to be sure that my "predicted output" is correct do I not need to calculate the output token as I traditionally would? e.g. the cat is on the -> table. I already have table in my predicted outputs, however to be sure that that's correct don't I need to run inference on the llm anyways? thanks a lot!
@TrelisResearch16 күн бұрын
Yup, it doesn’t save compute. In fact it not only costs compute to check your prediction it actually costs even more because first it will guess your full prediction and the it will use more tokens for everything that needs fixing. But because you give the prediction it can do all of the compute in one call, so it gives a speed up. More speed but more compute
@MultiTheflyer18 күн бұрын
Nice, is there a similar feature also in transformers?
@TrelisResearch16 күн бұрын
Transformers isn’t so much an inference library. Huggingface have TGI for inference and yes you can add speculation. Check my earlier video on speculative decoding
@fatshaddy-rz2wn18 күн бұрын
Please make something on MLX and using not a very high end macbook for fine tuning small LMs locally and more
@TrelisResearch18 күн бұрын
howdy, what specific kind of dataset and application are you looking for guidance on?
@fatshaddy-rz2wn17 күн бұрын
@@TrelisResearch something more on creating computer use of anthropic kind of functionality
@eado944017 күн бұрын
Pros vs cons , faster but uses more tokens? Seems worse than cache, faster and cheaper. Dosent seem efficient to regeneration of everthing for just a small chage.
@TrelisResearch16 күн бұрын
Yeah probably true for expensive models. But for cheaper models you probably pay for the speed up. As I understand, Cursor use a 70B model for fast apply. And they wouldn’t do fast apply if users didn’t like the speed up. So yeah I think you’re right but still there’s a case for doing the speed up - maybe more so for cheaper models
@darkmatter958319 күн бұрын
Please check datacenters Nvidia supercomputing ,AI Factories even NVIDIA DGX SuperPOD Please and beyond because there will be more just help
@TrelisResearch18 күн бұрын
What do you mean?
@mdrafatsiddiqui18 күн бұрын
hi ronan. do you have an email id to connect?
@TrelisResearch16 күн бұрын
Best to leave a comment here or else on a relevant page of trelis.com