How-To Speed-up Inference in LM Studio

Рет қаралды 1,257

Күн бұрын

This video shares some tips and tricks to speed-up inference in LM Studio to talk with models locally.
🔥 Buy Me a Coffee to support the channel: ko-fi.com/fahd...
🔥 Get 50% Discount on any A6000 or A5000 GPU rental, use following link and coupon:
bit.ly/fahd-mirza
Coupon code: FahdMirza
▶ Become a Patron 🔥 - / fahdmirza
#lmstudio
PLEASE FOLLOW ME:
▶ LinkedIn: / fahdmirza
▶ KZbin: / @fahdmirza
▶ Blog: www.fahdmirza.com
RELATED VIDEOS:
▶ Resource lmstudio.ai
All rights reserved © 2021 Fahd Mirza

Пікірлер: 16

@Ayushsingh019 3 ай бұрын

nice video and I am trying vLLM to reduce the inference time of LLM and now will try exllama for same

@fahdmirza 3 ай бұрын

Keep it up

@ZIaIqbal 3 ай бұрын

What is the type and memory of the GPU? And how much RAM do you have in your machine?

@kironlau 3 ай бұрын

as show as the video 9:07: RAM:47.13gb, VRAM:47.4gb probably GPU: 4090(24gb)X2

@ZIaIqbal 3 ай бұрын

@@kironlau thank you, and you are right the video does show the specs. Just curious, have you done any cpp testing with CPU only models to see how big models can be successful run on RAM?

@kironlau 3 ай бұрын

@@ZIaIqbal Yes, I have just run ollama (make use of llama.ccp) in my SBC (rk3588 board), believe in me....runnin on cpu is extremely slow...unless you have a 32+ multi-core cpu server (but it still can't beat a 4060, I think....) For general use, vram and ram usage to load a model is the same, if the model size (after quantization) is 8GB, then add 10% of it size for buffer (it is the ram usage for asking short/non context question) if the context size is longer...it even double the ram usage: just use GLM-4-Chat 9B official test as an example (though it's a GPU test, on cpu is similar) precision Ram usage Prefilling Decode Speed Remarks INT4 8 GB 0.2s 23.3 tokens/s input length 1000 INT4 10 GB 0.8s 23.4 tokens/s input length 8000 INT4 17 GB 4.3s 14.6 tokens/s input length 32000

@fahdmirza 3 ай бұрын

I have also pasted the link to GPU site with a discount coupon, thanks

@nexx4582 3 ай бұрын

Красава Иван!

@fahdmirza 3 ай бұрын

спасибо, хороший друг. Пожалуйста, также подпишитесь на канал.

@deepaktej7781 3 ай бұрын

But, results are not perfect as before bcz of temp value is high. This makes the model more creative

@fahdmirza 3 ай бұрын

@Abelius 3 ай бұрын

Er... the only reason you get more tokens per second is because you're offloading the entire model to VRAM on the second config, whereas you don't on the first config (not even 1). Also, at 12:00 you can see min_p and top_p settings get reverted back to acceptable values (they're a 0-1 range).

@fahdmirza 2 ай бұрын

thanks for feedback.