Go Production: ⚡️ Super FAST LLM (API) Serving with vLLM !!!

Рет қаралды 27,814

10 ай бұрын

vLLM is a fast and easy-to-use library for LLM inference Engine and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Optimized CUDA kernels
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
vLLM seamlessly supports many Huggingface models
Vllm - github.com/vllm-project/vllm
Google Colab - colab.research.google.com/dri...
❤️ If you want to support the channel ❤️
Support here:
Patreon - / 1littlecoder
Ko-Fi - ko-fi.com/1littlecoder

Пікірлер: 99

@seanmurphy9273 10 ай бұрын

my dude you're the super hero of these tutorials! I was just thinking about how i'm annoyed these llm's take so long to respond. And bamb, you posted this wonderful video! Thank you!

@1littlecoder 10 ай бұрын

Thanks so much for the kind words :)

@sujantkumarkv5498 10 ай бұрын

great work man... can't thank enough. thanks again. great to see more indian AI tech talent in the out :D

@mohegyux4072 10 ай бұрын

Thank you, your videos are becoming a daily thing for me

@1littlecoder 10 ай бұрын

Happy to hear that!

@marilynlucas5128 10 ай бұрын

❤ great job as always. Keep it up

@gpsb121993 Ай бұрын

Fantastic video! Just what I wanted to see.

@jankothyson 10 ай бұрын

Wow, this is awesome!

@karamwise1 10 ай бұрын

Awesome video with great value.

@1littlecoder 10 ай бұрын

Thanks for watching!

@deabyam 10 ай бұрын

Thanks you are the vLLM of this space love the speed of your videos. Colab let's more of us learn with less $$

@1littlecoder 10 ай бұрын

Absolutely, Thanks for the support!

@MarceloLimaXP 10 ай бұрын

Wow. Thank you for always bringing news to us ;)

@1littlecoder 10 ай бұрын

My pleasure!

@1littlecoder 10 ай бұрын

What would you love to see more on this channel? Might help me prioriting new content

@MarceloLimaXP 10 ай бұрын

@@1littlecoder One thing I believe could be very useful is the ability to 'talk' with sales reports. Something that makes the AI understand that what it's accessing is a sales report and not just a bunch of CSV data. This would go far beyond 'talk to your PDF' ;)

@riyayadav8468 10 ай бұрын

That's Good 🔥🔥

@shamaldesilva9533 10 ай бұрын

inference was the main bottel neck of LLMs , this is amazing thank you so much 🤩🤩🤩. Please make video on this page attention algorithm 🤩🤩

@1littlecoder 10 ай бұрын

Glad you liked it. Thanks for the suggestion!

@prestonmccauley43 8 ай бұрын

Fantastic share!

@1littlecoder 8 ай бұрын

Thank you! Cheers!

@thedoctor5478 10 ай бұрын

you saved our lives

@JohnVandivier 10 ай бұрын

great!

@shivayshakti6575 10 ай бұрын

You are llm angel :)

@moondevonyt 10 ай бұрын

mad props to the creator for breaking down vllm and its advantages over traditional llms that page attention tech sounds lit, giving it those crazy throughput numbers but, not gonna lie, using google collab as a production environment? kinda sus still, respect the hustle for making it accessible to peeps without fancy GPUs mad respect for that grind

@1littlecoder 10 ай бұрын

Is this an AI comment ?

@nic-ori 10 ай бұрын

Thanks.

@arjunm2467 10 ай бұрын

Great information, really appreciate 🎉🎉🎉. If possible, can you show me how we can add our own data (Excel report) along with the model so that LLM can provide our data too.

@mdmishfaqahmed2138 10 ай бұрын

nice one.

@1littlecoder 10 ай бұрын

Thank you! Cheers!

@anki1289 10 ай бұрын

amazing 🔥🔥, btw any idea how we can patch this into Gradio ?? that way sharing and access will be much easier

@santoshshetty6 5 ай бұрын

Thanks for this wonderful video. I want to know can we have RAG over the model with vllm. Also can we run vllm in a kubernetes cluster?

@pavanpraneeth4659 10 ай бұрын

Awesomeness man does it work with langchain?

@alx8439 7 ай бұрын

Does it support quantised models? Is it supposed in Oobabooga already? Quite an interesting topic - I hear ppl are using it in production often

@davidfa7363 10 күн бұрын

Hi. Really interesting and great work. If i am using a model via OpenAi like api, how can i implement a RAG system into it. How can i pass the prompt and the context to the model?

@aliissa4040 8 ай бұрын

Hello, in your opinion, what's better to use in production , TGI or vLLM ?

@sakshatkatyarmal2303 8 ай бұрын

Awesome video, But while inferencing and using the endpoint on postman, it shows jupyter notebook server is running and not the answer from the LLM..v1/completions

@cmeneseslob 10 ай бұрын

Thanks for the video, the only problem is that im getting "torch.cuda.OutOfMemoryError: CUDA out of memory." error when trying to serve the LLM in google colab. Is there a way to change the batch size using vllm parameters?

@ilianos 10 ай бұрын

So if I had to choose, one of the best LLMs from that selection would be Falcon?

@rageshantony2182 10 ай бұрын

I read that it doesn't support quantized models. Using ExLllama for quantized LLama models is faster with low memory footprints

@nat.serrano 9 ай бұрын

thanks for the confirmation, so what is the best option to expose LLama models? fastapi?

@HarshVerma-xs6ux 10 ай бұрын

Hey, amazing video dude. Is it possible to run GGML or GPTQ models through vLLM?

@1littlecoder 10 ай бұрын

Thanks Not at this moment.

@aozynoob 10 ай бұрын

How does it compare to 4bit quantization?

@marilynlucas5128 10 ай бұрын

It’s like another inference engine I’ve seen for LLMs called OpenLLM

@1littlecoder 10 ай бұрын

Exactly. Nice observation. OpenLLM is in my list to cover 🚀

@marilynlucas5128 8 ай бұрын

@@1littlecoder It doesn't give an open AI API token?

@Techonsapevole 10 ай бұрын

Cool, does it work also cpu only like con llama.cpp ?

@1littlecoder 10 ай бұрын

It currently doesn't support quantization. So I don't think the CPU would be powerful enough to run those.

@davidlazer3641 10 ай бұрын

Hey ur videos are nice, can you please give me the steps for how to test my llama2 trained model? I already trained llama2 7b chat model with my data using transformers and merge with the model and pushed it to my hugging face repo..

@bashamsk1288 8 ай бұрын

Does it support device mode auto type of thing ? For loading model in multiple gpus?

@True_Feelingsss... 4 ай бұрын

How to load custom finetuned model using vllm

@nat.serrano 9 ай бұрын

why does it only support a few models? what are the limitations? when are they going to support vicuna? why use vllm over fastapi? sorry for many questions?

@SloanMosley 10 ай бұрын

Does this support server less, also how would you host with sage maker ?

@mohamedsheded4143 9 ай бұрын

Why when i make the API endpoint to gives me a runtime error ? any one face the same issue ?

@rageshantony2182 10 ай бұрын

Please compare with ExLLAMA vs vLLM

@VijayasarathyMuthu 10 ай бұрын

Could you tell how to run this in Cloud Run or such service?

@ghaithkhelifi66 10 ай бұрын

hey my friend i have this setup ryzen 9 5900x with 48gb ram ddr4 with rtx 3090 msi oc so if you need help with testing reply i can give you my pc remotly so you can help yourself and i will learn from you if i can

@1littlecoder 10 ай бұрын

That's so kind of you. I'll let you know here in reply if such a setup might be required. Honestly every youtuber has to pick a niche and my niche is mostly people without powerful nvidia GPUs

@MarceloLimaXP 10 ай бұрын

@@1littlecoderExactly. I live in Brazil, and here the price of a GPU machine is desperate =P

@loicbaconnier9150 10 ай бұрын

So if i understand it's not work for QPTQ and GGML Mmodels ? Is there the chatcompletion and embedding in the api ? Is it possible to use an instruct model ?

@1littlecoder 10 ай бұрын

You're correct. It doesn't work with quantized models yet. I'll check on the chat completion part.

@chiggly007 10 ай бұрын

Does it support chat completion endpoint?

@Gerald-xg3rq 3 ай бұрын

hi nice video. how can i use vllm this on aws sagemaker?

@nithinbhandari3075 10 ай бұрын

Not able to replicate the result. Even after 10 minutes it stuck at "pip install vllm". Let see after few month. By the way, i was trying serverless gpu in runpod. The cold start is 30 second (for first request). It is just awesome. Just pay as you go. If you known any other method by which we can reduce inference time, please share. Thanks.

@1littlecoder 10 ай бұрын

Strange. Did it work ?

@nithinbhandari3075 10 ай бұрын

@@1littlecoder Vllm is not working for atleast me. Runpod serverless is working. (This is totally different topic that i am talking about, not related to vllm)

@user-fc5em1rk1s 9 ай бұрын

Can we use Langchain along with vLLM? When we use QA chains we actually create an llm instance using langchain. In that case how can we use this vLLM?

@larsuk9578 9 ай бұрын

exactly what I am trying to do!

@prudhvithtavva7891 10 ай бұрын

I have finetuned Falcon7B on a custom dataset using qlora, can I use the vllm over the fine-tuned model instead of pre-trained?

@1littlecoder 10 ай бұрын

I guess if you had pushed the final merged model to HF Hub. Yes you can (most likely)

@mtteslian9159 10 ай бұрын

Is it possible to use this solution through langchain?

@solomonaryeetey7370 10 ай бұрын

Hey buddy, can you show how to deploy vLLM with SkyPilot?

@unimposings 10 ай бұрын

can it run on collabd 24/7 ? how much will it cost to let it run for 1 month?

@pointlesspos8440 10 ай бұрын

Hey, do you know of a solution for this: I'm looking for a solution that is similar to chatgpt in that you host /serve 1 LLM and then multiple users can access it. Or do you have to server 1 llm for each user? I'm looking to build a chatbot with a Qlora trained on it for doing tech support/sales.

@pramodpatil2883 7 ай бұрын

hey did you find any solution for this as i am also in same problem..your help will appreciated

@pointlesspos8440 7 ай бұрын

No, what I have found is self served LLM's really start to lag after a short period of time. For most of my purposes, it would be fine. Since Im' just doing tech support chat. But also, doing multi-user could work with many spall models spun up. I have 48GB so maybe I could do three for 4 chat sessions with 7b models. I can do a 70b which is good, but not with 4 simultaneous users. But even so, I haven't been able to get a good model running as I have with ChatpGpt with my own docs embedded. What kind of solution are you looking to work on? @@pramodpatil2883

@TechieBlogging 7 ай бұрын

Does vLLM works om OpenAI Whisper models?

@Ryan-yj4sd 10 ай бұрын

How to do batch inference?

@rkp23in 4 ай бұрын

can we launch a ollama model as api executing in google colab?

@Gokulhraj 9 ай бұрын

can we use it with lagnchain?

@viratchoudhary6827 10 ай бұрын

hi bro , can you give me a ref " how to hide files as whisper-jax on huggingface"

@brandomiranda6703 4 ай бұрын

Why can't you use vllm for training?

@urisrssfeeds 10 ай бұрын

how long does the pip install take? Mine has been going like 45 mins in google collab

@1littlecoder 10 ай бұрын

I guess it took about 20 mins in my case

@davidlazer3641 10 ай бұрын

I run the exact commad that you given in free colab tier, its gives me cuda out of memory, what can i do? any suggestions

@1littlecoder 10 ай бұрын

Did you use the same model as mine or any other big model?

@don-jp2rs Ай бұрын

but why use vllm when you can use chatgpt api ?

@fxhp1 5 ай бұрын

you dont need a tunnel if you set --host 0.0.0.0

@yosefmoatti3633 5 ай бұрын

Very interesting video. Thanks! Unfortunately, I encounter problems with the initial "! pip install vllm": ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. lida 0.0.10 requires kaleido, which is not installed. lida 0.0.10 requires python-multipart, which is not installed. tensorflow-probability 0.22.0 requires typing-extensions