Deploying Llama3 with Inference Endpoints and AWS Inferentia2

Рет қаралды 6,301

2 ай бұрын

Are you curious about deploying large language models efficiently? In this video, I'll show you how to deploy a Llama3 8B model using Hugging Face Inference Endpoints and the powerful AWS Inferentia2 accelerator. I'll be using the latest Hugging Face Text Generation Inference container to demonstrate the process of running streaming inference with the OpenAI client library. Stay tuned as I also delve into Inferentia2 benchmarks, offering insights into its performance.
The Hugging Face Inference Endpoints provide a seamless way to deploy models, and when coupled with the AWS Inferentia2 accelerator, you can achieve remarkable efficiency. Don't miss out on this opportunity to enhance your deployment game!
#LargeLanguageModels #HuggingFace #AWSInferentia2 #Deployment #MachineLearning
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
Inference Endpoints:
huggingface.co/docs/inference...
Model:
huggingface.co/meta-llama/Met...
Notebook:
gitlab.com/juliensimon/huggin...
Inferentia2 benchmarks:
awsdocs-neuron.readthedocs-ho...

Пікірлер: 9

@suhasshirbavikar9496 2 ай бұрын

great video

@juliensimonfr 2 ай бұрын

Thanks!

@user-vo5ce6kn5t Ай бұрын

Hi I have question , I have deployed finetuned llama3 on AWS , but it generate repeated answers if I adjust payload it cut off the end of response , please provide me solution on this I have multiple times changed max lengths, max tokens size also still facing that issue

@fadauto 2 ай бұрын

Thanks for the great video! The costs per M tokens seem really high in the neuron inf2 benchmark table. For Llama-3-70B they are around $115 per M tokens. Even with 3-Year reserved instances the prices would still be way above GPT-4 token price and with a similar throughput . Are VLLM or TensorRT-LLM on GPU’s better options than Inferentia2? or can VLLM be combined with this?

@juliensimonfr 2 ай бұрын

Thanks! Yes, your numbers are correct (ref: awsdocs-neuron.readthedocs-hosted.com/en/latest/general/benchmarks/inf2/inf2-performance.html), and no, they're not great. Having said that, I meet a lot of enterprise customers and none so far have mentioned 70B models. For conversational apps, the huge majority relies on 7-8B models with RAG, and very often fine-tuning. 1M tokens for Llama3 8B costs $5 or $6 on inf2.48xlarge (on demand price), which is hard to beat. With new small models like Phi-3, cost should be even lower :)

@juliensimonfr 2 ай бұрын

A good alternative for larger models (70-100B) is our TGI inference server, combined with your cloud of choice. See huggingface.co/blog/tgi-benchmarking

@fadauto Ай бұрын

@@juliensimonfr Thanks Julien! That cost looks great but apparently is only for the Throughput optimized configurations, I see that the cost for latency optimized configurations is above $22 for 1M tokens for Llama-3-8B. So using INF2 instances wouldn't make much sense real-time applications using this setting, right? Unless combined with other techniques like continuous batching which would make the costs look more like the throughput optimized ones?