Deploying Llama3 with Inference Endpoints and AWS Inferentia2

  Рет қаралды 6,301

Julien Simon

Julien Simon

2 ай бұрын

Are you curious about deploying large language models efficiently? In this video, I'll show you how to deploy a Llama3 8B model using Hugging Face Inference Endpoints and the powerful AWS Inferentia2 accelerator. I'll be using the latest Hugging Face Text Generation Inference container to demonstrate the process of running streaming inference with the OpenAI client library. Stay tuned as I also delve into Inferentia2 benchmarks, offering insights into its performance.
The Hugging Face Inference Endpoints provide a seamless way to deploy models, and when coupled with the AWS Inferentia2 accelerator, you can achieve remarkable efficiency. Don't miss out on this opportunity to enhance your deployment game!
#LargeLanguageModels #HuggingFace #AWSInferentia2 #Deployment #MachineLearning
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
Inference Endpoints:
huggingface.co/docs/inference...
Model:
huggingface.co/meta-llama/Met...
Notebook:
gitlab.com/juliensimon/huggin...
Inferentia2 benchmarks:
awsdocs-neuron.readthedocs-ho...

Пікірлер: 9
@suhasshirbavikar9496
@suhasshirbavikar9496 2 ай бұрын
great video
@juliensimonfr
@juliensimonfr 2 ай бұрын
Thanks!
@user-vo5ce6kn5t
@user-vo5ce6kn5t Ай бұрын
Hi I have question , I have deployed finetuned llama3 on AWS , but it generate repeated answers if I adjust payload it cut off the end of response , please provide me solution on this I have multiple times changed max lengths, max tokens size also still facing that issue
@fadauto
@fadauto 2 ай бұрын
Thanks for the great video! The costs per M tokens seem really high in the neuron inf2 benchmark table. For Llama-3-70B they are around $115 per M tokens. Even with 3-Year reserved instances the prices would still be way above GPT-4 token price and with a similar throughput . Are VLLM or TensorRT-LLM on GPU’s better options than Inferentia2? or can VLLM be combined with this?
@juliensimonfr
@juliensimonfr 2 ай бұрын
Thanks! Yes, your numbers are correct (ref: awsdocs-neuron.readthedocs-hosted.com/en/latest/general/benchmarks/inf2/inf2-performance.html), and no, they're not great. Having said that, I meet a lot of enterprise customers and none so far have mentioned 70B models. For conversational apps, the huge majority relies on 7-8B models with RAG, and very often fine-tuning. 1M tokens for Llama3 8B costs $5 or $6 on inf2.48xlarge (on demand price), which is hard to beat. With new small models like Phi-3, cost should be even lower :)
@juliensimonfr
@juliensimonfr 2 ай бұрын
A good alternative for larger models (70-100B) is our TGI inference server, combined with your cloud of choice. See huggingface.co/blog/tgi-benchmarking
@fadauto
@fadauto Ай бұрын
@@juliensimonfr Thanks Julien! That cost looks great but apparently is only for the Throughput optimized configurations, I see that the cost for latency optimized configurations is above $22 for 1M tokens for Llama-3-8B. So using INF2 instances wouldn't make much sense real-time applications using this setting, right? Unless combined with other techniques like continuous batching which would make the costs look more like the throughput optimized ones?
@samyrockstar
@samyrockstar 20 күн бұрын
is sagemaker endpoint a good option for llama3 production use?
@juliensimonfr
@juliensimonfr 20 күн бұрын
Sure! SageMaker is solid :)
Deep Dive: Optimizing LLM inference
36:12
Julien Simon
Рет қаралды 19 М.
Sigma Kid Hair #funny #sigma #comedy
00:33
CRAZY GREAPA
Рет қаралды 39 МЛН
Пранк пошел не по плану…🥲
00:59
Саша Квашеная
Рет қаралды 7 МЛН
Doing This Instead Of Studying.. 😳
00:12
Jojo Sim
Рет қаралды 14 МЛН
Little girl's dream of a giant teddy bear is about to come true #shorts
00:32
AWS’s Future is…Q?
16:15
Theo - t3․gg
Рет қаралды 38 М.
Workshop Sessions: Deploying an E2E ML Pipeline with AWS SageMaker - What Amazon didn't tell you
2:27:17
MLOps World: Machine Learning in Production
Рет қаралды 884
SageMaker JumpStart: deploy Hugging Face models in minutes!
8:23
Fine-tuning LLMs with PEFT and LoRA
15:35
Sam Witteveen
Рет қаралды 120 М.
Unlimited AI Agents running locally with Ollama & AnythingLLM
15:21
Tim Carambat
Рет қаралды 106 М.
Это - iPhone 16!
16:29
Rozetked
Рет қаралды 184 М.
Хакер взломал компьютер с USB кабеля. Кевин Митник.
0:58
Последний Оплот Безопасности
Рет қаралды 2,2 МЛН
ноутбуки от 7.900 в тг laptopshoptop
0:14
Ноутбуковая лавка
Рет қаралды 3,5 МЛН