Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

  Рет қаралды 12,750

MLOps.community

MLOps.community

9 ай бұрын

Join us at our first in-person conference on June 25 all about AI Quality: www.aiqualityconference.com/
// Abstract
Getting the right LLM inference stack means choosing the right model for your task, and running it on the right hardware, with proper inference code. This talk will go through popular inference stacks and set-ups, detailing what makes inference costly. We'll talk about the current generation of open-source models and how to make the best use of them, but we will also touch on features currently missing from the open-source serving stack as well as what the future generations of models will unlock.
// Bio
Timothée Lacroix, aged 31, is Chief Technical Officer in charge of technical issues relating to product efficacy and research. Started as an engineer at Facebook AI Research in 2015 in New York, where he completed his thesis between 2016 and 2019, in collaboration with École des Ponts, on tensor factorization for recommender systems. He continued his career at Meta until 2023 when he co-founded ‪@Mistral-AI‬.
// Sign up for our Newsletter to never miss an event:
mlops.community/join/
// Watch all the conference videos here:
home.mlops.community/home/col...
// Check out the MLOps Community podcast: open.spotify.com/show/7wZygk3...
// Read our blog:
mlops.community/blog
// Join an in-person local meetup near you:
mlops.community/meetups/
// MLOps Swag/Merch:
mlops-community.myshopify.com/
// Follow us on Twitter:
/ mlopscommunity
//Follow us on Linkedin:
/ mlopscommunity

Пікірлер: 18
@evermorecurious91
@evermorecurious91 7 ай бұрын
This is gold!!!
@iandanforth
@iandanforth 9 ай бұрын
There seems to be a mistake in the cost estimate at 21:53. It uses the price for the A10 but the throughput of the H100. I believe the actual cost estimate would be $48, not $15.
@mndflctzn
@mndflctzn 8 ай бұрын
This is awesome. Thanks for sharing super useful
@windmaple
@windmaple 9 ай бұрын
Great talk!
@frank96997
@frank96997 5 ай бұрын
Great talk! is there link to the slides for this talk?
@Gerald-iz7mv
@Gerald-iz7mv 4 ай бұрын
hi what benchmark he run to generate the plots? any open source github links?
@boussouarsari4482
@boussouarsari4482 5 ай бұрын
It's possible that I'm misunderstanding, but given our use of a significantly large key-value cache (2GB multiplied by the batch size), can we still assert that the memory bandwidth is solely influenced by the model's weights?
@yaxiongzhao6640
@yaxiongzhao6640 29 күн бұрын
The KV cache's size is directly from the attention layer's size, which in turn is in proportional to model weights' total count So model weights still proportionally determines the kv cache size, thus the statement.
@janilbolswong1953
@janilbolswong1953 8 ай бұрын
@5:40 why do we need to load the entire model all the time? can't we just load once? If so, we might lower the needs of memory movement, and the intersection would shift left
@jjh5474
@jjh5474 8 ай бұрын
I guess "memory movement" mean movement from GPU memory(HBM) to GPU computing component. Model parameter stored in GPU memory not in compute component. So for computing model parameter moved from HBM to compute component every forward pass.
@fraternitas5117
@fraternitas5117 3 ай бұрын
yes, it needs to be loaded in the gpu all the time. advanced users optimize their applications by sending an equal number of bytes as the memory maximum to optimize the utilizations of all memory in the clock cycle.
@aneeinaec
@aneeinaec Ай бұрын
Is that Ryan Gosling ❤
@eduardoalvarez7152
@eduardoalvarez7152 6 ай бұрын
The math around 6:50 for A100 batch size isn't working out. It would be great if the values used to calculate the 400 batch size were provided. Based on the equations provided for compute time and model load time, the point of intersection is Flops/(2*MemoryBand) NOT the (2*FLOPS)/MemoryBand which is in the video.
@TheAIEpiphany
@TheAIEpiphany 2 ай бұрын
I believe it was just a piece of napkin math: in reality he didn't count in KV cache at all in the P / mem bandwidth line which is a function of sequence length. That seems like the biggest approximation error I see here? For the second line he discounted attention FLOPs and used just MLP FLOPs (the error of this approximation increases as the sequence grows, depends on the model size you're using e.g. for 7B model with a big sequence length, that term might actually be important). Additionally the peak flops is a function of the data type and the operation you're executing, he's assuming bf16/fp16 which is what Mistral 7B is using, that gives you ~312 TFLOPs/s for A100. All in all this is useful if you understand exactly the assumptions he's making.
@Venkat2811
@Venkat2811 2 ай бұрын
@@TheAIEpiphany Yes, I was looking for KV cache as well. Your explanation makes sense.
@MLOps
@MLOps 3 ай бұрын
Join us at our first in-person conference on June 25 all about AI Quality: www.aiqualityconference.com/
@AbdulK-kr2jv
@AbdulK-kr2jv 3 ай бұрын
What a horrible unethical response on the ethics of training data
Evaluating LLM-based Applications
33:50
Databricks
Рет қаралды 23 М.
Эффект Карбонаро и нестандартная коробка
01:00
История одного вокалиста
Рет қаралды 10 МЛН
Fast LLM Serving with vLLM and PagedAttention
32:07
Anyscale
Рет қаралды 20 М.
[1hr Talk] Intro to Large Language Models
59:48
Andrej Karpathy
Рет қаралды 2 МЛН
Accelerating drug discovery with AI: Insights from Isomorphic Labs
1:10:23
[Webinar] LLMs for Evaluating LLMs
49:07
Arthur
Рет қаралды 9 М.
vLLM on Kubernetes in Production
27:31
Kubesimplify
Рет қаралды 2 М.
Meta Announces Llama 3 at Weights & Biases’ conference
26:16
Weights & Biases
Рет қаралды 83 М.
Serve a Custom LLM for Over 100 Customers
51:56
Trelis Research
Рет қаралды 18 М.
Enabling Cost-Efficient LLM Serving with Ray Serve
30:28
Anyscale
Рет қаралды 4,7 М.
Looks very comfortable. #leddisplay #ledscreen #ledwall #eagerled
0:19
LED Screen Factory-EagerLED
Рет қаралды 11 МЛН
Rate This Smartphone Cooler Set-up ⭐
0:10
Shakeuptech
Рет қаралды 6 МЛН
$1 vs $100,000 Slow Motion Camera!
0:44
Hafu Go
Рет қаралды 28 МЛН
Todos os modelos de smartphone
0:20
Spider Slack
Рет қаралды 65 МЛН