Exploring the Latency/Throughput & Cost Space for LLM Inference // Timothée Lacroix // CTO Mistral

Рет қаралды 18,608

Күн бұрын

Пікірлер: 20

@iandanforth Жыл бұрын

There seems to be a mistake in the cost estimate at 21:53. It uses the price for the A10 but the throughput of the H100. I believe the actual cost estimate would be $48, not $15.

@evermorecurious91 Жыл бұрын

This is gold!!!

@mndflctzn Жыл бұрын

This is awesome. Thanks for sharing super useful

@eduardoalvarez7152 Жыл бұрын

The math around 6:50 for A100 batch size isn't working out. It would be great if the values used to calculate the 400 batch size were provided. Based on the equations provided for compute time and model load time, the point of intersection is Flops/(2*MemoryBand) NOT the (2*FLOPS)/MemoryBand which is in the video.

@TheAIEpiphany 8 ай бұрын

I believe it was just a piece of napkin math: in reality he didn't count in KV cache at all in the P / mem bandwidth line which is a function of sequence length. That seems like the biggest approximation error I see here? For the second line he discounted attention FLOPs and used just MLP FLOPs (the error of this approximation increases as the sequence grows, depends on the model size you're using e.g. for 7B model with a big sequence length, that term might actually be important). Additionally the peak flops is a function of the data type and the operation you're executing, he's assuming bf16/fp16 which is what Mistral 7B is using, that gives you ~312 TFLOPs/s for A100. All in all this is useful if you understand exactly the assumptions he's making.

@Venkat2811 8 ай бұрын

@@TheAIEpiphany Yes, I was looking for KV cache as well. Your explanation makes sense.

@iogbole 4 ай бұрын

The right continous profiling solution can help you find B* --> 7:23 with much less effort. 18:23 is where the power of low-level tracing with eBPF comes in; otherwise, the performance overhead is simply too high.

@frank96997 Жыл бұрын

Great talk! is there link to the slides for this talk?

@janilbolswong1953 Жыл бұрын

@5:40 why do we need to load the entire model all the time? can't we just load once? If so, we might lower the needs of memory movement, and the intersection would shift left

@attention42 Жыл бұрын

I guess "memory movement" mean movement from GPU memory(HBM) to GPU computing component. Model parameter stored in GPU memory not in compute component. So for computing model parameter moved from HBM to compute component every forward pass.

@fraternitas5117 9 ай бұрын

yes, it needs to be loaded in the gpu all the time. advanced users optimize their applications by sending an equal number of bytes as the memory maximum to optimize the utilizations of all memory in the clock cycle.

@boussouarsari4482 11 ай бұрын

It's possible that I'm misunderstanding, but given our use of a significantly large key-value cache (2GB multiplied by the batch size), can we still assert that the memory bandwidth is solely influenced by the model's weights?

@yaxiongzhao6640 7 ай бұрын

The KV cache's size is directly from the attention layer's size, which in turn is in proportional to model weights' total count So model weights still proportionally determines the kv cache size, thus the statement.