Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

  Рет қаралды 6,329

AI Engineer

AI Engineer

Күн бұрын

LLM inference is not your normal deep learning model deployment nor is it trivial when it comes to managing scale, performance and COST. Understanding how to effectively size a production grade LLM deployment requires understanding of the model(s), the compute hardware, quantization and parallelization methods, KV Cache budgets, input and output token length predictions, model adapter management and much more.
If you want to deeply understand these topics and their effects on LLM inference cost and performance you will enjoy this talk.
This talk will cover the following topics:
Why LLM inference is different to standard deep learning inference
Current and future NVIDIA GPU overview - which GPU(s) for which models and why
Understanding the importance of building inference engines
Deep recap on the attention mechanism along with different types of popular attention mechanisms used in production
Deep dive on KV Cache and managing KV Cache budgets to increase throughput per model deployment
Parallelism (reducing latency) - mainly tensor parallelism but data, sequence, pipeline and expert parallelism will be highlighted
Quantization methods on weights, activations, KV Cache to reduce engine sizes for more effective GPU utilization
Increasing throughput with inflight batching and other techniques
Detailed performance analysis of LLM deployments looking at Time to first token, inter-token latencies, llm deployment characterizations, and more that can help reduce deployment costs
The main inference engine referenced in the talk with TRT-LLM and the open-source inference serve NVIDIA Triton.
Recorded live in San Francisco at the AI Engineer World's Fair. See the full schedule of talks at www.ai.enginee... & join us at the AI Engineer World's Fair in 2025! Get your tickets today at ai.engineer/2025
About Mark
Dr. Mark Moyou is a Senior Data Scientist at NVIDIA on the Retail team focused on enabling scalable machine learning for the nation's top Retailers. Before NVIDIA, he was a Data Science Manager in the Professional Services division at Lucidworks, an Enterprise Search and Recommendations company. Prior to Lucidworks, he was a founding Data Scientist at Alstom Transportation where he applied Data Science to the Railroad Industry in the US. Mark holds a PhD and MSc in Systems Engineering and a BSc in Chemical Engineering. On the side, Mark is the host of The AI Portfolio Podcast, The Caribbean Tech Pioneers, Progress Guaranteed Podcast and Director of the Southern Data Science Conference in Atlanta.

Пікірлер: 6
Langfuse Town Hall: 2025 Roadmap & V3 (January 2025)
1:00:06
Every team from the Bracket Buster! Who ya got? 😏
0:53
FailArmy Shorts
Рет қаралды 13 МЛН
Who is More Stupid? #tiktok #sigmagirl #funny
0:27
CRAZY GREAPA
Рет қаралды 10 МЛН
БАБУШКА ШАРИТ #shorts
0:16
Паша Осадчий
Рет қаралды 4,1 МЛН
$1 vs $500,000 Plane Ticket!
12:20
MrBeast
Рет қаралды 122 МЛН
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
ETF BAFA Tech Talk: "Machine Learning: A Practitioner's Tale"
47:19
ETF BAFA Tech Talks
Рет қаралды 5
Quantization vs Pruning vs Distillation: Optimizing NNs for Inference
19:46
DeepSeek R1 Theory Overview | GRPO + RL + SFT
25:36
Deep Learning with Yacine
Рет қаралды 50 М.
AI Inference: The Secret to AI's Superpowers
10:41
IBM Technology
Рет қаралды 26 М.
Trends in Deep Learning Hardware: Bill Dally (NVIDIA)
1:10:58
Paul G. Allen School
Рет қаралды 25 М.
три кошака и ростелеком
0:26
Мистер Денала
Рет қаралды 2,4 МЛН
три кошака и ростелеком
0:26
Мистер Денала
Рет қаралды 2,4 МЛН
пранк🤣😂😂
0:51
Numdexx1
Рет қаралды 1,2 МЛН