LLM inference optimization: Architecture, KV cache and Flash attention

  Рет қаралды 3,220

YanAITalk

YanAITalk

Күн бұрын

Пікірлер: 5
@cliffordino
@cliffordino Ай бұрын
Nicely done and very helpful! Thank you!! FYI, the stress is on the first syllable of "INference", not the second ("inFERence").
@yanaitalk
@yanaitalk Ай бұрын
Copy that! Thank you😊
@johndong4754
@johndong4754 2 ай бұрын
Ive been learning about LLMs over the past few months, but i havent gone into too much depth. Your videos seem very detailed and technical. Which one(s) would you recommend starting off with?
@yanaitalk
@yanaitalk 2 ай бұрын
There are excellent courses from DeepLearning.ai on Coursera. To go even deeper, I recommend to directly read the technical papers which gives you more depth of understanding.
@HeywardLiu
@HeywardLiu Ай бұрын
1. Roofline model 2. Transformer arch. > bottleneck of attention > flash attention 3. LLM Inference can be divided into: prefilling-stage (compute-bound) and decoding-stage (memory-bound) 4. LLM serving: paged attention, radix attention If you want to optimize the inference performance, this review paper is awesome: LLM Inference Unveiled: Survey and Roofline Model Insights
Mixture of Experts: Mixtral 8x7B
39:42
YanAITalk
Рет қаралды 238
Long Nails 💅🏻 #shorts
00:50
Mr DegrEE
Рет қаралды 7 МЛН
ТЮРЕМЩИК В БОКСЕ! #shorts
00:58
HARD_MMA
Рет қаралды 2,5 МЛН
FOREVER BUNNY
00:14
Natan por Aí
Рет қаралды 21 МЛН
Deep Dive: Optimizing LLM inference
36:12
Julien Simon
Рет қаралды 24 М.
Parameter-efficient Fine-tuning of LLMs with LoRA
48:25
YanAITalk
Рет қаралды 132
Dynamic Deep Learning | Richard Sutton
1:04:32
ICARL
Рет қаралды 5 М.
The KV Cache: Memory Usage in Transformers
8:33
Efficient NLP
Рет қаралды 43 М.
Speculative Decoding: When Two LLMs are Faster than One
12:46
Efficient NLP
Рет қаралды 14 М.