Deep dive - Better Attention layers for Transformer models

Рет қаралды 9,018

Күн бұрын

The self-attention mechanism is at the core of transformer models. As amazing as it is, it requires a significant amount of computing and memory bandwidth, leading to scalability issues as models get more complex and context length increases.
In this video, we'll quickly review the computation involved in the self-attention mechanism and its multi-head variant. Then, we'll discuss newer attention implementations focused on compute and memory optimizations, namely Multi-Query Attention, Group-Query Attention, Sliding Window Attention, Flash Attention v1 and v2, and Paged Attention.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
03:00 Self-attention
07:20 Multi-Head Attention (MHA)
12:32 Multi-Query Attention (MQA)
18:45 Group-Query Attention (GQA)
22:47 Sliding Window Attention (SWA)
26:17 Flash Attention
31:28 Flash Attention v2
34:36 Paged Attention
39:00 The Hugging Face LLM performance leaderboard

Пікірлер: 15

@HieuNguyen-qs6ig 2 ай бұрын

Thanks Julien, your videos are really helpful.

@juliensimonfr Ай бұрын

Glad you like them!

@jacehua7334 5 ай бұрын

your videos are always so good thank so much congrats on 10k!

@juliensimonfr 5 ай бұрын

Thank you so much!!

@cybermanaudiobooks3231 5 ай бұрын

Thanks Julien! I really liked this video. Many thanks for sharing your knowledge and experience.

@juliensimonfr 5 ай бұрын

My pleasure!

@teddybear7949 5 ай бұрын

Thanks Julien, it really kept my attention and made me understand a lot about attention

@juliensimonfr 5 ай бұрын

That was my goal, thank you!

@justwest 5 ай бұрын

nice overview, thx julien!

@juliensimonfr 5 ай бұрын

Glad you liked it!

@thepresistence5935 3 ай бұрын

Julien outstanding video, where can I get the decks?

@rbrowne4255 Ай бұрын

Thanks for the overview!! Excellent!!! In terms of sizing for inference, is there a way to calculate the KV Cache or maybe the overall HBM memory usage base on these optimizations.

@juliensimonfr Ай бұрын

You're welcome. The model papers have some numbers, but I think the best resource is huggingface.co/spaces/optimum/llm-perf-leaderboard. You can see how much RAM is required to load a particular model, and of course compare same-size models based on different attention layers :)