Deep dive - Better Attention layers for Transformer models

  Рет қаралды 9,018

Julien Simon

Julien Simon

Күн бұрын

The self-attention mechanism is at the core of transformer models. As amazing as it is, it requires a significant amount of computing and memory bandwidth, leading to scalability issues as models get more complex and context length increases.
In this video, we'll quickly review the computation involved in the self-attention mechanism and its multi-head variant. Then, we'll discuss newer attention implementations focused on compute and memory optimizations, namely Multi-Query Attention, Group-Query Attention, Sliding Window Attention, Flash Attention v1 and v2, and Paged Attention.
⭐️⭐️⭐️ Don't forget to subscribe to be notified of future videos. Follow me on Medium at / julsimon or Substack at julsimon.substack.com. ⭐️⭐️⭐️
03:00 Self-attention
07:20 Multi-Head Attention (MHA)
12:32 Multi-Query Attention (MQA)
18:45 Group-Query Attention (GQA)
22:47 Sliding Window Attention (SWA)
26:17 Flash Attention
31:28 Flash Attention v2
34:36 Paged Attention
39:00 The Hugging Face LLM performance leaderboard

Пікірлер: 15
@HieuNguyen-qs6ig
@HieuNguyen-qs6ig 2 ай бұрын
Thanks Julien, your videos are really helpful.
@juliensimonfr
@juliensimonfr Ай бұрын
Glad you like them!
@jacehua7334
@jacehua7334 5 ай бұрын
your videos are always so good thank so much congrats on 10k!
@juliensimonfr
@juliensimonfr 5 ай бұрын
Thank you so much!!
@cybermanaudiobooks3231
@cybermanaudiobooks3231 5 ай бұрын
Thanks Julien! I really liked this video. Many thanks for sharing your knowledge and experience.
@juliensimonfr
@juliensimonfr 5 ай бұрын
My pleasure!
@teddybear7949
@teddybear7949 5 ай бұрын
Thanks Julien, it really kept my attention and made me understand a lot about attention
@juliensimonfr
@juliensimonfr 5 ай бұрын
That was my goal, thank you!
@justwest
@justwest 5 ай бұрын
nice overview, thx julien!
@juliensimonfr
@juliensimonfr 5 ай бұрын
Glad you liked it!
@thepresistence5935
@thepresistence5935 3 ай бұрын
Julien outstanding video, where can I get the decks?
@rbrowne4255
@rbrowne4255 Ай бұрын
Thanks for the overview!! Excellent!!! In terms of sizing for inference, is there a way to calculate the KV Cache or maybe the overall HBM memory usage base on these optimizations.
@juliensimonfr
@juliensimonfr Ай бұрын
You're welcome. The model papers have some numbers, but I think the best resource is huggingface.co/spaces/optimum/llm-perf-leaderboard. You can see how much RAM is required to load a particular model, and of course compare same-size models based on different attention layers :)
@xinzhou3744
@xinzhou3744 3 ай бұрын
you video is great! Could you paste all the code links in the description of the video? Much appreciated
@juliensimonfr
@juliensimonfr 3 ай бұрын
Hi, not sure what you mean. If you're talking about the pseudo code boxes , they're available in the research papers mentioned on the slides.
A little girl was shy at her first ballet lesson #shorts
00:35
Fabiosa Animated
Рет қаралды 16 МЛН
ПРОВЕРИЛ АРБУЗЫ #shorts
00:34
Паша Осадчий
Рет қаралды 7 МЛН
A teacher captured the cutest moment at the nursery #shorts
00:33
Fabiosa Stories
Рет қаралды 54 МЛН
Flash Attention 2.0 with Tri Dao (author)! | Discord server talks
1:00:25
Aleksa Gordić - The AI Epiphany
Рет қаралды 18 М.
Do you think that ChatGPT can reason?
1:42:28
Machine Learning Street Talk
Рет қаралды 40 М.
Optimising Code - Computerphile
19:43
Computerphile
Рет қаралды 143 М.
Paper deep dive: Evolutionary Optimization of Model Merging Recipes
40:00
DataScienceCastnet
Рет қаралды 3,3 М.
How a Transformer works at inference vs training time
49:53
Niels Rogge
Рет қаралды 50 М.
FlashAttention - Tri Dao | Stanford MLSys #67
58:58
Stanford MLSys Seminars
Рет қаралды 26 М.
10 weird algorithms
9:06
Fireship
Рет қаралды 1,2 МЛН
This is why Deep Learning is really weird.
2:06:38
Machine Learning Street Talk
Рет қаралды 377 М.
The Most Important Algorithm in Machine Learning
40:08
Artem Kirsanov
Рет қаралды 358 М.
Сколько реально стоит ПК Величайшего?
0:37
Копия iPhone с WildBerries
1:00
Wylsacom
Рет қаралды 8 МЛН
Todos os modelos de smartphone
0:20
Spider Slack
Рет қаралды 65 МЛН