Flash Attention Machine Learning

Рет қаралды 3,752

Stephen Blum

Күн бұрын

Пікірлер: 34

@aflah7572 3 ай бұрын

Thanks for the video! Your enthusiasm is infectious 😁

@StephenBlum 3 ай бұрын

Thanks for watching! happy and excited about this technology. It's great. The idea of Fused-kernels. Where multiple math ops are done all at once in the GPU without needing to communicate back with CPU. It's a lot faster 🎉😄

@siddharthmanumusic 8 күн бұрын

Great video! Loved that all my C++ knowledge is still relevant lol

@StephenBlum 8 күн бұрын

@@siddharthmanumusic thank you! 😄 Yes absolutely you are right. C++ is relevant and heavily used in advanced AI frameworks

@samkitjain8007 6 ай бұрын

At 6.46 min you mention that V doesn't have any positional encodings. Both K and V are derived from positional encodings (cosine embeddings) of original tokens in the sequence. Am I missing something here ?

@StephenBlum 6 ай бұрын

Yes, you are correct 😄 Q and K receive position encodings. V is not required to receive enrichment from positional encodings. Q, K and V start as identical (the same) vectors generated from a word embedding layer like Word2Vec. Q and K receive normalization like Cosine/Sine or a linear bias (ALiBi: Attention with Linear Biases) to enrich the embedding vectors with position information that the model can attend. The model itself does not have any inherent notion of order. Positional encodings are added to the Word2Vec embeddings output before processing by the model. Starts with "String of words" -> Tokenize -> Embedding -> QKV -> Positional Encoding for Q and K -> Q and K ( optionally V ). Note that there are more ways to explore correlation coefficient 🙌 and the model can see more properties that it can attend. Basically you get to pick what the model "sees", then what the model "responds" with. It's amazing! Note that you can add positional encodings to V as well. It can help the model learn faster. Excluding positional encoding from V can regularize the input data to create a smoother gradient in the dense layers. You can try using different enrichments for each QKV. Lots of options! 🎉

@StephenBlum 6 ай бұрын

Hi Samkit. I am planning a video for next week to cover QKV calculations. When I started learning the transformer model I had two major questions: 1. What is QKV? and 2. How are multi-heads created. Looking to cover those in the video next week 😄

@StephenBlum 6 ай бұрын

Hi Samkit posted a video about Query, Key and Value the Q, K and V variables: kzbin.info/www/bejne/j5OWmGatn6mmjpY

@SinanAkkoyun 5 ай бұрын

Duude, you're the only one who explains it like he truly understands it, keep it up!!! ❤

@StephenBlum 5 ай бұрын

Thank you! 😊 Happy to explain it. Will keep it up! The AI industry uses a lot of math language. This makes AI difficult to approach. We can explain how AI works and still provide the same details using common language

@PhantomKenTen 4 ай бұрын

This is great, but an implementation with a model (blip-2) from huggingface would be even more useful.

@StephenBlum 4 ай бұрын

wow! blip-2 nice! checking this out today. Thank you!

@StephenBlum 4 ай бұрын

amazing! huggingface.co/docs/transformers/main/en/model_doc/blip-2 this will be a great topic to cover. Thank you Kenan 😄🎉

@PhantomKenTen 4 ай бұрын

Yes, I'm trying to run inference with those vLLMs with flash attention, but it's tough

@StephenBlum 4 ай бұрын

@@PhantomKenTen ah yes, that may be tricky. The multi-head attention often is split across CPUs, usually. Each head is computed in parallel. If you are on a machine with 128 cores you can have 128 heads. Most frameworks do this because of easy parallelization on CPUs. Or if you have a multi-GPU server you can use the 8 GPUs as separate attention multipliers. It makes sense that it may be tricky to implement the flash attention being that it's an all-in-one GPU kernel. This is highly hardware dependent. The model you are working with needs to make sure the multi-headed attention multipliers are done on GPUs using the Flash Attention (v3 now) on the GPUs.

@PhantomKenTen 4 ай бұрын

@@StephenBlum I definitely don't have 128 CPU cores. I have hoping for a speed up by compressing the attention steps into one cuda core for inference. I don't think I can use Flash3 since I'm on an A10G-24GB, Flash3 is for Hopper