Thanks for the video! Your enthusiasm is infectious 😁
@StephenBlum3 ай бұрын
Thanks for watching! happy and excited about this technology. It's great. The idea of Fused-kernels. Where multiple math ops are done all at once in the GPU without needing to communicate back with CPU. It's a lot faster 🎉😄
@siddharthmanumusic8 күн бұрын
Great video! Loved that all my C++ knowledge is still relevant lol
@StephenBlum8 күн бұрын
@@siddharthmanumusic thank you! 😄 Yes absolutely you are right. C++ is relevant and heavily used in advanced AI frameworks
@samkitjain80076 ай бұрын
At 6.46 min you mention that V doesn't have any positional encodings. Both K and V are derived from positional encodings (cosine embeddings) of original tokens in the sequence. Am I missing something here ?
@StephenBlum6 ай бұрын
Yes, you are correct 😄 Q and K receive position encodings. V is not required to receive enrichment from positional encodings. Q, K and V start as identical (the same) vectors generated from a word embedding layer like Word2Vec. Q and K receive normalization like Cosine/Sine or a linear bias (ALiBi: Attention with Linear Biases) to enrich the embedding vectors with position information that the model can attend. The model itself does not have any inherent notion of order. Positional encodings are added to the Word2Vec embeddings output before processing by the model. Starts with "String of words" -> Tokenize -> Embedding -> QKV -> Positional Encoding for Q and K -> Q and K ( optionally V ). Note that there are more ways to explore correlation coefficient 🙌 and the model can see more properties that it can attend. Basically you get to pick what the model "sees", then what the model "responds" with. It's amazing! Note that you can add positional encodings to V as well. It can help the model learn faster. Excluding positional encoding from V can regularize the input data to create a smoother gradient in the dense layers. You can try using different enrichments for each QKV. Lots of options! 🎉
@StephenBlum6 ай бұрын
Hi Samkit. I am planning a video for next week to cover QKV calculations. When I started learning the transformer model I had two major questions: 1. What is QKV? and 2. How are multi-heads created. Looking to cover those in the video next week 😄
@StephenBlum6 ай бұрын
Hi Samkit posted a video about Query, Key and Value the Q, K and V variables: kzbin.info/www/bejne/j5OWmGatn6mmjpY
@SinanAkkoyun5 ай бұрын
Duude, you're the only one who explains it like he truly understands it, keep it up!!! ❤
@StephenBlum5 ай бұрын
Thank you! 😊 Happy to explain it. Will keep it up! The AI industry uses a lot of math language. This makes AI difficult to approach. We can explain how AI works and still provide the same details using common language
@PhantomKenTen4 ай бұрын
This is great, but an implementation with a model (blip-2) from huggingface would be even more useful.
@StephenBlum4 ай бұрын
wow! blip-2 nice! checking this out today. Thank you!
@StephenBlum4 ай бұрын
amazing! huggingface.co/docs/transformers/main/en/model_doc/blip-2 this will be a great topic to cover. Thank you Kenan 😄🎉
@PhantomKenTen4 ай бұрын
Yes, I'm trying to run inference with those vLLMs with flash attention, but it's tough
@StephenBlum4 ай бұрын
@@PhantomKenTen ah yes, that may be tricky. The multi-head attention often is split across CPUs, usually. Each head is computed in parallel. If you are on a machine with 128 cores you can have 128 heads. Most frameworks do this because of easy parallelization on CPUs. Or if you have a multi-GPU server you can use the 8 GPUs as separate attention multipliers. It makes sense that it may be tricky to implement the flash attention being that it's an all-in-one GPU kernel. This is highly hardware dependent. The model you are working with needs to make sure the multi-headed attention multipliers are done on GPUs using the Flash Attention (v3 now) on the GPUs.
@PhantomKenTen4 ай бұрын
@@StephenBlum I definitely don't have 128 CPU cores. I have hoping for a speed up by compressing the attention steps into one cuda core for inference. I don't think I can use Flash3 since I'm on an A10G-24GB, Flash3 is for Hopper
@bettylin70357 ай бұрын
thank you for the video! Please post more content like this 😍, and keep up with the great work :))
@StephenBlum7 ай бұрын
Thank you! Will do! 😄 More videos planned 🙌 will spend more focus on AI / ML going forward