Cross Attention | Method Explanation | Math Explained

  Рет қаралды 21,572

Outlier

Outlier

Күн бұрын

Cross Attention is one of the most crucial methods in the current field of deep learning. It enables many many models to work the way they are and output amazing results as seen with #stablediffusion #imagen #muse etc.
In this video I'm giving a visual and (hopefully) intuitive explanation of how Cross Attention works. I'm trying to give some simple examples to have an easy way to understand the math.
00:00 Introduction
01:10 Self Attention explained
07:40 Cross Attention explained
11:28 Summary
12:25 Outro
#crossattention #attention

Пікірлер: 88
@AICoffeeBreak
@AICoffeeBreak Жыл бұрын
Wow, what a great video this is, thanks! You do awesome explanations: you put things well into a nutshell but also spell them out later with awesome visuals. 👏
@outliier
@outliier Жыл бұрын
Thank you so much! So nice to hear that!
@ayushroy6208
@ayushroy6208 10 ай бұрын
Hi ma'am , big fan .... A subscriber
@AICoffeeBreak
@AICoffeeBreak 10 ай бұрын
@@ayushroy6208 Thanks! Happy to see you here! 🙌
@InturnetHaetMachine
@InturnetHaetMachine Жыл бұрын
Another stellar explanation. Hope you continue making videos.
@Mike..
@Mike.. 6 ай бұрын
had to pause a rewind a lot and look stuff up, but now I feel like I have a much better understanding of how this stuff works! Thank you
@user-jx5pm9nx8p
@user-jx5pm9nx8p 11 ай бұрын
The best DL explanation content I've ever seen!!! Thank you so much, sir!
@JidongLi-lb3zt
@JidongLi-lb3zt 4 күн бұрын
thanks for your detailed introduction
@user-dn8vx7hm2q
@user-dn8vx7hm2q Жыл бұрын
Well done, just the right amount of details for anyone to understand the main concepts quickly!
@evilby
@evilby Жыл бұрын
Just the content I'm looking for! great video! Please keep up the good work!
@user-rl9px9jg5e
@user-rl9px9jg5e 2 ай бұрын
Thank you so much!!! I was finally able to understand Attention with your video!!!
@jayhu2296
@jayhu2296 2 ай бұрын
Thanks for the crystal clear explanation!
@moajjem04
@moajjem04 Жыл бұрын
Thank you very much. The video cleared the concept of cross attention a lot.
@TheSlepBoi
@TheSlepBoi Ай бұрын
Amazing explanation and thank you for taking the time to properly visualize everything
@miquelespinosa8704
@miquelespinosa8704 Жыл бұрын
This video is truly amazing!! Thanks!
@felixvgs.9840
@felixvgs.9840 11 ай бұрын
Another awesome watch, keep up the awesome work!!!
@RyanHelios
@RyanHelios 26 күн бұрын
really nice video, helps me understand a lot❗
@raphaelfeigl1209
@raphaelfeigl1209 16 күн бұрын
Amazing explanation thanks a lot! Minor improvement suggestion: add a pop-protection to your microphone :)
@kevon217
@kevon217 9 ай бұрын
Very intuitive explanation, thanks!
@mr_tpk
@mr_tpk Жыл бұрын
Thank you for this. Please continue making videos.
@fritmore
@fritmore 5 ай бұрын
wow, excellent visualisation and explanation of the algebra ops inolved
@user-px7he7dg3c
@user-px7he7dg3c Жыл бұрын
Thanks for your video. It's so great!
@sureshmaliofficial
@sureshmaliofficial 2 ай бұрын
This is really great. make it so easy to grasp such a complex stuff
@Crouchasauris
@Crouchasauris Жыл бұрын
Great explanation!
@100deep1001
@100deep1001 9 ай бұрын
Story telling skills and animation are exceptionally impressive.
@cupatelj
@cupatelj 5 ай бұрын
Very well explained. Thank you sir.
@ahmadhassan8560
@ahmadhassan8560 4 ай бұрын
cleared my doubts for cross attention in a flash. Thanks a'lot fam
@aurkom
@aurkom Жыл бұрын
Thank you for the video!
@bigtimetimmyj
@bigtimetimmyj Жыл бұрын
I've struggling to understand image and textual cross attention. This video helped immensely, thanks a lot.
@user-eo7jy3dd7z
@user-eo7jy3dd7z 6 ай бұрын
Thanks for explanation really helping!
@Melina36295
@Melina36295 Жыл бұрын
The tone of your voice and your explanations are excellent.
@kadises5032
@kadises5032 5 ай бұрын
one of the best video ive ever seen
@alejandrorequenabermejo6608
@alejandrorequenabermejo6608 5 ай бұрын
Amazing video!
@420_gunna
@420_gunna 3 ай бұрын
Incredible, please create more!
@matthewprestifilippo7673
@matthewprestifilippo7673 2 ай бұрын
Great content
@george6248
@george6248 Жыл бұрын
Very helpful video👍
@leeda36
@leeda36 5 ай бұрын
awesome, thanks!
@jingyang4949
@jingyang4949 7 ай бұрын
This is really clear!!
@outliier
@outliier 7 ай бұрын
Thanks!
@dan1ar
@dan1ar 10 ай бұрын
Such a great video! Instant subscribe)
@alioo4692
@alioo4692 9 ай бұрын
thank you very very much
@user-by2gz5qr1z
@user-by2gz5qr1z Жыл бұрын
Hi I really enjoy your video! Can you make a video on how you make high-quality videos like this?😃
@r00t257
@r00t257 Жыл бұрын
well come back!
@leerichard5542
@leerichard5542 Жыл бұрын
Long time no see!
@fyremaelstrom2896
@fyremaelstrom2896 Жыл бұрын
Top tier content for ML folk.
@afasadaaffa3321
@afasadaaffa3321 Жыл бұрын
Nice video! Why do we take Values from text if we want to produce image?
@user-ol3fe3lj8j
@user-ol3fe3lj8j Жыл бұрын
Amazing video and great visual! Nice work man. I have two questions for you: Why it's the case that with cross attention we can aquire higher evaluation to our model? In which case (regarding the nature of the conditional information) those models perform better that self attention ones? Thank you and keep the nice work
@outliier
@outliier Жыл бұрын
Hey there. Thank you for the nice words. What do you mean with “higher evaluation”? Do you mean a better performance? For the last part, you can not really compare self attention and cross attention in that way. They both are attention, just between different things (self attention between the input itself and cross attention between input and conditioning). So you can not really say that cross attention performs better than self attention, because they do different things. Maybe I understood it wrong tho, so feel free to help me understand it
@user-ol3fe3lj8j
@user-ol3fe3lj8j Жыл бұрын
@@outliier no you get my point, thanks a lot 🙏
@masbulhaider4225
@masbulhaider4225 7 ай бұрын
Great explanation. But in case of image to image translation using U Net how i can use cross attention. Suppose for faded image to colorize image conversion how to prepare query for cross attention and should the K and V come from the colorized image as conditioning image? If yes should i use a transformer encoder to get K and V?
@malavbateriwala9282
@malavbateriwala9282 Жыл бұрын
Great explanation. Which topic are you planning to work on next?
@outliier
@outliier Жыл бұрын
Paella 🥘
@XuLei-ml2vd
@XuLei-ml2vd Жыл бұрын
You explain it so clearly🤩.But I still have a question.Why is the text condition Convert into K,V, and the image Convert into Q?Why do they correspond to each other like this?Can image cobvert into K,V? look forward to your reply
@outliier
@outliier Жыл бұрын
Hey there, thank you so much for the kind words! The reason why the text becomes k and v is because you first want to attend to the image. If you take the text to be q and v for example then the final shape of the output will not be correct. q determines the shape of the attention output. I had the same question once, but a friend explained this to me too! It might sound a bit confusing, but just do a calculation on paper and try to switch around the matrices for q k and v and you will see q has to come from the image and k, v from the extra conditioning. Let me know if that helps, otherwise I can give a concrete example :D
@XuLei-ml2vd
@XuLei-ml2vd Жыл бұрын
@@outliier Thank you very much for your reply. 😊I did it on paper. I found that the first dimension of the attention depends on Q, so in order to keep the output consistent with the dimensions of the original image, image must correspond to Q. But I have another question, can the matrix shapes of WCA^K and WCA^V be different?For example, set their shapes to (6,2), (6,3) respectively, and then set the subsequent Wout matrix shape to (3,3).Looking forward to your reply :)
@outliier
@outliier Жыл бұрын
@@XuLei-ml2vd Hey! That's exactly what I was thinking about too. In theory yes, I don't know if people are actually doing it. Because in my mind, it could make more sense to use a smaller dimension for K since it "only" contributes to similarity lookup and a bigger dimension for V. But it is important to keep in mind that similarity lookups perform best in high dimensions, so a dimension of 10 probably is way to low to embed the inputs. But maybe there is a sweet spot that can be found which performs better than giving K and V the same dimensions. Would love if someone would make a comparison (or maybe there already are plenty?)
@XuLei-ml2vd
@XuLei-ml2vd Жыл бұрын
​@@outliier Thank you so much for your reply! I really appreciate it✨
@harshaldharpure9921
@harshaldharpure9921 3 ай бұрын
I have two feature x--> text feature , y--> image_feature and rag --> rag_feature (This is Extra feature) I want to apply the cross _attention between (rag and (x/y)) How should I apply
@vaibhavsingh1049
@vaibhavsingh1049 Жыл бұрын
Thanks for the video. Can you explain how is an extra image (like in img2img) along with text input is used in cross-attention? In all there are 3 inputs, the noisy latent at x_t, the conditional input image and text. How are they mixed together?
@outliier
@outliier Жыл бұрын
The normal image2image doesn’t use the image embedding in cross attention. You can just take your image, noise it (about 60-80%) and then just do the sampling from there. So the image is in no way used in the cross attention
@NoahElRhandour
@NoahElRhandour Жыл бұрын
yo ist das manim oder wie hast du das alles (zahlen in den matrizen oder zb die pfeile) gemacht??? crazy gut!!!
@outliier
@outliier Жыл бұрын
Ja mit manim
@user-hm6sh6pl7r
@user-hm6sh6pl7r 10 күн бұрын
Thanks for the explanation, it's awesome! But I have a question. In cross attention, if we set the text as V, the final attention matrix could be viewed as a weighted sum of each word in V itself (the "weighted" part comes from the Q, K similarity). If I understand correctly, the final attention matrix should contain the values in the text domain, why can we multiply by a W_out projection and get the result in the image domain (add it to the original image)? Will it make more sense if we set the text condition as Q, and the image as K, V?
@outliier
@outliier 10 күн бұрын
If the text conditioning is q then it would not have the same shape as your image. So q needs to be the image
@akkokagari7255
@akkokagari7255 23 күн бұрын
Wonderful explanation! Not sure if this is in the original papers, but I find it very odd that there is no nonlinear function after V and before W_out. It seems like a waste to me since Attention@V is itself a linear function, so w_out wont necessarily change content of the data beyond what Attention@V already would have done through training.
@akkokagari7255
@akkokagari7255 23 күн бұрын
Whoops I mean the similarity matrix not Attention
@rikki146
@rikki146 Жыл бұрын
12:17 While I understand Q and K are just matrices to calculate similarities, why is V text instead of latent representation of picture? I thought Softmax(QK^T/sqrt(d))V is an input to a UNet that accepts latent representation of a picture? Or is the UNet trained to accept weighted word embeddings instead? From what I understand, the UNet accepts a latent representation of image as input and outputs predicted noise (in latent space). So why would weighted word conditions work? Let me know if I have any misunderstandings... thanks.
@outliier
@outliier Жыл бұрын
Hey. The Cross attention is only used as blocks in the UNet. So its part of it. The input to the the UNet is the noise and then usually you have a stack of ResNet, Attention and Cross Attention blocks. And the Cross Attention blocks are used to inject the conditioning. And V is text because the latent representations of the image can attend to the word tokens this way and can choose where to draw information from. Does that help?
@rikki146
@rikki146 Жыл бұрын
@@outliier To be honest, I am still puzzled lol. I will go back and study closer. I just thought in order to predict noise your input must be a noisy image (in latent representation), instead of weighted word embeddings. Thanks for the answer though.
@outliier
@outliier Жыл бұрын
@@rikki146 and thats totally right. The input is a noised image. The text only comes in as a conditioning through the cross attention to give the model more information what the noised image could be. And the way this works is by letting the image pixels attend to the word embeddings. And then through the skip connection the pixels will be modified based on the word embeddings
@kaanvural2920
@kaanvural2920 10 ай бұрын
That's a great explanation and presentation, thank you. I have a question about "condition" part. Does cross attention use as image information instead of text also? If yes, what is the condition means for images then? Thanks in advance
@outliier
@outliier 10 ай бұрын
Do you mean if its possible to use cross attention just with images when not having any text?
@kaanvural2920
@kaanvural2920 10 ай бұрын
@@outliier Yes, I was tried to mean that 😅
@outliier
@outliier 10 ай бұрын
@@kaanvural2920 then it naturally becomes self-attention. What I explained in the first part of the video
@kaanvural2920
@kaanvural2920 10 ай бұрын
@@outliier I remembered that I came across cross attention when I was searching about detection or segmentation models. Maybe I am mistaken, I'll check. Thanks again 🙏
@outliier
@outliier 10 ай бұрын
@@kaanvural2920 ah I think I understand. As soon as you are talking about conditional information, in any form (either text information, image information (eg in the form of segmentation maps) or anything else) you can use cross attention
@EngRiadAlmadani
@EngRiadAlmadani Жыл бұрын
the best from now on i will not put the self attention in from of large feature map it will take long time😆
@user-rw3xm8nv7u
@user-rw3xm8nv7u 7 ай бұрын
@Tremor244
@Tremor244 5 ай бұрын
The math makes sense, but how it results in a coherent image will forever be a mistery to me lol
@lewingtonn
@lewingtonn Жыл бұрын
this channel is certainly an outlier in terms of how good these fucking videos are
@juansmith4217
@juansmith4217 Жыл бұрын
'Promo sm'
@swaystar1235
@swaystar1235 8 ай бұрын
naruhodo (does not naruhodo)
@VerdonTrigance
@VerdonTrigance 5 ай бұрын
I'm lost on Self Attention. Man, you are too fast! Feels like u r telling it for someone who already knows it, but for others it takes time to chew on it. At the moment when I understand previous phrase you are much far away telling about something else. At the end I still don't get it, but thanks for you try. Especially for mathematic approach.
@saeed577
@saeed577 11 ай бұрын
Great presentation video, bad explanations.
@outliier
@outliier 11 ай бұрын
Thank you. Let me know what I could do different in your opinion for the explanation part
@youtubercocuq
@youtubercocuq Жыл бұрын
perfect video thanks
@bootyhole
@bootyhole 4 ай бұрын
Awesome!
@rajpulapakura001
@rajpulapakura001 6 ай бұрын
Thanks for the crystal clear explanation!
Diffusion Models | Paper Explanation | Math Explained
33:27
Outlier
Рет қаралды 224 М.
Lecture 12.1 Self-attention
22:30
DLVU
Рет қаралды 69 М.
Каха инструкция по шашлыку
01:00
К-Media
Рет қаралды 5 МЛН
Заметили?
00:11
Double Bubble
Рет қаралды 3,4 МЛН
Como ela fez isso? 😲
00:12
Los Wagners
Рет қаралды 34 МЛН
When Steve And His Dog Don'T Give Away To Each Other 😂️
00:21
BigSchool
Рет қаралды 12 МЛН
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 163 М.
VQ-GAN | Paper Explanation
14:00
Outlier
Рет қаралды 17 М.
Why do we need Cross Entropy Loss? (Visualized)
8:13
Normalized Nerd
Рет қаралды 45 М.
10 - Self / cross, hard / soft attention and the Transformer
1:12:01
Alfredo Canziani
Рет қаралды 34 М.
Iphone or nokia
0:15
rishton vines😇
Рет қаралды 1,1 МЛН
cool watercooled mobile phone radiator #tech #cooler #ytfeed
0:14
Stark Edition
Рет қаралды 8 МЛН