this was the best video on ViTs I found - and I looked at way too many. Seeing the clearly written code with comments makes everything so much clearer. Thank you so much for sharing!
@Explaining-AI9 ай бұрын
Really happy that the video was of some help to you!
@lucaherrmann779 ай бұрын
@@Explaining-AI All three ViT-videos were fantastic :D
Amazing. I only did not understand the classification part. Does this zero shot learning achieves that, we need to fine tune the pretrained model with hard labels to make it a classifier? Thanks.. amazing transformers series.. best best best!!!!
@Explaining-AI5 ай бұрын
Yes you would need to fine tune/train it on your dataset. Typically you would have a fc layer on top of the CLS token representation and through training the model(say on mnist), it will learn to attend on the right patches and build a CLS representation that allows it to correctly classify which digit it is.
@BhrantoPathik2 ай бұрын
Cls token should be *out[0, :]*, correct? You have selected the row instead. Also, the part where you explained how the attention layers are learning patterns doesn't seem clear to me. Would you mind clarifying it? Does this visualization imply that different attention heads learn different components from the image?
@Explaining-AI2 ай бұрын
In the implementation, the first index is the batch index. The second index is the sequence of tokens, which is why selecting 0th token will give us the CLS tokens representation. Regarding the attention layers, could you tell the specific visualization/timestamp you are referring to. Is it the images @6:15 ? Different attention heads do learn to capture different notions of similarity which are then combined to give the contextual representation for each token. However, in this video I did not get into analyzing each head separately, rather the goal was to use rollout(arxiv.org/pdf/2005.00928) to visualize which spatial tokens, the CLS token was attending to. And like the paper of rollout, we averaged attention weights over all heads for a layer.
@BhrantoPathik2 ай бұрын
@@Explaining-AI Correct, I missed the batch part. Yeah, I was talking about the attention head itself. I will go through the paper once. Thank you. Will you please cover the paper RT-DETR? Currently I am working on a project myself,where I am trying to extract text from images. The tesseract ocr wasn't helpful. So I tried to use object detection models(yolo, RT-DETR) and passed the bboxes through the tesseract engine for text extraction. Although it was a somewhat successful experience,although it's not much accurate. Any suggestion for this?
@signitureDGK8 ай бұрын
Hey really great video. Could you make a video explaining latent diffusion models (DDIM samplers) and how inpainting works in latent space etc. Also with OpenAI Sora released i think Diffusion models will be even more popular and I saw Sora works on a sort of ViT architecture. Thanks!
@Explaining-AI8 ай бұрын
Thank you @signitureDGK . Yes, I have a playlist kzbin.info/aero/PL8VDJoEXIjpo2S7X-1YKZnbHyLGyESDCe covering diffusion models and in my next video in that series(Stable Diffusion Part II), I will cover conditioning in LDM's in which I will make sure to also go over inpainting. Will soon follow that up with a video on different sampling techniques.