Image Classification Using Vision Transformer | An Image is Worth 16x16 Words

Рет қаралды 1,325

Күн бұрын

Пікірлер: 13

@lucaherrmann77 9 ай бұрын

this was the best video on ViTs I found - and I looked at way too many. Seeing the clearly written code with comments makes everything so much clearer. Thank you so much for sharing!

@Explaining-AI 9 ай бұрын

Really happy that the video was of some help to you!

@lucaherrmann77 9 ай бұрын

@@Explaining-AI All three ViT-videos were fantastic :D

@Explaining-AI 9 ай бұрын

@@lucaherrmann77 Thank you :)

@Explaining-AI Жыл бұрын

*Github Code* - github.com/explainingai-code/VIT-Pytorch *Patch Embedding* - Vision Transformer (Part One) - kzbin.info/www/bejne/onPMlKl4ac6sj7c *Attention* in Vision Transformer (Part Two) - kzbin.info/www/bejne/sIXClp-VmM-cgNk *Implementing Vision Transformer* (Part Three) - kzbin.info/www/bejne/fWfCenRrq7CLiKs

@AshishKumar-ye7dw 6 ай бұрын

Very nicely explained, Kudos

@DrAIScience 5 ай бұрын

Amazing. I only did not understand the classification part. Does this zero shot learning achieves that, we need to fine tune the pretrained model with hard labels to make it a classifier? Thanks.. amazing transformers series.. best best best!!!!

@Explaining-AI 5 ай бұрын

Yes you would need to fine tune/train it on your dataset. Typically you would have a fc layer on top of the CLS token representation and through training the model(say on mnist), it will learn to attend on the right patches and build a CLS representation that allows it to correctly classify which digit it is.

@BhrantoPathik 2 ай бұрын

Cls token should be *out[0, :]*, correct? You have selected the row instead. Also, the part where you explained how the attention layers are learning patterns doesn't seem clear to me. Would you mind clarifying it? Does this visualization imply that different attention heads learn different components from the image?

@Explaining-AI 2 ай бұрын

In the implementation, the first index is the batch index. The second index is the sequence of tokens, which is why selecting 0th token will give us the CLS tokens representation. Regarding the attention layers, could you tell the specific visualization/timestamp you are referring to. Is it the images @6:15 ? Different attention heads do learn to capture different notions of similarity which are then combined to give the contextual representation for each token. However, in this video I did not get into analyzing each head separately, rather the goal was to use rollout(arxiv.org/pdf/2005.00928) to visualize which spatial tokens, the CLS token was attending to. And like the paper of rollout, we averaged attention weights over all heads for a layer.

@BhrantoPathik 2 ай бұрын

@@Explaining-AI Correct, I missed the batch part. Yeah, I was talking about the attention head itself. I will go through the paper once. Thank you. Will you please cover the paper RT-DETR? Currently I am working on a project myself,where I am trying to extract text from images. The tesseract ocr wasn't helpful. So I tried to use object detection models(yolo, RT-DETR) and passed the bboxes through the tesseract engine for text extraction. Although it was a somewhat successful experience,although it's not much accurate. Any suggestion for this?

@signitureDGK 8 ай бұрын

Hey really great video. Could you make a video explaining latent diffusion models (DDIM samplers) and how inpainting works in latent space etc. Also with OpenAI Sora released i think Diffusion models will be even more popular and I saw Sora works on a sort of ViT architecture. Thanks!

@Explaining-AI 8 ай бұрын

Thank you @signitureDGK . Yes, I have a playlist kzbin.info/aero/PL8VDJoEXIjpo2S7X-1YKZnbHyLGyESDCe covering diffusion models and in my next video in that series(Stable Diffusion Part II), I will cover conditioning in LDM's in which I will make sure to also go over inpainting. Will soon follow that up with a video on different sampling techniques.