Vision Transformer Quick Guide - Theory and Code in (almost) 15 min

  Рет қаралды 104,494

DeepFindr

DeepFindr

Күн бұрын

Пікірлер: 68
@JessSightler
@JessSightler 8 ай бұрын
I've changed the output layer a bit... this: self.head_ln = nn.LayerNorm(emb_dim) self.head = nn.Sequential(nn.Linear(int((1 + self.height/self.patch_size * self.width/self.patch_size) * emb_dim), out_dim)) Then in forward: x = x.view(x.shape[0], int((1 + self.height/self.patch_size * self.width/self.patch_size) * x.shape[-1])) out = self.head(x) The downside is that you'll likely get a lot more overfitting, but without it the network was not really training at all.
@DeepFindr
@DeepFindr 8 ай бұрын
Hi, thanks for your recommendation. I would probably not use this model for real world data as there are many important details that are missing (for the sake of providing a simple overview). I will pin your comment for others that also want to use this implementation. Thank you!
@geekyprogrammer4831
@geekyprogrammer4831 Жыл бұрын
This is very underrated channel. You deserve way more viewers!!
@betabias
@betabias 7 ай бұрын
Keep making content like this, I am sure you will get a very good recognition in the future. Thanks for such amazing content.
@hmind9836
@hmind9836 Жыл бұрын
You're awesome man!!! I clicked your video so fast, you're one of the my favorite AI youtubers. I work in the field and I think you have a wonderful ability of explaining complex concepts in your videos
@DeepFindr
@DeepFindr Жыл бұрын
thanks for the kind words :)
@florianhonicke5448
@florianhonicke5448 Жыл бұрын
Really great explanation. Nice visuals
@DeepFindr
@DeepFindr Жыл бұрын
Much appreciated!
@gayanpathirage7675
@gayanpathirage7675 7 ай бұрын
There was a error on your published code but not in the video. attn_output, attn_output_weights = self.att(x, x, x) It should be attn_output, attn_output_weights = self.att(q, k, v) Anyway, thanks for sharing the video and code base. It helped me a lot while learning ViT
@hemanthvemuluri9997
@hemanthvemuluri9997 Жыл бұрын
Awesome man!! You code and explain with such simplicity.
@tenma5220
@tenma5220 Жыл бұрын
This channel is amazing. Please continue making videos!
@cosminpetrescu860
@cosminpetrescu860 Жыл бұрын
Why are the positional embeddings learnable? It doesn't make sense to me
@trendingtech4youth989
@trendingtech4youth989 Жыл бұрын
Bcoz, positional embedding represent the adress or position of image information of image patches
@cosminpetrescu860
@cosminpetrescu860 Жыл бұрын
@@trendingtech4youth989 in "Attention is all you need", afaik, positioal embeddings are not learnable
@Omsip123
@Omsip123 8 ай бұрын
​@trendingtech4youth989 so..they are given like the patches, why shall they be learned
@sohangundoju8940
@sohangundoju8940 8 ай бұрын
Imagine you have a sentence "I made a pizza and put it in an oven, it was tasty". Here we know that it refers to pizza, but for a model, it could mean pizza is delicious or oven is delicious. Positional encodings are learnt based on each other, the position of it here is with respect to pizza and oven. Therefore it is a learnt parameter. Here the video mentions that it is doesn't have any significant advantage over numbering, but when you are trying to teach the model to identify features, it will identify feature and position of one patch wrt other patches, ergo making it a learnable parameter
@isaakcarteraugustus1819
@isaakcarteraugustus1819 7 ай бұрын
It’s not really needed, in the original transformer paper have been tests with learnable pe and basic-not-learnable-sinusoidal pe and that did not make a big difference, gpt2 also uses learnable pe, nowadays it’s more RoPE doing the positional encoding
@dfdiasbr
@dfdiasbr 2 ай бұрын
Awesome! Thanks for this video!
@romanlyskov9785
@romanlyskov9785 11 ай бұрын
Awesome! Thanks for excellent explanation!
@beratcokhavali
@beratcokhavali Жыл бұрын
in 05:08. how we calculated that? when I calculated the patch shape I got a different result. Could someone explain that?
@chandank5266
@chandank5266 8 ай бұрын
yes exactly, I also have same doubt, for me its 192 instead of 324
@user-wm8xr4bz3b
@user-wm8xr4bz3b 7 ай бұрын
original image size was 144 (h) x 144 (w), after patch embedding, it was transformed to 324 patches, and each patch has 128 dimension size. 324 was derived from (144/8) x (144/8) = 18 x 18 = 324 patches. 8 is the patch size of each patch.
@xxyyzz8464
@xxyyzz8464 5 ай бұрын
Why use dropout with GeLU? Didn’t the GeLU paper specifically say one motivation for GeLU was to replace ReLU+dropout with a single GeLU layer?
@KacperPaszkowski-s4b
@KacperPaszkowski-s4b Жыл бұрын
Hello, first of all great tutorial video. I've tried running provided code for training, but after ~400 epochs loss is still the same (~3.61) and model always predicts the same class. Do you have an idea what might be a problem with it?
@DeepFindr
@DeepFindr Жыл бұрын
Hi, have you tried a Lower learning rate? Also is the train loss decreasing or also stuck?
@KacperPaszkowski-s4b
@KacperPaszkowski-s4b Жыл бұрын
​@@DeepFindr Actually I've already found one bug in notebook. In forward method of Attention module, input is directly passed to MultiheadAttention bypassing linear layers. Changing learning rate doesn't affect training at all. Also, when training I've noticed that model's output converges to all zeros. I've checked gradients in network and it turns out that gradient flow stops at PatchEmbedding layer. All layers after it have non-zero gradients. Still don't know why this happens
@DeepFindr
@DeepFindr Жыл бұрын
Thanks for finding this bug. But I actually think it's not super relevant for this issue - I experimented with the attention previously and tried both ways (with linear layers and without), that's how this bug was created in the first place. When I started the training back then the loss was definitely decreasing, but I didn't expect it to get stuck at some plateau. Typically when models predict always the same class there can be a couple of reasons. I already checked this: - Input data is normalized - Too few / too many parameters (I would recommend to count the model parameters to get a feeling for this) - Learning rate - SGD Optimizer (seems to work a bit better) - Batch size, I put it to 128 - Embedding size, make it a bit smaller After 100 epochs the Loss is also converging to 3.61, but the model is predicting different classes. Maybe the Dataset is not big enough. What about trying another dataset? Alternatively, try data augmentation. As stated in the video, transformers need to see a lot of examples.
@VoltVipin_VS
@VoltVipin_VS 3 ай бұрын
The best part of Vision transformers is inbuilt support interpretability as compared to CNN where we had to compute saliency maps.
@netanelmad
@netanelmad Жыл бұрын
Thank you! Very clear and informative.
@kristoferkrus
@kristoferkrus Жыл бұрын
Nice video! However, I think it's incorrect that you would get separate vectors for the three channels? This is not how they do it in the paper; there they say that the number of patches is N = HW/P^2, where H, W and P is the height and width of the original image and (P, P) is the resolution of each patch, so the number of color channels doesn't affect the number of patches you get.
@anightattheraces
@anightattheraces Жыл бұрын
Very helpful video, thanks!
@comunedipadova1790
@comunedipadova1790 Ай бұрын
Has anyone been able to make it converge? What hyperparameters did you modify?
@frommarkham424
@frommarkham424 2 ай бұрын
As a robot myself, i can confirm that an image really is worth 16x16 words
@vero811
@vero811 8 ай бұрын
I think there is a confusion between cls token and positional embedding? At 6:09?
@adosar7261
@adosar7261 6 ай бұрын
Isn't the embedding layer redundant? I mean we have then the projection matrices meaning that embedding + projection is a composition of two linear layers.
@josephmargaryan
@josephmargaryan 7 ай бұрын
Is this better for the MNIST challenge compared to a simple conv network like LeNet
@datascienceworld
@datascienceworld 11 ай бұрын
Great tutorial
@frommarkham424
@frommarkham424 2 ай бұрын
An image is worth 16x16 words🗣🗣🗣🗣🗣🗣🗣💯💯💯💯💯💯💯🔥🔥🔥🔥🔥🔥🔥
@MrMadmaggot
@MrMadmaggot 10 ай бұрын
Is the Colab using cuda? IF so how can I tell if it is useing cuda
@abhinavvura4973
@abhinavvura4973 8 ай бұрын
hi there I have used the code for binary class classification, but encountering the problem on accuracy , showing 100% accuracy only on label 1 and some times on label 2. So it would be helpful for me if u provide me any solution
@DeepFindr
@DeepFindr 8 ай бұрын
Hi, please see pinned comment. Maybe this helps :)
@justsomeone3375
@justsomeone3375 Жыл бұрын
can someone help me with the training codes in the Google Colab link in the description?
@kitgary
@kitgary Жыл бұрын
Awesome video! But I wonder if you reverse the order of LayerNorm and Multi-Head Attention? I think the LayerNorm should be applied after Multi-Head Attention but your implementation apply the LayerNorm before it.
@DeepFindr
@DeepFindr Жыл бұрын
Hi! Thanks! There is a paper that investigated pre- VS post-layernorm in transformers (see here arxiv.org/pdf/2002.04745). The "Pre" variant seems to perform better as opposed to the traditional suggestion in the transformer paper. This is also what most public implementations do :)
@chinnum9716
@chinnum9716 Жыл бұрын
Hey,great video. I have a question though. Isn't the entire point of 'pre' norm is that the normalization is applied before attention computation is performed? But from the code ,norm = PreNorm(128, Attention(dim=128, n_heads=4, dropout=0.)) , it seems like you are performing attention first and then normalizing aka post-norm. Please correct me if I'm wrong :)
@DeepFindr
@DeepFindr Жыл бұрын
Hi! In the forward pass of the PreNorm layer is this line: self.fn(self.norm(x), **kwargs) So normalization is applied first and then the function (such as attention in this example) The line you are referencing is just the initialization, not the actual call Hope that helps :)
@chinnum9716
@chinnum9716 Жыл бұрын
@@DeepFindr stupid of me to not see that first. Thank you for the reply
@efexzium
@efexzium 8 ай бұрын
can you please make a video on how to perform inference on VIT like googles open source vision transformer?
@muhammadtariq7474
@muhammadtariq7474 8 ай бұрын
Where to get slides? Used in video
@marcossrivas
@marcossrivas Жыл бұрын
Cool video! What do you think about the implementation of ViT on signal processing (spectrogram analysis) , applied to audio for example. Which advantages could it have over the classic convolutional networks?
@sorvex9
@sorvex9 8 ай бұрын
Take a look at gpt4-omni to find out, lol
@murphy1162
@murphy1162 Жыл бұрын
Hope you could explain the swim transformer object detection in new video please
@newbie8051
@newbie8051 5 ай бұрын
Ah, tough to understand, guess will have to read more on this to fully understand
@VoltVipin_VS
@VoltVipin_VS 3 ай бұрын
You should have a deep understanding of transformer architecture to understand this.
@지능시스템트랙신현수
@지능시스템트랙신현수 11 ай бұрын
감사합니다!!
@ycombinator765
@ycombinator765 5 ай бұрын
bro is educated!
@hautran-uc8gz
@hautran-uc8gz 10 ай бұрын
thank you
@RAZZKIRAN
@RAZZKIRAN Жыл бұрын
thank u ,
@avirangal2044
@avirangal2044 8 ай бұрын
The video is great but the training in the code didn't work for the entire 1000 epochs. Despite the code looks logical, there is endless of things that can go wrong so I think it was better to do a tutorial with working ViT notebook.
@DeepFindr
@DeepFindr 8 ай бұрын
Hi! I think this is because the Dataset is too small. Transformers are data hungry. It should work with a bigger dataset
@DeepFindr
@DeepFindr 8 ай бұрын
Also have a look at the pinned comment, maybe that helps :)
@0x00official
@0x00official 11 ай бұрын
And now sora uses the same algorithm. this video aged so well
@simpleplant606
@simpleplant606 10 ай бұрын
Sora is using DiT (Diffusion Transformer)
@Saed7630
@Saed7630 6 ай бұрын
Bravo!
Vision Transformer Basics
30:49
Samuel Albanie
Рет қаралды 35 М.
Why Does Diffusion Work Better than Auto-Regression?
20:18
Algorithmic Simplicity
Рет қаралды 432 М.
Support each other🤝
00:31
ISSEI / いっせい
Рет қаралды 81 МЛН
Правильный подход к детям
00:18
Beatrise
Рет қаралды 11 МЛН
AI can't cross this line and we don't know why.
24:07
Welch Labs
Рет қаралды 1,5 МЛН
The moment we stopped understanding AI [AlexNet]
17:38
Welch Labs
Рет қаралды 1,5 МЛН
Vision Transformers explained
13:44
Code With Aarohi
Рет қаралды 44 М.
The U-Net (actually) explained in 10 minutes
10:31
rupert ai
Рет қаралды 136 М.
LoRA explained (and a bit about precision and quantization)
17:07
Attention in transformers, step-by-step | DL6
26:10
3Blue1Brown
Рет қаралды 2 МЛН
Swin Transformer paper animated and explained
11:10
AI Coffee Break with Letitia
Рет қаралды 72 М.
Transformers (how LLMs work) explained visually | DL5
27:14
3Blue1Brown
Рет қаралды 4,4 МЛН
MAMBA from Scratch: Neural Nets Better and Faster than Transformers
31:51
Algorithmic Simplicity
Рет қаралды 218 М.
Support each other🤝
00:31
ISSEI / いっせい
Рет қаралды 81 МЛН