Vision Transformer - Keras Code Examples!!

Рет қаралды 42,782

Connor Shorten

Күн бұрын

Пікірлер: 62

@artukikemty 8 ай бұрын

Amazing, few people can even do this explanation line by line, great contribution democratizing AI knowledge!

@sayakpaul3152 3 жыл бұрын

When you specify `from_logits=True` softmax is first applied and then cross-entropy is taken.

@connor-shorten 3 жыл бұрын

Thanks again Sayak, really appreciate it!

@CristianGarcia 3 жыл бұрын

This is the main idea but internally "log_softmax" is used for performance. Actually if you pass from_logits=False Keras turns the output of the softmax back to logits via log: github.com/tensorflow/tensorflow/blob/85c8b2a817f95a3e979ecd1ed95bff1dc1335cff/tensorflow/python/keras/backend.py#L4908

@sayakpaul3152 3 жыл бұрын

Yes totally correctly. I didn't mention it for simplicity. But giving it another thought, I should have been clearer in my answer. Thank you!

@santhoshckumar7367 Жыл бұрын

Appreciate your additional clarification. Thanks,

@sinancalsr726 3 жыл бұрын

Hi, thanks for the video :) In 10:35 , I guess the -1 comes from the number of patches. Like if we say the batch_size=2 the output dimension of the tf.reshape function will be 2x144x108, since there are 144 patches inside the 72x72 image (patch_size=6). Also in the plotting loop, we are looping through the second dimension which has 144 element.

@connor-shorten 3 жыл бұрын

Thank you so much for the clarification on this!

@sayakpaul3152 3 жыл бұрын

-1 inside reshaping is a handy neat trick. Let's say you want to flatten a tensor of shapes (batch_size, 512, 16). You can easily do that by doing something like tf.reshape(your_tensor, (batch_size, -1)). You don't need to explicitly specify the flattened dimensions.

@connor-shorten 3 жыл бұрын

Thanks Sayak! I was really confused about that haha

@sz4746 2 жыл бұрын

It's so easy to implement ViT. Before I was afraid of using those big models because I thought it would be hard to implement, but keras and pytorch do have multiheadattention as a built-in function!

@abdurrahmansefer2548 2 жыл бұрын

hello thanks, but i want to ask a question, in the input section(Extra learnable [class] embedding) What is the zero (0) index used for and what information does it contain?

@CristianGarcia 3 жыл бұрын

Since TF 2.0 you can the regular plus (+) operator instead of the Add layer.

@connor-shorten 3 жыл бұрын

Thanks! Definitely cleans it up a bit

@yaswanth1679 3 жыл бұрын

can we implement this ViT on our own dataset

@ferdoussedjamai1954 2 жыл бұрын

did you try to do it ?

@annicetrazafindratovolahy1512 2 жыл бұрын

Hello! Please, can you do a video on how to use Swin Transformer using an autoencoder architecture? Thank you in advance. I have a difficulty when I want to restore the patch into an image (for the decoder part)

@sakibulislam4463 2 ай бұрын

can you share the link to this notebook

@DiogoSanti 3 жыл бұрын

Cool job... For the "from_logits=True" part it expects only the logits (without the softmax activation) the SparseCategoricalCorssEntropy will apply softmax for you with that option... Just be careful as, if people set from_logits to True and still apply the Softmax at the end of their network, it will apply the loss function(with the softmax) on an already probability distribution

@connor-shorten 3 жыл бұрын

Thank you so much for the clarification, really appreciate it! What would be the major problem with double softmaxes? I guess slow computation and massive blowup of large densities comes to mind

@DiogoSanti 3 жыл бұрын

@@connor-shorten Happy i could help, thanks for all the good content!

@LiveLifeWithLove 2 жыл бұрын

@@connor-shorten SoftMax does two things, one it makes sum equal to 1 (probability distribution), other it brings far logits near. So if you apply once far logits will be transformed to values which are relatively near and have probability distribution but will still maintain nice separation, now if you apply it again will bring outputs even more closer, apply again and they will be so near that you won't be able to find the pattern. Logits X = (1.5, 3.5, 2.5), X1 = softmax(X) = (0.10, 0.67, 0.25), X2 = softmax(X1) = (0.25, 0.45, 0.30)

@pakistanproud8123 2 жыл бұрын

Can anybody explain this paragraph to me: Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.

@Bomerang23 2 жыл бұрын

maybe it's a silly question but does vit work on gray scale pic??

@khalladisofiane9195 2 жыл бұрын

Please i have my custom dataset with 3 folders than 3 classes how can i use the ViT please to do classification

@NehadHirmiz 3 жыл бұрын

Thank you very much for these amazing videos. Your contribution is key to the applications of these methods.

@وذكرفإنالذكرىتنفعالمؤمنين-ق7ز 2 жыл бұрын

Your explanation is amazing, thank you very much, but I want to ask a question, what is the projection dimension and why it is 64 however the patches 144 per image and the index will be from 0 to 143?? thank you very much again for your attention?

@connor-shorten 2 жыл бұрын

Thank you! The projection dimension is analogous to the embedding dimension in say, word embeddings or any kind of categorical encoding, in the end you transform the feature set into a 64 x 144 representation with 64 dimensions to encode each of the 144 patches

@lifted1785 2 жыл бұрын

Was the pun intended at the end? 😂 funny

@jason-yb9qk Жыл бұрын

guys how to modify the code so i can use dataset from kaggle?

@sendjasniabderrezzaq9347 3 жыл бұрын

Hi, Thank you for the explanation. I have a question regarding the variable `position_dim`, how it was chosed? If i change the patch size, do I need to change that too?

@chaymaebenhammacht1618 2 жыл бұрын

hi thaank u for this video its very usefull , but i found some problems when i used this model to do my own images classification on multiple malware i tried many times to solve the problem but unfortunately can u help me plzz ??

@billiartag 3 жыл бұрын

might be a stupid question, but how to visualize the attention? i honestly confused on extracting the attention

@isaacbaffoursenkyire1018 3 жыл бұрын

Hello, Thanks for the video. I have a question please. I have written the same and exact line of codes in Google Colab but I don't get any results ie. after running the def run_experiment (model) I don't get any results ( the Epochs with the accuracy), Please is there anything I am not doing right?

@ferdoussedjamai1954 2 жыл бұрын

did you find a solution to this problem ?

@isaacbaffoursenkyire1018 2 жыл бұрын

@@ferdoussedjamai1954 No please

@javaqtquicktutorials1131 2 жыл бұрын

Hi sir, can I use this code in a custom dataset?

@connor-shorten 2 жыл бұрын

Yes, be mindful of the resolution size and how that changes the hard coded parameters for the patching -- can be a bit tricky, I recommend borrowing that same matplotlib code to plot the patchings to make sure you did it correctly.

@nitishsingla9057 3 жыл бұрын

I have checked the github link given in the original paper. Is this keras code different from what it is mentioned in the github link ?

@Paul_johns 3 жыл бұрын

Great job!! Quick question, I see that the the labels on the both csv files are different than previous cnn vision csv files. This is because the data needs to be encoded? By any chance do you know how to encoded? If not is ok thanks for your videos!

@khalladisofiane9195 2 жыл бұрын

How can i use this code on my custom data please with 3 classes

@draaken0 Жыл бұрын

Just change the image size input and num_classes=3.Also you can play with patch size according to your image shape.

@JoseMiguel_____ 2 жыл бұрын

great explanation! keep doing this

@pesky_mousquito 3 жыл бұрын

Where is the CLS token read?

@mahdiyehbasereh Жыл бұрын

It was very helpful , thanks alot

@sayakpaul3152 3 жыл бұрын

I second your thoughts on complementary priors. In fact, BotNets, IMO, are a step in that direction. DeIT as well.

@connor-shorten 3 жыл бұрын

Thanks Sayak! Yeah, DeiT's distillation with the CNN activations is incredibly interesting. I think the large-scale data pre-training could be a complementary prior with respect to the global aggregation thing and just needing a lot of data to get a sense of that. I hope data augmentations can also be customized to the global prior vs. local prior in CNNs.

@sayakpaul3152 3 жыл бұрын

@@connor-shorten yes, seconded. As I mentioned earlier in those lines, BotNet seems to be a really good proposal not only for image classification for other tasks (instance segmentation, object detection) as well where modeling long range depencies is crucial.

@jayakrishnankv1681 2 жыл бұрын

thank you for the video

@suke933 3 жыл бұрын

Hi Henry, kindly explain how can it be used to binary class problems?

@AbdennacerAyeb 3 жыл бұрын

we miss weekly update in AI..

@connor-shorten 3 жыл бұрын

Thank you so much for your interest in the series! I’m hoping to get back to it soon

@dome8116 3 жыл бұрын

@@connor-shorten yes please bring them back sir

@WahranRai 2 жыл бұрын

Too much animation , what about reduce your speed and let us examine the slides

@squirrel4635 Жыл бұрын

How much coffee did you drink?

@m.hassan8142 3 жыл бұрын

I came here from bert model

@graceln2480 3 жыл бұрын

Too fast, illustrations with figures as you explain would be more useful

@yifeipei5484 3 жыл бұрын

If you want to explain the codes, you should figure out every part of the codes. For "from_logits", if you didn't know it, you should figure it out by reference of tensorflow API before the tutorial. However, you didn't and were very lazy.

@khalladisofiane9195 2 жыл бұрын

Hi thanks , but can you help me i want to use VIT on my custom dataset for classification please can i get your email .

@pakistanproud8123 2 жыл бұрын