Amazing, few people can even do this explanation line by line, great contribution democratizing AI knowledge!
@sayakpaul31523 жыл бұрын
When you specify `from_logits=True` softmax is first applied and then cross-entropy is taken.
@connor-shorten3 жыл бұрын
Thanks again Sayak, really appreciate it!
@CristianGarcia3 жыл бұрын
This is the main idea but internally "log_softmax" is used for performance. Actually if you pass from_logits=False Keras turns the output of the softmax back to logits via log: github.com/tensorflow/tensorflow/blob/85c8b2a817f95a3e979ecd1ed95bff1dc1335cff/tensorflow/python/keras/backend.py#L4908
@sayakpaul31523 жыл бұрын
Yes totally correctly. I didn't mention it for simplicity. But giving it another thought, I should have been clearer in my answer. Thank you!
@santhoshckumar7367 Жыл бұрын
Appreciate your additional clarification. Thanks,
@sinancalsr7263 жыл бұрын
Hi, thanks for the video :) In 10:35 , I guess the -1 comes from the number of patches. Like if we say the batch_size=2 the output dimension of the tf.reshape function will be 2x144x108, since there are 144 patches inside the 72x72 image (patch_size=6). Also in the plotting loop, we are looping through the second dimension which has 144 element.
@connor-shorten3 жыл бұрын
Thank you so much for the clarification on this!
@sayakpaul31523 жыл бұрын
-1 inside reshaping is a handy neat trick. Let's say you want to flatten a tensor of shapes (batch_size, 512, 16). You can easily do that by doing something like tf.reshape(your_tensor, (batch_size, -1)). You don't need to explicitly specify the flattened dimensions.
@connor-shorten3 жыл бұрын
Thanks Sayak! I was really confused about that haha
@sz47462 жыл бұрын
It's so easy to implement ViT. Before I was afraid of using those big models because I thought it would be hard to implement, but keras and pytorch do have multiheadattention as a built-in function!
@abdurrahmansefer25482 жыл бұрын
hello thanks, but i want to ask a question, in the input section(Extra learnable [class] embedding) What is the zero (0) index used for and what information does it contain?
@CristianGarcia3 жыл бұрын
Since TF 2.0 you can the regular plus (+) operator instead of the Add layer.
@connor-shorten3 жыл бұрын
Thanks! Definitely cleans it up a bit
@yaswanth16793 жыл бұрын
can we implement this ViT on our own dataset
@ferdoussedjamai19542 жыл бұрын
did you try to do it ?
@annicetrazafindratovolahy15122 жыл бұрын
Hello! Please, can you do a video on how to use Swin Transformer using an autoencoder architecture? Thank you in advance. I have a difficulty when I want to restore the patch into an image (for the decoder part)
@sakibulislam44632 ай бұрын
can you share the link to this notebook
@DiogoSanti3 жыл бұрын
Cool job... For the "from_logits=True" part it expects only the logits (without the softmax activation) the SparseCategoricalCorssEntropy will apply softmax for you with that option... Just be careful as, if people set from_logits to True and still apply the Softmax at the end of their network, it will apply the loss function(with the softmax) on an already probability distribution
@connor-shorten3 жыл бұрын
Thank you so much for the clarification, really appreciate it! What would be the major problem with double softmaxes? I guess slow computation and massive blowup of large densities comes to mind
@DiogoSanti3 жыл бұрын
@@connor-shorten Happy i could help, thanks for all the good content!
@LiveLifeWithLove2 жыл бұрын
@@connor-shorten SoftMax does two things, one it makes sum equal to 1 (probability distribution), other it brings far logits near. So if you apply once far logits will be transformed to values which are relatively near and have probability distribution but will still maintain nice separation, now if you apply it again will bring outputs even more closer, apply again and they will be so near that you won't be able to find the pattern. Logits X = (1.5, 3.5, 2.5), X1 = softmax(X) = (0.10, 0.67, 0.25), X2 = softmax(X1) = (0.25, 0.45, 0.30)
@pakistanproud81232 жыл бұрын
Can anybody explain this paragraph to me: Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.
@Bomerang232 жыл бұрын
maybe it's a silly question but does vit work on gray scale pic??
@khalladisofiane91952 жыл бұрын
Please i have my custom dataset with 3 folders than 3 classes how can i use the ViT please to do classification
@NehadHirmiz3 жыл бұрын
Thank you very much for these amazing videos. Your contribution is key to the applications of these methods.
@وذكرفإنالذكرىتنفعالمؤمنين-ق7ز2 жыл бұрын
Your explanation is amazing, thank you very much, but I want to ask a question, what is the projection dimension and why it is 64 however the patches 144 per image and the index will be from 0 to 143?? thank you very much again for your attention?
@connor-shorten2 жыл бұрын
Thank you! The projection dimension is analogous to the embedding dimension in say, word embeddings or any kind of categorical encoding, in the end you transform the feature set into a 64 x 144 representation with 64 dimensions to encode each of the 144 patches
@lifted17852 жыл бұрын
Was the pun intended at the end? 😂 funny
@jason-yb9qk Жыл бұрын
guys how to modify the code so i can use dataset from kaggle?
@sendjasniabderrezzaq93473 жыл бұрын
Hi, Thank you for the explanation. I have a question regarding the variable `position_dim`, how it was chosed? If i change the patch size, do I need to change that too?
@chaymaebenhammacht16182 жыл бұрын
hi thaank u for this video its very usefull , but i found some problems when i used this model to do my own images classification on multiple malware i tried many times to solve the problem but unfortunately can u help me plzz ??
@billiartag3 жыл бұрын
might be a stupid question, but how to visualize the attention? i honestly confused on extracting the attention
@isaacbaffoursenkyire10183 жыл бұрын
Hello, Thanks for the video. I have a question please. I have written the same and exact line of codes in Google Colab but I don't get any results ie. after running the def run_experiment (model) I don't get any results ( the Epochs with the accuracy), Please is there anything I am not doing right?
@ferdoussedjamai19542 жыл бұрын
did you find a solution to this problem ?
@isaacbaffoursenkyire10182 жыл бұрын
@@ferdoussedjamai1954 No please
@javaqtquicktutorials11312 жыл бұрын
Hi sir, can I use this code in a custom dataset?
@connor-shorten2 жыл бұрын
Yes, be mindful of the resolution size and how that changes the hard coded parameters for the patching -- can be a bit tricky, I recommend borrowing that same matplotlib code to plot the patchings to make sure you did it correctly.
@nitishsingla90573 жыл бұрын
I have checked the github link given in the original paper. Is this keras code different from what it is mentioned in the github link ?
@Paul_johns3 жыл бұрын
Great job!! Quick question, I see that the the labels on the both csv files are different than previous cnn vision csv files. This is because the data needs to be encoded? By any chance do you know how to encoded? If not is ok thanks for your videos!
@khalladisofiane91952 жыл бұрын
How can i use this code on my custom data please with 3 classes
@draaken0 Жыл бұрын
Just change the image size input and num_classes=3.Also you can play with patch size according to your image shape.
@JoseMiguel_____2 жыл бұрын
great explanation! keep doing this
@pesky_mousquito3 жыл бұрын
Where is the CLS token read?
@mahdiyehbasereh Жыл бұрын
It was very helpful , thanks alot
@sayakpaul31523 жыл бұрын
I second your thoughts on complementary priors. In fact, BotNets, IMO, are a step in that direction. DeIT as well.
@connor-shorten3 жыл бұрын
Thanks Sayak! Yeah, DeiT's distillation with the CNN activations is incredibly interesting. I think the large-scale data pre-training could be a complementary prior with respect to the global aggregation thing and just needing a lot of data to get a sense of that. I hope data augmentations can also be customized to the global prior vs. local prior in CNNs.
@sayakpaul31523 жыл бұрын
@@connor-shorten yes, seconded. As I mentioned earlier in those lines, BotNet seems to be a really good proposal not only for image classification for other tasks (instance segmentation, object detection) as well where modeling long range depencies is crucial.
@jayakrishnankv16812 жыл бұрын
thank you for the video
@suke9333 жыл бұрын
Hi Henry, kindly explain how can it be used to binary class problems?
@AbdennacerAyeb3 жыл бұрын
we miss weekly update in AI..
@connor-shorten3 жыл бұрын
Thank you so much for your interest in the series! I’m hoping to get back to it soon
@dome81163 жыл бұрын
@@connor-shorten yes please bring them back sir
@WahranRai2 жыл бұрын
Too much animation , what about reduce your speed and let us examine the slides
@squirrel4635 Жыл бұрын
How much coffee did you drink?
@m.hassan81423 жыл бұрын
I came here from bert model
@graceln24803 жыл бұрын
Too fast, illustrations with figures as you explain would be more useful
@yifeipei54843 жыл бұрын
If you want to explain the codes, you should figure out every part of the codes. For "from_logits", if you didn't know it, you should figure it out by reference of tensorflow API before the tutorial. However, you didn't and were very lazy.
@khalladisofiane91952 жыл бұрын
Hi thanks , but can you help me i want to use VIT on my custom dataset for classification please can i get your email .
@pakistanproud81232 жыл бұрын
Can anybody explain this paragraph to me: Unlike the technique described in the paper, which prepends a learnable embedding to the sequence of encoded patches to serve as the image representation, all the outputs of the final Transformer block are reshaped with layers.Flatten() and used as the image representation input to the classifier head.