Great Explanation with detailed notations. Most of the videos found in the KZbin were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.
@ai_lite9 ай бұрын
great expalation! Good for you! Don't stop giving ML guides!
@mmpattnaik972 жыл бұрын
Can't stress enough on how easy to understand you made it
@drakehinst2712 жыл бұрын
These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)
@thecheekychinaman6713 Жыл бұрын
The best ViT explanation available. Also key to understand this for understanding Dino and Dino V2
@drelvenkee1885 Жыл бұрын
The best video so far. The animation is easy to follow and the explaination is very straight forward.
@adityapillai309110 ай бұрын
Clear, concise, and overall easy to understand for a newbie like me. Thanks!
@valentinfontanger49622 жыл бұрын
Amazing, I am in a rush to implement vision transformer as an assignement, and this saved me so much time !
@randomperson53032 жыл бұрын
lol , same
@sheikhshafayat69842 жыл бұрын
Man, you made my day! These lectures were golden. I hope you continue to make more of these
@aimeroundiaye13783 жыл бұрын
Amazing video. It helped me to really understand the vision transformers. Thanks a lot.
@thepresistence59352 жыл бұрын
15 minutes of heaven 🌿. Thanks a lot understood clearly!
@Peiying-h4m Жыл бұрын
Best ViT explanation ever!!!!!!
@soumyajitdatta9203 Жыл бұрын
Thank you. Best ViT video I found.
@vladi21k2 жыл бұрын
Very good explanation, better that many other videos on KZbin, thank you!
@MonaJalal3 жыл бұрын
This was a great video. Thanks for your time producing great content.
@arash_mehrabi2 жыл бұрын
Thank you for your Attention Models playlist. Well explained.
@DerekChiach3 жыл бұрын
Thank you, your video is way underrated. Keep it up!
@wengxiaoxiong666 Жыл бұрын
good video ,what a splendid presentation , wang shusen yyds.
@MenTaLLyMenTaL2 жыл бұрын
@9:30 Why do we discard c1... cn and use only c0? How is it that all the necessary information from the image gets collected & preserved in c0? Thanks
@abhinavgarg56112 жыл бұрын
Hey, did you get answer to your question?
@swishgtv78273 жыл бұрын
This reminds me of Encarta encyclopedia clips when I was a kid lol! Good job mate!
@ronalkobi43566 ай бұрын
Wonderful explanation!👏
@nova25772 жыл бұрын
If we ignore output c1 ... cn, what c1 ... cn represent then?
@nehalkalita Жыл бұрын
Nicely explained. Appreciate your efforts.
@NisseOhlsen3 жыл бұрын
Very nice job, Shusen, thanks!
@rajgothi2633 Жыл бұрын
You have explained ViT in simple words. Thanks
@hongkyulee9724 Жыл бұрын
Thank you for the clear explanation!!☺
@lionhuang92093 жыл бұрын
Very clear, thanks for your work.
@sehaba95312 жыл бұрын
Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.
@muhammadfaseeh58102 жыл бұрын
Awesome Explanation. Thank you
@mmazher5826 Жыл бұрын
Excellent explanation 👌
@chawkinasrallah72698 ай бұрын
The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?
@xXMaDGaMeR Жыл бұрын
amazing precise explanation
@ASdASd-kr1ft Жыл бұрын
Nice video!!, Just a question what is the argue behind to rid of the vectors c1 to cn, and just remain with c0? Thanks
@DrAIScience7 ай бұрын
How data A is trained? I mean what is the loss function? Is it only using encoder or both e/decoder?
@aryanmobiny73403 жыл бұрын
Amazing video. Please do one for Swin Transformers if possible. Thanks alot
@t.pranav28343 жыл бұрын
Awesome explanation man thanks a tonne!!!
@parmanandchauhan61825 ай бұрын
Great Explanation.Thanqu
@user-wr4yl7tx3w Жыл бұрын
In the job market, do data scientists use transformers?
@tallwaters97082 жыл бұрын
Brilliant explanation, thank you.
@ervinperetz59732 жыл бұрын
This is a great explanation video. One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.
@deeplearn65842 жыл бұрын
Very good explanation subscribed!
@jidd322 жыл бұрын
Brilliant. Thanks a million
@boemioofworld3 жыл бұрын
thank you so much for the clear explanation
@saeedataei2692 жыл бұрын
great video. thanks. could u plz explain swin transformer too?
@mariamwaleed21322 жыл бұрын
really great explaination , thankyou
@bbss87583 жыл бұрын
Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model
@medomed11052 жыл бұрын
Great explanation
@BeytullahAhmetKINDAN Жыл бұрын
that was educational!
@sudhakartummala47012 жыл бұрын
Wonderful talk
@ThamizhanDaa12 жыл бұрын
WHY is the transformer requiring so many images to train?? and why is resnet not becoming better with ore training vs ViT?
@ansharora32483 жыл бұрын
Great explanation :)
@zeweichu5502 жыл бұрын
great video!
@DrAhmedShahin_7072 жыл бұрын
The simplest and more interesting explanation, Many Thanks. I am asking about object detection models, did you explain it before?
@fedegonzal3 жыл бұрын
Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?
@ME-mp3ne3 жыл бұрын
Really good, thx.
@ogsconnect13122 жыл бұрын
Good job! Thanks
@DungPham-ai3 жыл бұрын
Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .
@NeketShark3 жыл бұрын
Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.
@Darkev773 жыл бұрын
@@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions
@NeketShark3 жыл бұрын
@@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.
@shamsarfeen27293 жыл бұрын
If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right? I mean those dense layers are just as filters of a CNN.
@parveenkaur27473 жыл бұрын
Very good explanation! Can you please explain how we can fine tune these models to our dataset. Is it possible on our local computer
@ShusenWangEng3 жыл бұрын
Unfortunately, no. Google has TPU clusters. The amount of computation is insane.
@parveenkaur27473 жыл бұрын
@@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images So it can’t be done? What size dataset can we train on our local PC
@parveenkaur27473 жыл бұрын
Can you please reply? Stuck at the moment.. Thanks
@ShusenWangEng3 жыл бұрын
@@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.
@palyashuk423 жыл бұрын
Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...
@ShusenWangEng3 жыл бұрын
ResNet is a family of CNNs. Many tricks are applied to make ResNet work better. The reported are indeed the best accuracies that CNNs can achieve.
@sevovo Жыл бұрын
CNN on images + positional info = Transformers for images
@yinghaohu87843 жыл бұрын
1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...
@swishgtv78273 жыл бұрын
The concept has similarities to TCP protocol in terms of segmentation and positional encoding. 😅😅😅