Vision Transformer for Image Classification

  Рет қаралды 123,893

Shusen Wang

Shusen Wang

Күн бұрын

Пікірлер: 84
@UzzalPodder
@UzzalPodder 3 жыл бұрын
Great Explanation with detailed notations. Most of the videos found in the KZbin were some kind of oral explanation. But this kind of symbolic notation is very helpful for garbing the real picture, specially if anyone want to re-implement or add new idea with it. Thank you so much. Please continuing helping us by making these kind of videos for us.
@ai_lite
@ai_lite 9 ай бұрын
great expalation! Good for you! Don't stop giving ML guides!
@mmpattnaik97
@mmpattnaik97 2 жыл бұрын
Can't stress enough on how easy to understand you made it
@drakehinst271
@drakehinst271 2 жыл бұрын
These are some of the best, hands-on and simple explanations I've seen in a while on a new CS method. Straight to the point with no superfluous details, and at a pace that let me consider and visualize each step in my mind without having to constantly pause or rewind the video. Thanks a lot for your amazing work! :)
@thecheekychinaman6713
@thecheekychinaman6713 Жыл бұрын
The best ViT explanation available. Also key to understand this for understanding Dino and Dino V2
@drelvenkee1885
@drelvenkee1885 Жыл бұрын
The best video so far. The animation is easy to follow and the explaination is very straight forward.
@adityapillai3091
@adityapillai3091 10 ай бұрын
Clear, concise, and overall easy to understand for a newbie like me. Thanks!
@valentinfontanger4962
@valentinfontanger4962 2 жыл бұрын
Amazing, I am in a rush to implement vision transformer as an assignement, and this saved me so much time !
@randomperson5303
@randomperson5303 2 жыл бұрын
lol , same
@sheikhshafayat6984
@sheikhshafayat6984 2 жыл бұрын
Man, you made my day! These lectures were golden. I hope you continue to make more of these
@aimeroundiaye1378
@aimeroundiaye1378 3 жыл бұрын
Amazing video. It helped me to really understand the vision transformers. Thanks a lot.
@thepresistence5935
@thepresistence5935 2 жыл бұрын
15 minutes of heaven 🌿. Thanks a lot understood clearly!
@Peiying-h4m
@Peiying-h4m Жыл бұрын
Best ViT explanation ever!!!!!!
@soumyajitdatta9203
@soumyajitdatta9203 Жыл бұрын
Thank you. Best ViT video I found.
@vladi21k
@vladi21k 2 жыл бұрын
Very good explanation, better that many other videos on KZbin, thank you!
@MonaJalal
@MonaJalal 3 жыл бұрын
This was a great video. Thanks for your time producing great content.
@arash_mehrabi
@arash_mehrabi 2 жыл бұрын
Thank you for your Attention Models playlist. Well explained.
@DerekChiach
@DerekChiach 3 жыл бұрын
Thank you, your video is way underrated. Keep it up!
@wengxiaoxiong666
@wengxiaoxiong666 Жыл бұрын
good video ,what a splendid presentation , wang shusen yyds.
@MenTaLLyMenTaL
@MenTaLLyMenTaL 2 жыл бұрын
@9:30 Why do we discard c1... cn and use only c0? How is it that all the necessary information from the image gets collected & preserved in c0? Thanks
@abhinavgarg5611
@abhinavgarg5611 2 жыл бұрын
Hey, did you get answer to your question?
@swishgtv7827
@swishgtv7827 3 жыл бұрын
This reminds me of Encarta encyclopedia clips when I was a kid lol! Good job mate!
@ronalkobi4356
@ronalkobi4356 6 ай бұрын
Wonderful explanation!👏
@nova2577
@nova2577 2 жыл бұрын
If we ignore output c1 ... cn, what c1 ... cn represent then?
@nehalkalita
@nehalkalita Жыл бұрын
Nicely explained. Appreciate your efforts.
@NisseOhlsen
@NisseOhlsen 3 жыл бұрын
Very nice job, Shusen, thanks!
@rajgothi2633
@rajgothi2633 Жыл бұрын
You have explained ViT in simple words. Thanks
@hongkyulee9724
@hongkyulee9724 Жыл бұрын
Thank you for the clear explanation!!☺
@lionhuang9209
@lionhuang9209 3 жыл бұрын
Very clear, thanks for your work.
@sehaba9531
@sehaba9531 2 жыл бұрын
Thank you so much for this amazing presentation. You have a very clear explanation, I have learnt so much. I will definitely watch your Attention models playlist.
@muhammadfaseeh5810
@muhammadfaseeh5810 2 жыл бұрын
Awesome Explanation. Thank you
@mmazher5826
@mmazher5826 Жыл бұрын
Excellent explanation 👌
@chawkinasrallah7269
@chawkinasrallah7269 8 ай бұрын
The class token 0 is in the embed dim, does that mean we should add a linear layer from embed to number of classes before the softmax for the classification?
@xXMaDGaMeR
@xXMaDGaMeR Жыл бұрын
amazing precise explanation
@ASdASd-kr1ft
@ASdASd-kr1ft Жыл бұрын
Nice video!!, Just a question what is the argue behind to rid of the vectors c1 to cn, and just remain with c0? Thanks
@DrAIScience
@DrAIScience 7 ай бұрын
How data A is trained? I mean what is the loss function? Is it only using encoder or both e/decoder?
@aryanmobiny7340
@aryanmobiny7340 3 жыл бұрын
Amazing video. Please do one for Swin Transformers if possible. Thanks alot
@t.pranav2834
@t.pranav2834 3 жыл бұрын
Awesome explanation man thanks a tonne!!!
@parmanandchauhan6182
@parmanandchauhan6182 5 ай бұрын
Great Explanation.Thanqu
@user-wr4yl7tx3w
@user-wr4yl7tx3w Жыл бұрын
In the job market, do data scientists use transformers?
@tallwaters9708
@tallwaters9708 2 жыл бұрын
Brilliant explanation, thank you.
@ervinperetz5973
@ervinperetz5973 2 жыл бұрын
This is a great explanation video. One nit : you are misusing the term 'dimension'. If a classification vector is linear with 8 values, that's not '8-dimensional' -- it is a 1-dimensional vector with 8 values.
@deeplearn6584
@deeplearn6584 2 жыл бұрын
Very good explanation subscribed!
@jidd32
@jidd32 2 жыл бұрын
Brilliant. Thanks a million
@boemioofworld
@boemioofworld 3 жыл бұрын
thank you so much for the clear explanation
@saeedataei269
@saeedataei269 2 жыл бұрын
great video. thanks. could u plz explain swin transformer too?
@mariamwaleed2132
@mariamwaleed2132 2 жыл бұрын
really great explaination , thankyou
@bbss8758
@bbss8758 3 жыл бұрын
Can you explain yhis paper please “your classifier is secretly an energy based model and you should treat it like one “ i want understand these energy based model
@medomed1105
@medomed1105 2 жыл бұрын
Great explanation
@BeytullahAhmetKINDAN
@BeytullahAhmetKINDAN Жыл бұрын
that was educational!
@sudhakartummala4701
@sudhakartummala4701 2 жыл бұрын
Wonderful talk
@ThamizhanDaa1
@ThamizhanDaa1 2 жыл бұрын
WHY is the transformer requiring so many images to train?? and why is resnet not becoming better with ore training vs ViT?
@ansharora3248
@ansharora3248 3 жыл бұрын
Great explanation :)
@zeweichu550
@zeweichu550 2 жыл бұрын
great video!
@DrAhmedShahin_707
@DrAhmedShahin_707 2 жыл бұрын
The simplest and more interesting explanation, Many Thanks. I am asking about object detection models, did you explain it before?
@fedegonzal
@fedegonzal 3 жыл бұрын
Super clear explanation! Thanks! I want to understand how attention is applied to the images. I mean, using cnn you can "see" where the neural network is focusing, but with transformers?
@ME-mp3ne
@ME-mp3ne 3 жыл бұрын
Really good, thx.
@ogsconnect1312
@ogsconnect1312 2 жыл бұрын
Good job! Thanks
@DungPham-ai
@DungPham-ai 3 жыл бұрын
Amazing video. It helped me to really understand the vision transformers. Thanks a lot. But i have a question why we only use token cls for classifier .
@NeketShark
@NeketShark 3 жыл бұрын
Looks like due to attention layers cls token is able to extract all the data it needs for a good classification from other tokens. Using all tokens for classification would just unnecessarily increase computation.
@Darkev77
@Darkev77 3 жыл бұрын
@@NeketShark that’s a good answer. At 9:40, any idea how a softmax function was able to increase (or decrease) the dimension of vector “c” into “p”? I thought softmax would only change the entries of a vector, not its dimensions
@NeketShark
@NeketShark 3 жыл бұрын
@@Darkev77 I think it first goes through a linear layer which then goes through a softmax, so its the linear layer that changes the dimention. In the video this info were probably ommited for simplification.
@shamsarfeen2729
@shamsarfeen2729 3 жыл бұрын
If you remove the positional encoding step, the whole thing is almost equivalent to a CNN, right? I mean those dense layers are just as filters of a CNN.
@parveenkaur2747
@parveenkaur2747 3 жыл бұрын
Very good explanation! Can you please explain how we can fine tune these models to our dataset. Is it possible on our local computer
@ShusenWangEng
@ShusenWangEng 3 жыл бұрын
Unfortunately, no. Google has TPU clusters. The amount of computation is insane.
@parveenkaur2747
@parveenkaur2747 3 жыл бұрын
@@ShusenWangEng Actually I have my project proposal due today.. I was proposing this on the dataset of FOOD-101 it has 101000 images So it can’t be done? What size dataset can we train on our local PC
@parveenkaur2747
@parveenkaur2747 3 жыл бұрын
Can you please reply? Stuck at the moment.. Thanks
@ShusenWangEng
@ShusenWangEng 3 жыл бұрын
@@parveenkaur2747 If your dataset is very different from ImageNet, Google's pretrained model may not transfer well to your problem. The performance can be bad.
@palyashuk42
@palyashuk42 3 жыл бұрын
Why do the authors evaluate and compare their results with the old ResNet architecure? Why not to use EfficientNets for comparison? Looks like not the best result...
@ShusenWangEng
@ShusenWangEng 3 жыл бұрын
ResNet is a family of CNNs. Many tricks are applied to make ResNet work better. The reported are indeed the best accuracies that CNNs can achieve.
@sevovo
@sevovo Жыл бұрын
CNN on images + positional info = Transformers for images
@yinghaohu8784
@yinghaohu8784 3 жыл бұрын
1) you mentioned pretrain model, it uses large scale dataset, and then using a smaller dataset for finetuning. Does it mean, they c0 is almost the same, except the last layer softmax will be adjusted based on the class_num ? and then train on fine-tuning dataset ? Or there're other different settings ? 2)Another doubt for me is, there's completely no mask in ViT, right? since it is from MLM ... um ...
@swishgtv7827
@swishgtv7827 3 жыл бұрын
The concept has similarities to TCP protocol in terms of segmentation and positional encoding. 😅😅😅
@mahmoudtarek6859
@mahmoudtarek6859 2 жыл бұрын
great
@st-hs2ve
@st-hs2ve 3 жыл бұрын
Great great great
@seakan6835
@seakan6835 2 жыл бұрын
其实我觉得up主说中文更好🥰🤣
@boyang6105
@boyang6105 2 жыл бұрын
也有中文版的( kzbin.info/www/bejne/eJPdgI1via2ln7s ),不同的语言有不同的听众
@顾小杰
@顾小杰 Жыл бұрын
👏
@mahdiyehbasereh
@mahdiyehbasereh Жыл бұрын
That was great and helpful 🤌🏻
@randomperson5303
@randomperson5303 2 жыл бұрын
Not All Heroes Wear Capes
@yuan6950
@yuan6950 2 жыл бұрын
这英语也是醉了
@kutilkol
@kutilkol 8 ай бұрын
this is supposed to be english?
@tianbaoxie2324
@tianbaoxie2324 2 жыл бұрын
Very clear, thanks for your work.
@Raulvic
@Raulvic 3 жыл бұрын
Thank you for the clear explanation
To Brawl AND BEYOND!
00:51
Brawl Stars
Рет қаралды 17 МЛН
Vision Transformer explained in detail | ViTs
1:11:48
Code With Aarohi
Рет қаралды 3,4 М.
Visualizing transformers and attention | Talk for TNG Big Tech Day '24
57:45
The math behind Attention: Keys, Queries, and Values matrices
36:16
Serrano.Academy
Рет қаралды 269 М.
BERT for pretraining Transformers
15:53
Shusen Wang
Рет қаралды 12 М.
Vision Transformer Basics
30:49
Samuel Albanie
Рет қаралды 33 М.
Image Classification Using Vision Transformer | ViTs on Google Colab
27:22
Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained
24:57
Aleksa Gordić - The AI Epiphany
Рет қаралды 42 М.
2024's Biggest Breakthroughs in Physics
16:46
Quanta Magazine
Рет қаралды 461 М.
Swin Transformer paper animated and explained
11:10
AI Coffee Break with Letitia
Рет қаралды 71 М.
Lecture 2: Image Classification
1:02:15
Michigan Online
Рет қаралды 75 М.