Vision Transformer (ViT) - An image is worth 16x16 words

Vision Transformer (ViT) - An image is worth 16x16 words | Paper Explained

Рет қаралды 41,453

Күн бұрын

Пікірлер: 63

@TheAIEpiphany 3 жыл бұрын

Transformers are ruining everything! They first ruled the NLP world and finally, they are killing it in computer vision as well. I make 2 predictions in this video: 1. We can expect much bigger transformers being used in computer vision (same trend as in NLP) 2. We can expect a smaller patch size combined with efficient transformers (Reformer, Linformer, Longformer, etc.) any time soon Forgot to mention 1 interesting thing. The transformer is in a way more general than CNNs and LSTMs (i.e. it has fewer inductive biases). It turns out that transformer is a special case of a GNN (graph neural network) in particular GAT (well everything is a graph haha, 0 shenons here but still). Check out this blog: thegradient.pub/transformers-are-graph-neural-networks/

@DavenH 3 жыл бұрын

I am so pumped to see what happens with Performers* on whole documents, books, images, movies, audio files... I'm sure multiple companies are training 1T+ parameter Performer models as we type. It's going to be a great year ahead of us. *linear scaling Transformers that are better than linformers/linear xf/reformer/sparse xf etc

@TheAIEpiphany 3 жыл бұрын

@@DavenH Me too! I am also excited about many other areas of AI, especially graph neural nets and RL! Mostly because I am going to dig deeper into them over the next period! 😅 Just researched AlphaStar a bit it also uses transformers to beat pro gamers of Star Craft II. RL as well but I was happy to see transformers in there as well! 😂

@ibrahimaba8966 3 жыл бұрын

Hello. Can we use GAT with image PATCH to do the same work 🤔 ?

@masteronepiece6559 3 жыл бұрын

This guy will hit 500K subs. in the end of 2021. You're the only person on KZbin how that gives 100000....000% effort in his video. I'm learning a lot from you.

@TheAIEpiphany 3 жыл бұрын

Hehe. 500k might be an overkill considering this is a niche channel but I'll give it my best shot! Thanks a lot for your kind words!

@irinelmanolache601 2 жыл бұрын

Thanks man, I really enjoy watching your explanations!

@tongluo9860 Жыл бұрын

thank you for this great vedio, explaining Fit very well. It took me lots of time to understand Transfer and Bert series. You video make and vision part much easier to understand.

@NasheedYasin08 3 жыл бұрын

Informative as anything. Definitely 25 min well spent.

@TheAIEpiphany 3 жыл бұрын

Glad to hear that 😁

@ameynaik2743 2 жыл бұрын

Thanks for the detailed overview. What exactly is class embedding? Why is it required?

@meidachen8489 Жыл бұрын

Nice work! Thank you! (Really nice prediction that big tech will have some large Transformer coming, now "Segment Anything" is here😁)

@SH94 3 жыл бұрын

Keep up the good work bro!

@TheAIEpiphany 3 жыл бұрын

Thanks man!

@SpesMagisteriiGradus 11 ай бұрын

thank you so much

@lukkuuu6368 3 жыл бұрын

I read your blog post on your journey to a deepmind engineer. It was very very inspiring. Thank you for spending time writing that!

@TheAIEpiphany 3 жыл бұрын

Thank you! 🙏

@clapathy 2 жыл бұрын

Nice job! Thank you very much!

@JohnSmith-ut5th 2 жыл бұрын

There are so many applications for AI/ML. I'm curious, why am I not seeing this being applied? So many people and companies are claiming to do "AI/ML" but I'm not seeing commercial applications.

@aleksabisercic1410 3 жыл бұрын

Love it !

@TheAIEpiphany 3 жыл бұрын

Glad to hear that!

@XX-vu5jo 3 жыл бұрын

I love this paper. But I hate that I cannot train shit with this LOL

@lifted1785 2 жыл бұрын

bro this shit was helpful as fuck, u just helped me do my fucking capstone for mit. good looking out homie! Im subbing

@navinbondade5365 3 жыл бұрын

can you make a coding video on it ?

@TheAIEpiphany 3 жыл бұрын

Will do, check out my DINO video and let me know what u think.

@jinorohit2921 2 жыл бұрын

@@TheAIEpiphany thanks? ahahha great video btw

@TheAIEpiphany 2 жыл бұрын

@@jinorohit2921 failed to edit it lolz

@TheAIEpiphany 2 жыл бұрын

@@jinorohit2921 thanks! 🤣

@fast_harmonic_psychedelic 3 жыл бұрын

how to convert a VIT-L_32.npz checkpoint to .VIT-L_32.pt so i can load it with clip? anyone know?

@fast_harmonic_psychedelic 3 жыл бұрын

why can't we just use THEIR pre-trained model. They already did it once, whats the sense in doing the same process over wasting energy when they already have the model

@DavenH 3 жыл бұрын

Excellent, love these paper breakdowns. Keep it up!

@TheAIEpiphany 3 жыл бұрын

Thanks man! Appreciate it, if people find these kinds of breakdowns (which I'll know if I get enough similar feedback) useful I'll definitely make more of these. I learn a lot by doing this as well. As a side note I am noticing a huge gap in the community. On the one hand people don't know how to start learning ML and on the other hand people need help understanding the papers I'm trying to balance it out. Still not sure what the best strategy is but I'll continue covering seminal papers from different areas.

@DavenH 3 жыл бұрын

@@TheAIEpiphany My opinion is not to worry about the people just starting ML, unless that's what you want to do of course. There are many good resources for beginners...free courses, blogs, playlists etc. However, I find that there aren't many channels aimed at the level of actual ML practitioners like yourself. Yannic Kilcher's channel is one such beacon, and it's doing very well. I also want to say you present stuff in a very clear and digestible way, and a 30+ minute video is more than fine. There are so many papers coming out every day on arxiv it's impossible to keep up, so having any help distilling them is wonderful. One symptom of this bottleneck, I notice, is that people generally just read the highlights (papers from Google Brain, FAIR, DeepMind, OpenAI). Unfortunately, by dominating the mindshare this has pulled the research into areas well beyond the compute capabilities of PhD students, independent practitioners, or startup companies. I read the wav2vec 2.0 paper yesterday, and got all excited to try to apply their methods until I see they trained for multiple days on 100 GPUs, expensive v100s at that. Google papers are even worse this way, they never train on anything less than 1000 TPUs it seems! I guess they are probably the most high quality papers too, so there's a feedback effect. But there are gems that surely get missed by all the universities, which I would assume focus less on scaling and more on theory or other insights.

@TheAIEpiphany 3 жыл бұрын

@@DavenH Extremely valuable feedback thank you! I somehow tend not to go over 30 minutes for now I am still figuring it out. I agree I even started writing down the amount of compute needed in my OneNote. 😅 And it's crazy. I agree there is a lot of valuable research that will be done aside from the mainstream deep learning of that I'm certain. Judea Pearl ideas, etc.

@Deshwal.mahesh 3 жыл бұрын

Flattened Patch is not 14*14 only. TO Flatten it, You have to tke channels into consideration too so 14*14*3. Please correct me if I am wrong

@taekwondosplit 3 жыл бұрын

Excellent explanation! Thank you.

@present-bk2dh Ай бұрын

crazy to see that this was just 3 years ago

@DavenH 3 жыл бұрын

The discussion in this video about the inductive bias of resnets vs the unbiased Transformers got me thinking. Right now I'm doing a fun network architecture search (NAS) project, and it evolves architectures and tests each one on small amounts of data to compare against other architectures. This is somewhat similar to how Transformers learn dynamic routing, whereas the genetic search algorithm in the NAS-space is "learning" this routing by discrete methods. So if the comparison is valid -- the learned inductive bias of Transformers and the searched inductive bias of a NAS, I wonder which of these methods is more compute efficient overall? NAS is slow indeed, as it churns through many small, yet specialized architectures. But Transformers are perhaps equally slow, as they are searching over a similar space. I suspect given constant compute resources, a Transformer-based arch vs a NAS-search over smaller nets, the Transformer probably comes out ahead still as its search method is at least exploiting a gradient, while evolutionary strategies scale poorly with parameter size.

@TheAIEpiphany 3 жыл бұрын

Nice connection! Interesting way to look at it! For me it looks like NAS is probably more flexible but it's also more compute intensive. I didn't get to play with NAS so far so I can't get into any serious discussion without taking (at least) a day to investigate it a bit more. Nice thoughts keep them coming! Did you play with GNNs? They are even more general than transformers like GAT e.g.

@kassem6436 3 жыл бұрын

great works.. keep going

@Raulvic 3 жыл бұрын

Thx for the video!

@lukasugar94 3 жыл бұрын

Very nice!

@quanhua92 3 жыл бұрын

How would you implement this Vision Transformer? I think ImageNet is the choice. However, it is still worse than ResNet. I would start with the imagenette dataset from FastAI for fast iterations then switch to ImageNet which will be trained on Lambdas lab GPU cloud with 4 GPUs.

@TheAIEpiphany 3 жыл бұрын

I am not 100% sure I understood your question. Did you mean to ask: 1. How would you train it like given the amount of compute it requires what would be the correct machine setup? 2. How would we train it given that JFT-300 is not a public dataset? 3. Or did you mean to ask how to implement it? Which is fairly simple as it's almost a regular transformer (except for the input preprocessing part contained in the stem of the model).

@quanhua92 3 жыл бұрын

@@TheAIEpiphany likely 3. I want to implement from scratch to understand all the details. So I will need to find a replacement for the dataset and maybe a smaller version of the model. I don't think that it makes sense to train for 600 GPU days as an individual researcher.

@TheAIEpiphany 3 жыл бұрын

@@quanhua92 Neither could you unless your dad is Bill Gates haha. Hm check out my GitHub project. And the annotated transformer and Jay Alammar's blog. I did a couple videos on how I did it maybe that could help as well. Your question is basically "how do I implement the transformer". The preprocessing step is really simple and that's beautiful about this model.

@marijnspijker5199 3 жыл бұрын

@@quanhua92 Did you get this working? I am looking for a thesis topic and I am wondering if it is feasible to make this work. Thanks.

2 жыл бұрын

Great work explaining the paper Aleksa

@TheAIEpiphany 2 жыл бұрын

Thanks! 🙏😄

@shandi1241 3 жыл бұрын

thnx you man, was really helpfull!

@TheAIEpiphany 3 жыл бұрын

You're welcome Alexander!

@sathishkumarthirumalai3722 Жыл бұрын

Are the position encodings learnt in the vision transformer?. In the "Attention is all you need" transformer, positions are not learnt

@amirzarei4558 2 жыл бұрын

Thanks a lot for you great explanation that how vision transformer works.

@НиколайНовичков-е1э 2 жыл бұрын

Great work! Thank you!

@samirelzein1095 2 жыл бұрын

make more of these :)

@nire-hj9pe 2 жыл бұрын

super cool

@user-or7ji5hv8y 3 жыл бұрын

Cool

@TheAIEpiphany 3 жыл бұрын

Thanks!

@mickeymilo2753 3 жыл бұрын

Yes! NEW CLIP!

@TheAIEpiphany 3 жыл бұрын

It's burning baby!

@parker1981xxx 3 жыл бұрын

What happens if the positional embeddings are not trainable (so they are constant)?

@TheAIEpiphany 3 жыл бұрын

They are constant for ViT, they didn't gain much by learning them.

@parker1981xxx 3 жыл бұрын

@@TheAIEpiphany That is exactly my observation: if the data is scarce and/or there are too many patches then trainable positional embeddings become a liability. Your message just confirmed it, thanks.