Transformers are ruining everything! They first ruled the NLP world and finally, they are killing it in computer vision as well. I make 2 predictions in this video: 1. We can expect much bigger transformers being used in computer vision (same trend as in NLP) 2. We can expect a smaller patch size combined with efficient transformers (Reformer, Linformer, Longformer, etc.) any time soon Forgot to mention 1 interesting thing. The transformer is in a way more general than CNNs and LSTMs (i.e. it has fewer inductive biases). It turns out that transformer is a special case of a GNN (graph neural network) in particular GAT (well everything is a graph haha, 0 shenons here but still). Check out this blog: thegradient.pub/transformers-are-graph-neural-networks/
@DavenH3 жыл бұрын
I am so pumped to see what happens with Performers* on whole documents, books, images, movies, audio files... I'm sure multiple companies are training 1T+ parameter Performer models as we type. It's going to be a great year ahead of us. *linear scaling Transformers that are better than linformers/linear xf/reformer/sparse xf etc
@TheAIEpiphany3 жыл бұрын
@@DavenH Me too! I am also excited about many other areas of AI, especially graph neural nets and RL! Mostly because I am going to dig deeper into them over the next period! 😅 Just researched AlphaStar a bit it also uses transformers to beat pro gamers of Star Craft II. RL as well but I was happy to see transformers in there as well! 😂
@ibrahimaba89663 жыл бұрын
Hello. Can we use GAT with image PATCH to do the same work 🤔 ?
@masteronepiece65593 жыл бұрын
This guy will hit 500K subs. in the end of 2021. You're the only person on KZbin how that gives 100000....000% effort in his video. I'm learning a lot from you.
@TheAIEpiphany3 жыл бұрын
Hehe. 500k might be an overkill considering this is a niche channel but I'll give it my best shot! Thanks a lot for your kind words!
@irinelmanolache6012 жыл бұрын
Thanks man, I really enjoy watching your explanations!
@tongluo9860 Жыл бұрын
thank you for this great vedio, explaining Fit very well. It took me lots of time to understand Transfer and Bert series. You video make and vision part much easier to understand.
@NasheedYasin083 жыл бұрын
Informative as anything. Definitely 25 min well spent.
@TheAIEpiphany3 жыл бұрын
Glad to hear that 😁
@ameynaik27432 жыл бұрын
Thanks for the detailed overview. What exactly is class embedding? Why is it required?
@meidachen8489 Жыл бұрын
Nice work! Thank you! (Really nice prediction that big tech will have some large Transformer coming, now "Segment Anything" is here😁)
@SH943 жыл бұрын
Keep up the good work bro!
@TheAIEpiphany3 жыл бұрын
Thanks man!
@SpesMagisteriiGradus11 ай бұрын
thank you so much
@lukkuuu63683 жыл бұрын
I read your blog post on your journey to a deepmind engineer. It was very very inspiring. Thank you for spending time writing that!
@TheAIEpiphany3 жыл бұрын
Thank you! 🙏
@clapathy2 жыл бұрын
Nice job! Thank you very much!
@JohnSmith-ut5th2 жыл бұрын
There are so many applications for AI/ML. I'm curious, why am I not seeing this being applied? So many people and companies are claiming to do "AI/ML" but I'm not seeing commercial applications.
@aleksabisercic14103 жыл бұрын
Love it !
@TheAIEpiphany3 жыл бұрын
Glad to hear that!
@XX-vu5jo3 жыл бұрын
I love this paper. But I hate that I cannot train shit with this LOL
@lifted17852 жыл бұрын
bro this shit was helpful as fuck, u just helped me do my fucking capstone for mit. good looking out homie! Im subbing
@navinbondade53653 жыл бұрын
can you make a coding video on it ?
@TheAIEpiphany3 жыл бұрын
Will do, check out my DINO video and let me know what u think.
@jinorohit29212 жыл бұрын
@@TheAIEpiphany thanks? ahahha great video btw
@TheAIEpiphany2 жыл бұрын
@@jinorohit2921 failed to edit it lolz
@TheAIEpiphany2 жыл бұрын
@@jinorohit2921 thanks! 🤣
@fast_harmonic_psychedelic3 жыл бұрын
how to convert a VIT-L_32.npz checkpoint to .VIT-L_32.pt so i can load it with clip? anyone know?
@fast_harmonic_psychedelic3 жыл бұрын
why can't we just use THEIR pre-trained model. They already did it once, whats the sense in doing the same process over wasting energy when they already have the model
@DavenH3 жыл бұрын
Excellent, love these paper breakdowns. Keep it up!
@TheAIEpiphany3 жыл бұрын
Thanks man! Appreciate it, if people find these kinds of breakdowns (which I'll know if I get enough similar feedback) useful I'll definitely make more of these. I learn a lot by doing this as well. As a side note I am noticing a huge gap in the community. On the one hand people don't know how to start learning ML and on the other hand people need help understanding the papers I'm trying to balance it out. Still not sure what the best strategy is but I'll continue covering seminal papers from different areas.
@DavenH3 жыл бұрын
@@TheAIEpiphany My opinion is not to worry about the people just starting ML, unless that's what you want to do of course. There are many good resources for beginners...free courses, blogs, playlists etc. However, I find that there aren't many channels aimed at the level of actual ML practitioners like yourself. Yannic Kilcher's channel is one such beacon, and it's doing very well. I also want to say you present stuff in a very clear and digestible way, and a 30+ minute video is more than fine. There are so many papers coming out every day on arxiv it's impossible to keep up, so having any help distilling them is wonderful. One symptom of this bottleneck, I notice, is that people generally just read the highlights (papers from Google Brain, FAIR, DeepMind, OpenAI). Unfortunately, by dominating the mindshare this has pulled the research into areas well beyond the compute capabilities of PhD students, independent practitioners, or startup companies. I read the wav2vec 2.0 paper yesterday, and got all excited to try to apply their methods until I see they trained for multiple days on 100 GPUs, expensive v100s at that. Google papers are even worse this way, they never train on anything less than 1000 TPUs it seems! I guess they are probably the most high quality papers too, so there's a feedback effect. But there are gems that surely get missed by all the universities, which I would assume focus less on scaling and more on theory or other insights.
@TheAIEpiphany3 жыл бұрын
@@DavenH Extremely valuable feedback thank you! I somehow tend not to go over 30 minutes for now I am still figuring it out. I agree I even started writing down the amount of compute needed in my OneNote. 😅 And it's crazy. I agree there is a lot of valuable research that will be done aside from the mainstream deep learning of that I'm certain. Judea Pearl ideas, etc.
@Deshwal.mahesh3 жыл бұрын
Flattened Patch is not 14*14 only. TO Flatten it, You have to tke channels into consideration too so 14*14*3. Please correct me if I am wrong
@taekwondosplit3 жыл бұрын
Excellent explanation! Thank you.
@present-bk2dhАй бұрын
crazy to see that this was just 3 years ago
@DavenH3 жыл бұрын
The discussion in this video about the inductive bias of resnets vs the unbiased Transformers got me thinking. Right now I'm doing a fun network architecture search (NAS) project, and it evolves architectures and tests each one on small amounts of data to compare against other architectures. This is somewhat similar to how Transformers learn dynamic routing, whereas the genetic search algorithm in the NAS-space is "learning" this routing by discrete methods. So if the comparison is valid -- the learned inductive bias of Transformers and the searched inductive bias of a NAS, I wonder which of these methods is more compute efficient overall? NAS is slow indeed, as it churns through many small, yet specialized architectures. But Transformers are perhaps equally slow, as they are searching over a similar space. I suspect given constant compute resources, a Transformer-based arch vs a NAS-search over smaller nets, the Transformer probably comes out ahead still as its search method is at least exploiting a gradient, while evolutionary strategies scale poorly with parameter size.
@TheAIEpiphany3 жыл бұрын
Nice connection! Interesting way to look at it! For me it looks like NAS is probably more flexible but it's also more compute intensive. I didn't get to play with NAS so far so I can't get into any serious discussion without taking (at least) a day to investigate it a bit more. Nice thoughts keep them coming! Did you play with GNNs? They are even more general than transformers like GAT e.g.
@kassem64363 жыл бұрын
great works.. keep going
@Raulvic3 жыл бұрын
Thx for the video!
@lukasugar943 жыл бұрын
Very nice!
@quanhua923 жыл бұрын
How would you implement this Vision Transformer? I think ImageNet is the choice. However, it is still worse than ResNet. I would start with the imagenette dataset from FastAI for fast iterations then switch to ImageNet which will be trained on Lambdas lab GPU cloud with 4 GPUs.
@TheAIEpiphany3 жыл бұрын
I am not 100% sure I understood your question. Did you mean to ask: 1. How would you train it like given the amount of compute it requires what would be the correct machine setup? 2. How would we train it given that JFT-300 is not a public dataset? 3. Or did you mean to ask how to implement it? Which is fairly simple as it's almost a regular transformer (except for the input preprocessing part contained in the stem of the model).
@quanhua923 жыл бұрын
@@TheAIEpiphany likely 3. I want to implement from scratch to understand all the details. So I will need to find a replacement for the dataset and maybe a smaller version of the model. I don't think that it makes sense to train for 600 GPU days as an individual researcher.
@TheAIEpiphany3 жыл бұрын
@@quanhua92 Neither could you unless your dad is Bill Gates haha. Hm check out my GitHub project. And the annotated transformer and Jay Alammar's blog. I did a couple videos on how I did it maybe that could help as well. Your question is basically "how do I implement the transformer". The preprocessing step is really simple and that's beautiful about this model.
@marijnspijker51993 жыл бұрын
@@quanhua92 Did you get this working? I am looking for a thesis topic and I am wondering if it is feasible to make this work. Thanks.
2 жыл бұрын
Great work explaining the paper Aleksa
@TheAIEpiphany2 жыл бұрын
Thanks! 🙏😄
@shandi12413 жыл бұрын
thnx you man, was really helpfull!
@TheAIEpiphany3 жыл бұрын
You're welcome Alexander!
@sathishkumarthirumalai3722 Жыл бұрын
Are the position encodings learnt in the vision transformer?. In the "Attention is all you need" transformer, positions are not learnt
@amirzarei45582 жыл бұрын
Thanks a lot for you great explanation that how vision transformer works.
@НиколайНовичков-е1э2 жыл бұрын
Great work! Thank you!
@samirelzein10952 жыл бұрын
make more of these :)
@nire-hj9pe2 жыл бұрын
super cool
@user-or7ji5hv8y3 жыл бұрын
Cool
@TheAIEpiphany3 жыл бұрын
Thanks!
@mickeymilo27533 жыл бұрын
Yes! NEW CLIP!
@TheAIEpiphany3 жыл бұрын
It's burning baby!
@parker1981xxx3 жыл бұрын
What happens if the positional embeddings are not trainable (so they are constant)?
@TheAIEpiphany3 жыл бұрын
They are constant for ViT, they didn't gain much by learning them.
@parker1981xxx3 жыл бұрын
@@TheAIEpiphany That is exactly my observation: if the data is scarce and/or there are too many patches then trainable positional embeddings become a liability. Your message just confirmed it, thanks.