DINO: Emerging Properties in Self-Supervised Vision Transformers (Facebook AI Research Explained)

Рет қаралды 126,249

Күн бұрын

Пікірлер: 151

@YannicKilcher 3 жыл бұрын

OUTLINE: 0:00 - Intro & Overview 6:20 - Vision Transformers 9:20 - Self-Supervised Learning for Images 13:30 - Self-Distillation 15:20 - Building the teacher from the student by moving average 16:45 - DINO Pseudocode 23:10 - Why Cross-Entropy Loss? 28:20 - Experimental Results 33:40 - My Hypothesis why this works 38:45 - Conclusion & Comments Paper: arxiv.org/abs/2104.14294 Blog: ai.facebook.com/blog/dino-paws-computer-vision-with-self-supervised-transformers-and-10x-more-efficient-training Code: github.com/facebookresearch/dino My Video on ViT: kzbin.info/www/bejne/iqPHlql8gMSUo5Y My Video on BYOL: kzbin.info/www/bejne/j4HJhpyFgr6Ce6c

@samanthaqiu3416 3 жыл бұрын

from the paper is not clear AT ALL that they detach gradients of the teacher via the Center (C) variable. Will have to look at their repo to see what is going on. Typically things like mean still propagate gradients in pytorch

@samanthaqiu3416 3 жыл бұрын

yep, it didn't help much that they seem to code like 9 year olds, but from line 304 of main_dino.py ( github.com/facebookresearch/dino/blob/a15f6afee2f3b868f44a5021a0951f718c8e2dd5/main_dino.py#L304 ) it seems clear they are NOT DETACHING all gradients from the teacher network via the `update_center` method

@samanthaqiu3416 3 жыл бұрын

it will not be a problem since they don't seem to be using those gradients anywhere, although I haven't verified it

@mathildecaron1821 3 жыл бұрын

Thanks a lot Yannic for covering DINO, that’s really an honor ! I’m a big fan of your channel :D

2 жыл бұрын

Hi. Enjoyed the paper and the explanation given in this video. Thank you both. Are you aware of any robustness analysis (in the context of adversarial examples) done for DINO?

@tiro0oO5 Жыл бұрын

I know version 2 is out. Still congrats to this brake through work!

@fatmaguney3598 3 жыл бұрын

"learning to predict cat from cat ear" is a good summary of this paper.

@Metaloid-wv4kz 3 жыл бұрын

simplest form of AI lol

@sohamroy9868 2 жыл бұрын

I am super impressed how you nailed the pronunciation every single name of the authors of the paper.

@sabyasachibandyopadhyay8558 3 жыл бұрын

Your comment on augmentations is spot on! I have worked with BYOLs in clinical images for a while now, and choosing the correct augmentations makes a heck of a difference, and there is no way to know the right augmentation without trial and error! I think that's a major downside of BYOL, which will obviously percolate to DINO as well. Thanks for your presentation of the paper!

@rahuldeora5815 3 жыл бұрын

Surprisingly fluent pronunciation of the authors .... bet that took more takes than one would expect :)

@patf9770 3 жыл бұрын

As often said, what a time to be alive!

@GeekProdigyGuy 3 жыл бұрын

wrong channel xd

@michaelwangCH 3 жыл бұрын

If you are the guy "two minute paper", excellent work - we are living in a exordinary time of human history.

@vsiegel 3 жыл бұрын

@@michaelwangCH Let's enjoy it, the progress is exponential, and we are at a steep region! I decide to ignore the question whether we are near the end of human history, at the time the progress curve goes to infinity... Actually, I'm not afraid: Progress is exponential, making progress means adding knowledge, tools and scientists, and that allows to make faster Progress. But I think it is actually a logistic development, that looks very much like exponential, but instead of the singularity, the curve begins to get less steep. It happens when a finite resource is involved. But as physicist, I say: no problem, the observable universe is finite.

@michaelwangCH 3 жыл бұрын

@@vsiegel We have a log curve between scientific output(progress) and resources which we put in - the work as researcher is getting harder, the hardness will increase every year - in other word, the hardness increase expontially to time(an exponential function of time) - that is bad for scientific progress and societal resources distribution. E.g. CERN with over 6000 scientists and $14B + fixed costs per yr, those resources could be used probably more productively in other area of sciences.

@gaypaul5635 10 ай бұрын

@@vsiegel that's assumes that knowledge can be added like chocolates cakes, but that's a wrong hypothesis for individual humans and humanity. A focus and more knowledge on a topic means other topics are given less attention and are forgotten. This is why the definition of "progress" must be chosen, and according to certain definitions, Dino is not representing a progress in itself as it can have negative indirect effects as any digital technologies.

@jaakjpn 3 жыл бұрын

Cool paper, thanks for the review! About centering vs sharpening. You are right: centering avoids the collapse as each unsupervised class gets pushed to equal running average, i.e., each unsupervised class should pick up 1/K of the images because the means of their logits are 0. This way, model cannot collapse to picking the same class each time. Sharpening is to make sure that each time one class is being picked (otherwise, flat uniform distribution could be the result).

@rahuldeora5815 3 жыл бұрын

It can still collapse at 0, as output of a neuron can be 0 (or very small value)and its running mean also 0. If most of the neurons have very small mean and outputs then is'int it possible for few classes to always dominate? (This would'nt happen if we divided by the std deviation btw)

@astudent8885 10 ай бұрын

Thank you for this presentation. You made sure to explain all background concepts so someone with limited ml knowledge can still understand. I found that really helpful. Thank you so much!

3 жыл бұрын

One point to note in this paper is the dataset consist of object centred images and the augmentation method relies on cropping which is learning to represent the images invariant to the cropping position. This form a strong inductive prior that produces a representations that focus on the objects of interests in the image. The main learning signal that guides the self-supervised learning process comes from the cropping augmentation so I don't see how such a method can be trained without augmentation. My hypothesis is that this method would not work with dataset that don't have object centred images like a dataset that has images of rooms since in that case cropping would result in different objects that have little in common which would effectively eliminate the learning signal.

@redseventyfiveprime5018 3 жыл бұрын

In reinforcement learning similarity of teacher and student responses can probably be used to move an agent into a position where an object is centered in its view.

@oncedidactic 3 жыл бұрын

I think you could extend the system by pre-training on object centered, and then expanding to more natural imagery, such as scenes as you say. But the cropping augmentation would probably still need adjustment.

@mdrayedbinwahed2172 3 жыл бұрын

Excellent insight. Sounds like a good follow up to this paper.

@kaveh_shh Жыл бұрын

Yeah. In this case we can not expect the model to give us the same representation from both e.g. "sky" and a "cat", cropped from different parts of the same image.

@susdoge3767 8 ай бұрын

what an insight! thanks for making me think!

@mfpears 2 жыл бұрын

Last 10 minutes are really a great explanation of a few concepts

@shivamshrirao2374 3 жыл бұрын

Was just going through the paper and there's already a video. Noiceeee !!

@nauman.mustafa 3 жыл бұрын

I like how they include PyTorch code which makes it so easy to implement compared to heavy latex math

@herp_derpingson 3 жыл бұрын

Why dont more papers do this?

@Metalwrath2 3 жыл бұрын

@@herp_derpingson Because most papers don't have reproducable results.

@kanal7523 2 жыл бұрын

@@Metalwrath2 sheeeeeeeesh

@Omsip123 3 ай бұрын

Outstanding video on the topic. I have watched almost everything on the subject on YT and this is one of the best explanations on VAE. I very much appreciate the helpful animations and diagrams, I guess it is a lot of work to get these done. Please keep doing, your channel will take off eventually, and I hope you get reward from the fact that the few who find your channel, learn a lot from your work. Thanks a lot, and of course I subbed and liked.

@iandanforth 3 жыл бұрын

This video made clear to me the strong occlusion prior being introduced by the local/global student/teacher training method. I hadn't picked up on that in my first read through. Thank you!

@florianjug 3 жыл бұрын

Thanks! I hoped you’d be fast to cover DINO... and you delivered! :)

@originalsingh 3 жыл бұрын

Yannic : Nobody takes a picture of dirt and grass and posts it on SM GameDev artists : Woah look at this dirt patch!

@Kram1032 3 жыл бұрын

This is super cool! A really clever way to kinda do contrastive stuff without doing contrastive stuff and the results speak for themselves

@justwiredme 2 жыл бұрын

Great presentation I like how you show the visual part best part for me as a beginner am very excited to learn this algorithm as well this is very useful information for me because sometimes in everyday life I can read so the audio is so helpful thank you

@EyedMoon 3 жыл бұрын

God damnit every time there's a new method/architecture I want to try out and can't find the time to really use. Thanks for the video, I read the paper but the hindsight and small pieces of knowledge you give us about those methods and why they work are reaaaally good.

@_tnk_ 3 жыл бұрын

Very interesting and amazing results

@XX-vu5jo 3 жыл бұрын

Stupid I submitted a similar concept before and I was rejected because I am not a well known person. Now, just because fb made it they were glorified! This is crazy!

@herp_derpingson 3 жыл бұрын

Sad

@samanthaqiu3416 3 жыл бұрын

that's why there is arxiv. Didn't you thought about publishing there?

@herp_derpingson 3 жыл бұрын

@@samanthaqiu3416 For many PHD programs, publishing on arxiv is not good enough.

@chndrl5649 2 жыл бұрын

I love these paper summaries!!!

@neworldemancer 2 жыл бұрын

tnx Dr. Kilcher, what you do is useful af! ;)

@dinoscheidt 3 жыл бұрын

I really like the acronym of this method. 👀

@mathildecaron1821 3 жыл бұрын

🦖

@dinoscheidt 3 жыл бұрын

Yeah... maybe not. Already getting messages with “See! DINO has attention issues”... 😶 thanks fb

@saurabheights 3 жыл бұрын

@@dinoscheidt Could you expand on those messages, interested in "DINO has attention issues"!

@tzjtjktzjtzjztjztj 3 жыл бұрын

Great insight and comments, thanks Yannic

@zenchiassassin283 Жыл бұрын

Very interesting hypothesis !

@yaoweili681 3 жыл бұрын

great video, mate! The segmentation results are so good!

@robertgirard5659 3 жыл бұрын

Saved for later! Yannic dude love your vids!

@anassbairouk953 3 жыл бұрын

The data augmentation is important to avoid using clustering which is not scalable when using a huge dataset because you get a huge cluster centroid matrix that you need to store and update each time.

@kiachi470 3 жыл бұрын

Amazing Explaination and Paper to,Very interesting

@odin12 3 жыл бұрын

When will the code for Generative Minimization Networks: Training GANs Without Competition be released?

@ivanr7725 3 жыл бұрын

Thanks a lot! Dinozaur should be on the cover.

@vsiegel 3 жыл бұрын

Confusing to me was: TL;DE: It seems like it requires video as input, but it works on still images. In the intro at 0:55 , there are examples shown, and all of them are videos. On first sight, it seemed obvious to me that it is detecting the moving Object. Looking more closely, something more is going on - the movement of the waves is ignored, in a clean way. But still, the information for the separation is available in a very salient way. It took a while until I understood that it is about still images. Now, I think the frames of the example videos are processed individually.

@pensiveintrovert4318 3 жыл бұрын

It pays attention to patches with maximal change. Of course we, the erect monkeys, also pay attention to visual fields with maximum change, to get food, or escape danger. Why? Because it works and that is how we have evolved, because it worked.

@ensabinha 9 ай бұрын

29:15 - It achieves better results with ViT when compared to the "best ResNet," of course, but it's 3.6 times larger in the number of parameters. They're comparing a ~3.6x LARGER modern architecture (which probably employs an arsenal of training tricks) with ResNet. Shocking, truly groundbreaking, you can get better results with a larger model.

@DamianReloaded 3 жыл бұрын

The attention maps look really good, specially the ones in video. It'd be interesting to see what it does when you occlude the thing in the scene it attended to the most. How many things in the scene it would be capable of telling apart as you remove the ones it already attended to. Regarding the cooking video I think would have been better if it had been 90% about the language model and 10% about cooking. I personally would like to see more programming and possibly interviews with the authors of the papers you reviewd. my2c

@oncedidactic 3 жыл бұрын

I had a similar side. If you paint out the objective attention using another system, what happens? Like Yannic‘s comment about pictures of roads and grass 😂

@zebrg 3 жыл бұрын

kzbin.info/www/bejne/nmTMm2Z8aMiDf80

@DistortedV12 3 жыл бұрын

Yannic some constructive feedback, turn up your volume!

@korota199905 3 жыл бұрын

Absolutely yesss!

@mobilexia6285 3 жыл бұрын

Two quick notes: 1. The video can replace CVPR 2. If the cat can be recognised by its ear, would that mean some 'generative power' has been created within the student?

@momeho Жыл бұрын

Thanks for your great video. Do you have any video on DINO2?

@sanj1772 8 ай бұрын

Amazing video, can you please make one on DinoV2

@oliverchalkley1187 Жыл бұрын

Great video thanks! Surely the reason for the softmax is the crossentropy equation requires probabilities and the softmax funciton turns the outputs into probabilities?

@Hydroslyde 3 жыл бұрын

Great video! So are we going to get a PAWS video next? Pretty please???

@mrwu6565 Жыл бұрын

Thank you Yannic!!! Can you do a video about CutLER ? :)

@0_0bserver27 3 жыл бұрын

I don't exactly understand how distillation prevents collapse in this model in the explanation of it on 13:53. On 19:59 it is mentioned again that the student cannot output the same thing every time because it is prevented, but how exactly? Does someone want to elaborate?

@nahakuma 3 жыл бұрын

Nice final comments. Totally agree in that augmentations should be internal to the learning process. As I see it, we humans do something similar by remembering previously seen patterns, as well as by imagining how things would change if we perturb them (or by actually performing the perturbation). With respect to the global and local crops, does the teacher really only see global crops? Because according to the pseudo-code both x_1 and x_2 go into both models.

@JamesAwokeKnowing 3 жыл бұрын

For augmentation we can replace with noisy input. For dataset a a reconstructive loss and world model should give basic objects and cause the model to prefer images that nore significant (less random) semantic meaning. Then at dream time it can train on the meaningful images.

@francoisplessier9913 2 жыл бұрын

Great explanations, thank you for this quality video! I loved the 34:38 insight on augmentations! And I found your concern about the meme culture quite funny :-)

@alastairfinlinson898 3 жыл бұрын

Love the videos! Will you be providing valuable insight for the papers "Multiscale Vision Transformers", "Vision Transformers for Remote Sensing Image Classification" and "Vision Transformers for Dense Prediction"?

@harambe2552 3 жыл бұрын

The softmax bounds the embedding space to a hypersphere. Otherwise your embedding space is unbounded and gives you an infinite projection space.

@yesno3071 2 жыл бұрын

keep going :) very well

@sheggle 3 жыл бұрын

Would love to see time changes in natural video instead of augmentations, to see if "why AI is harder than we think" holds any water

@danielalorbi 3 жыл бұрын

@Robert w No it isn't. We invented a whole new term and everything.

@danielalorbi 3 жыл бұрын

@Robert w Your comment changed. I don't recall exactly what it was initially but the meaning has changed.

@vasylcf 3 жыл бұрын

Thanks. it's really intresting.

@samdavidson5511 3 жыл бұрын

Awesome vid thanks! and I see they are linking this video of yours on their git repo!

@miladaghajohari2308 3 жыл бұрын

well done!

@_arshadm 3 жыл бұрын

Great explainer video, not sure I agree with your conclusion that augmentation may be a major source of the signal that the approach is latching onto. My own suspicion is that it's the clipping that is the main reason this approach works.

@odin12 3 жыл бұрын

This paper looks insane

@Bryan-jb6xu 2 жыл бұрын

please make a video explaining about EsVIT. Thanks!

@Niels1234321 3 жыл бұрын

Maybe we should try to use consecutive frames of a video as augmentations of the same thing, it requires less augmentation engineering and you could argue that it resembles the data that humans learn from as children.

@yb801 4 ай бұрын

Clearly explained, thanks.

@jonatan01i 3 жыл бұрын

Right now the images for the student model are sampled from the image with different x,y coordinates. What we could also do is to sample them from different timestamps from a video.

@iftekharniloy913 3 жыл бұрын

I am just curious to see people using self supervision on images which have multiple classes of interest.

@yimml4246 3 жыл бұрын

The cooking video did not really do "terribly." Yes, perhaps a bit less than the average video, but I watched it and it was adequate. Nonetheless, sometimes we need to try random things to prevent getting stuck in a local maximum. Keep it up!

@TechyBen 3 жыл бұрын

Terminator misspelt "Facebook" in the movies.

@pauljones9150 3 жыл бұрын

Have my updoot. I loved the cooking video btw Maybe have a separate channel for cooking like video so you don't get tanked by the algo

@rakshithv5073 3 жыл бұрын

Looking into the pseudo code , block diagram (Figure 2) isn't a good representation of what's actually happening right ? At first sight, I thought x2 only goes through teacher network and x1 goes through student network

@andrewcutler4599 3 жыл бұрын

ViT for augmentations when?

@akhilezai 3 жыл бұрын

There's no temporal aspect to it?

@RoboticusMusic 3 жыл бұрын

What's the framerate for 1080p? Is it realtime?

@piku1920 3 жыл бұрын

Hi- what does it mean by thresholding the self attention maps to keep 60% of the mass? What does mass represent here?

@SakvaUA 3 жыл бұрын

Thanks for the video! Enlightening as always. The audio volume is a bit too low though.

@anibalgonzalez7990 Жыл бұрын

Could anyone tell me how the teacher knows there are 'k' classes to be identified in a picture? Cheers!

@iiiiaaaa4548 2 жыл бұрын

Which model use for downstream? student? teacher?

@susdoge3767 8 ай бұрын

didnt understand properly about sharpening and centering, can anyone help me understanding it intuitively?

@444haluk 3 жыл бұрын

Augmentations are so simple in their nature that it can be a part of the evolutionary dynamic of humans on how our perception develops over time. Maybe in your sleep different crops of occipital cortex play this game of augmentation. Maybe you didn't born tabula rasa but born with augmentation dynamics.

@godsondeep241 3 жыл бұрын

Can we use this for the object detection task

@michaelwangCH 3 жыл бұрын

What is the intuition behind? How it does work so well without labeling? Yannic, can you explain the intuition?

@susdoge3767 8 ай бұрын

the intuition is you try to make the network learn that an image of a cats ear and a complete image of the cat should have the same representation. The hypothesis is that by forcing the model to learn consistent representations across scales (patch vs. whole image), it can grasp transferable features that are generally useful for computer vision tasks.

@michaelwangCH 8 ай бұрын

@@susdoge3767 thank you. Unsupervised learning is only possible if the latent space representations are similar to each other(minimize the distance in latent space - that is the reason why emergent properties of LLMs we can oberserve, e.g. google trained translator in english can surprisingly translate Indi or other languages without trained on - only reason it works because the human languages have similar structure that it related to human biology resp. brain functions - those processes in the brain are similar to all humans - it is independent of color, gender, nationality or race.

@susdoge3767 8 ай бұрын

@@michaelwangCH thats another cool insight i didnt know!

@michaelwangCH 8 ай бұрын

@@susdoge3767 happy to help and the knowledge belong the entire human race, not small group of people.

@Amin-wd4du Жыл бұрын

super

@DanFrederiksen 3 жыл бұрын

If it's truly unsupervised, why is it blind to vegetation and ocean waves. It seems they somehow managed to impose the simplistic notion that an image has only one classification.

@jeroenput258 3 жыл бұрын

Exactly. One of the images shows a dog in a sofa and only pays attention to the dog. What if I'm more interested in the sofa than the dog? It seems to impose a very subjective notion of importance on the image content. Besides, segmentation is highly task dependent, so how could it know whether to segment the dog or its limbs for instance? If you ask me, it just seems to learn from ImageNet to predict the most salient object and then use the features to perform a segmentation.

@randomthoughts3009 3 жыл бұрын

This is a visual artifact due to plot normalization. The central object has heatmap values that are relatively much higher than the background. Check the running dog example on the project page and look at the last frame where the dog is absent.

@DanFrederiksen 3 жыл бұрын

@@randomthoughts3009 well that it has very faint recognition of other things isn't really an excuse. But I guess it can be a simple result of focus in the training set. The initial dog video tracks the dog so that is naturally a heavy bias towards single object classification.

@rahuldeora5815 3 жыл бұрын

The paper says "We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality. This dynamic was not observed in previous works [28, 56]." How does this make sense given that teacher is updated much more slowly than student?

@jeroenput258 3 жыл бұрын

That's one thing I don't get either...

@mathildecaron1821 3 жыл бұрын

Poliak averaging

@roughr4044 3 жыл бұрын

What clustering algo does it use on the features?

@roughr4044 3 жыл бұрын

Linear and knn, got it...

@freemind.d2714 3 жыл бұрын

Basically: DINO = BYOL + Transformers

@JoshuaGAlbert 3 жыл бұрын

Volume is low in this video.

@sebbecht 3 жыл бұрын

WTF. I found this and was just about to suggest it to you over linkedin and thought.. what if I just checked if there were any youtube videos on it first...

@calaldred2526 3 жыл бұрын

Yannic “Lightspeed” Kilcher strikes again

@lannguyende 3 жыл бұрын

I've read the paper and sadly that I didn't find anything new. They just gathered some techniques that already existed and implemented in a self-supervised way. Funny is DINO: DIstill NO labels, but normal distillation training don't use any label at all 😂

@louislouis7388 3 жыл бұрын

Many papers do in such way. Although it is very simple, they tried to magic it to get it complicated and plausible. I found this paper is not impressive at all.

@larrybird3729 3 жыл бұрын

Would I rather watch Gordon Ramsay review the latest AI paper or would I rather Watch Yannic? that might answer your question Yannic😆

@mrburns366 2 жыл бұрын

Skynet is coming

@ssssssstssssssss 3 жыл бұрын

This seems to be an unsupervised clustering algorithm to me. I guess calling it "self-supervised" sounds sexier.

@preethamgali3023 3 жыл бұрын

It looks like double-Q learning. What do you think?

@444haluk 3 жыл бұрын

The dataset argument is weak as well because every human you know has a parent or somebody looked after them in their childhood, no human grow alone with the wolves. Hence the "where to look" may be a social aspect of human species, hell, every species. I know cows have a type of attention and understanding which we refer as autistic, wherever they walk, if some unknown things is in the proximity, they freeze and freak out. Maybe they are not good cow culture teachers after all.

@laurenpinschannels 3 жыл бұрын

offtopic thing - would you be open to adding donation options in a proof of stake coin? I don't have strong opinions about which one, I'd convert to whatever you think is a good option. I don't want to fund gpu demand with my donation :)

@cunningham.s_law 3 жыл бұрын

seems like attention is all you need

@aminabbasloo 3 жыл бұрын

Seems like a game theory problem to me!

@lwang9175 3 жыл бұрын

You can the stripes of the horse... Sorry it's a zebra 🦓 hahaha

@HughesPerreault 3 жыл бұрын

Commenting for algo.

@djfl58mdlwqlf 3 жыл бұрын

cooking was good video lol

@scottmiller2591 3 жыл бұрын

"Cooking video" - Wat.

@陈宸-r7g 3 жыл бұрын

好快~

@444haluk 3 жыл бұрын

Dude cooking video is done terribly because in the thumbnail there is a "brown" object on the plate and it is pixelated. People may have related it with, I don't know, LITERAL SHIT?