Big Self-Supervised Models are Strong Semi-Supervised Learners (Paper Explained)

Рет қаралды 34,244

Yannic Kilcher

Күн бұрын

Пікірлер: 79

@drhilm 4 жыл бұрын

You are so good at ditilling the paper knowledge. Clearing the top insights. Thanks.

@ensabinha 7 ай бұрын

Essentially, they pre-train with contrastive learning and finetune, then do pseudo-labeling (but using the probability distribution over your labels) and retrain on that.

@Progfrag 4 жыл бұрын

Wow! So self-distillation is basically label smoothing but smoothing at the right places instead of evenly

@CosmiaNebula 2 жыл бұрын

The "projection layer" is not an architecture, but a job description. Any module that performs the job of a projection is a "projection layer". simCLR is an abstract framework for self-supervised contrastive learning. It consists of the following components: 1. data augmentation: turning data points into data point pairs (or triples, or n-tuples), to be used for contrastive learning. 2. working layer: a module for turning data points into general representations 3. projection layer: a module for turning general representation into specific representation adapted to specific purposes. 4. student network: a different network for distilling the teacher network. In the paper, simCLRv2 is concretely instantiated as the following: 1. data augmentation: random cropping, color distortion, and gaussian blur 2. working layer: ResNet-152 3. projection layer: 3-layered MLP 4. student network: ResNet but smaller than ResNet-152 The idea of a projection layer is to allow the working layer to focus on learning the general representation, instead of learning both a general representation AND the specific task in self-supervised training. Even self-supervised training is not a general task; it is specific! As they said in simCLRv1 paper. > We conjecture that the importance of using the representation before the nonlinear projection is due to loss of information induced by the contrastive loss. In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects. By leveraging the nonlinear transformation g(·), more information can be formed and maintained in h. This is similar to how, in iGPT (2020), the authors found that linear probing works best in the middle. Probably because in the middle, the Transformer has fully understood the image, and would then start to focus back to the next pixel. Imagine its attention as a spindle, starting local, then global, finally local again.

@tylertheeverlasting 4 жыл бұрын

I think there are two reason to make a big deal out of that extra projection layer. 1. Its not standard practice, so their comparisions with previous methods aren't fully fair, so doing this might improve other methods as well. 2. The last layer of Resnet50 is CNN->Activation->Global Average Pooling, so its kinda different from regular models with only single linear layer on top of CNN

@quAdxify 3 жыл бұрын

There usually isn't an activiation in front of GAP, I think at least. But yeah it's basically not just a stacked matrix multiplication (which should be equivalent to just using a wider layer) because of GAP. But it's pretty obvious why it works better. Basically they are bringing the fully connected layer back that was common place before GAP kinda replaced it for most cases. So there simply is more representational power. We shouldn't forget that a fully connected layer has orders of magnitude more weights compared to a Conv layer (depends on the number of filters but let's keep that reasonable). I'd bet it wouldn't matter if they just replaced GAP with a regular fully connected layer.

@ralf2202 3 жыл бұрын

Yannic, you are great teacher network! Thank you.

@sudhanshumittal8921 4 жыл бұрын

Thanks a lot Yannic for latest updates.

@teslaonly2136 4 жыл бұрын

I was stunned when I saw the broader impact section.

@sathisha2394 4 жыл бұрын

You are so good at explaining things in a not mathematical way which helps me to grasp the insights very quickly. I kind of felt so much of knowledge I gained just watching your videos. Thank you so much. Keep posting. Can you put a video about SIREN?

@salmaalsinan8612 3 жыл бұрын

its been a while since i laughed while listening to something technical :D, excellent review and appreciate the funny commentary as I had similar questions .

@PhucLe-qs7nx 4 жыл бұрын

Self-distillation is boostraping / self-play in RL. The recent paper BYOL also uses bootstraping to ditch away the negative samples altogether. I guess the reason that these self-play or distillation works is because of the initial inductive bias in the random intializaed + architecture. If you can't bootstrap to learn from initial inductive biases, no learning is possible. And because we know learning is possible, even from zero labels, as long as the inductive biases and procedure is correct, then bootstraping / self-distillation / self-play must work.

@herp_derpingson 4 жыл бұрын

Great paper. Definitely the quality you expect from Hinton. Fun fact: His great-great grandfather was George Boole. (Boolean algebra) . 21:20 I think its to be noted that ResNet50 probably went through some extensive hyperparameter tuning to do exactly what it was supposed to do and thus had a fixed number of dense layers at the end. So, perhaps adding a new layer just happens to help in the problem we are trying to solve, i.e. the teacher student thing instead of one hot. . 18:43 The whistling in the background. Is someone snoring?

@YannicKilcher 4 жыл бұрын

Wow, didn't know Hinton had royal blood :D Yea I agree this extra layer is super problem specific, but I don't get why they don't just say the encoder is now bigger, but instead make the claim that this is part of the projection head. and no, I have no clue what that noise is O.o

@AdnanKhan-cx9it 2 жыл бұрын

that horrific background sound at 18:43 , btw excellent explanation as always

@drdca8263 4 жыл бұрын

Regarding the broader impact statement, while I generally agree that many broader impact statements appear to not be useful, I do think that the “where it is more expensive or difficult to label additional data than to train larger models” point, along with the example of needing clinicians to carefully create annotations for the medical applications, was probably worth saying. That part appears to point to a specific area in which this improvement is useful. Of course, it would still be interesting even if it couldn’t be used for anything, but I do think that detail is still worthy of note. I imagine (with no real justification) that the reason that they mentioned crop yield was because they felt obligated to include at least one negative example, but wanted the positive examples they listed to outnumber the negative ones, so they needed a second one. Another beneficial use-case where getting labeled data is especially expensive or difficult, compared to other use-cases, and where it is clear that that is the case, may have been better than the part about food, but eh.

@YannicKilcher 4 жыл бұрын

Yea it's kind of like a job interview where they ask you about your weaknesses and you want to say something that's so minor it's almost irrelevant :D jokes aside, it's actually awesome that you don't have to collect as many labels. but that doesn't belong in the broader impact section, at least not as it is defined by NeurIPS, because it still deals with the field of ML. In the BI section, you're supposed to pre-view how your method will influence greater society.

@Guesstahw 3 жыл бұрын

Danke Vielmal @Yannik for the video, you did a great job. On the intuition or explanation behind figure 1 plots and why is it so, here's my 2 cents: You just have to think in terms of percentage of trainable parameters for the downstream task. To elaborate firstly keep in mind that growing the model size means growing the encoder size only and the size of the classification (linear) head remains constant. Now since in fine-tuning only the head parameters are trained as you grow the size of the self-supervised encoder, the ratio of trainable number of parameters (corresponding to Head) shrinks with respect of the total model parameters. Therefore a downstream task with fewer labels is more benefited from drop in percentage of number of trainable parameters (as the encoder size grows) than its counter parts with more labels. I think this would be the intuition behind the observed larger gains. In other words the fewer the labels, the more expressive encoder is required to capture as much information about the structure and geometry of the (unlabeled) data as possible to compensate for the shortage of labels.

@jaesikkim6218 4 жыл бұрын

Really awesome explanation! Easy to understand!

@victorrielly4588 4 жыл бұрын

Very interesting hypothesis about why a bigger model provides better improvements through self supervised learning however, I would caution that bigger models do not actually necessarily mean more learned features, for instance, suppose you use a giant model where the last layer is 1 dimensional. In fact the dimensionality of the feature space is not at all dependent on the size of the model but the dimensionality of the output layer.

@ProfessionalTycoons 4 жыл бұрын

such a great paper, still so much secrets to unravel

@rustists 3 жыл бұрын

very good presentation. thank you Yannic!!

@bengineer_the 4 жыл бұрын

Hi Yannic, this set of ideas feels like gold. This is how 'we' as humans learn. Children are allowed to experience the world with as few 'adult-labels' as possible.. "to get a feel" of the world.. we then come along, explain things.. they kinda memorise what you said, but then years later come back going, "Ahh, now I get it on my terms..". So perhaps the warning for future abuse of this technique is somewhat valid. Can we now make a form of "accumulative consciousness scheme" (over all time) that could then be queried & labelled in the future. Retroactively plucking the knowledge after you become aware of the concept-label. ..this could be quite far reaching.

@bengineer_the 4 жыл бұрын

Hmm, how about teach such a system as described, then [later] give it a inference based connection to the internet (let it search) and let it figure out the labels latter? Going on a tangent, but I wonder if there has been much research into clustering multiple image-classifiers & nlp transformers into a label acquisition learning scheme?

@bengineer_the 4 жыл бұрын

This form of learning (minimal-labelling combined with jittered-input forms) gives the network time to breathe. A bad teacher barks the answer. I like it a lot.

@YannicKilcher 4 жыл бұрын

Super interesting suggestions, I think what you're describing really goes into the direction of AGI where the system sort-of learns to reflect on what it knows and how it can learn the things it doesn't!

@SungEunSo 4 жыл бұрын

Thank you for great explanation!

@eelcohoogendoorn8044 4 жыл бұрын

So.. putting a bunch of cool existing methods together works pretty well? Sarcasm aside, the extensive experiments are appreciated.

@alviur 3 жыл бұрын

Thanks a lot Yannic!

@RohitKumarSingh25 4 жыл бұрын

So only novel idea in this paper is just adding the self training or distillation part right? I wonder how come we had never thought of it before for unlabelled data given it seems so obvious especially after realising the benefits of label smoothing and mix-up technique.

@slackstation 4 жыл бұрын

This one video, I feel like I've learned so many different insights. I'm still trying to level up where I understand that the math annotations easily and clearly like Mr. Kilcher but, the insights here are amazing. If I could suggest a paper/video to explain, SIREN: kzbin.info/www/bejne/h2PJfYp9d8qUn6s Paper: arxiv.org/abs/2006.09661 The video does a decent job of explaining the concept and application. I'm more interested in your opinion on what you think the impact that this could have on the rest of the field by replacing ReLU and others with SIREN. As always, thank you for your work.

@RohitKumarSingh25 4 жыл бұрын

Agree. Yannic please review this paper if you get time.

@grinps 2 жыл бұрын

Thank for the great review. What app did you use for read and annotate the pdf in this video?

@SachinSingh-do5ju 4 жыл бұрын

You have fans.., And many of them 😛 I am one now

@sayakpaul3152 4 жыл бұрын

22:13 why did you mention you were wrong in the supervised loss part? Sorry if this is a redundant question.

@YannicKilcher 4 жыл бұрын

I just re-watched it and I can't figure it out myself :D

@sayakpaul3152 4 жыл бұрын

@@YannicKilcher no worries man. I think these little traits make us human. Anyway, great explanation as always.

@theodorosgalanos9663 4 жыл бұрын

Thanks Yannic this is great! I wonder, are you aware of any approach that deals with domains where augmentation, at least most of it, is not available? The best I remember is the ablation study on augmentations from...I forget which paper, might have been v1 of this one? In my domain, most augmentations, other than random crop, invalidate the image completely (they are physics simulations), I wonder if anyone has tested if the SSL approach still helps in these cases.

@YannicKilcher 4 жыл бұрын

No idea. Yes, I also recall that crop is the main driver of performance here.

@hexinlei6250 4 жыл бұрын

Really good presentation!!! btw, may I ask what's the presenting app?

@YannicKilcher 3 жыл бұрын

OneNote

@authmanapatira3016 4 жыл бұрын

Love all your videos.

@johnkrafnik5414 4 жыл бұрын

Great video, thanks for making this so digestible. I am curious what the long term goal is here, it feels like we are piling on hack after hack to improve small percentage points. I understand that the overall goal of transitioning to semi-supervised learning is important, but so far feels very incremental.

@phsamuelwork 4 жыл бұрын

Broader impact... it is like something one put in an NSF proposal.

@sudhanshumittal8921 4 жыл бұрын

And that saturates the semi-supervised image classification performance. The community needs more realistic/harder benchmarks.

@sam.q 4 жыл бұрын

Thank you!

@RobNeuhaus 4 жыл бұрын

Do you have more information or intuition on self distillation? Why does distilling the same model/architecture on unlabeled using an identical architecture improve the student over the teacher?

@YannicKilcher 4 жыл бұрын

because it sees more data than the teacher

@mhadnanali 2 жыл бұрын

you are really good at paper reading. how to gain this skill?

@christianleininger2954 4 жыл бұрын

I really like your videos Maybe you would like to make a video about the paper Accelerating Online Reinforcement Learning with Offline Datasets

@JavierPortillo1 2 жыл бұрын

Thanks! Very clearly explained! Coud you please explain the SwAV model?

@vitocorleone1991 2 жыл бұрын

I salute you sir!!!

@christianleininger2954 4 жыл бұрын

great job ! amazing

@johngrabner 4 жыл бұрын

Wow, maybe this paper discovered why we dream.

@shivanshu6204 3 жыл бұрын

Damn you went hard after the broader impact lol.

@theodorosgalanos9663 4 жыл бұрын

So SSL gives us access to a sort of large feature space and distillation filters through which of those features are important for the task in hand? I wonder if there an experiment without distillation to see if that extra noise in the feature space hurts (so finetune and predict without student). Okay I'll stop being lazy and check!

@YannicKilcher 4 жыл бұрын

Yes, the first experiments in the paper are without distillation, as far as I understand (it's not explicitly clear, though)

@nopnopnopnopnopnopnop 2 жыл бұрын

I still don't get the self-distillation part. If the teacher and the student are the same network, then they produce the same outputs. So what is there to even learn? In this case, the student didn't have the additional projection layer, so at least the networks aren't identical (though I still don't understand what there is to learn). But kilcher made it look like it would be useful even if the networks were the same

@antonio.7557 4 жыл бұрын

one thing confuses me about distillation/self-supervised learning: Some methods enhance the pseudo label, some use confidence threshholds, some use augmentations for the student input, but this paper doesn't do any of those?

@rpcruz 4 жыл бұрын

It uses agumentation. From the paper: "SimCLR learns representations by maximizing agreement [26] between differently augmented views of the same data example via a contrastive loss in the latent space. (...) We use the same set of simple augmentations as SimCLR."

@antonio.7557 4 жыл бұрын

@@rpcruz ah ok thanks! makes a lot more sense then

@MastroXel 4 жыл бұрын

You mentioned that it's not well understood why are we getting better model after distillation. Let me even further this question: if that's the case why can't we now take Student and treat it as a new Teacher to obtain even better Student? That doesn't make too much sense, does it?

@YannicKilcher 4 жыл бұрын

People do that, but there are diminishing returns.

@andres_pq 4 жыл бұрын

What is the difference between contrastive loss and triplet loss?

@YannicKilcher 4 жыл бұрын

Haven't looked at triplet loss yet, but contrastive loss has an entire set of negatives

@snippletrap 4 жыл бұрын

Triplet loss is a kind of contrastive loss

@rajeshdhawan4624 4 жыл бұрын

I want to connect about same...kindly let me know how??

@twobob 2 жыл бұрын

agree

@sacramentofwilderness6656 4 жыл бұрын

I would like a neural network to slow down the time to keep up with the advances in machine learning and AI

@Sileadim 3 жыл бұрын

"That would be ridiculous. We'll I guess, in this day and age nothing is ridiculous." xD

@alonsorobots 3 жыл бұрын

pure alchemy...

@sayakpaul3152 4 жыл бұрын

21:20 I think the representations pass through a non-linearity. There's a sigma there. But anyway, the notation is more complicated than it needed to be frankly.

@scottmiller2591 4 жыл бұрын

If I were cynical, I would think you don't see much value in broader impact statements. If I were cynical.

@YannicKilcher 4 жыл бұрын

hypothetically

@snippletrap 4 жыл бұрын

Lol. The same busybodies and morality police sticking their noses into open source communities, renaming NIPS, etc. Why does no one say No to these humorless twats and control freaks?