You are so good at ditilling the paper knowledge. Clearing the top insights. Thanks.
@ensabinha7 ай бұрын
Essentially, they pre-train with contrastive learning and finetune, then do pseudo-labeling (but using the probability distribution over your labels) and retrain on that.
@Progfrag4 жыл бұрын
Wow! So self-distillation is basically label smoothing but smoothing at the right places instead of evenly
@CosmiaNebula2 жыл бұрын
The "projection layer" is not an architecture, but a job description. Any module that performs the job of a projection is a "projection layer". simCLR is an abstract framework for self-supervised contrastive learning. It consists of the following components: 1. data augmentation: turning data points into data point pairs (or triples, or n-tuples), to be used for contrastive learning. 2. working layer: a module for turning data points into general representations 3. projection layer: a module for turning general representation into specific representation adapted to specific purposes. 4. student network: a different network for distilling the teacher network. In the paper, simCLRv2 is concretely instantiated as the following: 1. data augmentation: random cropping, color distortion, and gaussian blur 2. working layer: ResNet-152 3. projection layer: 3-layered MLP 4. student network: ResNet but smaller than ResNet-152 The idea of a projection layer is to allow the working layer to focus on learning the general representation, instead of learning both a general representation AND the specific task in self-supervised training. Even self-supervised training is not a general task; it is specific! As they said in simCLRv1 paper. > We conjecture that the importance of using the representation before the nonlinear projection is due to loss of information induced by the contrastive loss. In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects. By leveraging the nonlinear transformation g(·), more information can be formed and maintained in h. This is similar to how, in iGPT (2020), the authors found that linear probing works best in the middle. Probably because in the middle, the Transformer has fully understood the image, and would then start to focus back to the next pixel. Imagine its attention as a spindle, starting local, then global, finally local again.
@tylertheeverlasting4 жыл бұрын
I think there are two reason to make a big deal out of that extra projection layer. 1. Its not standard practice, so their comparisions with previous methods aren't fully fair, so doing this might improve other methods as well. 2. The last layer of Resnet50 is CNN->Activation->Global Average Pooling, so its kinda different from regular models with only single linear layer on top of CNN
@quAdxify3 жыл бұрын
There usually isn't an activiation in front of GAP, I think at least. But yeah it's basically not just a stacked matrix multiplication (which should be equivalent to just using a wider layer) because of GAP. But it's pretty obvious why it works better. Basically they are bringing the fully connected layer back that was common place before GAP kinda replaced it for most cases. So there simply is more representational power. We shouldn't forget that a fully connected layer has orders of magnitude more weights compared to a Conv layer (depends on the number of filters but let's keep that reasonable). I'd bet it wouldn't matter if they just replaced GAP with a regular fully connected layer.
@ralf22023 жыл бұрын
Yannic, you are great teacher network! Thank you.
@sudhanshumittal89214 жыл бұрын
Thanks a lot Yannic for latest updates.
@teslaonly21364 жыл бұрын
I was stunned when I saw the broader impact section.
@sathisha23944 жыл бұрын
You are so good at explaining things in a not mathematical way which helps me to grasp the insights very quickly. I kind of felt so much of knowledge I gained just watching your videos. Thank you so much. Keep posting. Can you put a video about SIREN?
@salmaalsinan86123 жыл бұрын
its been a while since i laughed while listening to something technical :D, excellent review and appreciate the funny commentary as I had similar questions .
@PhucLe-qs7nx4 жыл бұрын
Self-distillation is boostraping / self-play in RL. The recent paper BYOL also uses bootstraping to ditch away the negative samples altogether. I guess the reason that these self-play or distillation works is because of the initial inductive bias in the random intializaed + architecture. If you can't bootstrap to learn from initial inductive biases, no learning is possible. And because we know learning is possible, even from zero labels, as long as the inductive biases and procedure is correct, then bootstraping / self-distillation / self-play must work.
@herp_derpingson4 жыл бұрын
Great paper. Definitely the quality you expect from Hinton. Fun fact: His great-great grandfather was George Boole. (Boolean algebra) . 21:20 I think its to be noted that ResNet50 probably went through some extensive hyperparameter tuning to do exactly what it was supposed to do and thus had a fixed number of dense layers at the end. So, perhaps adding a new layer just happens to help in the problem we are trying to solve, i.e. the teacher student thing instead of one hot. . 18:43 The whistling in the background. Is someone snoring?
@YannicKilcher4 жыл бұрын
Wow, didn't know Hinton had royal blood :D Yea I agree this extra layer is super problem specific, but I don't get why they don't just say the encoder is now bigger, but instead make the claim that this is part of the projection head. and no, I have no clue what that noise is O.o
@AdnanKhan-cx9it2 жыл бұрын
that horrific background sound at 18:43 , btw excellent explanation as always
@drdca82634 жыл бұрын
Regarding the broader impact statement, while I generally agree that many broader impact statements appear to not be useful, I do think that the “where it is more expensive or difficult to label additional data than to train larger models” point, along with the example of needing clinicians to carefully create annotations for the medical applications, was probably worth saying. That part appears to point to a specific area in which this improvement is useful. Of course, it would still be interesting even if it couldn’t be used for anything, but I do think that detail is still worthy of note. I imagine (with no real justification) that the reason that they mentioned crop yield was because they felt obligated to include at least one negative example, but wanted the positive examples they listed to outnumber the negative ones, so they needed a second one. Another beneficial use-case where getting labeled data is especially expensive or difficult, compared to other use-cases, and where it is clear that that is the case, may have been better than the part about food, but eh.
@YannicKilcher4 жыл бұрын
Yea it's kind of like a job interview where they ask you about your weaknesses and you want to say something that's so minor it's almost irrelevant :D jokes aside, it's actually awesome that you don't have to collect as many labels. but that doesn't belong in the broader impact section, at least not as it is defined by NeurIPS, because it still deals with the field of ML. In the BI section, you're supposed to pre-view how your method will influence greater society.
@Guesstahw3 жыл бұрын
Danke Vielmal @Yannik for the video, you did a great job. On the intuition or explanation behind figure 1 plots and why is it so, here's my 2 cents: You just have to think in terms of percentage of trainable parameters for the downstream task. To elaborate firstly keep in mind that growing the model size means growing the encoder size only and the size of the classification (linear) head remains constant. Now since in fine-tuning only the head parameters are trained as you grow the size of the self-supervised encoder, the ratio of trainable number of parameters (corresponding to Head) shrinks with respect of the total model parameters. Therefore a downstream task with fewer labels is more benefited from drop in percentage of number of trainable parameters (as the encoder size grows) than its counter parts with more labels. I think this would be the intuition behind the observed larger gains. In other words the fewer the labels, the more expressive encoder is required to capture as much information about the structure and geometry of the (unlabeled) data as possible to compensate for the shortage of labels.
@jaesikkim62184 жыл бұрын
Really awesome explanation! Easy to understand!
@victorrielly45884 жыл бұрын
Very interesting hypothesis about why a bigger model provides better improvements through self supervised learning however, I would caution that bigger models do not actually necessarily mean more learned features, for instance, suppose you use a giant model where the last layer is 1 dimensional. In fact the dimensionality of the feature space is not at all dependent on the size of the model but the dimensionality of the output layer.
@ProfessionalTycoons4 жыл бұрын
such a great paper, still so much secrets to unravel
@rustists3 жыл бұрын
very good presentation. thank you Yannic!!
@bengineer_the4 жыл бұрын
Hi Yannic, this set of ideas feels like gold. This is how 'we' as humans learn. Children are allowed to experience the world with as few 'adult-labels' as possible.. "to get a feel" of the world.. we then come along, explain things.. they kinda memorise what you said, but then years later come back going, "Ahh, now I get it on my terms..". So perhaps the warning for future abuse of this technique is somewhat valid. Can we now make a form of "accumulative consciousness scheme" (over all time) that could then be queried & labelled in the future. Retroactively plucking the knowledge after you become aware of the concept-label. ..this could be quite far reaching.
@bengineer_the4 жыл бұрын
Hmm, how about teach such a system as described, then [later] give it a inference based connection to the internet (let it search) and let it figure out the labels latter? Going on a tangent, but I wonder if there has been much research into clustering multiple image-classifiers & nlp transformers into a label acquisition learning scheme?
@bengineer_the4 жыл бұрын
This form of learning (minimal-labelling combined with jittered-input forms) gives the network time to breathe. A bad teacher barks the answer. I like it a lot.
@YannicKilcher4 жыл бұрын
Super interesting suggestions, I think what you're describing really goes into the direction of AGI where the system sort-of learns to reflect on what it knows and how it can learn the things it doesn't!
@SungEunSo4 жыл бұрын
Thank you for great explanation!
@eelcohoogendoorn80444 жыл бұрын
So.. putting a bunch of cool existing methods together works pretty well? Sarcasm aside, the extensive experiments are appreciated.
@alviur3 жыл бұрын
Thanks a lot Yannic!
@RohitKumarSingh254 жыл бұрын
So only novel idea in this paper is just adding the self training or distillation part right? I wonder how come we had never thought of it before for unlabelled data given it seems so obvious especially after realising the benefits of label smoothing and mix-up technique.
@slackstation4 жыл бұрын
This one video, I feel like I've learned so many different insights. I'm still trying to level up where I understand that the math annotations easily and clearly like Mr. Kilcher but, the insights here are amazing. If I could suggest a paper/video to explain, SIREN: kzbin.info/www/bejne/h2PJfYp9d8qUn6s Paper: arxiv.org/abs/2006.09661 The video does a decent job of explaining the concept and application. I'm more interested in your opinion on what you think the impact that this could have on the rest of the field by replacing ReLU and others with SIREN. As always, thank you for your work.
@RohitKumarSingh254 жыл бұрын
Agree. Yannic please review this paper if you get time.
@grinps2 жыл бұрын
Thank for the great review. What app did you use for read and annotate the pdf in this video?
@SachinSingh-do5ju4 жыл бұрын
You have fans.., And many of them 😛 I am one now
@sayakpaul31524 жыл бұрын
22:13 why did you mention you were wrong in the supervised loss part? Sorry if this is a redundant question.
@YannicKilcher4 жыл бұрын
I just re-watched it and I can't figure it out myself :D
@sayakpaul31524 жыл бұрын
@@YannicKilcher no worries man. I think these little traits make us human. Anyway, great explanation as always.
@theodorosgalanos96634 жыл бұрын
Thanks Yannic this is great! I wonder, are you aware of any approach that deals with domains where augmentation, at least most of it, is not available? The best I remember is the ablation study on augmentations from...I forget which paper, might have been v1 of this one? In my domain, most augmentations, other than random crop, invalidate the image completely (they are physics simulations), I wonder if anyone has tested if the SSL approach still helps in these cases.
@YannicKilcher4 жыл бұрын
No idea. Yes, I also recall that crop is the main driver of performance here.
@hexinlei62504 жыл бұрын
Really good presentation!!! btw, may I ask what's the presenting app?
@YannicKilcher3 жыл бұрын
OneNote
@authmanapatira30164 жыл бұрын
Love all your videos.
@johnkrafnik54144 жыл бұрын
Great video, thanks for making this so digestible. I am curious what the long term goal is here, it feels like we are piling on hack after hack to improve small percentage points. I understand that the overall goal of transitioning to semi-supervised learning is important, but so far feels very incremental.
@phsamuelwork4 жыл бұрын
Broader impact... it is like something one put in an NSF proposal.
@sudhanshumittal89214 жыл бұрын
And that saturates the semi-supervised image classification performance. The community needs more realistic/harder benchmarks.
@sam.q4 жыл бұрын
Thank you!
@RobNeuhaus4 жыл бұрын
Do you have more information or intuition on self distillation? Why does distilling the same model/architecture on unlabeled using an identical architecture improve the student over the teacher?
@YannicKilcher4 жыл бұрын
because it sees more data than the teacher
@mhadnanali2 жыл бұрын
you are really good at paper reading. how to gain this skill?
@christianleininger29544 жыл бұрын
I really like your videos Maybe you would like to make a video about the paper Accelerating Online Reinforcement Learning with Offline Datasets
@JavierPortillo12 жыл бұрын
Thanks! Very clearly explained! Coud you please explain the SwAV model?
@vitocorleone19912 жыл бұрын
I salute you sir!!!
@christianleininger29544 жыл бұрын
great job ! amazing
@johngrabner4 жыл бұрын
Wow, maybe this paper discovered why we dream.
@shivanshu62043 жыл бұрын
Damn you went hard after the broader impact lol.
@theodorosgalanos96634 жыл бұрын
So SSL gives us access to a sort of large feature space and distillation filters through which of those features are important for the task in hand? I wonder if there an experiment without distillation to see if that extra noise in the feature space hurts (so finetune and predict without student). Okay I'll stop being lazy and check!
@YannicKilcher4 жыл бұрын
Yes, the first experiments in the paper are without distillation, as far as I understand (it's not explicitly clear, though)
@nopnopnopnopnopnopnop2 жыл бұрын
I still don't get the self-distillation part. If the teacher and the student are the same network, then they produce the same outputs. So what is there to even learn? In this case, the student didn't have the additional projection layer, so at least the networks aren't identical (though I still don't understand what there is to learn). But kilcher made it look like it would be useful even if the networks were the same
@antonio.75574 жыл бұрын
one thing confuses me about distillation/self-supervised learning: Some methods enhance the pseudo label, some use confidence threshholds, some use augmentations for the student input, but this paper doesn't do any of those?
@rpcruz4 жыл бұрын
It uses agumentation. From the paper: "SimCLR learns representations by maximizing agreement [26] between differently augmented views of the same data example via a contrastive loss in the latent space. (...) We use the same set of simple augmentations as SimCLR."
@antonio.75574 жыл бұрын
@@rpcruz ah ok thanks! makes a lot more sense then
@MastroXel4 жыл бұрын
You mentioned that it's not well understood why are we getting better model after distillation. Let me even further this question: if that's the case why can't we now take Student and treat it as a new Teacher to obtain even better Student? That doesn't make too much sense, does it?
@YannicKilcher4 жыл бұрын
People do that, but there are diminishing returns.
@andres_pq4 жыл бұрын
What is the difference between contrastive loss and triplet loss?
@YannicKilcher4 жыл бұрын
Haven't looked at triplet loss yet, but contrastive loss has an entire set of negatives
@snippletrap4 жыл бұрын
Triplet loss is a kind of contrastive loss
@rajeshdhawan46244 жыл бұрын
I want to connect about same...kindly let me know how??
@twobob2 жыл бұрын
agree
@sacramentofwilderness66564 жыл бұрын
I would like a neural network to slow down the time to keep up with the advances in machine learning and AI
@Sileadim3 жыл бұрын
"That would be ridiculous. We'll I guess, in this day and age nothing is ridiculous." xD
@alonsorobots3 жыл бұрын
pure alchemy...
@sayakpaul31524 жыл бұрын
21:20 I think the representations pass through a non-linearity. There's a sigma there. But anyway, the notation is more complicated than it needed to be frankly.
@scottmiller25914 жыл бұрын
If I were cynical, I would think you don't see much value in broader impact statements. If I were cynical.
@YannicKilcher4 жыл бұрын
hypothetically
@snippletrap4 жыл бұрын
Lol. The same busybodies and morality police sticking their noses into open source communities, renaming NIPS, etc. Why does no one say No to these humorless twats and control freaks?
@deeplearner26343 жыл бұрын
crop yields... haha. did the researchers suffer from food shortage??