This channel is insanely good. Deserves even more recognition. Great work! Subscribed
@MachineLearningStreetTalk3 жыл бұрын
This is a really important paper, I suggest people pay particular attention to Yannic's "robustness to data shift" section if you are short on time. I hope we can get the authors on to discuss this!
@ghostlv40303 жыл бұрын
The idea is so simple and so hard to believe it is this effective! Okay, I see, NLP is so useful in vision now.
@jonatan01i3 жыл бұрын
Thank you so much for this, especially for not keeping the promise on cutting the video short!
@ПавелШтыков-х9т2 жыл бұрын
Man, you have a talent to explain hard things! And your english is awesome!!
@shengyaozhuang37483 жыл бұрын
Interestingly, similar training methods have been explored in the field of information retrieval for searching relevant documents to the given query. So, probably a good application of CLIP could be searching a wanted photo on the internet by using a text query.
@dreadfulbodyguard72883 жыл бұрын
Google images?
@charrylee36713 ай бұрын
so good video. I understand CLIP after your explanation.
@G12GilbertProduction3 жыл бұрын
In a 8 of 20 examples presented in this paper review is really measured by different compilers of models, but not only this same in 20, 45, 60 bites for a 1mm³ pixel outer the third output layer.
@alteshaus31493 ай бұрын
Thank you so much for this video! It really helped me understand Clip! Best regards from Vienna!
@MeatFingerSteam3 жыл бұрын
Absolutely loved the Alec meme, thanks!
@key2thacity87 Жыл бұрын
Hey @YannicKilcher /all, it seems like OpenAI is only referring to performance on the class of bananas at 39:05 (figure 13) not that zero-shot CLIP outperforms resnet in general on ImageNet. Earlier in the paper (8:15) they achieve 40% accuracy on ImageNet. Is 39:05, (figure 13) showing 72% accuracy on bananas or overall?
@p.z.83553 жыл бұрын
Do they have a specific strategy to sample the batches ? Maybe sampling totally unrelated captions initially of e.g dogs and planes, then in a later state in training sampling more subtly differing captions of e.g different breeds of dogs.
@YannicKilcher3 жыл бұрын
I think it's just pure random.
@p.z.83553 жыл бұрын
@@YannicKilcher merci :)
@bukovelby2 жыл бұрын
Just a Brilliant overview!
@vsiegel3 жыл бұрын
Trained on "the internet" - so technically speaking, it is a porn classifier, right? Except if it used a separate algorithm for "adult image filtering". Fascinating! (And funny!)
@naifalkhunaizi7847 Жыл бұрын
Truly great explanation!
@ophir10803 жыл бұрын
Great video, thanks for sharing! Just one wonder if mine, why are we 100% sure that all these old known datasets are not just subsets of the images CLIP was trained on?
@jeshweedleon39603 жыл бұрын
Imagine this but with more sensory data - audio, video, text, hell any string of bytes even. Wild...
@ashrafg46683 жыл бұрын
Thank you for the explanation!
@TechNewsReviews27 күн бұрын
Very good explanation....👌
@aminasadi10402 жыл бұрын
Thanks a lot for this awesome video! The explanations are very digestible even for a beginner.
@oflasch3 жыл бұрын
Great explanation! 👍
@SaidAzizov-c3l Жыл бұрын
Excellent! Thank you a lot!
@theocachet64962 жыл бұрын
Do they check that Ti =! Tj for i =! j with (i, j) indexes of a minibatch? If it is not the case, than sometimes it may have conflict in the contrastive loss (max Ti,Ti and min Ti,Ti in the same computation). Do we agree?
@44Kokoloko2 жыл бұрын
Am I understanding this right: The CLIP training results in having both a text and image encoder that are able to numerically represent the proximity between words and image representations, with vectors. These encoders can then be used on different datasets to good effect. In other words, it relies on the findings related to text embeddings (word2vec) to train corresponding "image embeddings" in a way that allows matching an image embedding to a text embedding. Text embeddings having proved to be able to encode relations between concepts in 3d space (king - man + woman = queen), you can then move between text and image representation of these concepts. Does that sound right? Also, what is the pretraining done on?
@jenishah98253 жыл бұрын
I can't thank you enough for making such useful videos.
@Xaelum3 жыл бұрын
Just imagine a version of CLIP trained on random KZbin video frames + Title or Subtitles.
@ThetaPhiPsi3 жыл бұрын
just for ppl watching this lately: They revised the results for STL-10 in another version of the paper. On p. 40 they write "We updated the STL10 scores from the previous version of this paper after fixing a CUDA-related bug."
@antonio.75573 жыл бұрын
thanks yannic, great video! but the biggest question i havs is how they got this dataset with images+descriptions 🤔
@akhilezai3 жыл бұрын
Hey Yannic! I wanna know what software you use to "extend" your PDF with empty space that you use to write notes. Please tell us
@tsunamidestructor3 жыл бұрын
OneNote, afaik
@fayeq17453 жыл бұрын
I was also wondering about that and figured out it might be OneNote.
@akhilezai3 жыл бұрын
So I found another way to do it. Using latex's includepdf
@tsunamidestructor3 жыл бұрын
@@akhilezai you could also use LiquidText if you have an iPad
@akhilezai3 жыл бұрын
@@tsunamidestructor thanks! I was sure it was possible on some apps on iPad, but I own Samsung tab s7+
@uniqued4ve2 жыл бұрын
I'm missing a bit your critique points here! But thanks, good intro to CLIP
@ShivamSingh-xf8nb Жыл бұрын
Amazing explaination!
@user-vr3bl6cn9e4 ай бұрын
after understanding the paper, how do we approach to understand the code?
@prabhupadpradhan4893 жыл бұрын
The dataset which was used for pretraining the model (in the paper it is mentioned as WebImageText) is it made available for public use ?
@bhavikdhandhalya8 ай бұрын
I thought you will explain how those image and words are processes so that they have some connection. No issue.
@omkarpanhalkar18573 ай бұрын
what is linear probing between 2 visual models?
@Qumeric3 жыл бұрын
It's weird that ImageNet-A performance is higher than ordinary ImageNet performance.
@norik16163 жыл бұрын
Could it be, because the images are more artistic ≈ closer to labeled images ppl put on the internet?
@srinathtangudu48992 жыл бұрын
Your videos are so good. Thanks:)
@frankd11563 жыл бұрын
Very good Yanic......
@Kram10323 жыл бұрын
Can't wait for this to be done to, like, entire movies. "Just" take the actual movie scripts as text input and the entire resulting movies (the frames) as image input, and add the modality of sound on top. Could also add a bunch of other production data if available (such as, say, concept art, or voices and music unmixed or even making-of documentaries and interviews or entire books which those movies are based on etc.) Between (such versions of) CLIP and Dall-E you probably could make entire movies from scratch with just writing out scripts, and then refine them by giving some concept art or something. I mean that level is a long ways off I expect - mostly due to how much data needs to be fit into a model that has to be long-time coherent etc. - just the memory requirements as of right now would be quite insane. But *in principle* I think this could be possible. Resource-needs aside, I suspect adding a sound modality wouldn't even be that difficult in CLIP, right? You'd basically do the same symmetric contrastive classification but add a third concept to it dealing with sound.
@GiangNguyen-of4qf3 жыл бұрын
best ever video explained Yannic :)
@florianhonicke54483 жыл бұрын
New video from yannic!!! Saved my day :D
@raphaelsaeed Жыл бұрын
Well explained
@morkovija3 жыл бұрын
Chuckled at that narrator cut! x)
@maryamaghili11483 жыл бұрын
Thank you for your great work! So is there any way we could find the actual label (text) they have used for training? I need to use this model for some classification tasks that I have, but I am wondering how to organize labels? I have only images with no annotation.
@yaka1692 жыл бұрын
How it works is similar to siamese network, or how? I quite confused
@dl569 Жыл бұрын
thank you a lot!
@GUINTHERKOVALSKI Жыл бұрын
24:55 "i think prompt engineering will become quite a bit more relevant"
@Abdulazizab23 жыл бұрын
Great explanation! But I wonder how they measure the accuracy of zero-shot prediction, is it by containing the original word of the label only? or some sort of combination as the output of zero-shot CLIP would be a sentence I assume.
@gocomputing8529 Жыл бұрын
It is a bit too late, but I'll answer for the future people. From the video the classification is performed by creating a prompt. For example, if you know they are photos, you would say 'a photo of {label}'. As the video shows, the prompt you choose is really important for some applications (datasets)
@xingjian4179 ай бұрын
thanks for sharing
@eliteari8 ай бұрын
great video
@ranam3 жыл бұрын
can i make an orc text recognizer with it
@black-snow3 жыл бұрын
"random GoPro fallen into a bunch of bananas" xD
@simonstrandgaard55033 жыл бұрын
Mindblown again.
@jonatan01i3 жыл бұрын
29:38 Voice borrowed from Josh from Let's Game It Out
@morkovija3 жыл бұрын
funny how we all watch same channels
@h3rtc3 жыл бұрын
that Alec meme is fire haha!
@willrazen3 жыл бұрын
"We'll forgive it"
@herp_derpingson3 жыл бұрын
18:20 This symmetric classification looks like a good idea. I wonder if we can use this for all classification tasks in general. 28:40 If you look at the datasets it is weak at. They involve some form of arithmetic. This paper is a big deal. Kudos to the authors.
@YannicKilcher3 жыл бұрын
good thought, but if you apply this to standard classification, you always have the same N labels, which would just reduce to the classic crossentropy loss
@imranq92412 жыл бұрын
Is it zero shot if you consider image captioning as a single task ?
@DajesOfficial Жыл бұрын
it is zero shot in terms of not using dataset-specific data. Otherwise it is obviously heavily trained
@chandrahmmouleb96112 жыл бұрын
super Hit
@emilyme94782 жыл бұрын
👍👍
@antonio.75573 жыл бұрын
shouldn't this easily beat imagenet state of the art if you actually finetune it on the full imagent dataset?
@p.z.83553 жыл бұрын
Why do you even need a prompt ? Can't you just use the original label set ?
@DajesOfficial Жыл бұрын
They show in the paper and it is demonstrated in the video that prompt engineering adds 5 percent points to accuracy.
@nakshatrasingh92023 жыл бұрын
Switch transformer, Google. Video please 😭😭🙏🙏🙏
@Lee-vs5ez3 жыл бұрын
Better with vision to do nlp
@harinkumar10733 жыл бұрын
44:00 "human model" lmao
@jointcc22 жыл бұрын
"logit" XDDDD
@chenjieY-z3qКүн бұрын
one of the worst explanation. It's supprising how you can put a lot of clear, well-defined, and neat concepts/terms togather and end up with a presentations that is a mess.