its so cute that the coffee bean takes a pause when you take a breath! Thank you, this video was more conclusive than anything i've seen on CLIP, it really explained the intuition to vector embedding image to text pairs and what that means
@mikewise99211 ай бұрын
Thanks!
@AICoffeeBreak11 ай бұрын
Wow, thank YOU!
@ravivarma57034 жыл бұрын
This channel is Gold - Excellent
@AICoffeeBreak4 жыл бұрын
Thank you, you very kind!
@xxlvulkann67435 ай бұрын
Very well and succinctly explained! This channel is a great educational resource!
@pixoncillo14 жыл бұрын
Wow, Letiția, what a piece of gold! Love your channel
@AICoffeeBreak4 жыл бұрын
Thank you!
@anilaxsus6376 Жыл бұрын
I like the fact that you talked about the Ingredients they used, thank you very much for that.
@b_01_aditidonode433 ай бұрын
such an incredible explanation
@AICoffeeBreak3 ай бұрын
Thank you, it's great to hear this, especially about such an old video!
@nasibullah15554 жыл бұрын
Great job again. Thanks to Ms. Coffee Bean ;-)
@AICoffeeBreak4 жыл бұрын
Our pleasure! Or rather Ms. Coffee Bean's pleasure. I am just strolling along. 😅
@satishgoda9 ай бұрын
Thank you so much for this succinct and action packed overview of CLIP.
@AICoffeeBreak9 ай бұрын
Thank you for visiting! Hope to see you again.
@EpicGamer-ux1tu2 жыл бұрын
Amazing video! This definitely deserves more views/likes. Congratulations. Much love.
@vince9434 жыл бұрын
Thank you for your continued research. ☕😇
@AICoffeeBreak4 жыл бұрын
Any time! Except when I do not have the time to make a video. 🤫
@郭正龙-s2n10 ай бұрын
Thanks! Very clear!👍
@cogling572 жыл бұрын
Wow, such amazing clear, succinct explanations!
@AICoffeeBreak2 жыл бұрын
Thanks! ☺️
@yuviiiiiiiiiiiiiiiii3 жыл бұрын
This is an excellent video, congrats.
@AICoffeeBreak3 жыл бұрын
Thanks! Glad to have you here!
@romeoleon11183 жыл бұрын
Amazing content ! Thanks for sharing :)
@gkaplan93 Жыл бұрын
couldnt find the link to colab that lets us experiment, can you please attach to description?
@AICoffeeBreak Жыл бұрын
The Colab link has become obsolete since the video has been up (a lot is happening in ML). Now you can use CLIP much easier since it has been integrated in hugginggface: huggingface.co/docs/transformers/model_doc/clip This is the link included in the description right now.
@Youkouleleh4 жыл бұрын
Thanks for the vidéo :)
@AICoffeeBreak4 жыл бұрын
As always, it was Ms. Coffee Bean's pleasure! 😉
@OguzAydn4 жыл бұрын
underrated channel
@elinetshaaf753 жыл бұрын
absolutely!
@henkjekel40817 ай бұрын
Thank you so much:) So the vectors T from the text encoder and I from the image encoders are the latent representations of the last word/pixel from the encoders that would normally be used to predict the next word?
@AICoffeeBreak7 ай бұрын
Not the last word/image region, but a summary of the entire picture / text sentence. :)
@henkjekel40817 ай бұрын
@@AICoffeeBreak Thank you for your quick reply:) Let me see, so I do understand the transformer architecture very well. The text encoder will just consist of the decoder part of a transformer. Due to all the self attention going on, the latent representation of the last word at the end of the decoder will contain the meaning of the entire sentence. That is why the model is able to predict the next word, based on just the last words latent representation. So could you elaborate on what you mean with the summary of the text sentence? Which latent representations are you talking about?
@henkjekel40817 ай бұрын
Hmm, I'm reading something about a CLS token, maybe that's it?
@AICoffeeBreak7 ай бұрын
@@henkjekel4081 Yes, exactly! So, the idea of CLIP is that they need a summary vector for the image and one for the text to compare them via inner product. It is a bit of architecture-dependent how exactly to get them. CLIP in its latest versions uses a ViT where the entire image is summarised in the CLS token. But the authors experimented with convolutional backbones as well, as the original implementation had two variants: one using a ResNet image encoder and the other using a Vision Transformer. The ViT variant became more widely available and popular. And yes, the text encoder happens to be a decoder-only autoregressive (causal attention) LLM, but it could have just been a bidirectional encoder transformer as well. The authors chose a decoder LLM to make future variants of CLIP generate language too. But for CLIP as it is in the paper, all one needs is a neural net that outputs a image summary vector, and another one that outputs a text summary vector of the same dimensionality as the image vector.
@harumambaru3 жыл бұрын
Thanks for teaching me something new today! I will try to return a favour and tell that dog race is breed :) But as not native English speaker it made perfect sense to me
@harumambaru3 жыл бұрын
Hunderasse is pretty good word :)
@AICoffeeBreak3 жыл бұрын
You're right! Hunderasse was a false friend to me, thanks for uncovering him to me. 😅 Do you also speak German?
@harumambaru3 жыл бұрын
@@AICoffeeBreak I am only learning it. After I moved to work to Israel and learned Hebrew I decided not to stop fun and continue learning new languages. I made bold suggestion that living in Heidelberg makes you speak German, and then I went to Wikipedia page Dog breed and found German version of the page, then my suggestion was confirmed.
@AICoffeeBreak3 жыл бұрын
@@harumambaru True detective work! :) It's great you are curious and motivated enough to learn new languages. Keep going!
@mishaelthomas31763 жыл бұрын
Thank you very much mam for such an insightful video tutorial. But I have one doubt. Suppose I trained the CLIP model on a dataset consist of two classes i.e dog and cat. After training. I tested my model on two new classes for example horses and elephants in the same way as told in the CLIP blog of OpenAI. Will it give me a satisfactory result? as you said that it can perform zero short learning.
@AICoffeeBreak3 жыл бұрын
Hi Mishael, this is a little mode complicated than that. If you train CLIP from scratch on two classes (dog and cat), it will not recognize elephants, no. The zero-shot capabilities of CLIP do not come from magical understanding of the world and generalization capabilities, but from the immense amounts of data CLIP has seen during pretraining. In my humble opinion, true zero-shot does not exists in current models (yet). It is just our human surprise when "the model has learned how to read" combined with our ignorance of the fact that the model had a lot of optical character recognition (reading) to do during pre-training. Or: look, it can make something out of satellite images, while its training data was full of those, but with a slightly different objective. The current state of zero-shot in machine learning is that you have trained on task A (e.g. align images containing text and the text transcription) and that it then can do another, but similar task B (e.g. distinguishing writing styles or fonts). I am sorry this didn't come across in the video so well and that it left the impression that zero-shot is more than it is. Experts in the field know very well the limitations of this but like to exaggerate it a little bit to get funding and papers accepted; but also because even this limited type of zero-shot merits enthusiasm, because models have not been capable of this at all until recently. I might make a whole video about "how zero-shot is zero-shot". A tangential video on the topic is this one kzbin.info/www/bejne/rqLHeZmCp8qpq9E where it becomes clear how the wrong interpretation of the "magic of zero-shot" led to mislabeling some behavior of CLIP as "adversarial attack", which is not.
@AICoffeeBreak3 жыл бұрын
One thing to add: if you *fine-tune* CLIP on dogs and cats and test the model on elephants, it might recognize elephants; not because your fine-tuning, but because all the pre-training that has been done beforehand. But not even this is not guaranteed: while fine-tuning, the model might catastrophically forget everything from pre-training.
@compilations63582 жыл бұрын
what's that music in the end?
@AICoffeeBreak2 жыл бұрын
Voices - Patrick Patrikios
@lewingtonn2 жыл бұрын
LEGENDARY!!!
@andresredondomercader20233 жыл бұрын
Hello, CLIP is impressive :) Is there a listing of all possible tags/results it can return?
@AICoffeeBreak3 жыл бұрын
CLIP can compute image-text similarity for any piece of text you input it has seen during training. I do not know exactly the entire list, but you can think of at least 30k English words.
@andresredondomercader20233 жыл бұрын
@@AICoffeeBreak Many thanks Letitia. I think we are trying to use CLIP the other way around: It seems that the algorithm is great if you provide keywords to identify images containing objects related to those keywords. But we are trying to obtain keywords from a given image, and then categorise those keywords to understand what is in the image. Maybe I'm a bit lost in how CLIP works?
@AICoffeeBreak3 жыл бұрын
@@andresredondomercader2023 CLIP computes similarities between image and text. So what you can do is take the image and compute similarities to every word of interest. When the similarity is high, then the image is likely to contain that word and you have an estimate for what is in the image, right?
@andresredondomercader20233 жыл бұрын
@@AICoffeeBreak thanks so much for taking the time to respond. In our project, we have about 300 categories: "Motor", "Beauty", "Electronics", "Sports"... Each category could be defined by a series of keywords; For instance "Sports" is made up of keywords like "Soccer", "Basketball", "Athlete"..., whilst "Motor" is made of keywords such as "Motorbike", "Vehicle", "Truck"... Our goal would be to take an image and obtain the related keywords (items in the image) that would help us associate the image with one or more categories. I guess we could invert the process, ie pushing into CLIP the various keywords we have for each category and then analyse the results to see which sets of keywords resulted in the highest probability, hence identifying the related category, but that seems very inefficient, since for each image we'd do 300 iterations (we have 300 categories). However, if given an image CLIP returned the matching keywords that are most appropriate to it, we could more easily then match those keywords returned by CLIP with our category keywords. Not sure if I'm missing something or maybe CLIP is just not suitable in this case. Thanks so much!
@AICoffeeBreak3 жыл бұрын
@@andresredondomercader2023 You are right, this would be inefficient to do 300 iterations per image, just so one can use it out of the box without changing much to it. But I would argue that: 1. inference is not that costly and you can to the following optimizations: 2. For one image: since the image stays the same during the 300 queries, you only have to run the visual branch once. Saves you a lot of compute. Then you have to encode only the text 300 times for the 300 labels, but it is quite fast because your textual sequence length is so small (one word, mostly). 3. For all images: You only have to compute the textual representations (run the textual) branch 300 times. Then you have the encodings. So a tip would be to compute the 300 textual representations (vectors). Store them. For each image, run the visual backbone and do the dot product of the image representation with the 300 (stored) textual representations.
@y.l.deepak5107 Жыл бұрын
The colab isnt Working Mam please kindly check it once
@AICoffeeBreak Жыл бұрын
Thanks for noticing! The Colab link has become obsolete since the video has been up (a lot is happening in ML). Now you can use CLIP much easier since it has been integrated in hugginggface: huggingface.co/docs/transformers/model_doc/clip I've updated the video description as well. :)
@joaquinpunales43653 жыл бұрын
Hi everybody :), we have been working locally with CLIP and exploring what can we achieve with the model, however we are still not sure if CLIP can be used in a production environment, I mean commercial usage, we have read CLIP's licence doc but it's still not clear, so if someone has a clear idea if that's allowed or not I'd be more than grateful !
@renanmonteirobarbosa81293 жыл бұрын
Make LSTMs great again, they are sad :/
@AICoffeeBreak Жыл бұрын
I've been prompted by someone to think whether LSTMs should still be part of neural network fundamental courses. What do you think? Is it CNN then Transformers directly? Or are LSTMs more than a historical digression?
@renanmonteirobarbosa8129 Жыл бұрын
@@AICoffeeBreak The concepts are more important and understanding why it works. LSTMs are fun