OpenAI CLIP model explained

Рет қаралды 5,045

Machine Learning Studio

Күн бұрын

Пікірлер: 11

@AI_For_Scientists 2 ай бұрын

Great video series on vit and derivatives, watched all of it. Thank you very much for sharing.

@PyMLstudio 2 ай бұрын

Glad you enjoyed it.

@klaverhenrik Ай бұрын

Your videos are amazing! Clear and well-structured. Are the slides available anywhere?

@PyMLstudio Ай бұрын

Thanks, I am glad you found the videos useful! Sure, I am uploading the slides to GitHub, you can find the PDF slides at github.com/PyML-studio/mlstudio/tree/main/Slides

@SebastianRaschka 5 ай бұрын

Very nice video! I can also imagine that predicting the caption text exactly isn't only more difficult but it would also be more likely result in (more) overfitting if it is learned this way. At 5:43, the pair-wise similarities, they are basically like cross-attention scores?

@PyMLstudio 5 ай бұрын

Yes, in a way, it’s analogous to cross-attention, taking dot-product between the features from the text encoder and image encoder. This dot-product similarity is used as the final output of the model to determine if an image and a text caption are related or not. Good question, thanks for the comment

@fouziaanjums6475 5 ай бұрын

Please cover FasterViT model too...

@PyMLstudio 5 ай бұрын

Absolutely, I’ll cover that , I have a few other topics lined up, then I’ll get to FasterViT Thanks for the suggestion!

@fouziaanjums6475 10 күн бұрын

@@PyMLstudioPlease make a video on the above topic soon....

@randomstuff39280 3 ай бұрын

thank you for explaining! very clear! but I'm wondering how do you know WiT dataset is based on 50000 queries and 20000 pairs for each query? I can't find it in the paper.

@PyMLstudio 3 ай бұрын

Thanks for the comment! Please see Page 3, section 2.2: Creating a sufficiently large dataset But it’s 500000 queries, balancing 20000 (Image, text) pairs per query