Great video series on vit and derivatives, watched all of it. Thank you very much for sharing.
@PyMLstudio2 ай бұрын
Glad you enjoyed it.
@klaverhenrikАй бұрын
Your videos are amazing! Clear and well-structured. Are the slides available anywhere?
@PyMLstudioАй бұрын
Thanks, I am glad you found the videos useful! Sure, I am uploading the slides to GitHub, you can find the PDF slides at github.com/PyML-studio/mlstudio/tree/main/Slides
@SebastianRaschka5 ай бұрын
Very nice video! I can also imagine that predicting the caption text exactly isn't only more difficult but it would also be more likely result in (more) overfitting if it is learned this way. At 5:43, the pair-wise similarities, they are basically like cross-attention scores?
@PyMLstudio5 ай бұрын
Yes, in a way, it’s analogous to cross-attention, taking dot-product between the features from the text encoder and image encoder. This dot-product similarity is used as the final output of the model to determine if an image and a text caption are related or not. Good question, thanks for the comment
@fouziaanjums64755 ай бұрын
Please cover FasterViT model too...
@PyMLstudio5 ай бұрын
Absolutely, I’ll cover that , I have a few other topics lined up, then I’ll get to FasterViT Thanks for the suggestion!
@fouziaanjums647510 күн бұрын
@@PyMLstudioPlease make a video on the above topic soon....
@randomstuff392803 ай бұрын
thank you for explaining! very clear! but I'm wondering how do you know WiT dataset is based on 50000 queries and 20000 pairs for each query? I can't find it in the paper.
@PyMLstudio3 ай бұрын
Thanks for the comment! Please see Page 3, section 2.2: Creating a sufficiently large dataset But it’s 500000 queries, balancing 20000 (Image, text) pairs per query