At 58:22, the weights W_i and W_t are the projections of the embedding space form the image model output and text model output respectively (allows for change in embedding dimension). This allows for more generic text and image models with different output dimensions, and they can all map to the same embedding dimension.
@Qzariuss5 ай бұрын
going to try this tomorrow
@johntanchongmin5 ай бұрын
1:07:31 This is a mistake on my end - this is not the ImageNet Supervised Learning model. Li. et. al. is actually the Visual N-gram model where they predict n-grams (n words) for each picture. arxiv.org/pdf/1612.09161.pdf Here, I believe they did not even implement out their model (it is quite low performance of 10+% accuracy on ImageNet), but rather, just use the method of how they use the class name text directly. They applied this on CLIP. Basically, the paper was misleading - they did not even need to refer to Li. et. al. for that chart as the methodology is totally different. It is just CLIP with ImageNet class names without any added prompt engineering.
@johntanchongmin5 ай бұрын
For the loss function at 1:00:15, they use Cross Entropy Loss with the input as the unnormalised logits (multiply by exponent term with temperature t). That is why there is a need to multiply the resultant cosine similarity matrix with the logits. In the Cross Entropy Loss function, this will be divided further by the summation of all other input terms multiplied by the exponent term (otherwise known as normalised). See pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for details.
@johntanchongmin5 ай бұрын
CLIP's loss function has also been described as InfoNCE loss, a common loss term for contrastive learning. See builtin.com/machine-learning/contrastive-learning for details. It is essentially Cross Entropy over cosine similarity terms, which is what is done in CLIP.
@johntanchongmin5 ай бұрын
Jupyter Notebook Code can be found here if you want to do your own experiments too: github.com/tanchongmin/TensorFlow-Implementations/tree/main/Paper_Reviews/CLIP/CLIP%20Code
@awakenwithoutcoffee10 күн бұрын
bookmarking this one! question: have you tried using OCR capable LLM's like gemini as an alternative for CLIP embeddings ?
@johntanchongmin10 күн бұрын
Have not yet, but will be very interested to see if there are good alternatives for CLIP embeddings. Multimodal LLMs and applications will be on the rise in the near future.