OpenAI CLIP Embeddings: Walkthrough + Insights

Рет қаралды 784

Күн бұрын

Пікірлер: 8

@johntanchongmin 5 ай бұрын

At 58:22, the weights W_i and W_t are the projections of the embedding space form the image model output and text model output respectively (allows for change in embedding dimension). This allows for more generic text and image models with different output dimensions, and they can all map to the same embedding dimension.

@Qzariuss 5 ай бұрын

going to try this tomorrow

@johntanchongmin 5 ай бұрын

1:07:31 This is a mistake on my end - this is not the ImageNet Supervised Learning model. Li. et. al. is actually the Visual N-gram model where they predict n-grams (n words) for each picture. arxiv.org/pdf/1612.09161.pdf Here, I believe they did not even implement out their model (it is quite low performance of 10+% accuracy on ImageNet), but rather, just use the method of how they use the class name text directly. They applied this on CLIP. Basically, the paper was misleading - they did not even need to refer to Li. et. al. for that chart as the methodology is totally different. It is just CLIP with ImageNet class names without any added prompt engineering.

@johntanchongmin 5 ай бұрын

For the loss function at 1:00:15, they use Cross Entropy Loss with the input as the unnormalised logits (multiply by exponent term with temperature t). That is why there is a need to multiply the resultant cosine similarity matrix with the logits. In the Cross Entropy Loss function, this will be divided further by the summation of all other input terms multiplied by the exponent term (otherwise known as normalised). See pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for details.

@johntanchongmin 5 ай бұрын

CLIP's loss function has also been described as InfoNCE loss, a common loss term for contrastive learning. See builtin.com/machine-learning/contrastive-learning for details. It is essentially Cross Entropy over cosine similarity terms, which is what is done in CLIP.

@johntanchongmin 5 ай бұрын

Jupyter Notebook Code can be found here if you want to do your own experiments too: github.com/tanchongmin/TensorFlow-Implementations/tree/main/Paper_Reviews/CLIP/CLIP%20Code

@awakenwithoutcoffee 10 күн бұрын

bookmarking this one! question: have you tried using OCR capable LLM's like gemini as an alternative for CLIP embeddings ?

@johntanchongmin 10 күн бұрын

Have not yet, but will be very interested to see if there are good alternatives for CLIP embeddings. Multimodal LLMs and applications will be on the rise in the near future.