OpenAI CLIP Embeddings: Walkthrough + Insights

  Рет қаралды 784

John Tan Chong Min

John Tan Chong Min

Күн бұрын

Пікірлер: 8
@johntanchongmin
@johntanchongmin 5 ай бұрын
At 58:22, the weights W_i and W_t are the projections of the embedding space form the image model output and text model output respectively (allows for change in embedding dimension). This allows for more generic text and image models with different output dimensions, and they can all map to the same embedding dimension.
@Qzariuss
@Qzariuss 5 ай бұрын
going to try this tomorrow
@johntanchongmin
@johntanchongmin 5 ай бұрын
1:07:31 This is a mistake on my end - this is not the ImageNet Supervised Learning model. Li. et. al. is actually the Visual N-gram model where they predict n-grams (n words) for each picture. arxiv.org/pdf/1612.09161.pdf Here, I believe they did not even implement out their model (it is quite low performance of 10+% accuracy on ImageNet), but rather, just use the method of how they use the class name text directly. They applied this on CLIP. Basically, the paper was misleading - they did not even need to refer to Li. et. al. for that chart as the methodology is totally different. It is just CLIP with ImageNet class names without any added prompt engineering.
@johntanchongmin
@johntanchongmin 5 ай бұрын
For the loss function at 1:00:15, they use Cross Entropy Loss with the input as the unnormalised logits (multiply by exponent term with temperature t). That is why there is a need to multiply the resultant cosine similarity matrix with the logits. In the Cross Entropy Loss function, this will be divided further by the summation of all other input terms multiplied by the exponent term (otherwise known as normalised). See pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html for details.
@johntanchongmin
@johntanchongmin 5 ай бұрын
CLIP's loss function has also been described as InfoNCE loss, a common loss term for contrastive learning. See builtin.com/machine-learning/contrastive-learning for details. It is essentially Cross Entropy over cosine similarity terms, which is what is done in CLIP.
@johntanchongmin
@johntanchongmin 5 ай бұрын
Jupyter Notebook Code can be found here if you want to do your own experiments too: github.com/tanchongmin/TensorFlow-Implementations/tree/main/Paper_Reviews/CLIP/CLIP%20Code
@awakenwithoutcoffee
@awakenwithoutcoffee 10 күн бұрын
bookmarking this one! question: have you tried using OCR capable LLM's like gemini as an alternative for CLIP embeddings ?
@johntanchongmin
@johntanchongmin 10 күн бұрын
Have not yet, but will be very interested to see if there are good alternatives for CLIP embeddings. Multimodal LLMs and applications will be on the rise in the near future.
Has Generative AI Already Peaked? - Computerphile
12:48
Computerphile
Рет қаралды 999 М.
OpenAI CLIP Explained | Multi-modal ML
33:33
James Briggs
Рет қаралды 23 М.
From Small To Giant Pop Corn #katebrush #funny #shorts
00:17
Kate Brush
Рет қаралды 72 МЛН
Spongebob ate Patrick 😱 #meme #spongebob #gmod
00:15
Mr. LoLo
Рет қаралды 21 МЛН
Ozoda - Lada ( Official Music Video 2024 )
06:07
Ozoda
Рет қаралды 20 МЛН
How Strong is Tin Foil? 💪
00:26
Preston
Рет қаралды 147 МЛН
Michael Hodel: Reverse Engineering the Abstraction and Reasoning Corpus
1:28:17
The Best RAG Technique Yet? Anthropic’s Contextual Retrieval Explained!
16:14
How AI 'Understands' Images (CLIP) - Computerphile
18:05
Computerphile
Рет қаралды 203 М.
OpenAI CLIP: ConnectingText and Images (Paper Explained)
48:07
Yannic Kilcher
Рет қаралды 131 М.
Run ALL Your AI Locally in Minutes (LLMs, RAG, and more)
20:19
Cole Medin
Рет қаралды 120 М.
CodeAct: Code As Action Space of LLM Agents - Pros and Cons
1:37:57
John Tan Chong Min
Рет қаралды 560
Processing Videos for GPT-4o and Search
12:48
James Briggs
Рет қаралды 6 М.
How to set up RAG - Retrieval Augmented Generation (demo)
19:52
Don Woodlock
Рет қаралды 31 М.
From Small To Giant Pop Corn #katebrush #funny #shorts
00:17
Kate Brush
Рет қаралды 72 МЛН