Рет қаралды 2,645
In a Colab Notebook we code a visualization of the last layer of the Vision Transformer Encoder stack and analyze the visual output of each of the 12 Attention Heads, given a specific image. Now we understand how a only pre-trained ViT (although with the DINO method) can not always succeed in an image classification (downstream) task. The fine-tuning of the ViT is simply missing - but essential for a better performance.
Based on the COLAB NB by Niels Rogge, HuggingFace (all rights with him):
colab.research...
In one of my next video we will fine-tune a pre-trained Vision Transformer ViT from scratch. For better image classification performance.
#ai
#vision
#technology