BLIP-2 : How Transformers Learn to ‘See’ and Understand Images
Author(s): Arnavbhatt Originally published on Towards AI. This is a step-by-step walkthrough of how an image moves through BLIP-2: from raw pixels → frozen Vision Transformer (ViT) → Q-Former → final query representations that get fed into a language model. You’ll understand …