BLIP-2 : How Transformers Learn to ‘See’ and Understand Images

Author(s): Arnavbhatt

Originally published on Towards AI.

This is a step-by-step walkthrough of how an image moves through BLIP-2: from raw pixels → frozen Vision Transformer (ViT) → Q-Former → final query representations that get fed into a language model. You’ll understand what the “queries” are, where they come from, and how they evolve.

Introduction

If you know Transformers but always wondered how vision and language models like BLIP-2 actually connect, this guide is for you.

We’ll trace exactly what happens, tensor by tensor, when a 224×224×3 image moves through BLIP-2

You’ll see every shape, every step: how patch embeddings are made, how Q-Former “queries” work, and how the final summary is plugged into the LLM.

What is BLIP-2 Trying to Do?
Vision Transformer (ViT): Breaking Down the Image
Q-Former: Where “Queries” Meet Vision
How the Q-Former Actually Works (With Dimensions)
From Q-Former Output to LLM: Turning Vision into Language
Putting It All Together : An End-to-End Example
Conclusion: Why Is This Design Powerful?

1. What is BLIP-2 Trying to Do?

BLIP-2 bridges images and language by:

Taking an image → Turning it into a handful of compact, information-rich embeddings → Feeding those directly to an LLM to generate text or answer questions.

But images are huge, so BLIP-2:

Uses a frozen Vision Transformer (ViT) to summarize the image as 196 patch embeddings
Then runs a small trainable Transformer (Q-Former) on top, which distills those 196 vectors down to 32 special “queries”

Those 32 queries are what get fed to the language model.

BLIP-2 : How Transformers Learn to ‘See’ and Understand Images — **Figure:** BLIP-2 overview — frozen ViT image encoder, Q-Former for query extraction, and a frozen language model for text generation.
*Source:* *BLIP-2 paper*

2. Vision Transformer (ViT): Breaking Down the Image

2.1 Splitting the Image into Patches

Input image: 224×224×3(RGB)
Patch size: 16×16
Number of patches: 14×14 = 196
Each patch is flattened into a 16×16×3 = 768-dimensional vector

2.2 Patch Embedding + Positional Encoding

Each patch: passed through a learnable linear projection (768 →768)
Learnable positional embedding added (so the model knows where the patch came from)
Result: 196×768 matrix (patches × features)

2.3 ViT Encoder

Stack of 12 Transformer encoder layers (self-attention + MLP)
All 196 patch tokens attend to each other
Output: 196×768 tensor, each row a globally-contextualized patch embedding

In BLIP-2 the ViT is completely frozen. No gradients, no updates. It’s just a fixed feature extractor, So how do we learn anything ? wait and watch.

#IMAGE: (A grid image showing image patchification; underneath, a row of 196 patch vectors lined up; an arrow to a stack labeled “12 Transformer layers”; output as 196 vectors)

3. Q-Former: Where “Queries” Meet Vision

3.1 What Are the “Queries”?

Picture it like this:

After the frozen ViT Outputs: A matrix of shape [196 x 768] — 196 patches, each with 768 features.
These 196 patch vectors together hold all the image info ViT can provide.
But: Feeding 196 tokens into a language model is slow, expensive, and not what the LLM expects.

Enter the Q-Former.

The Q-Former doesn’t operate on the 196 patch tokens directly.
Instead, it starts with 32 trainable vectors, each of size 768.
Q(0): [32 x 768] — 32 queries, each a 768-dimensional vector.
These 32 vectors are called “queries” (just a name — they’re parameters, not outputs from ViT).
At the start of training, they’re random, or sometimes initialized from BERT or similar.
The entire goal of Q-Former is:
Let these 32 queries repeatedly “ask questions” to the 196 patch features and learn to summarize all important aspects of the image into just 32 slots.

Analogy:
Imagine 196 students (ViT patches) know everything about the image. Instead of reading 196 essays, you send 32 interviewers (the queries) to talk to the whole group and each write a summary.

Q-Former is a mini-Transformer whose only job is to make these 32 interviewers ask better, more focused questions over time.

3.2 Why Not Feed 196 Patches Directly?

Memory: 196 tokens × 768 dimensions = 150,528 values per image.
Efficiency: Instead of training a whole ViT to produce meaningful 196 patches, we can train a small parameter Q-Former that would “learn” how to extract the valuable information out of these 192 patches.
Specialization: Each query can learn to focus on a different “aspect” of images (object, color, theme, …)

In short : Q-Former is both a compressor and a learner.

4. How Q-Former Actually Works

Now that you know Q-Former is itself a Transformer, let’s break down exactly what happens in each layer, step by step.

Inputs to Each Q-Former Layer

Q (queries, current layer): Shape [32 x 768] — 32 query vectors, each 768-dimensional.
V (ViT output, always frozen): Shape [196 x 768] — 196 patch embeddings from the Vision Transformer

4.1 Self-Attention (Within the Queries)

Each of the 32 queries “talks to” all the other queries.
Purpose: Queries coordinate, share information, and can specialize so that no two queries are redundant.

How it works:

Layer Normalization:
Take Q, apply LayerNorm → result is still [32 x 768]
Multi-Head Self-Attention:
Q gets projected to queries, keys, and values for each head (12 heads, each 64-dim)
Attention is computed between all pairs of queries (32 x 32 attention matrix) Output is [32 x 768]
Residual Connection:
Add this output back to the original Q (elementwise addition), so the shape stays [32 x 768]

Analogy:
Imagine 32 interviewers in a room, all sharing notes before they go out to ask questions about the image. Each tweaks their “interview topic” based on what the others are focusing on.

4.2 Cross-Attention (Queries to ViT Outputs)

Now, each query vector looks at all 196 patch features from the ViT (the actual “information” from the image).
Purpose: This is where queries extract their relevant info from the image.

How it works:

LayerNorm:
Normalize the queries again (shape [32 x 768]
Cross-Attention:
Each of the 32 queries (rows) attends to all 196 ViT outputs (patches)
This is like a “weighted summary” of all patches for each query
Output: [32 x 768]
Residual Connection:
Add this output back to the queries (still [32 x 768])

Analogy:
Now each interviewer walks up to the 196 students (patches) and says, “Tell me what you know about my focus area.” The interviewer gets a custom answer.

4.3 Feed-Forward + Residual

Each query is then individually processed by a tiny neural network (MLP):

Two linear layers:
First layer expands to 3072, applies GeLU activation, then projects back to 768.
LayerNorm and Residual:
Result is still [32 x 768]

The above is the workings of just one layer, this is repeated for 6 Layers.

**Figure:**
BLIP-2 Q-Former: each query attends to all frozen ViT patch embeddings, interacts with other queries, and passes through a feed-forward layer — stacked over several layers to distill image features for language tasks.
*Source:* *BLIP-2 paper*

5. From Q-Former Output to LLM: Turning Vision into Language

5.1 How Does the LLM Use the Queries?

Input Format:

The 32 query outputs are treated just like the first 32 tokens in a text sequence.
A [DEC] token is added next (to separate the visual tokens and text tokens).
Then come the text tokens — this could be a prompt, a question.

So the full input for the LLM looks like this:
[visual tokens (32)] + [DEC token (1)] + [text tokens (M)]

The LLM just sees one long sequence, where each “token” is a 768-dimensional vector.

5.2 How Attention Works: The Causal Mask

The visual tokens (those 32 image queries) can only “see” each other. They don’t look ahead into the text.
The text tokens can look back at everything — both the visual summary and any earlier words in the prompt.
At every step as it generates text, the LLM is allowed to use the image context (the 32 queries), the [DEC] token, and whatever words have been produced so far.

BLIP-2 Q-Former: The Causal Mask
*Source:* *BLIP-2 paper*

6. Putting It All Together: An End-to-End Example

7. Conclusion: Why Is This Design Powerful?

Frozen ViT: Saves memory, no gradients through giant image encoder, or the transformer.
Small bottleneck: Only train Q-Former and 32 queries, not ViT.
Efficient for LLMs: 32 prefix tokens is manageable; aligns with how LLMs expect text.
Versatility: Can be used for captioning (if no text is given as an input), VQA, multi-turn dialogue (just keep Z at front, keep appending text)

If this helped you understand BLIP‑2 better, please leave a few claps or a comment. It helps others discover it, and it really motivates me to keep writing.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication