Qwen2.5-VL: A hands on code walkthrough

Last Updated on September 14, 2025 by Editorial Team

Author(s): tangbasky

Originally published on Towards AI.

Twin articles:

Qwen2-VL: A hands-on code walkthrough

understand the working mechanism of multimodal LLMs

medium.com

It is difficult for those who read Qwen-VL for the first time to understand. The key barrier lies not in the algorithm or model framework, but in data preprocessing. Therefore, this article still takes a single sample as an example — like the previous article about Qwen2-VL — to elaborate on the workflow of Qwen2.5-VL.

Overview

The Enhancements Compared to Qwen2-VL

window-attention

Introducing the window attention mechanism in the ViT — bidirectional attention solely occurs within the range of predefined image window. To further save the computational resources, the maximum size of each window is set to 112*112(8*8 patches/tokens, with each patch/token sized 16*16) .
For windows smaller than 12*12, no padding is applied to ensure that images are processed as close as possible to their naive resolution.
dynamic fps sampling: A manually curated sampling frame ratio is applied to sample frames from the original videos.
revision of the position_id calculation for 3D mrope in T dimension: The original default value (1) is replaced with the actual time interval to perform weight calculation.
The scale of pretraining dataset has been significantly increased, and the number of tokens has increased from 1.2T to 4T.

Three stages for training

Step 1: Training a New ViT Encoder

A new ViT encoder is trained first. The training data comprises Image caption, visual knowledge(e.g., celebrities, landmarks, animals and plants), and OCR data. The training framework is built on the CLIP pretrained model, adopting the (image, text) data format.

step 2: Joint Training of ViT and QwenVL Decoder

The ViT encoder (trained in Step 1) is combined with the QwenVL decoder, and joint training is conducted on the integrated model.

step3: Long context pretraining

A long context understanding task is curated to train the model.

Data inquiry format

Below is a step-by-step explanation of how to process data in (image,text) format, using a single example.

Loading Model

First, we import the core libraries and functions required for Qwen2.5-VL’s image-text inference. These include the model class for conditional generation, the preprocessor for data formatting, and the utility function for vision data processing.

""
Image Processing Module for Qwen2.5-VL Inference
(Used to handle image input and convert it into a format compatible with the Qwen2.5-VL model)
"""
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

Next, we load the pre-trained Qwen2.5-VL model weights. There are two options: a basic loading method, and a recommended method with FlashAttention-2 enabled (for faster inference and lower memory usage, especially for multi-image or video tasks).

Basic loading method

# ======================================================================================================
# Load Model Weights (Basic Version)
# Recommend using the commented code below to enable FlashAttention-2 for better acceleration and memory saving,
# especially in multi-image and video scenarios.
# ======================================================================================================
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 "Qwen/Qwen2.5-VL-3B-Instruct", # Model identifier (3B parameter instruct version, suitable for general tasks)
 torch_dtype="auto", # Automatically select tensor type (e.g., float16 for GPU, float32 for CPU)
 device_map="auto" # Automatically assign model layers to available devices (avoids manual device configuration)
)

FlashAttention-2 loading method

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
 "Qwen/Qwen2.5-VL-7B-Instruct", # 7B parameter version (higher precision than 3B, for more complex tasks)
 torch_dtype=torch.bfloat16, # Use bfloat16 (reduces memory usage vs. float32, retains sufficient precision)
 attn_implementation="flash_attention_2", # Enable FlashAttention-2 (speeds up attention computation by 2-4x)
 device_map="auto", # Same as above: automatic device assignment
)

Loading data

We then load the AutoProcessor, which is responsible for two core tasks: formatting text prompts into the model’s required chat template, and converting images into standardized pixel data. The comments below detail key parameters (e.g., token count per image) and the rationale for pixel-token mapping.

# ======================================================================================================
# Load AutoProcessor for Image-Text Preprocessing
# Core functions: 1) Format text into chat template; 2) Resize/normalize images; 3) Control token count per image.
# Key Details:
# (1) Default token range per image: 4–16384 tokens (balances visual detail preservation and computational cost).
# (2) Customizable token count: Adjust min_pixels/max_pixels to trade off performance and cost (formula: pixel_size = token_count × 28 × 28).
# (3) Rationale for 28×28 pixel-per-token: ViT uses 14×14 base patches; 2×2 patches are merged into 1 token → 14×2 = 28.
# ======================================================================================================
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct") # Matches the model identifier

# Example of customizing min_pixels/max_pixels (uncomment to use):
# min_pixels = 256 * 28 * 28 # Min 256 tokens per image (requires ≥256×28×28 pixels)
# max_pixels = 1280 * 28 * 28 # Max 1280 tokens per image (capped at 1280×28×28 pixels)
# processor = AutoProcessor.from_pretrained(
# "Qwen/Qwen2.5-VL-7B-Instruct",
# min_pixels=min_pixels,
# max_pixels=max_pixels
# )

Define Inference Prompt (Text + Image)

We construct the input prompt in a “role-content” format (compatible with Qwen2.5-VL’s chat template). A single text prompt can be paired with multiple images (each image is represented as a dictionary), and we can customize image parameters (e.g., min_pixels) for individual images.

# ======================================================================================================
# Define Inference Prompt (Text + Image)
# Format Rules:
# (1) One text can pair with multiple images (each image is a dict in the "content" list).
# (2) Per-image parameter customization: Override global min_pixels/max_pixels by adding them to the image dict (see official docs).
# ======================================================================================================
messages = [
 {
 "role": "user", # Role label (required for the chat template to distinguish user/model turns)
 "content": [
 {
 "type": "image", # Content type (marks this as an image input)
 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", # Image URL (can also use local file paths)
 # Optional: Add min_pixels/max_pixels here to customize this image alone
 },
 {"type": "text", "text": "Describe this image."}, # User’s text query (asks the model to describe the image)
 ],
 }
]

Apply Chat Template (Preprocess Text Prompt)

# ====================================================================================================
# Step 1: Chat Template Application (Preprocess Text Prompt)
# Purpose:
# - Add default system prompt ("You are a helpful assistant.") if not provided.
# - Wrap each role’s message with <|im_start|> (start) and <|im_end|> (end) tokens.
# - Reserve space for images with <|vision_start|><|image_pad|><|vision_end|> (placeholder for image tokens).
# Output Example:
# <|im_start|>system
# You are a helpful assistant.<|im_end|>
# <|im_start|>user
# <|vision_start|><|image_pad|><|vision_end|>Describe this image.<|im_end|>
# <|im_start|>assistant
# ====================================================================================================
text = processor.apply_chat_template(
 messages, 
 tokenize=False, # Return plain text (not tokenized tensors) for subsequent processing
 add_generation_prompt=True # Append <|im_start|>assistant to trigger model generation
)

Preprocess Vision Data (Image Resizing & Validation)

We use process_vision_info to standardize the input image: validate its aspect ratio, align its dimensions to be divisible by 28, and scale it to fit the min_pixels/max_pixels range. This ensures the image is compatible with the ViT encoder’s patch processing logic.

# ====================================================================================================
# Step 2: Vision Data Preprocessing (Image Resizing & Validation)
# Purpose: Convert raw images into standardized PIL images compatible with ViT.
# Outputs:
# - image_inputs: List of preprocessed PIL images (length = number of images in the prompt).
# - video_inputs: Empty list (reserved for video processing; marked as TODO).
# Core Logic for Each Image:
# 1. Aspect ratio check: Reject images with max(h,w)/min(h,w) > 200 (avoids extreme distortion).
# 2. Dimension alignment: Round h/w to values divisible by 28 (for ViT patch merging).
# 3. Size scaling: Resize to fit min_pixels/max_pixels while preserving aspect ratio.
# 4. Final resize: Use bilinear interpolation to adjust to aligned dimensions (no cropping/padding).
# ====================================================================================================
image_inputs, video_inputs = process_vision_info(messages) # Process images in the "messages" prompt

Batch Tokenization (Combine Text + Image into Model Inputs)

We convert the formatted text (text) and preprocessed images (image_inputs) into PyTorch tensors. This step replaces the <|image_pad|> placeholder with actual image tokens and adds padding for batch compatibility—generating the final input tensors for the model.

# ====================================================================================================
# Step 3: Batch Tokenization (Combine Text + Image into Model Inputs)
# Purpose: Convert text and images into tensors (input_ids, attention_mask, pixel_values, etc.) that the model can process.
# Key Tensors:
# - input_ids: Tokenized text + image tokens (shape: [batch_size, sequence_length]).
# - attention_mask: Marks valid tokens (1) and padding (0) (shape: [batch_size, sequence_length]).
# - pixel_values: Normalized image pixels (shape: [total_image_tokens, 3×2×14×14]).
# - image_grid_thw: Token grid dimensions for each image (shape: [num_images, 3]).
# Note: Image tokenization (pixel → token) occurs in model.forward(), not here.
# ====================================================================================================
inputs = processor(
 text=[text], # List of formatted text (batch size = 1 here)
 images=image_inputs, # Preprocessed images (from Step 2)
 videos=video_inputs, # Empty (no video data)
 padding=True, # Add padding to make all sequences in the batch the same length
 return_tensors="pt", # Return PyTorch tensors (compatible with Qwen2.5-VL)
)
inputs = inputs.to(model.device) # Move tensors to the same device as the model (GPU/CPU)

Inference & Output Decoding

# ====================================================================================================
# Step 4: Inference — Generate Output Text
# Purpose: Generate a description of the image and decode the result into plain text.
# Core Steps:
# 1. model.generate(): Generate text (caps output length with max_new_tokens=128).
# 2. Trim input tokens: Keep only the model’s response (remove the original prompt).
# 3. Batch decode: Convert token tensors back to text (skip special tokens).
# ====================================================================================================
# Generate text response
generated_ids = model.generate(**inputs, max_new_tokens=128) # Max 128 new tokens (avoids overly long outputs)

# Trim: Remove input prompt tokens (retain only the model’s generated content)
generated_ids_trimmed = [
 out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# Decode: Convert trimmed tokens to plain text
output_text = processor.batch_decode(
 generated_ids_trimmed,
 skip_special_tokens=True, # Exclude special tokens like <|im_start|>
 clean_up_tokenization_spaces=False # Preserve original spacing (avoids formatting distortion)
)

# Print the final image description
print(output_text)

Architecture

Qwen2.5-VL: A hands on code walkthrough — Figure 1: The Qwen2.5-VL framework. Image from [1].

This architecture primarily contains three modules: process_vision_info, ViT and Qwen2.5 LM Decoder.

process_vision_info

This preprocessor performs specific tasks including dynamic image size adjustment, dynamic frame extraction for videos and so on.

It uses 3D convolution to split input visual data into a series of 14*14 patches. On the input side, it applies window attention to reduce computational resources. Meanwhile, on the output side, it leverages an MLP layer to merge 2*2 patches into a single merged patch, which serves as an input token in Qwen2.5-VL.

Qwen2.5 LM Decoder

It uses 3D M-rope for both text and visual data, which are jointly fed into the model for processing.

Next, we will discuss the details of each module.

process_vision_info

Image data processing

image_inputs, video_inputs = process_vision_info(messages)

The process_vision_info module follows the subsequent steps:

(1). check whether the image aspect ratio (max(h,w) / min(h,w)) falls within a predefined range. If the image aspect ratio exceeds the threshold (current threshold is 200), an error is raised.

(2). reset the height and width of images to ensure both values are divisible by 28.

(3). If an image is too large (i.e., its total number of pixels exceeds max_pixels), new height and width values are calculated while preserving the original aspect ratio. Importantly, the total number of pixels in the resized image must not exceed max_pixels .

(4). After completing the above steps, we obtain target dimensions (resized_height，resized width). The image is then scaled to these dimension, producing the image_inputs used in the model.

From the above steps, we discover that:

The traditional crop-if-large, pad-if-small strategy for resizing images to a fixed resolution has been abandoned. We just scale images within a reasonable range while preserving the original information. Notably, each image has a different resolution.
Each image involves a different number of patches.

Both of these points illustrate the concept of dynamic resolution in Qwen2.5-VL.

Video data processing

Video data processing primarily includes two functions: _read_video_decord() and smart_nframes() .

_read_video_decord()

This function is the core of video preprocessing: it loads videos from various sources, samples frames uniformly, converts data formats, and outputs model-compatible tensors.

def _read_video_decord(
 ele: dict,
) -> (torch.Tensor, float):
 """read video using decord.VideoReader

 Args:
 ele (dict): a dict contains the configuration of video.
 support keys:
 - video: the path of video. support "file://", "http://", "https://" and local path.
 - video_start: the start time of video.
 - video_end: the end time of video.
 Returns:
 torch.Tensor: the video tensor with shape (T, C, H, W).
 """
 import decord
 video_path = ele["video"]
 st = time.time()
 vr = decord.VideoReader(video_path)
 # TODO: support start_pts and end_pts
 if 'video_start' in ele or 'video_end' in ele:
 raise NotImplementedError("not support start_pts and end_pts in decord for now.")
 # =====================================================================================
 # The following are the original properties of the video
 # total_frames: Total number of frames in the original video
 # video_fps: FPS (frames per second) of the original video
 # =====================================================================================
 total_frames, video_fps = len(vr), vr.get_avg_fps()
 logger.info(f"decord: {video_path=}, {total_frames=}, {video_fps=}, time={time.time() - st:.3f}s")
 # =====================================================================================
 # nframes: The total number of frames to sample from this video after calculation.
 # See below for an explanation of the calculation logic.
 # =====================================================================================
 nframes = smart_nframes(ele, total_frames=total_frames, video_fps=video_fps)
 # =====================================================================================
 # idx: IDs of the selected video frames, sampled in a uniform manner
 # (The purpose of uniform sampling is to preserve information from all time axes of the original video as much as possible)
 # For example: Assume total_frames = 20 (original video has 20 frames), video_fps = 5 (5 frames per second)
 # The original video is then 4 seconds long, with frames grouped by second as:
 # [[0,1,2,3,4], [5,6,7,8,9], [10,11,12,13,14], [15,16,17,18,19]]
 # If nframes = 10 (we need to sample 10 frames finally), the sampled idx will be:
 # [0, 2, 4, 6, 8, 11, 13, 15, 17, 19]
 # =====================================================================================
 idx = torch.linspace(0, total_frames - 1, nframes).round().long().tolist()
 video = vr.get_batch(idx).asnumpy() # Extract frames at the corresponding positions and convert to a numpy array
 video = torch.tensor(video).permute(0, 3, 1, 2) # Convert to TCHW format (Time, Channels, Height, Width)
 # =====================================================================================
 # sample_fps: Represents the frame rate of the sampled video.
 # For example: total_frames = 20 (original video has 20 frames), video_fps = 5 (5 frames per second)
 # The original video is 4 seconds long. If nframes = 10, the sampled FPS is:
 # nframes / (total_frames / video_fps) = 10 / 4 = 2.5fps
 # =====================================================================================
 sample_fps = nframes / max(total_frames, 1e-6) * video_fps
 # =====================================================================================
 # video: Sampled video data, converted to a tensor with shape (T, C, H, W)
 # sample_fps: Frame rate (FPS) of the sampled video
 # =====================================================================================
 return video, sample_fps

smart_nframes()

This helper function calculates the valid number of frames to sample (nframes) by reconciling user configuration (fps/nframes) with model constraints (e.g., token limits, frame grouping requirements).

def smart_nframes(
 ele: dict,
 total_frames: int,
 video_fps: int | float,
) -> int:
 """calculate the number of frames for video used for model inputs.

 Args:
 ele (dict): a dict contains the configuration of video.
 support either `fps` or `nframes`:
 - nframes: the number of frames to extract for model inputs.
 - fps: the fps to extract frames for model inputs.
 - min_frames: the minimum number of frames of the video, only used when fps is provided.
 - max_frames: the maximum number of frames of the video, only used when fps is provided.
 total_frames (int): the original total number of frames of the video.
 video_fps (int | float): the original fps of the video.

 Raises:
 ValueError: nframes should in interval [FRAME_FACTOR, total_frames].

 Returns:
 int: the number of frames for video used for model inputs.
 """
 # =====================================================================================
 # Both nframes and fps are derived from the user's configuration in the message.
 # - nframes: Determines the total number of frames to sample from the video finally.
 # - fps: Assuming the original video duration remains unchanged, this value represents the 
 # frame rate (frames per second) the user wants to use for sampling (default = 2).
 # Theoretically: Total duration of original video * user-configured fps = Total frames of the sampled video.
 # In practice: Due to QwenVL constraints (preventing video data from occupying too few/too many tokens),
 # the actual FPS of the sampled video may not exactly match the user-configured fps.
 # Based on the above definitions: You must configure either fps or nframes, not both.
 # =====================================================================================
 assert not ("fps" in ele and "nframes" in ele), "Only accept either `fps` or `nframes`"
 # =====================================================================================
 # If nframes is configured: Round it to a multiple of FRAME_FACTOR (default FRAME_FACTOR = 2)
 # (Because we want to process 2 video frames together before inputting them into the ViT (Vision Transformer))
 # =====================================================================================
 if "nframes" in ele:
 nframes = round_by_factor(ele["nframes"], FRAME_FACTOR)
 # =====================================================================================
 # If fps is configured (use the default value FPS = 2 if not configured)
 # =====================================================================================
 else:
 fps = ele.get("fps", FPS)
 # Minimum total frames required for video data (default = 4), ensuring it is a multiple of FRAME_FACTOR
 min_frames = ceil_by_factor(ele.get("min_frames", FPS_MIN_FRAMES), FRAME_FACTOR)
 # Maximum total frames allowed for video data (default = 764), ensuring it is a multiple of FRAME_FACTOR
 max_frames = floor_by_factor(ele.get("max_frames", min(FPS_MAX_FRAMES, total_frames)), FRAME_FACTOR)
 # Theoretically: Total frames we need = Original video length (in seconds) * Artificially defined sampling FPS
 nframes = total_frames / video_fps * fps
 if nframes > total_frames:
 logger.warning(f"smart_nframes: nframes[{nframes}] > total_frames[{total_frames}]")
 # Apply constraints to nframes: Ensure it stays within [min_frames, max_frames] and does not exceed total_frames
 nframes = min(min(max(nframes, min_frames), max_frames), total_frames)
 # Round down nframes to the nearest multiple of FRAME_FACTOR
 nframes = floor_by_factor(nframes, FRAME_FACTOR)
 # Validate: Ensure nframes is within the valid range [FRAME_FACTOR, total_frames]
 if not (FRAME_FACTOR <= nframes and nframes <= total_frames):
 raise ValueError(f"nframes should in interval [{FRAME_FACTOR}, {total_frames}], but got {nframes}.")
 return nframes

The results from two functions are utilized in the following 3D rope calculation step to calculate the positional information in T dimension.

Notably, an important parameter — second_per_grid_ts — is applied, and its implementation is as follows:

second_per_grid_ts = (1/sample_fps) * temporal_patch_size(defaule value is 2)

The entire workflow of video data processing is demonstrated below:

Figure 2: The processing for video frames. Image from author.

Processor

In the previous steps, we have completed the following steps:

The data has undergone the add_chat_template processing where some start, end, image and video placeholders are added.
The visual data has undergone some preprocessing such as dynamically resizing images, extracting frames for videos and dynamically resizing the extracted frames.

However, the text data has not been processed and the visual data still needs further processing.

# ====================================================================================================
# Step 3: Batch Tokenization (Combine Text + Image into Model Inputs)
# Purpose: Convert text and images into tensors (input_ids, attention_mask, pixel_values, etc.) that the model can process.
# Key Tensors:
# - input_ids: Tokenized text + image tokens (shape: [batch_size, sequence_length]).
# - attention_mask: Marks valid tokens (1) and padding (0) (shape: [batch_size, sequence_length]).
# - pixel_values: Normalized image pixels (shape: [total_image_tokens, 3×2×14×14]).
# - image_grid_thw: Token grid dimensions for each image (shape: [num_images, 3]).
# Note: Image tokenization (pixel → token) occurs in model.forward(), not here.
# ====================================================================================================
inputs = processor(
 text=[text], # List of formatted text (batch size = 1 here)
 images=image_inputs, # Preprocessed images (from Step 2)
 videos=video_inputs, # Empty (no video data)
 padding=True, # Add padding to make all sequences in the batch the same length
 return_tensors="pt", # Return PyTorch tensors (compatible with Qwen2.5-VL)
)
inputs = inputs.to(model.device) # Move tensors to the same device as the model (GPU/CPU)

In this function, self.image_processor is responsible for processing visual data and self.tokenizer for processing text data.

Qwen2VLImageProcessor

self.image_processor inherits from Qwen2VLImageProcessor which performs the following processes:

(1). do_resize / do_rescale / do_normalize

do_resize: adjusts the image size.
do_rescale(multiply 1/255): scales pixel values to the range[0,1].
do_normalize: normalizes images channel-wise using pre-defined std and mean.

These three operations are optional.

(2). Each image is duplicated by temporal_patch_size times (default value is 2). This ensures the image processing method aligns with that of videos.

(3). After completing the above steps, an image’s size is (grid_t * grid_h * grid_w, channel * temporal_patch_size(2) * patch_size(14) * patch_size(14)), where

* grid_t : number of grids along the T (temporal) dimension — that is, how many 2-frame segments the video is divided into. For a single image, grid_t is set to 1.

* grid_h = resize_h / patch_size : number of patches along the height dimension when divided by the patch size (e.g., 14).

* grid_w : number of patches along the width dimension.

Meanwhile, the grid dimensions (grid_t, grid_h, grid_w) are retained for later use.

Then all the images in a batch are concatenated, thus the size of final output (refers to pixel_valuesandimage_grid_thw) is [sum(grid_t * grid_h * grid_w), channel * temporal_patch_size(2) * patch_size(14) * patch_size(14)].

(4). Notably, in the shape [grid_t * grid_h * grid_w, channel * temporal_patch_size(2) * patch_size(14) * patch_size(14)], the first dimension (grid_t * grid_h * grid_w) represents the number of patches. Spatially adjacent patch blocks are arranged into 4 consecutive positions in the sequence.

Figure 3: The process of patch flattening. Image from author.

For a single video, the processing workflow is similar to that of an image. The key difference is that grid_t is not set to 1; instead it is calculated as total number of frames/temporal_patch_size .

Tokenization

For the purpose of concatenating text and images as an unified input, a image placeholder <|image_pad|> is retained, where a <|image_pad|> represents an image. Since images are processed into multiple tokens, <|image_pad|> is duplicated multiple times. The number of duplication is (grid_t * grid_h * grid_w)/ (merge_size**2) where merge_size(default value is 2) is the number of patches to merge a token — that is, after partitioning the original image into 14×14 patches, every group of 2×2 neighboring patches is merged into a single token representation.

When images are processed using this method, they are fed into tokenizer along with text data.

VIT Encoder

After processing text data and image data, the processed data is fed into the ViT model. The detailed steps are as follows:

3D Convolution

Before entering the ViT, the visual data uses flatten_patches and the corresponding (grid_t, grid_h, grid_w) ( grid information) as the input — consistent with the data in figure3.

The purpose of 3D convolution is to 3D kernels: each kernel computes with all patch blocks to generate one value for the hidden_size value. Once all kernels have completed their computation, the full hidden_size feature representation is obtained.

Figure 4: 3D convolution. Image from author.

2D RoPE

The next step in the computation is the RoPE of ViT. Notably, the reason we use 2D RoPE instead of 3D RoPE is that the main purpose of the ViT lies in extracting features for a single image or frame (especially, the window attention is introduced in a subsequent step, and this attention mechanism does not require computing the temporal (T) dimension — there is no need for 3D RoPE here). Instead, the information in T dimension is retained for the core computation in Qwen-VL where 3D RoPE is officially introduced.

Below is the 1D rope

Since the R_m is a sparse matrix where most of values is 0, we generally adopt the following computation format for convinence.

Notice that the solution for R_m is the unique. For instance, the 1 D RoPE matrix R_m can also be represented in the following format, since it satisfies the property shown in the third formula.

Then, for a single token in the ViT’s 2D RoPE, its 2D positional encoding is as follows (assume vit_hidden_size=12, and the token’s position is (h,w)):

Although the formula may be difficult to interpret, we can roughly understand it as follows: half of head_dim is used to apply RoPE based on the h-axis dimension, and the other half of head_dim is used to apply RoPE based on the w-axis dimension — with both parts sharing a single set of θ(angle parameters).

window attention

The theory of window attention is illustrated in figure5, where the attention mechanism sorely operates between the patches whithin the same window. However, the flattening method in 2D RoPE cannot ensure tokens in the same window are arraged in sequence.

Figure 6: The flattening format of pathes. Image from [2].

To perform the window attention, we apply the window mask in attention mechanism:

Figure 7: mask window attention. Image from [2].

The method for reordering the ViT’s tokens is wrote in here. We use this reordered token sequence as ViT input to the ViT for computation. Importantly, after obtaining outputs from the ViT, the reordered sequence must be restored to its original order.

MRoPE

In the 3D RoPE, regardless of whether the input is text or images, their position_ids are represented in the format (t, h, w), where values in the position_id of text are equal.

In Figure 8, a video containing a dog is processed into 3 frames. Each dashed cube represents a vision token, which is then fed into the Qwen-VL encoder. From this, we can see that the position_id values for the height (h) and width (w) dimensions are assigned new values.

Additionally, in the temporal dimension, the default interval between two adjacent frames (or images) is 1. Thus, in Figure 8, we can observe that the values (ranging from 0 to 2) in the temporal dimension are arranged in descending order.

In Qwen2-VL, the initial position_id values for a new modality are derived from the maximum position_id value of the previous modality plus 1. We use some examples to illustrate:

In figure 8, the maximum position_id value of the visual modality is 3 (i.e., max(t,h,w)=3). Thus, the initial position_id for the text modality is 3+1 =4, and the position_id of the first text token is (4,4).
In figure 8, the maximum position_id value(i.e., max(t,h,w)) reaches 12 by the last token. When appending the visual data to the text tokens, the initial position_id values of the new visual data are (0+13, 0+13, 0+13), (0+13, 0+13, 1+13), (0+13, 0+13, 2+13). This is because the initial position_id values of the visual data (without accounting for the initial offset from the previous modality) is (0, 0, 0),(0, 0, 1), (0, 0, 2), and so on.

Now, we have introduced the 3D position_id of Qwen2-VL. The only difference between Qwen2.5-VL and Qwen2-VL lies in the initial position_id value setting for the T dimension for the visual data.

Based on the previously introduced second_per_grid_t , we now introduce a new hyperparameter — tokens_per_second . This hyperparameter is somewhat abstract, but I attempt to illustrate it clearly.

Figure 9: The video frames. Image from author.

From the original visual data, two tokens exist on the temporal dimension between the first group of video frames and the second group of video frames. Importantly, these tokens are not the ones fed into the Qwen-VL decoder; instead, they correspond to 14*14 patches in the original video frames.

At first glance, the tokens_per_second may be equivalent to temporal_patch_size . However, since the hyperparameter can be manually set, it is better interpreted as : “the number of tokens that users consider reasonable to pass through the T dimension within 1 second”. Thus, the default value of tokens_per_second is set to temporal_patch_size .

With this understanding, second_per_grid_t * tokens_per_second represents the absolute time interval of video frames between two video frames. Theoretically, this product fixes the number of tokens passing through the T dimension, so it substitutes the default T dimension position_id value (which is 1). Here is an example:

Assume the original T dimension position_id values are [0,1,2] (with a default time interval of 1 second).
if the second_per_grid_t * tokens_per_second = 50 (This value is much larger than the default value of 1, indicating a significantly longer time interval between adjacent video frame groups. This scenario typically occurs when the original video is long, however, only a small number of frames are extracted, resulting in large gaps between adjacent frames. Using 50(instead of 1) better reflects this long interval.
Thus, the updated temporal dimension position_id values = [0, 1, 2] * 50 = [0, 50, 100].

3D MRoPE

Assume we have known the position_id of a token is (t,h,w), how is the rope matrix for it?

Assume head_dim = 128, we can determine how many of the 128 can be used to perform rope in temporal, width, and height dimension respectively from config files. In the default config file, we assume:

T = 16 * 2 = 32
H = 24 * 2 = 48
W= 24 * 2 = 48

Therefore, the RoPE is:

Reference

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication