Fine-Tuning DeepSeek-VL2 for Multimodal Instruction Following: A Comprehensive Technical Guide

Author(s): Ojasva Goyal

Originally published on Towards AI.

Fine-tuning large-scale vision-language models with detailed error breakdowns and best practices. Unlocking advanced vision-language capabilities with parameter-efficient adaptation

Fine-Tuning DeepSeek-VL2 for Multimodal Instruction Following: A Comprehensive Technical Guide — Image by Alex Shuper on Unsplash

Introduction

DeepSeek-VL2 is a multimodal large language model (MLLM) capable of interpreting and responding to both textual and visual instructions. Fine-tuning it allows tailoring its understanding and generation capabilities to specific multimodal tasks, significantly boosting performance compared to zero-shot applications. In this detailed guide, I’ll walk you through each step to effectively fine-tune DeepSeek-VL2, making complex multimodal applications accessible and achievable.

We’ll cover:

Dataset preparation and preprocessing
Model and processor configuration details
Training workflow with LoRA and Hugging Face’s PEFT
Common errors encountered (and clear step-by-step solutions)
Practical inference examples

Whether you’re a researcher exploring VLMs or a builder creating domain-specific AI agents, this guide will help you avoid many pitfalls and understand the process in depth.

📌 Problem Statement & Dataset Format

My goal: fine-tune DeepSeek-VL2 on a custom Visual Question Answering (VQA) dataset, where each entry includes a question, an answer, and a path to an image.

Sample JSON format:

{
 "question": "What should I do and why according to the visual?",
 "answer": "You should attend the awards because...",
 "image_path": "/path/to/image.jpg"
}

Challenges faced initially:

Variations in image dimensions causing model incompatibilities.
Texts needed to be converted into structured conversations.
Required multi-modal fusion: image + text.

I used datasets.Dataset for loading and preprocessing, PIL for images, and formatted each sample to fit DeepSeek-VL2’s chat template.

Fine-Tuning Workflow Overview

Here’s the pipeline at a glance:

[Dataset JSONs] 
 ⬇️
[Preprocessing: Format Text + Load Image] 
 ⬇️
[Tokenizer & Processor: Add Chat Template, Image Tokens] 
 ⬇️
[Custom Collator: Handle dynamic input_ids, single batch] 
 ⬇️
[DeepSeek-VL2 Model + LoRA Adapters] 
 ⬇️
[Training with Trainer API] 
 ⬇️
[Model + Processor Save] 
 ⬇️
[Test Inference]

Each step brought its own set of bugs and learnings, which we’ll now dive into deeply.

Understanding DeepSeek-VL2’s Chat Template

When working with large language models — especially ones trained for conversational tasks like DeepSeek-VL2 — you can’t just feed in plain text.

The input has to follow a specific conversation structure, known as the chat template.

This format helps the model understand:

Who is speaking (User or Assistant)
What type of content follows (text, image, etc.)
Where the model should begin its response

In DeepSeek-VL2, a typical chat sequence looks like:

<|User|>
What should I do and why should I do according to the visual advertisement?
<image>
<|Assistant|>

Important Points:

<|User|> and <|Assistant|> special tokens guide the model.
<image> token explicitly indicates where visual inputs are injected into text sequences.
Assistant response is kept empty initially during training so the model learns to generate it.

This format had to be manually created while pre-processing every sample.

Mistake initially made:

We forgot the <eos> token (end of sequence) after the assistant, causing downstream tensor mismatches.

Fix: Handled automatically via the processor by keeping the Assistant’s text blank "".

Without this format, you may get token mismatch errors or incorrect outputs. During preprocessing and inference, we always applied this structure using processor(..., conversations=[...], images=[...]). This ensured the model knew the question context and when to generate the answer.

Model Setup & LoRA Integration

Fine-tuning large models can be computationally expensive. DeepSeek-VL2 is too large to fine-tune fully on a single GPU. Hence, we opted for LoRA (Low-Rank Adaptation) fine-tuning using the PEFT library. LoRA significantly reduces computational and memory requirements by only adjusting a small subset of model parameters.

We used:

DeepSeek-VL2 from HuggingFace (manually cloned & patched)
LoRA via PEFT for parameter-efficient tuning
Trainer from transformers

LoRA configuration:

Here’s an ideal LoRA configuration:

lora_cfg = LoraConfig(
 r=8, 
 lora_alpha=16,
 target_modules=["q_proj", "v_proj"],
 lora_dropout=0.05,
 bias="none",
 task_type=TaskType.CAUSAL_LM
)

Why these settings?

r=8: A good balance for parameter efficiency.
lora_alpha=16: Ensures stable training.
Target Modules: Specifically the attention layers for best multimodal results.

Compatibility fixes:

Patched tokenizer to add special tokens (<image>, <|User|>, etc.)
Forced all tensors to bfloat16
Monkey-patched xFormers and DeepSeek’s custom attention to use PyTorch’s fallback for CUDA compatibility

Key setup steps:

Load base DeepSeek-VL2 model:

model = DeepseekVLV2ForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16)

Attach LoRA adapters:

model = get_peft_model(model, LoraConfig(...))

A Quick Note on Mixture-of-Experts (MoE)

DeepSeek-VL2 uses a Mixture-of-Experts (MoE) architecture: only a subset of model “experts” are active for each input, making it possible to scale up the model without linearly scaling compute. For fine-tuning, LoRA adapters sit on top of the main attention and MLP layers, while the MoE routing logic remains unchanged. This is a big reason why LoRA + MoE is so memory-efficient compared to full fine-tuning.

Key Errors and Deep Debugging Journey

During the project, I hit many non-trivial errors. Here are major ones with cause, effect, and final solution:

🔴 1. `AttributeError: 'BatchCollateOutput' object has no attribute 'items'`

Cause: Processor output was a BatchCollateOutput, not a standard dict.

Impact: Model forwarding failed as it couldn’t accept it.

Solution:

inputs = dict(processor(...))

Converted output back into dictionary explicitly.

🔴 2. `RuntimeError: expected sequence of length 1658 at dim 1 (got 1833)`

Cause: Different input_ids lengths per batch sample.

Impact: default_data_collator failed.

Solution:

Created custom collator assuming batch size = 1.
No padding. Only squeezing tensors manually.

batch = {k: v.unsqueeze(0) for k, v in sample.items()}

🔴 3. `RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same`

Cause: Image tensors were in float32 while model expected bfloat16.

Impact: Crashed at the convolutional layer.

Solution: Explicit conversion during preprocessing:

images = images.to(torch.bfloat16)

🔴 4. `AttributeError: 'Dummy' object has no attribute 'sft_format'`

Cause: Wrong format being passed to DeepSeek’s processor.

Impact: Processor expected an SFT format (chat + image tuple).

Solution: Ensure proper conversation format during preprocessing using:

processor(prompt=None, conversations=..., images=...)

🔴 5. `NotImplementedError: No operator found for memory_efficient_attention_forward`

Cause: xFormers wasn’t built for the CUDA/PyTorch version.

Impact: Attention operator missing at runtime.

Solution: Monkey-patch the DeepSeek attention fallback to use PyTorch’s native attention:

fmha.memory_efficient_attention = lambda *args, **kwargs: F.scaled_dot_product_attention(*args, **kwargs)

🔴 6. `AssertionError: input_ids[-1] != eos_id`

Cause: Assistant message didn’t include an <eos> token.

Impact: Processor expected the Assistant part to terminate correctly.

Solution: During testing, we ensured to leave the assistant message empty ("") so the processor appends <eos> automatically.

{"role": "<|Assistant|>", "content": ""}

Training Pipeline

After solving all preprocessing, device, and data datatype mismatches and LoRA injection, I launched training using

trainer.train()

We also added logging_steps, gradient_accumulation_steps, and increased num_train_epochs=10 for better convergence.

Weights were saved post-training with:

model.save_pretrained("./vl2_finetuned_lora_saved")
processor.save_pretrained("./vl2_finetuned_lora_saved")

Training logs showed steadily decreasing loss-a good sign!

Inference Example

Instead of relying on dataset samples, we picked an explicit image:

test_path = "/path/to/test_image.jpg"
test_img = Image.open(test_path).convert("RGB")

Generated conversation inputs and passed through model:

test_conv = [
 {"role": "<|User|>", "content": "What should I do and why should I do according to the visual?"},
 {"role": "<|Assistant|>", "content": ""}
]

inputs = dict(processor(prompt=None, conversations=test_conv, images=[test_img], return_tensors="pt"))
outputs = model.generate(**{k: v.to(device) for k, v in inputs.items()})print(processor.decode(outputs[0]))

✅ Final Learnings and Pro Tips

Fine-tuning DeepSeek-VL2 is possible even with limited compute — if you know how to handle custom tokenization and bfloat16 mismatches and avoid memory pitfalls from external dependencies like xFormers.

Always check dtype and device compatibility manually before model invocation.
Patch external libraries like xFormers if running into missing ops.
Respect model-specific formats (chat templates, special tokens).
Create a tiny test batch first before starting full training.
Log every stage of preprocessing.

With this project, we not only fine-tuned DeepSeek-VL2 but also understood how fine-tuning complex VLMs involves model surgery, datatype control, and attention to structural details.

Results & Next Steps

On my custom VQA dataset, fine-tuning with LoRA reduced VRAM usage from 80GB to 24GB and improved task accuracy from ~62% (zero-shot) to ~89%. Inference latency also dropped significantly.

Next up:

Public GitHub repo with code and a pre-configured inference notebook
Experiments with video modality and federated learning
Tips for deploying on edge devices

Ethical Considerations

When deploying powerful multimodal models, consider:

Content filtering (e.g., CLIP-based NSFW detection)
Bias mitigation (diverse data sampling)
Transparency (documenting model limitations and intended use)

Stay tuned for a GitHub repo with code and pre-configured inference notebook! 🚀

Thanks for reading! 🙌

If you enjoyed this post, feel free to follow me on LinkedIn, Github and check out my upcoming Medium articles for more deep-dive AI tutorials!

📚 Resources

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Fine-Tuning DeepSeek-VL2 for Multimodal Instruction Following: A Comprehensive Technical Guide

Author(s): Ojasva Goyal

Fine-tuning large-scale vision-language models with detailed error breakdowns and best practices. Unlocking advanced vision-language capabilities with parameter-efficient adaptation

Introduction

📌 Problem Statement & Dataset Format

Fine-Tuning Workflow Overview

Understanding DeepSeek-VL2’s Chat Template

Model Setup & LoRA Integration

Key setup steps:

Key Errors and Deep Debugging Journey

🔴 1. `AttributeError: 'BatchCollateOutput' object has no attribute 'items'`

🔴 2. `RuntimeError: expected sequence of length 1658 at dim 1 (got 1833)`

🔴 3. `RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same`

🔴 4. `AttributeError: 'Dummy' object has no attribute 'sft_format'`

🔴 5. `NotImplementedError: No operator found for memory_efficient_attention_forward`

🔴 6. `AssertionError: input_ids[-1] != eos_id`

Training Pipeline

Inference Example

✅ Final Learnings and Pro Tips

Results & Next Steps

Ethical Considerations

📚 Resources

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Crack ML Interviews with Confidence: K-Nearest Neighbors (KNN 20 Q&A)

The Event-Driven Blueprint: How I Scaled a Spring Boot System to 10 Million Kafka Messages/Day

Building Vector Search? Why FAISS Alone Isn’t Enough

TAI #202: GPT-5.5 Moves Codex Into Real Work

Machine Learning System Design -The Model Serving Triangle, With One Forward Pass Flowing Through Every Trade-off (Part3)

AI Orchestration in Action: How MuleSoft and LLMs Fuel the Future of Enterprise AI

GPT-4 Has 1.8 Trillion Parameters. It Uses 2% of Them Per Token.

Part 20: Data Manipulation in Multi-Dimensional Aggregation

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Fine-Tuning DeepSeek-VL2 for Multimodal Instruction Following: A Comprehensive Technical Guide

Author(s): Ojasva Goyal

Fine-tuning large-scale vision-language models with detailed error breakdowns and best practices. Unlocking advanced vision-language capabilities with parameter-efficient adaptation

Introduction

📌 Problem Statement & Dataset Format

Fine-Tuning Workflow Overview

Understanding DeepSeek-VL2’s Chat Template

Model Setup & LoRA Integration

Key setup steps:

Key Errors and Deep Debugging Journey

🔴 1. AttributeError: 'BatchCollateOutput' object has no attribute 'items'

🔴 2. RuntimeError: expected sequence of length 1658 at dim 1 (got 1833)

🔴 3. RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

🔴 4. AttributeError: 'Dummy' object has no attribute 'sft_format'

🔴 5. NotImplementedError: No operator found for memory_efficient_attention_forward

🔴 6. AssertionError: input_ids[-1] != eos_id

Training Pipeline

Inference Example

✅ Final Learnings and Pro Tips

Results & Next Steps

Ethical Considerations

📚 Resources

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement

🔴 1. `AttributeError: 'BatchCollateOutput' object has no attribute 'items'`

🔴 2. `RuntimeError: expected sequence of length 1658 at dim 1 (got 1833)`

🔴 3. `RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same`

🔴 4. `AttributeError: 'Dummy' object has no attribute 'sft_format'`

🔴 5. `NotImplementedError: No operator found for memory_efficient_attention_forward`

🔴 6. `AssertionError: input_ids[-1] != eos_id`