Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Fine-Tuning DeepSeek-VL2 for Multimodal Instruction Following: A Comprehensive Technical Guide
Latest   Machine Learning

Fine-Tuning DeepSeek-VL2 for Multimodal Instruction Following: A Comprehensive Technical Guide

Author(s): Ojasva Goyal

Originally published on Towards AI.

Fine-tuning large-scale vision-language models with detailed error breakdowns and best practices. Unlocking advanced vision-language capabilities with parameter-efficient adaptation

Fine-Tuning DeepSeek-VL2 for Multimodal Instruction Following: A Comprehensive Technical Guide
Image by Alex Shuper on Unsplash

Introduction

DeepSeek-VL2 is a multimodal large language model (MLLM) capable of interpreting and responding to both textual and visual instructions. Fine-tuning it allows tailoring its understanding and generation capabilities to specific multimodal tasks, significantly boosting performance compared to zero-shot applications. In this detailed guide, I’ll walk you through each step to effectively fine-tune DeepSeek-VL2, making complex multimodal applications accessible and achievable.

We’ll cover:

  • Dataset preparation and preprocessing
  • Model and processor configuration details
  • Training workflow with LoRA and Hugging Face’s PEFT
  • Common errors encountered (and clear step-by-step solutions)
  • Practical inference examples

Whether you’re a researcher exploring VLMs or a builder creating domain-specific AI agents, this guide will help you avoid many pitfalls and understand the process in depth.

📌 Problem Statement & Dataset Format

My goal: fine-tune DeepSeek-VL2 on a custom Visual Question Answering (VQA) dataset, where each entry includes a question, an answer, and a path to an image.

Sample JSON format:

{
"question": "What should I do and why according to the visual?",
"answer": "You should attend the awards because...",
"image_path": "/path/to/image.jpg"
}

Challenges faced initially:

  • Variations in image dimensions causing model incompatibilities.
  • Texts needed to be converted into structured conversations.
  • Required multi-modal fusion: image + text.

I used datasets.Dataset for loading and preprocessing, PIL for images, and formatted each sample to fit DeepSeek-VL2’s chat template.

Fine-Tuning Workflow Overview

Here’s the pipeline at a glance:

[Dataset JSONs] 
⬇️
[Preprocessing: Format Text + Load Image]
⬇️
[Tokenizer & Processor: Add Chat Template, Image Tokens]
⬇️
[Custom Collator: Handle dynamic input_ids, single batch]
⬇️
[DeepSeek-VL2 Model + LoRA Adapters]
⬇️
[Training with Trainer API]
⬇️
[Model + Processor Save]
⬇️
[Test Inference]

Each step brought its own set of bugs and learnings, which we’ll now dive into deeply.

Understanding DeepSeek-VL2’s Chat Template

When working with large language models — especially ones trained for conversational tasks like DeepSeek-VL2 — you can’t just feed in plain text.

The input has to follow a specific conversation structure, known as the chat template.

This format helps the model understand:

  • Who is speaking (User or Assistant)
  • What type of content follows (text, image, etc.)
  • Where the model should begin its response

In DeepSeek-VL2, a typical chat sequence looks like:

<|User|>
What should I do and why should I do according to the visual advertisement?
<image>
<|Assistant|>

Important Points:

  • <|User|> and <|Assistant|> special tokens guide the model.
  • <image> token explicitly indicates where visual inputs are injected into text sequences.
  • Assistant response is kept empty initially during training so the model learns to generate it.

This format had to be manually created while pre-processing every sample.

Mistake initially made:

  • We forgot the <eos> token (end of sequence) after the assistant, causing downstream tensor mismatches.

Fix: Handled automatically via the processor by keeping the Assistant’s text blank "".

Without this format, you may get token mismatch errors or incorrect outputs. During preprocessing and inference, we always applied this structure using processor(..., conversations=[...], images=[...]). This ensured the model knew the question context and when to generate the answer.

Model Setup & LoRA Integration

Fine-tuning large models can be computationally expensive. DeepSeek-VL2 is too large to fine-tune fully on a single GPU. Hence, we opted for LoRA (Low-Rank Adaptation) fine-tuning using the PEFT library. LoRA significantly reduces computational and memory requirements by only adjusting a small subset of model parameters.

We used:

  • DeepSeek-VL2 from HuggingFace (manually cloned & patched)
  • LoRA via PEFT for parameter-efficient tuning
  • Trainer from transformers

LoRA configuration:

Here’s an ideal LoRA configuration:

lora_cfg = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)

Why these settings?

  • r=8: A good balance for parameter efficiency.
  • lora_alpha=16: Ensures stable training.
  • Target Modules: Specifically the attention layers for best multimodal results.

Compatibility fixes:

  • Patched tokenizer to add special tokens (<image>, <|User|>, etc.)
  • Forced all tensors to bfloat16
  • Monkey-patched xFormers and DeepSeek’s custom attention to use PyTorch’s fallback for CUDA compatibility

Key setup steps:

  • Load base DeepSeek-VL2 model:
model = DeepseekVLV2ForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16)
  • Attach LoRA adapters:
model = get_peft_model(model, LoraConfig(...))

A Quick Note on Mixture-of-Experts (MoE)

DeepSeek-VL2 uses a Mixture-of-Experts (MoE) architecture: only a subset of model “experts” are active for each input, making it possible to scale up the model without linearly scaling compute. For fine-tuning, LoRA adapters sit on top of the main attention and MLP layers, while the MoE routing logic remains unchanged. This is a big reason why LoRA + MoE is so memory-efficient compared to full fine-tuning.

Key Errors and Deep Debugging Journey

During the project, I hit many non-trivial errors. Here are major ones with cause, effect, and final solution:

🔴 1. AttributeError: 'BatchCollateOutput' object has no attribute 'items'

Cause: Processor output was a BatchCollateOutput, not a standard dict.

Impact: Model forwarding failed as it couldn’t accept it.

Solution:

inputs = dict(processor(...))

Converted output back into dictionary explicitly.

🔴 2. RuntimeError: expected sequence of length 1658 at dim 1 (got 1833)

Cause: Different input_ids lengths per batch sample.

Impact: default_data_collator failed.

Solution:

  • Created custom collator assuming batch size = 1.
  • No padding. Only squeezing tensors manually.
batch = {k: v.unsqueeze(0) for k, v in sample.items()}

🔴 3. RuntimeError: Input type (float) and bias type (c10::BFloat16) should be the same

Cause: Image tensors were in float32 while model expected bfloat16.

Impact: Crashed at the convolutional layer.

Solution: Explicit conversion during preprocessing:

images = images.to(torch.bfloat16)

🔴 4. AttributeError: 'Dummy' object has no attribute 'sft_format'

Cause: Wrong format being passed to DeepSeek’s processor.

Impact: Processor expected an SFT format (chat + image tuple).

Solution: Ensure proper conversation format during preprocessing using:

processor(prompt=None, conversations=..., images=...)

🔴 5. NotImplementedError: No operator found for memory_efficient_attention_forward

Cause: xFormers wasn’t built for the CUDA/PyTorch version.

Impact: Attention operator missing at runtime.

Solution: Monkey-patch the DeepSeek attention fallback to use PyTorch’s native attention:

fmha.memory_efficient_attention = lambda *args, **kwargs: F.scaled_dot_product_attention(*args, **kwargs)

🔴 6. AssertionError: input_ids[-1] != eos_id

Cause: Assistant message didn’t include an <eos> token.

Impact: Processor expected the Assistant part to terminate correctly.

Solution: During testing, we ensured to leave the assistant message empty ("") so the processor appends <eos> automatically.

{"role": "<|Assistant|>", "content": ""}

Training Pipeline

After solving all preprocessing, device, and data datatype mismatches and LoRA injection, I launched training using

trainer.train()

We also added logging_steps, gradient_accumulation_steps, and increased num_train_epochs=10 for better convergence.

Weights were saved post-training with:

model.save_pretrained("./vl2_finetuned_lora_saved")
processor.save_pretrained("./vl2_finetuned_lora_saved")

Training logs showed steadily decreasing loss-a good sign!

Inference Example

Instead of relying on dataset samples, we picked an explicit image:

test_path = "/path/to/test_image.jpg"
test_img = Image.open(test_path).convert("RGB")

Generated conversation inputs and passed through model:

test_conv = [
{"role": "<|User|>", "content": "What should I do and why should I do according to the visual?"},
{"role": "<|Assistant|>", "content": ""}
]
inputs = dict(processor(prompt=None, conversations=test_conv, images=[test_img], return_tensors="pt"))
outputs = model.generate(**{k: v.to(device) for k, v in inputs.items()})
print(processor.decode(outputs[0]))

✅ Final Learnings and Pro Tips

Fine-tuning DeepSeek-VL2 is possible even with limited compute — if you know how to handle custom tokenization and bfloat16 mismatches and avoid memory pitfalls from external dependencies like xFormers.

  • Always check dtype and device compatibility manually before model invocation.
  • Patch external libraries like xFormers if running into missing ops.
  • Respect model-specific formats (chat templates, special tokens).
  • Create a tiny test batch first before starting full training.
  • Log every stage of preprocessing.

With this project, we not only fine-tuned DeepSeek-VL2 but also understood how fine-tuning complex VLMs involves model surgery, datatype control, and attention to structural details.

Results & Next Steps

On my custom VQA dataset, fine-tuning with LoRA reduced VRAM usage from 80GB to 24GB and improved task accuracy from ~62% (zero-shot) to ~89%. Inference latency also dropped significantly.

Next up:

  • Public GitHub repo with code and a pre-configured inference notebook
  • Experiments with video modality and federated learning
  • Tips for deploying on edge devices

Ethical Considerations

When deploying powerful multimodal models, consider:

  • Content filtering (e.g., CLIP-based NSFW detection)
  • Bias mitigation (diverse data sampling)
  • Transparency (documenting model limitations and intended use)

Stay tuned for a GitHub repo with code and pre-configured inference notebook! 🚀

Thanks for reading! 🙌

If you enjoyed this post, feel free to follow me on LinkedIn, Github and check out my upcoming Medium articles for more deep-dive AI tutorials!

📚 Resources

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.