
Dense Passage Retrieval (2020) and Contriever (2021): The Models That Paved the Way for Future, Smarter LLMs
Author(s): Saif Ali Kheraj
Originally published on Towards AI.
Dense Passage Retriever (DPR) marked a turning point in open-domain question answering when it launched in 2020. It demonstrated that dense vector representations, learned through deep neural networks, can outperform traditional sparse retrieval methods like BM25 — especially in top-k recall benchmarks. Since DPR’s release, the dense retrieval landscape has evolved rapidly. This post provides a deep dive into the architectures, advancements, and key lessons learned.
This is intended for AI researchers, MLOps engineers, and data scientists building cutting-edge retrieval systems.
Dual-Encoder (Two-Tower) Models
Architecture
A dual-encoder model (also called a bi-encoder or Siamese network) consists of two independent BERT-based encoders: one for queries and one for documents/passages.

Each encoder maps its input into a fixed-length dense vector:
- Query embedding:
Q = Encoder(query)
- Document embedding:
D = Encoder(document)
These vectors are compared using a similarity score (typically a dot product or cosine similarity):

Unlike cross-encoders (which compute interactions jointly), dual-encoders do this independently, enabling fast and scalable retrieval. During backpropagation, the gradients from the shared loss flow individually through each encoder, updating both sets of parameters separately, even though they’re part of the same computation graph.
Goal
To maximize retrieval performance in tasks like:
- Open-domain Question Answering (QA)
- Dense information retrieval
- Large-scale document search
Parameters
Typically ~220 million (using two BERT-base encoders)
- Each encoder has ~110M parameters
- Can be initialized separately or from the same base model
Embedding Space
Produces two independent embedding spaces for queries and documents. These are aligned during training so that relevant pairs are closer in vector space. Semantic similarity, not lexical overlap, drives retrieval.
Training
Supervised Dataset (Triplet Format):
- Query (Q): A natural question (e.g: “What is the capital of Italy?”)
- Positive passage (D⁺): Relevant answer passage (e.g: “Rome is the capital of Italy…”)
- Negative passages (D⁻): Irrelevant or hard negatives (e.g: “Paris is the capital of France…”)
Batch Construction:

What the Model Learns
Semantic alignment, not just keyword matching. The model understands:
- Synonyms
- Paraphrases
- Semantic relations
- Trains a ranking function to distinguish relevant from irrelevant documents
Loss Function
A softmax-based cross-entropy loss over the similarity scores:

Where:
- D⁺:= positive passage
Dᵢ
= all passages in the batch (including negatives)
Objective: Maximize similarity between query and positive passage, minimize similarity to all negative passages.
Zero-Shot Retrieval Use Case
DPR works especially well for Open-Domain Question Answering tasks — for example, using the Natural Questions dataset.
Example:
- Query: “What is the capital of Italy?”
- Positive passage (D⁺): “Rome is the capital of Italy and is known for the Colosseum.”
- Negative passage (D⁻): “Paris is the capital of France and a popular tourist destination.”
During training, DPR learns how to match questions with correct answers — not by memorizing exact pairs, but by learning general patterns.
As a result, even if it never saw this exact question during training, at test time it can still retrieve the right answer. This is called zero-shot retrieval — the model can handle new, unseen questions without needing to be retrained. Open QA tasks are ideal to test this ability, because they contain diverse, open-ended questions — perfect for evaluating zero-shot generalization.
Benefits and Limitations

Evaluation
BEIR Benchmark (Zero-Shot Evaluation)
DPR’s performance on the BEIR benchmark, which assesses zero-shot retrieval capabilities across diverse datasets, varies:
- Average nDCG@10: Approximately 39–40%
Note: While DPR excels in QA-specific datasets, its zero-shot performance on BEIR is moderate, highlighting challenges in generalizing to diverse domains without fine-tuning.
Shared Encoder Models: Contriever (2021)
Contriever, released by Meta AI in 2021, marked a significant evolution in dense retrieval by moving from supervised, dual-encoder architectures (like DPR) to unsupervised, shared encoder models. Unlike DPR, which uses separate encoders for queries and documents, Contriever uses a single shared BERT encoder for both, trained without labeled QA data.
This shift enabled powerful zero-shot generalization, outperforming supervised models like DPR on BEIR without any fine-tuning. This section explores Contriever’s architecture, training method, inference setup, and performance.
Architecture: Shared Encoder (Siamese Network)
Contriever uses one shared encoder:
- Single BERT-base model for both query and document
- Input agnostic: treats all inputs as generic text (no special query/document role)

Similarity is computed using cosine or dot product between embeddings. This design forces the embeddings to live in a unified semantic space, improving generalization across tasks.
Goal
- Train a general-purpose retriever that works zero-shot on diverse domains
- Avoid dependence on QA-specific supervision
- Strong performance on BEIR and domain-shifted tasks
Parameters
Single BERT-base (110M parameters). Lower compute/storage cost vs DPR (which has 2x BERT).
Embedding Space
A shared space for both queries and documents. No role-specific specialization. Embeddings capture generic semantic structure, useful for broad retrieval tasks.
Training
Training Data:
No human supervision or QA pairs. Uses generic corpus like Wikipedia or Common Crawl.
Method: Unsupervised Contrastive Pretraining
Follows InfoNCE loss:

- Positives (x⁺): Augmented views of the same passage (e.g., span masking, cropping)
- Negatives (x⁻): Other texts in the batch

Data Augmentation:
- Random span masking (30%)
- Causal cropping
- Dropout noise (p=0.1)
What the Model Learns
Semantic similarity via data augmentation, not explicit answers. Learns to embed similar texts close together. Excellent zero-shot capabilities due to general-purpose alignment.
Loss Function
InfoNCE (contrastive loss). Embedding similarity maximized for augmented pairs, minimized for other texts.
Zero-Shot Use Case
Contriever is trained to retrieve relevant supporting facts or arguments, even in new domains (such as biomedical, legal, or news), without requiring domain-specific supervision.
Example:
- Input 1: “Benefits of intermittent fasting”
- Input 2 (positive passage): “Time-restricted eating improves insulin sensitivity.”
Like DPR, Contriever learns semantic similarity patterns using contrastive learning . However, Contriever’s training does not rely on QA-specific supervision — it uses unsupervised data augmentation. As a result, it can generalize better in zero-shot settings, especially to domains it was never fine-tuned on. Contriever also uses a shared encoder — both queries and documents are passed through the same model, making it more flexible for general-purpose retrieval.
Benefits and Limitations

Shared Encoder vs. Separate Encoders
Dual-encoder setups can either use the same underlying network (shared weights) for queries and documents or use two separate networks. DPR originally used separate BERT encoders for questions and passages (allowing them to specialize), whereas some later models (like Contriever) use one BERT model for both queries and docs (tying weights).

Shared encoders enforce the embeddings to live in one common space, which can improve zero-shot transfer (the model sees generic text regardless of role). However, separate encoders can accommodate query/document length differences or specialized training (e.g. questions vs. long docs).
In practice, many systems keep separate encoder instances but initialize them identically or from the same pre-trained model. The key is ensuring the two embeddings are compatible — query vectors and doc vectors must lie in the same semantic space to be comparable. Community discussions often stress avoiding encoder mismatch, i.e. not using different model families or training objectives for query vs. doc, since that would yield incomparable embeddings.
While both shared encoder models and DPR (Dense Passage Retriever) aim to improve dense retrieval for open-domain question answering, their training approaches differ due to their architectural designs.
Performance Comparison

Conclusion
The evolution from DPR to Contriever represents a fundamental shift in how we approach dense retrieval. DPR demonstrated that dense representations could outperform sparse methods, but required supervised training data. Contriever showed that unsupervised training with a shared encoder could achieve even better zero-shot performance across diverse domains.
These models laid the foundation for modern retrieval systems used in RAG pipelines and continue to influence the development of more sophisticated retrieval architectures. The key insights — that semantic similarity matters more than lexical overlap, that unsupervised pretraining can be surprisingly effective, and that encoder alignment is crucial — remain relevant as we build the next generation of retrieval systems.
Conclusion
Future work in this space continues to build on these foundational concepts, exploring reasoning-aware retrieval, multi-modal approaches, and hybrid architectures that combine the best of both supervised and unsupervised methods.
References
[1] https://aclanthology.org/2020.emnlp-main.550/
[2] https://arxiv.org/pdf/2112.09118
[3] https://github.com/facebookresearch/contriever
[4] https://huggingface.co/facebook/contriever
[5] https://arxiv.org/abs/2112.09118
[6] https://genai-course.jding.org/rag/index.html?utm_source=chatgpt.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.