Dense Passage Retrieval (2020) and Contriever (2021): The Models That Paved the Way for Future, Smarter LLMs

Author(s): Saif Ali Kheraj

Originally published on Towards AI.

Dense Passage Retriever (DPR) marked a turning point in open-domain question answering when it launched in 2020. It demonstrated that dense vector representations, learned through deep neural networks, can outperform traditional sparse retrieval methods like BM25 — especially in top-k recall benchmarks. Since DPR’s release, the dense retrieval landscape has evolved rapidly. This post provides a deep dive into the architectures, advancements, and key lessons learned.

This is intended for AI researchers, MLOps engineers, and data scientists building cutting-edge retrieval systems.

Dual-Encoder (Two-Tower) Models

Architecture

A dual-encoder model (also called a bi-encoder or Siamese network) consists of two independent BERT-based encoders: one for queries and one for documents/passages.

Dense Passage Retrieval (2020) and Contriever (2021): The Models That Paved the Way for Future, Smarter LLMs — Figure by Author

Each encoder maps its input into a fixed-length dense vector:

Query embedding: Q = Encoder(query)
Document embedding: D = Encoder(document)

These vectors are compared using a similarity score (typically a dot product or cosine similarity):

Unlike cross-encoders (which compute interactions jointly), dual-encoders do this independently, enabling fast and scalable retrieval. During backpropagation, the gradients from the shared loss flow individually through each encoder, updating both sets of parameters separately, even though they’re part of the same computation graph.

Goal

To maximize retrieval performance in tasks like:

Open-domain Question Answering (QA)
Dense information retrieval
Large-scale document search

Parameters

Typically ~220 million (using two BERT-base encoders)

Each encoder has ~110M parameters
Can be initialized separately or from the same base model

Embedding Space

Produces two independent embedding spaces for queries and documents. These are aligned during training so that relevant pairs are closer in vector space. Semantic similarity, not lexical overlap, drives retrieval.

Training

Supervised Dataset (Triplet Format):

Query (Q): A natural question (e.g: “What is the capital of Italy?”)
Positive passage (D⁺): Relevant answer passage (e.g: “Rome is the capital of Italy…”)
Negative passages (D⁻): Irrelevant or hard negatives (e.g: “Paris is the capital of France…”)

Batch Construction:

What the Model Learns

Semantic alignment, not just keyword matching. The model understands:

Synonyms
Paraphrases
Semantic relations
Trains a ranking function to distinguish relevant from irrelevant documents

Loss Function

A softmax-based cross-entropy loss over the similarity scores:

Where:

D⁺:= positive passage
Dᵢ = all passages in the batch (including negatives)

Objective: Maximize similarity between query and positive passage, minimize similarity to all negative passages.

Zero-Shot Retrieval Use Case

DPR works especially well for Open-Domain Question Answering tasks — for example, using the Natural Questions dataset.

Example:

Query: “What is the capital of Italy?”
Positive passage (D⁺): “Rome is the capital of Italy and is known for the Colosseum.”
Negative passage (D⁻): “Paris is the capital of France and a popular tourist destination.”

During training, DPR learns how to match questions with correct answers — not by memorizing exact pairs, but by learning general patterns.
As a result, even if it never saw this exact question during training, at test time it can still retrieve the right answer. This is called zero-shot retrieval — the model can handle new, unseen questions without needing to be retrained. Open QA tasks are ideal to test this ability, because they contain diverse, open-ended questions — perfect for evaluating zero-shot generalization.

Benefits and Limitations

Evaluation

BEIR Benchmark (Zero-Shot Evaluation)

DPR’s performance on the BEIR benchmark, which assesses zero-shot retrieval capabilities across diverse datasets, varies:

Average nDCG@10: Approximately 39–40%

Note: While DPR excels in QA-specific datasets, its zero-shot performance on BEIR is moderate, highlighting challenges in generalizing to diverse domains without fine-tuning.

Shared Encoder Models: Contriever (2021)

Contriever, released by Meta AI in 2021, marked a significant evolution in dense retrieval by moving from supervised, dual-encoder architectures (like DPR) to unsupervised, shared encoder models. Unlike DPR, which uses separate encoders for queries and documents, Contriever uses a single shared BERT encoder for both, trained without labeled QA data.

This shift enabled powerful zero-shot generalization, outperforming supervised models like DPR on BEIR without any fine-tuning. This section explores Contriever’s architecture, training method, inference setup, and performance.

Architecture: Shared Encoder (Siamese Network)

Contriever uses one shared encoder:

Single BERT-base model for both query and document
Input agnostic: treats all inputs as generic text (no special query/document role)

Similarity is computed using cosine or dot product between embeddings. This design forces the embeddings to live in a unified semantic space, improving generalization across tasks.

Goal

Train a general-purpose retriever that works zero-shot on diverse domains
Avoid dependence on QA-specific supervision
Strong performance on BEIR and domain-shifted tasks

Parameters

Single BERT-base (110M parameters). Lower compute/storage cost vs DPR (which has 2x BERT).

Embedding Space

A shared space for both queries and documents. No role-specific specialization. Embeddings capture generic semantic structure, useful for broad retrieval tasks.

Training

Training Data:

No human supervision or QA pairs. Uses generic corpus like Wikipedia or Common Crawl.

Method: Unsupervised Contrastive Pretraining

Follows InfoNCE loss:

Positives (x⁺): Augmented views of the same passage (e.g., span masking, cropping)
Negatives (x⁻): Other texts in the batch

Data Augmentation:

Random span masking (30%)
Causal cropping
Dropout noise (p=0.1)

What the Model Learns

Semantic similarity via data augmentation, not explicit answers. Learns to embed similar texts close together. Excellent zero-shot capabilities due to general-purpose alignment.

Loss Function

InfoNCE (contrastive loss). Embedding similarity maximized for augmented pairs, minimized for other texts.

Zero-Shot Use Case

Contriever is trained to retrieve relevant supporting facts or arguments, even in new domains (such as biomedical, legal, or news), without requiring domain-specific supervision.

Example:

Input 1: “Benefits of intermittent fasting”
Input 2 (positive passage): “Time-restricted eating improves insulin sensitivity.”

Like DPR, Contriever learns semantic similarity patterns using contrastive learning . However, Contriever’s training does not rely on QA-specific supervision — it uses unsupervised data augmentation. As a result, it can generalize better in zero-shot settings, especially to domains it was never fine-tuned on. Contriever also uses a shared encoder — both queries and documents are passed through the same model, making it more flexible for general-purpose retrieval.

Benefits and Limitations

Shared Encoder vs. Separate Encoders

Dual-encoder setups can either use the same underlying network (shared weights) for queries and documents or use two separate networks. DPR originally used separate BERT encoders for questions and passages (allowing them to specialize), whereas some later models (like Contriever) use one BERT model for both queries and docs (tying weights).

Shared encoders enforce the embeddings to live in one common space, which can improve zero-shot transfer (the model sees generic text regardless of role). However, separate encoders can accommodate query/document length differences or specialized training (e.g. questions vs. long docs).

In practice, many systems keep separate encoder instances but initialize them identically or from the same pre-trained model. The key is ensuring the two embeddings are compatible — query vectors and doc vectors must lie in the same semantic space to be comparable. Community discussions often stress avoiding encoder mismatch, i.e. not using different model families or training objectives for query vs. doc, since that would yield incomparable embeddings.

While both shared encoder models and DPR (Dense Passage Retriever) aim to improve dense retrieval for open-domain question answering, their training approaches differ due to their architectural designs.

Performance Comparison

Conclusion

The evolution from DPR to Contriever represents a fundamental shift in how we approach dense retrieval. DPR demonstrated that dense representations could outperform sparse methods, but required supervised training data. Contriever showed that unsupervised training with a shared encoder could achieve even better zero-shot performance across diverse domains.

These models laid the foundation for modern retrieval systems used in RAG pipelines and continue to influence the development of more sophisticated retrieval architectures. The key insights — that semantic similarity matters more than lexical overlap, that unsupervised pretraining can be surprisingly effective, and that encoder alignment is crucial — remain relevant as we build the next generation of retrieval systems.

Conclusion

Future work in this space continues to build on these foundational concepts, exploring reasoning-aware retrieval, multi-modal approaches, and hybrid architectures that combine the best of both supervised and unsupervised methods.

References

[1] https://aclanthology.org/2020.emnlp-main.550/

[2] https://arxiv.org/pdf/2112.09118

[3] https://github.com/facebookresearch/contriever

[4] https://huggingface.co/facebook/contriever

[5] https://arxiv.org/abs/2112.09118

[6] https://genai-course.jding.org/rag/index.html?utm_source=chatgpt.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Dense Passage Retrieval (2020) and Contriever (2021): The Models That Paved the Way for Future, Smarter LLMs

Author(s): Saif Ali Kheraj

Dual-Encoder (Two-Tower) Models

Architecture

Goal

Parameters

Embedding Space

Training

Supervised Dataset (Triplet Format):

Batch Construction:

What the Model Learns

Loss Function

Zero-Shot Retrieval Use Case

Benefits and Limitations

Evaluation

Shared Encoder Models: Contriever (2021)

Architecture: Shared Encoder (Siamese Network)

Goal

Parameters

Embedding Space

Training

Training Data:

Method: Unsupervised Contrastive Pretraining

Data Augmentation:

What the Model Learns

Loss Function

Zero-Shot Use Case

Benefits and Limitations

Shared Encoder vs. Separate Encoders

Performance Comparison

Conclusion

Conclusion

References

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement