Beyond Vectors: A Deep Dive into Modern Search in Qdrant

Last Updated on December 29, 2025 by Editorial Team

Author(s): Ashish Abraham

Originally published on Towards AI.

Beyond Vectors: A Deep Dive into Modern Search in Qdrant — Image by Author

Years back, I read a book called “I Thought I Knew How to Google”. It showed numerous ways in which we can write Google queries with operators like AND, NOT, quotation marks, and other tricks, to get relevant results for our keywords every time. It used to work well and works even now; but not just Google, the entire concept of search engines has changed over the years. When was the last time you actually typed something like “best smartphone AND battery life NOT iPhone” into a search box? Today, we mostly type natural-language questions, rely on autocomplete suggestions, or simply expect the search engine to understand our intent without us having to craft the query carefully. Search has quietly shifted from telling the machine exactly what to do to trusting the machine to figure out what we mean.

Search Has Outgrown “Just Keywords”

Search engine queries we use now increasingly resemble questions or loosely described intent rather than precise keywords or exact terms. For example, instead of typing “Section 80C tax deduction India,” users are more likely to ask, “How can I save tax in India this year?” At the same time, you can also search for something like “Something casual and green from Levi’s for winter”, where Levi’s is a brand name and should be processed exactly as it is. Such a shift in user behavior alongside advances in the AI landscape requires developers to consider a blend of approaches. Pure keyword matching struggles with intent-heavy, conversational queries, while pure semantic search can miss critical constraints like exact product names, IDs, dates, prices, or compliance terms. Modern search systems must combine the following: semantic similarity to understand meaning, exact match to respect precision, filters to narrow down results by category or metadata, and numeric context to interpret ranges, limits, and comparisons correctly. This is called hybrid search.

The problem is not to build a search engine but to build one that resonates with how real people actually search. I experimented with a few options — hosted vector databases that focused mainly on semantic similarity, Elasticsearch setups for hybrid retrieval, and even custom pipelines where dense retrieval, keyword filtering, and reranking were handled by separate components arranged sequentially. Each could handle only part of the problem, and none felt comfortable once I moved beyond demo queries. Pure vector stores like ChromaDB delivered great semantic matches, but filters felt secondary. As soon as I added constraints like brand, price, or availability, I was spending more time tuning recall and latency than improving relevance.

I also tried stitching multiple systems together. One for vectors, one for keywords, and custom logic to merge results. It worked, but it added operational and conceptual complexity to a simple search engine. Out of every approach, Qdrant stood out because hybrid search wasn’t something you assembled in it. It was native. Dense vectors, sparse vectors, full-text search, filters, and ranking all lived in one engine. That made it easier to reason about search as a single system, not a collection of workarounds.

In this article, I’ll walk you through the different concepts I explored for building search engines for modern user queries. I’ll share what worked, what didn’t, and show how Qdrant’s features and trade-offs helped me create a search system that’s both smart and precise.

· Dense Vector Search: Semantic Backbone of Modern Retrieval
∘ Similarity Measures in Vector Search
∘ Implementation
· Sparse Vector Search: Bringing Back Precision and Rare Terms
∘ Search Mechanics
∘ Implementation
· Full-Text Indexing: Lightweight but Surprisingly Powerful
∘ What Happens When You Hit Search
∘ Implementation
· ASCII-Folding: Making Multilingual Text Search Actually Work
∘ ASCII Folding in Hybrid Search
∘ Implementation
· ACORN: Filter-Aware Vector Search
∘ How Does ACORN Work?
∘ Implementation
· Leveling Up the Modern Search Stack
∘ Reranking
∘ Multilingual Tokenization
∘ Performance & Cost Improvements
· Putting It All Together: Designing a Hybrid Retrieval Pipeline
∘ Set Up Qdrant
∘ Data Ingestion
∘ Payload Indexing
∘ Hybrid Search
· Wrapping Up
· References

Dense Vector Search: Semantic Backbone of Modern Retrieval

We do not need much of an introduction to embeddings or dense vectors, though they pretty much lay the foundation for modern AI. At a high level, they are numerical representations that capture the meaning of text, images, or almost any other data modality, allowing machines to compare and reason about them meaningfully.

Text: "Running shoes for daily jogging"
 ↓
 [0.12, 0.75, -0.33, ...]
 ← Dense vector →

For example, imagine embedding a few common queries and labels into a 2D space:

Embeddings Illustration (Image By Author)

“Running shoes for daily jogging” and “lightweight sneakers for morning runs” appear close together because both describe the same intent — comfortable footwear for regular running, even though the wording is different.
“Wireless noise-cancelling headphones” and “Bluetooth headphones with ANC” cluster together since both refer to the same product category and feature set, expressed using different terminology.
“iPhone 14 charging cable”, “Lightning cable for Apple phone”, and “Apple fast charger wire” form another cluster, grouped by accessory compatibility and charging intent.

Even though the exact words differ, embeddings place related items near each other because they mean roughly the same thing. These embeddings are computed by specialized embedding models, each designed for particular tasks and available in a variety of sizes depending on performance and cost needs.

Similarity Measures in Vector Search

Before performing a vector search, the whole corpus of data that you need to search in is stored as embeddings and indexed in a vector database or vector store. Once you hit search, your query is converted into a vector by the same embedding model used to prepare the database. This query embedding is then compared against each embedding, and the closest data points are retrieved. Let’s understand how this comparison actually works in real-time.

A common way to compare vectors is cosine similarity, calculated as the dot product of the vectors divided by the product of their magnitudes. The closer the result is to 1, the more similar the vectors are:

Other common similarity measures used are:

Euclidean Distance (L2)
This is literally the straight-line distance between two vectors in space. It usually works well in situations where the embedding magnitude carries meaning.

Manhattan Distance (L1)
It measures the sum of absolute differences across dimensions, or in other words, how many small steps apart the vectors are.

Implementation

The code is pretty straightforward. You need to define a collection with the required configuration for the similarity measure.

from qdrant_client import QdrantClient, models

client = QdrantClient(":memory:")

client.create_collection(
 collection_name="simple_vector_search",
 vectors_config=models.VectorParams(
 size=4, # 4‑dimensional vectors
 distance=models.Distance.COSINE
 ),
)

Populate the database using any embedding model.

from fastembed import TextEmbedding 
embed_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5")
def embed(text: str):
 return list(embed_model.embed([text]))[0].tolist()
texts = [
 "Running shoes for daily jogging",
 "Apple fast charger wire",
]

points = [
 models.PointStruct(
 id=i,
 vector=embed(text),
 payload={"text": text},
 )
 for i, text in enumerate(texts, start=1)
]
client.upsert(collection_name="docs", points=points)

Once this is done, you can query the collection directly.

query = "Running shoes for daily jogging"
query_vector = embed(query)

results = client.search(
 collection_name="simple_vector_search",
 query_vector=query_vector,
 limit=3, # top‑k
)

Sparse Vector Search: Bringing Back Precision and Rare Terms

My first instinct was to go all-in on dense vectors. It worked beautifully for a while. Queries like “running shoes for daily jogging” suddenly matched results I’d never tagged explicitly. But the cracks showed quickly. A search for “comfortable mid-sole support Nike running shoes” returned ortho slippers that were semantically similar but not actually correct. Dense vectors were necessary but clearly not sufficient. This is where sparse vectors come in handy.

Sparse vectors are high-dimensional representations where most values are zero, but a few key dimensions are active, corresponding directly to important keywords or tokens. You can think of them as an advanced evolution of traditional keyword-based search methods (like TF-IDF), adapted for modern retrieval engines. It is the preferred method when you want relevance ranking beyond exact matches but don’t need full semantic understanding. It doesn’t capture complete semantic understanding like dense vectors.

Text: "HP 15-eg2018TU original battery"
 ↓
[0, 0, 0, 0, 1.2, 0, 0, 0, 0, 0.9, 0, 0, 0, 1.5, 0, 0, 0, 0, 0, 0, ...]
 ↑ ↑ ↑
 "HP" "15-eg2018TU" "Battery"
 (1.2) (0.9) (1.5)
Further could be represented as:
[(4,1.2), (9,0.9), (13,1.5)]

Transaction descriptors like ‘Apple iPhone 17 Pro 256GB Cosmic Orange’ and ‘Adidas Terrex rain.rdy’ are typically terse and dense with specific identifiers. Sparse vector models (such as SPLADE) maintain this term-level precision, guaranteeing that compliance or support queries retrieve these specific financial records accurately, even in the absence of detailed metadata.

Search Mechanics

Sparse vector search is commonly implemented through algorithms like BM25 (Best Match 25). BM25 determines the most relevant documents for a given query by considering two factors:

Term Frequency (TF): How often do the query words appear in each document? (The more, the better.)
Inverse Document Frequency (IDF): How rare are the query words across the entire document set? (The rarer, the better.)

The BM25 score for a document D with respect to a query Q is the sum of scores for the individual query terms:

Another approach is SPLADE (Sparse Lexical and Expansion Model). It takes a learned, neural approach to sparse vector search. Instead of directly counting term occurrences, SPLADE uses a transformer model to generate a high-dimensional sparse vector, where each dimension corresponds to a token in the vocabulary. The values in this vector represent how important each term is to the meaning of the document. This allows SPLADE to combine the strength of keyword-based precision like BM25 and learned relevance and term expansion from neural models.

Similarity between a query and a document is computed using a simple dot product between the sparse vectors.

One more learned sparse–lexical approach is miniCOIL (Contextualized Inverted List). While SPLADE expands documents into a vocabulary-wide sparse vector, mini-COIL takes a more token-centric view of retrieval.

miniCOIL encodes each token in a document into a contextualized embedding using a lightweight transformer model. Instead of producing one large sparse vector per document, it builds an inverted index of token embeddings, where each token retains its semantic context. At query time, query tokens are encoded in the same way, and relevance is computed by matching query tokens against document tokens using efficient dot-product similarity.

Implementation

The configuration is a bit different from vector search. Here’s how you create the collection.

client.create_collection(
 collection_name="simple_sparse_vector_search",
 vectors_config={}, # no dense vectors
 sparse_vectors_config={
 "bm25_sparse_vector": models.SparseVectorParams(
 modifier=models.Modifier.IDF
 )
 },
)

Setting on_disk true stores the index on disk, which lets you save memory. This may slow down search performance. If set to false, the index is persisted on disk, but it is also loaded into memory for faster search.

Use any sparse embedding method to create the vectors.

from fastembed import SparseTextEmbedding

bm25_model_name = "Qdrant/bm25"
bm25 = SparseTextEmbedding(model_name=bm25_model_name)

def bm25_embed(texts: list[str]):
 return list(bm25.embed(texts))

doc_texts = [d["text"] for d in docs]
bm25_embeddings = bm25_embed(doc_texts) # list of SparseEmbedding

points = []
for doc, emb in zip(docs, bm25_embeddings):
 points.append(
 models.PointStruct(
 id=doc["id"],
 vector=None,
 sparse_vector=models.NamedSparseVector(
 name="bm25_sparse_vector",
 vector=models.SparseVector(
 indices=emb.indices.tolist(),
 values=emb.values.tolist(),
 ),
 ),
 payload={"text": doc["text"]},
 )
 )

client.upsert(collection_name="simple_sparse_vector_search", points=points)

Convert the query into sparse vectors, and you can start the search.

query_text = "Adidas Terrex"
query_emb = bm25_embed([query_text])[0]

query_sparse = models.NamedSparseVector(
 name="bm25_sparse_vector",
 vector=models.SparseVector(
 indices=query_emb.indices.tolist(),
 values=query_emb.values.tolist(),
 ),
)

results = client.search(
 collection_name=collection_name,
 query_vector=query_sparse,
 limit=3,
)

Full-Text Indexing: Lightweight but Surprisingly Powerful

Full-text indexing may sound old-school in the world of vectors and vector DBs, but it could actually be the practical tool you will be looking for in most scenarios. When used correctly, it can deliver fast, precise results with minimal infrastructure.

Full-text indexing shines when users are looking for specific words or short phrases rather than abstract meaning. Imagine you are searching for “Adidas Terrex rain.rdy”. This query is short, precise, and terminology-driven. The user isn’t asking for semantic understanding. The system needs to retrieve records that contain specific terms. Running an embedding model for “Adidas Terrex rain.rdy” doesn’t add much value, as the exact words already carry all the meaning needed to find the right documents. Running embeddings on such a query is meaningless, and this extra step adds time, infrastructure complexity, and often money, especially if you’re using large or hosted models. Moreover, such queries require fast, direct, and predictable results.

Not everything deserves a vector.

What Happens When You Hit Search

The ingested data undergoes tokenization to split into words or subwords, followed by normalization like lowercasing, stemming (e.g., “running” to “run”), and stopword removal (e.g., ignoring “the” or “and”). The processed tokens build an inverted index, such as mapping “GST” to document IDs like [Doc1, Doc3]. The inverted index flips the traditional document-to-terms mapping into a terms-to-documents structure, enabling rapid searches across large text corpora.

When a user searches, the query is tokenized similarly, and the engine quickly finds documents containing those tokens and applies scoring based on frequency, position, or relevance rules. Modern implementations go beyond simple “contains word X” logic. You can:

Search for phrases, not just individual words
Match any of several terms (e.g., “RBI” or “GST”)
Combine text search with filters like category, date, or region

This makes full-text indexing surprisingly flexible, especially for structured content with short text fields.

Implementation

When creating the collection in Qdrant, add a payload index with TextIndexParams to create the inverted index for full-text search.

client.create_payload_index(
 collection_name="demo",
 field_name="description",
 field_schema=models.TextIndexParams(
 tokenizer=models.TokenizerType.WORD,
 min_token_len=2,
 max_token_len=15,
 lowercase=True
 )
)

When searching, use the MatchText filters to perform a full-text search on the word or phrase you require.

client.search(
 collection_name="demo",
 query_vector=[1, 0, 0, 1],
 query_filter=models.Filter(
 must=[models.FieldCondition(key="description", match=models.MatchText(text="ANC buds"))]
 ),
 limit=3
)

For a more flexible option, you can use text_any. Now, if the text field contains even one of the query terms “ANC” or “buds”, it is considered a match.

client.search(
 collection_name="demo",
 query_vector=[1, 0, 0, 1],
 query_filter=models.Filter(
 must=[models.FieldCondition(key="description", match=models.MatchTextAny(text_any="ANC buds"))]
 ),
 limit=3
)

ASCII-Folding: Making Multilingual Text Search Actually Work

Let’s understand the problem first, with an example. Suppose you are searching for “Crème skincare set”. While searching, there is a 99% chance that you type “Creme skincare set” instead. For a full-text search, this will not work since that character is different even though they look the same. In such situations, we require something called ASCII-folding. ASCII folding converts unicode accented or non-ASCII characters into their corresponding ASCII equivalents, for example, by removing diacritics. Importantly, this happens before the search engine stores or compares tokens, so there’s no runtime penalty. In this case, é is converted to e and ü to u. This way, you can search for “Crème skincare set” and still find “Crème skincare set” or search for Munchen and find München.

ASCII Folding in Hybrid Search

ASCII folding cannot be considered a standalone feature. It works when blended with broader search approaches. It pairs naturally with:

Full-text indexing for exact-term retrieval
Users can type simplified versions of names, institutions, or regulatory terms and still hit the right documents, even when the indexed text contains accents or localized spellings. This improves recall while keeping the deterministic behavior that full-text search is valued for.
Filters on names, tags, and identifiers
Filters often sit on structured fields where users expect binary correctness. Either a record matches or it doesn’t. ASCII folding ensures these filters remain reliable across inconsistent data sources and user input methods. Irrespective of where the value comes from, a form, a CSV upload, or a third-party API, folded normalization prevents subtle character differences from silently excluding valid records.
Vector search, by improving recall before ranking
Semantic ranking is only effective if the right candidates make it into the result set. ASCII folding increases the likelihood that relevant documents survive early filtering stages. By solving character mismatches upfront, vectors can focus on ranking by meaning rather than compensating for avoidable text inconsistencies.

Implementation

In Qdrant, it is simple to enable ASCII folding. It can be done by setting the ascii_folding parameter to True in TextIndexParams.

client.create_payload_index(
 collection_name="demo",
 field_name="description",
 field_schema=models.TextIndexParams(
 tokenizer=models.TokenizerType.WORD,
 min_token_len=2,
 max_token_len=15,
 ascii_folding=True,
 )
)

ACORN: Filter-Aware Vector Search

When we ingest documents or data into any vectorDB, it must be stored in such a way that enables fast lookup and retrieval. This is called indexing. There are a few algorithms using which different databases achieve this. One such is HNSW (Hierarchical Navigable Small World) indexing. It involves storing data as multi-layered graphs which enables fast approximate nearest neighbour searches across each layer instead of blindly searching the whole database. This is completely working on high-dimensional vectors. But what happens when we try to apply filters on such searches as we did before?

HNSW is not aware of filters. It fetches semantically similar data only to realise towards the end that most of them violate the filters. This means wasteful traversals and low recall. ACORN solves this by allowing filter-aware vector searches.

How Does ACORN Work?

ACORN illustrated across HNSW Layers (Image By Author)

Instead of treating filters as an afterthought, ACORN-style search incorporates predicate checks directly into the graph exploration process. The idea is simple but powerful: prefer paths that are more likely to lead to valid points, rather than exploring blindly and discarding results later. If a neighbor is filtered out, it inspects that neighbor’s neighbors (up to configurable depth), effectively bridging gaps in the filtered subgraph without index rebuilds, thereby boosting recall.

More technically speaking, it builds denser HNSW graphs predicate-agnostically using two-phase neighbor selection with parameter Mβ (0 ≤ Mβ ≤ M·γ, where γ=2–4 expansion factor).

This involves:

Retain nearest Mβ neighbors exactly (dense core).
For farther candidates, expand to 2-hop neighbors (neighbors-of-neighbors), filter by distance, and prune aggressively via compression or remove redundant edges reachable via intermediates.

When the search happens, start at the top-layer entry point. At each node in the HNSW graph:

Phase 1: Look at the immediate neighbors of the current node. If a neighbor satisfies the filter conditions, compute its distance to the query and add it to the candidate queue.

Phase 2: For invalid first-hop nodes, don’t stop there. Instead, explore their neighbors (one level deeper), apply the same filters, and keep any valid candidates.

This benchmark table from Qdrant compares ACORN-enabled HNSW vs standard HNSW under heavy filtering.

Implementation

There are two variants of ACORN, and Qdrant offers ACORN-1. ACORN is disabled by default. Once enabled via the enable flag, it activates conditionally when estimated filter selectivity is below the threshold. The optional max_selectivity value controls this threshold; 0.0 means ACORN will never be used, 1.0 means it will always be used. The default value is 0.4. For example, consider this sample data.

client.upsert("fintech_products", points=[
 models.PointStruct(id=1, vector=[0.1]*128, payload={
 "product": "Fixed Deposit", "bank": "SBI", "yield": 7.5, "min_invest": 10000, "risk": "low"
 }),
 models.PointStruct(id=2, vector=[0.2]*128, payload={
 "product": "Mutual Fund", "bank": "HDFC", "yield": 12.0, "min_invest": 5000, "risk": "medium"
 }),
 models.PointStruct(id=3, vector=[0.15]*128, payload={
 "product": "Arbitrage Fund", "bank": "ICICI", "yield": 8.2, "min_invest": 5000, "risk": "low"
 })
])

For complex search queries as shown, enable ACORN in SearchParams.

results_acorn = client.search(
 collection_name="fintech_products",
 query_vector=[0.12]*128, # Embedding of "safe high-return liquid investment"
 query_filter=models.Filter(
 must=[
 models.FieldCondition(key="risk", match=models.MatchValue("low")),
 models.FieldCondition(key="yield", range=models.Range(gte=7.0)),
 models.FieldCondition(key="bank", match=models.MatchAny(["SBI", "ICICI"])),
 ],
 should=[
 models.FieldCondition(key="min_invest", range=models.Range(lte=10000)),
 
 models.FieldCondition(key="liquid", match=models.MatchValue(True))
 ],
 must_not=[
 models.FieldCondition(key="management_fee", range=models.Range(gte=1.5)),
 models.FieldCondition(key="launch_year", range=models.Range(lte=2023))
 ]
 ),
 search_params=models.SearchParams(
 hnsw_ef=128,
 acorn=models.AcornSearchParams(
 enable=True,
 max_selectivity=0.4,
 ) 
 ),
 limit=5
)

Here is a quick sum up of the topics for future reference.

Leveling Up the Modern Search Stack

You have now learned various approaches to building the best search engines. But as the cherry on top, I’ll give you a few extra tips that can elevate your search experience even further, especially when tailored to your specific task or domain.

Reranking

Reranking is an optional but powerful post-processing step in modern search systems. It refines the initial set of results returned by a vector search. While vector similarity can efficiently retrieve the top-k most relevant candidates based on embedding distance, it may not fully capture nuances such as user intent, context, or domain-specific relevance.

In theory, reranking can be implemented in several ways. Commonly, in RAG systems, it is implemented by taking the top-k results and passing them, along with the original query, into a more powerful model, often a cross-encoder or transformer-based reranker. This model examines the query and each result together (rather than independently) and assigns a relevance score based on deeper semantic understanding.

Qdrant offers a feature called score boosting rerankers. For instance, in e-commerce you may want to boost products from a specific manufacturer perhaps because you have a brand promotion or you need to clear inventory. By configuring score boosting rerankers, you can easily influence ranking using metadata like brand or stock status. That is how you bring up products showing “1 remaining” to the top, which boosts the chances of the user buying it faster. The basic idea is to assign scores to specific fields while reranking.

score = score + (stock_status * 0.5) + (brand * 0.35)

In code, it is done by adding these multipliers suitably.

boosted_products = client.query_points(
 collection_name="{collection_name}",
 prefetch=models.Prefetch(
 query=[0.1, 0.45, 0.67], # dense vector for product similarity
 limit=50
 ),
 query=models.FormulaQuery(
 formula=models.SumExpression(sum=[
 "$score", # base similarity score

 # Boost products from a promoted brand
 models.MultExpression(mult=[
 0.35,
 models.FieldCondition(
 key="brand",
 match=models.MatchAny(any=["Nike"])
 )
 ]),

 # Boost products with low stock to encourage faster purchase
 models.MultExpression(mult=[
 0.5,
 models.FieldCondition(
 key="stock_status",
 match=models.MatchAny(any=["low_stock"])
 )
 ])
 ])
 )
)

Multilingual Tokenization

In modern search engines, vector search alone often relies on English-heavy embeddings or loses nuance in long-tail languages, while keyword filters and full-text indexes still drive precision and faceting. As search systems expand globally, supporting multiple languages is no longer optional but a baseline expectation. Different languages follow very different linguistic rules. Some use spaces between words, others don’t. Some rely heavily on inflections or compound words, while others use accents or non-Latin scripts. A tokenizer designed only for English will struggle when applied to languages like Hindi, Japanese, Arabic, or German. In search engines, poor tokenization leads to:

Missed matches despite relevant content existing
Incorrect filtering or ranking
Lower recall in keyword, sparse, and hybrid search

A multilingual tokenizer lets the text index work consistently across locales: it handles Unicode normalization, script-specific segmentation, and per-language stopwords and stemming, which stabilizes recall and ranking in multilingual datasets. It ensures that text is split, normalized, and indexed in a way that respects the structure of each language.

Here’s how to use it in Qdrant.

client.create_payload_index(
 collection_name="multilingual_products",
 field_name="title",
 field_schema=models.TextIndexParams(
 type=models.TextIndexType.TEXT,
 tokenizer=models.TokenizerType.MULTILINGUAL, 
 ),
)

Performance & Cost Improvements

All the above discussed techniques are great, but they don’t come without a catch. Techniques like reranking and ACORN add overhead for each search request. Thus, it is crucial to optimize the search engine from the base so that this latency is not fully passed on to the end user.

One such performance vs cost optimization comes in storing the HNSW index. HNSW was designed to be an in-memory index structure. Traversing the HNSW graph involves a lot of random access reads, which is fast in RAM but slow on disk. For instance, querying 1 million vectors with the HNSW parameters m (the number of connections each node maintains to its neighbors) set to 16 and ef (the size of the candidate pool explored during search) to 100 requires approximately 1200 vector comparisons. This is fine in RAM. But on disk, each random access read can take up to 1ms in SSDs, or even longer when using HDDs.

One workaround in disk-based storage is to reduce random reads by leveraging paged reading, where data is read in fixed-size blocks (typically 4KB or more). Traditional tree-based structures like B-trees take advantage of this naturally, but graph-based indexes such as HNSW struggle because nodes can have many arbitrary connections. Qdrant implements this through inline storage, which stores quantized vector data directly inside HNSW nodes. This reduces disk seeks and improves read performance, at the cost of slightly higher storage usage.

A single HNSW graph node with inline storage enabled (Source)

Benchmarking on 1,000,000 vectors, 2-bit quantization, float16 data type, reveals the following results according to Qdrant docs.

As the database grows, there can be constraints on storage space and search latency. Instead of storing each vector as a high-precision float32, quantization converts them into a lower-bit format, dramatically reducing memory usage while preserving most of the semantic similarity. Commonly, int8 or even 1-bit binary representations are possible.

Scalar quantization treats each dimension of the vector independently, mapping its high‑precision floating point value to the nearest bin in a small integer range. This is similar to converting from float32 to int8 for each element. Scalar quantization alone can reduce embedding size by up to 4×, with retrieval performance typically remaining above 99% accuracy.

Putting It All Together: Designing a Hybrid Retrieval Pipeline

After multiple iterations and performance tuning, this is the pipeline I ended up trusting in practice.

Set Up Qdrant

If you’re working locally, make sure you have Docker installed and the Docker engine running. Qdrant can be installed by pulling its Docker image:

! docker pull qdrant/qdrant

Then run the Qdrant Docker container:

! docker run -p 6333:6333 \
 -v $(pwd)/qdrant_storage:/qdrant/storage \
 qdrant/qdrant

Alternatively, a more convenient option is to use Qdrant Cloud. Log in to the cloud platform, create a cluster, and retrieve your API key.

Install the required Python libraries.

! pip install qdrant-client datasets fastembed transformers qdrant-client[fastembed] openai

Now you are ready to start a client.

from qdrant_client import models, QdrantClient
from google.colab import userdata

from qdrant_client import models, QdrantClient
from google.colab import userdata


client = QdrantClient(
 url="YOUR_QDRANT_CLOUD_INSTANCE_URL",
 api_key=userdata.get('qdrant_api_key'),
)

Data Ingestion

To make the search system realistic and representative of a production setup, we need a meaningful dataset to work with. For this purpose, I will be using an Amazon products dataset.

df_raw = pd.read_csv('/content/amazon-products.csv')

Index(['timestamp', 'title', 'seller_name', 'brand', 'description',
 'initial_price', 'final_price', 'currency', 'availability',
 'reviews_count', 'categories', 'asin', 'buybox_seller',
 'number_of_sellers', 'root_bs_rank', 'answered_questions', 'domain',
 'images_count', 'url', 'video_count', 'image_url', 'item_weight',
 'rating', 'product_dimensions', 'seller_id', 'date_first_available',
 'discount', 'model_number', 'manufacturer', 'department',
 'plus_content', 'upc', 'video', 'top_review', 'variations', 'delivery',
 'features', 'format', 'buybox_prices', 'parent_asin', 'input_asin',
 'ingredients', 'origin_url', 'bought_past_month', 'is_available',
 'root_bs_category', 'bs_category', 'bs_rank', 'badge',
 'subcategory_rank', 'amazon_choice', 'images', 'product_details',
 'prices_breakdown', 'country_of_origin'],
 dtype='object')

Remove unnecessary rows and clean up the dataset. You can drop rows that contain null values for certain important fields. Also, adjust the prices, ratings, and stock status to the required data types for your use case.

def clean_dataset(df: pd.DataFrame) -> pd.DataFrame:
 """Extract & clean required fields"""
 
 required_cols = ['asin', 'title', 'brand', 'description', 'categories',
 'final_price', 'availability', 'rating', 'reviews_count']
 df_clean = df[required_cols].copy()
 
 # Clean text fields
 df_clean['asin'] = df_clean['asin'].fillna('').astype(str).str.strip()
 df_clean['title'] = df_clean['title'].fillna('').astype(str).str.strip()
 df_clean = df_clean[df_clean['title'] != '']
 df_clean['brand'] = df_clean['brand'].fillna('Unknown').astype(str).str.strip()
 df_clean['description'] = df_clean['description'].fillna('').astype(str).str.strip()
 
 # Parse categories from string
 def parse_categories(cat_str):
 try:
 if pd.isna(cat_str) or cat_str == '':
 return []
 cat_str = str(cat_str).strip('[]"')
 cats = [c.strip(' "\'') for c in cat_str.split(',')]
 return [c for c in cats if c]
 except:
 return []
 
 df_clean['categories'] = df_clean['categories'].apply(parse_categories)
 
 # Parse price
 def parse_price(price_str):
 try:
 if pd.isna(price_str):
 return 0.0
 return float(str(price_str).strip('"').replace(',', ''))
 except:
 return 0.0
 
 df_clean['final_price'] = df_clean['final_price'].apply(parse_price)
 
 # Stock status detection
 def get_stock_status(avail_str):
 avail_lower = str(avail_str).lower()
 if 'in stock' in avail_lower:
 return 'in_stock'
 elif 'only' in avail_lower or 'left' in avail_lower:
 return 'low_stock'
 return 'out_of_stock'
 
 df_clean['availability'] = df_clean['availability'].fillna('Out of Stock').astype(str)
 df_clean['stock_status'] = df_clean['availability'].apply(get_stock_status)
 
 # Numeric fields
 df_clean['rating'] = pd.to_numeric(df_clean['rating'], errors='coerce').fillna(0.0)
 df_clean['reviews_count'] = pd.to_numeric(df_clean['reviews_count'], errors='coerce').fillna(0).astype(int)
 
 df_clean = df_clean.drop_duplicates(subset='asin', keep='first')
 
 print(f"✓ Cleaned {len(df_clean)} products")
 print(f" Brands: {df_clean['brand'].nunique()}, Avg price: ${df_clean['final_price'].mean():.2f}")
 print(f" Stock: {(df_clean['stock_status']=='in_stock').sum()} in stock, {(df_clean['stock_status']=='low_stock').sum()} low stock")
 
 return df_clean.reset_index(drop=True)

Create a collection to get started. The configuration will be such that we use dense vectors for semantic similarity, sparse vectors for keyword-based relevance and INT8 quantization for reduced memory usage. Quantization is not actually required much here, as the dataset is comparatively smaller compared to actual use cases, but it’s added now for the sake of learning.

def create_collection(collection_name: str = "ecommerce_products"):
 """Create collection with dense+sparse vectors & quantization"""
 
 try:
 client.delete_collection(collection_name)
 except:
 pass
 
 client.create_collection(
 collection_name=collection_name,
 vectors_config={
 "dense": VectorParams(
 size=dense_dim,
 distance=Distance.COSINE,
 quantization_config=models.ScalarQuantization(
 scalar=models.ScalarQuantizationConfig(
 type=models.ScalarType.INT8,
 quantile=0.99,
 always_ram=True
 )
 ),
 hnsw_config=models.HnswConfigDiff(
 m=16,
 ef_construct=100,
 full_scan_threshold=10000
 )
 )
 },
 sparse_vectors_config={
 "sparse": models.SparseVectorParams()
 },
 optimizers_config=models.OptimizersConfigDiff(
 indexing_threshold=10000,
 memmap_threshold=20000
 )
 )
 
 print(f"✓ Collection created (dense: {dense_dim}-dim + sparse, INT8 quantized)")
 return collection_name

You can also see HNSW configuration. For a quick lookup, here is what to keep in mind to set the parameters.

Payload Indexing

After setting up the collections, define this function that sets up payload indexes to support fast, flexible filtering and hybrid search. Text fields like title and description use multilingual tokenization with ASCII folding, ensuring robust full-text search across languages and accent variations. Keyword indexes on brand, categories, and stock_status enable exact matching and efficient filtering for faceted navigation and reranking. Numeric indexes on final_price and rating allow range queries and sorting, which are essential for price filters and quality-based ranking.

def create_payload_indexes(collection_name: str):
 """Full-text indexes with ASCII folding & multilingual tokenization"""

 # Title: multilingual + ASCII folding
 client.create_payload_index(
 collection_name=collection_name,
 field_name="title",
 field_schema=models.TextIndexParams(
 type=models.TextIndexType.TEXT,
 tokenizer=models.TokenizerType.MULTILINGUAL,
 min_token_len=2,
 max_token_len=30,
 lowercase=True,
 ascii_folding=True
 )
 )
 
 # Description: multilingual + ASCII folding
 client.create_payload_index(
 collection_name=collection_name,
 field_name="description",
 field_schema=models.TextIndexParams(
 type=models.TextIndexType.TEXT,
 tokenizer=models.TokenizerType.MULTILINGUAL,
 min_token_len=2,
 max_token_len=20,
 lowercase=True,
 ascii_folding=True
 )
 )
 
 # Keyword indexes
 for field in ['brand', 'categories', 'stock_status']:
 client.create_payload_index(
 collection_name=collection_name,
 field_name=field,
 field_schema=models.KeywordIndexParams(
 type=models.KeywordIndexType.KEYWORD
 )
 )
 
 # Numeric indexes
 client.create_payload_index(
 collection_name=collection_name,
 field_name="final_price",
 field_schema=models.IntegerIndexParams(
 type=models.IntegerIndexType.INTEGER,
 lookup=True,
 range=True
 )
 )
 
 client.create_payload_index(
 collection_name=collection_name,
 field_name="rating",
 field_schema=models.FloatIndexParams(
 type=models.FloatIndexType.FLOAT
 )
 )

Embeddings & Ingestion

Before ingesting the data, you need to define the embedding models and convert each data point to the required vectors. For efficient resource utilization and processing, it is often a good practice in production environments to batch requests for embeddings.

from qdrant_client.models import Distance, VectorParams, PointStruct
from fastembed import TextEmbedding, SparseTextEmbedding

dense_model = TextEmbedding(model_name="BAAI/bge-small-en-v1.5") 
dense_dim = 384

sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")

def generate_embeddings_batch(texts: List[str], batch_size: int = 32):
 """Generate dense embeddings in batches"""
 embeddings = []
 for i in range(0, len(texts), batch_size):
 batch = texts[i:i+batch_size]
 embeddings.extend(list(dense_model.embed(batch)))
 return embeddings

def generate_sparse_embeddings_batch(texts: List[str], batch_size: int = 32):
 """Generate sparse embeddings in batches"""
 embeddings = []
 for i in range(0, len(texts), batch_size):
 batch = texts[i:i+batch_size]
 embeddings.extend(list(sparse_model.embed(batch)))
 return embeddings

def ingest_data(df: pd.DataFrame, collection_name: str, batch_size: int = 50):
 """Ingest data with dense + sparse vectors"""
 print(f"Ingesting {len(df)} products...")
 # Combine text (title weighted 2x)
 combined_texts = [
 f"{row['title']} {row['title']} {row['brand']} {row['description'][:500]}"
 for _, row in df.iterrows()
 ]
 print(" Generating embeddings...")
 dense_embeddings = generate_embeddings_batch(combined_texts)
 sparse_embeddings = generate_sparse_embeddings_batch(combined_texts)
 
 # Prepare points
 points = []
 for idx, row in df.iterrows():
 sparse_emb = sparse_embeddings[idx]
 sparse_vector = models.SparseVector(
 indices=sparse_emb.indices.tolist(),
 values=sparse_emb.values.tolist()
 )
 
 point = PointStruct(
 id=idx,
 vector={
 "dense": dense_embeddings[idx].tolist(),
 "sparse": sparse_vector
 },
 payload={
 "asin": row['asin'],
 "title": row['title'],
 "brand": row['brand'],
 "description": row['description'][:1000],
 "categories": row['categories'],
 "final_price": float(row['final_price']),
 "availability": row['availability'],
 "stock_status": row['stock_status'],
 "rating": float(row['rating']),
 "reviews_count": int(row['reviews_count'])
 }
 )
 points.append(point)
 
 # Upload in batches
 for i in range(0, len(points), batch_size):
 batch = points[i:i+batch_size]
 client.upsert(collection_name=collection_name, points=batch)
 
 print(f"✓ Ingested {len(points)} products")

We should consider only the text-rich fields for embeddings. In this case, it is the title, description and brand. Here, bge-small-en-v1.5 is used for dense embeddings and bm25 for sparse embeddings. Feel free to use other dense embedding models and SPLADE or mini-Coil for sparse embeddings according to your use case.

Hybrid Search

Once the database is ready, you can proceed to the search functions. For simple natural language queries, I will use only dense vector search.

query_embedding = list(dense_model.embed([query]))[0]
 results = client.query_points(
 collection_name=collection_name,
 query=query_embedding.tolist(),
 using="dense",
 limit=limit,
 with_payload=True
 )

For pure hybrid search, both dense and sparse embeddings will be used. This helps in detecting keywords like brand names, but also keeps semantic relevance.

dense_emb = list(dense_model.embed([query]))[0]
 sparse_emb = list(sparse_model.embed([query]))[0]


 sparse_vector = models.SparseVector(
 indices=sparse_emb.indices.tolist(),
 values=sparse_emb.values.tolist()
 )


 results = client.query_points(
 collection_name=collection_name,
 prefetch=[
 models.Prefetch(query=dense_emb.tolist(), using="dense", limit=limit*2),
 models.Prefetch(query=sparse_vector, using="sparse", limit=limit*2)
 ],
 query=models.FusionQuery(fusion=models.Fusion.RRF),
 with_payload=True,
 limit=limit
 )

Reciprocal Rank Fusion (RRF) is a simple yet powerful technique used to combine results from multiple search strategies into a single ranked list. It’s especially effective in hybrid search, where different retrieval methods excel at different things.

For queries involving complex filters, you can use ACORN as shown. Make sure not to use it every time since it has added latency, as we discussed. Instead, set a condition to perform hybrid search normally and use ACORN when a certain condition, like the number of filters, is met.

must = [
 models.FieldCondition(key="rating", range=models.Range(gte=min_rating))
 ]

 must_not = [
 models.FieldCondition(key="stock_status", match=models.MatchValue(value="out_of_stock"))
 ]

 should = []
 if target_brands:
 should.append(models.FieldCondition(key="brand", match=models.MatchAny(any=target_brands)))
 if max_price:
 should.append(models.FieldCondition(key="final_price", range=models.Range(lte=max_price)))
 if prefer_low_stock:
 should.append(models.FieldCondition(key="stock_status", match=models.MatchValue(value="low_stock")))

 query_embedding = list(dense_model.embed([query]))[0]

 results = client.query_points(
 collection_name=collection_name,
 query=query_embedding.tolist(),
 using="dense",
 query_filter=models.Filter(must=must, should=should, must_not=must_not),
 search_params=models.SearchParams(hnsw_ef=128, acorn=models.AcornSearchParams(
 enable=True,
 max_selectivity=0.4,
 )),
 limit=limit,
 with_payload=True
 )

For full-text search on title or description, you can also make use of ASCII-folding since these are the fields where variations in characters generally happen.

match_condition = (
 models.MatchText(text=query) if not match_any
 else models.MatchTextAny(text_any=query)
 )


 results = client.query_points(
 collection_name=collection_name,
 query_filter=models.Filter(should=[
 models.FieldCondition(key="title", match=match_condition),
 models.FieldCondition(key="description", match=match_condition)
 ]),
 limit=limit,
 with_payload=True
 )
 return [{"score": r.score, **r.payload} for r in results.points]

Combining the whole into a simple web application in Streamlit, we have:

You can find the complete code in the Notebook and GitHub.

Wrapping Up

Search quality ultimately comes down to understanding how users think and what they expect, not just the sophistication of the algorithms under the hood. Qdrant didn’t magically make queries perfect, but it provided a framework where I could combine dense vectors for semantic understanding, sparse vectors for exact terms, full-text indexing for precision, and filters and ranking for structured relevance, all in one place. This made it easier to reason about search behavior as requirements grew, instead of constantly adjusting glue code between systems. The result was a search setup that handled both intent-driven queries and strict constraints more consistently, without adding unnecessary operational complexity.

Enjoyed This Article?

💖Hit follow and stay tuned for more deep dives! Let’s connect on LinkedIn — I’d love to chat!

References

[1] Liana Patel, Peter Kraft, Carlos Guestrin, Matei Zaharia (2024).ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data

[2] Thibault Formal, Benjamin Piwowarski, Stéphane Clinchant (2021).SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking

[3] Qdrant Docs (2025)

[4] AzureAI(2025).Compress Vectors using Scalar or Binary Quantization

Images

If not otherwise stated, all images are created by the author.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Beyond Vectors: A Deep Dive into Modern Search in Qdrant

Author(s): Ashish Abraham

Search Has Outgrown “Just Keywords”

Table Of Contents

Dense Vector Search: Semantic Backbone of Modern Retrieval

Similarity Measures in Vector Search

Implementation

Sparse Vector Search: Bringing Back Precision and Rare Terms

Search Mechanics

Implementation

Full-Text Indexing: Lightweight but Surprisingly Powerful

What Happens When You Hit Search

Implementation

ASCII-Folding: Making Multilingual Text Search Actually Work

ASCII Folding in Hybrid Search

Implementation

ACORN: Filter-Aware Vector Search

How Does ACORN Work?

Implementation

Leveling Up the Modern Search Stack

Reranking

Multilingual Tokenization

Performance & Cost Improvements

Putting It All Together: Designing a Hybrid Retrieval Pipeline

Set Up Qdrant

Data Ingestion

Payload Indexing

Hybrid Search

Wrapping Up

Enjoyed This Article?

References

Images

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement