Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Leveraging SuperDuperDB to Create a Simple De-duplication System Easily
Latest   Machine Learning

Leveraging SuperDuperDB to Create a Simple De-duplication System Easily

Last Updated on December 11, 2023 by Editorial Team

Author(s): Okoh Anita

Originally published on Towards AI.

Image by Author


I have spent a considerable amount of years in the identity resolution field, trying to identify duplicate customer accounts and associating them into groups.

In my experience, there are two types of B2C new customers:

  1. A True new customer: A customer with no existing account in the company database.

2. A False new customer: A customer that has one or more existing accounts in the company’s database

One common issue most large B2C companies face is customer account duplication, i.e., customers opening multiple accounts in a company. Depending on the company’s new customer incentives, customers with multiple accounts could potentially utilize the incentive more than once. If not tracked, it could lead to significant monetary losses over time as more false new customers increase.

Lately, I have been thinking about how LLMs could help identify and associate customers in real-time, i.e., deciding if a customer is eligible for new customer incentives as soon as they register.

This led to crafting a solution that can be summarized in two steps:

  • Using LLM to find semantic similarity between customer details in the database as soon as a customer registers. However, only a semantic search would not suffice as it can output false positive similarities, and these similarities may hurt the business's reputation, especially when an actual new customer is penalized unfairly.
  • Creating a simple re-ranking logic layer as a postprocessing task or a second layer of validation will help to narrow down true positive results.

Like always, my thought process typically ends up with me finding tools to help build simple MVP demos fast. And this time was not any different.

A new framework I have recently been playing with is SuperDuperDB

SuperDuperDB is an open-source framework that attempts to eliminates complex MLOps pipelines, specialized vector databases — and the need to migrate and duplicate data by integrating AI at the data’s source, directly on top of your existing data infrastructure.
More info about the framework documentation here

I used SuperDuperDB to extend the vector search capability of MongoDB as my backend database. And then use the RecordLinkage library to re-rank the results outputted from the vector search as a postprocessing step

The Python Record Linkage Toolkit is a library to link records in or between data sources. The toolkit provides most of the tools needed for record linkage and deduplication. The package contains indexing methods, functions to compare records and classifiers.

More info about the toolkit documentation here

Code Demo

Creating scenarios

Let's say we have a course website called “UpSite” that gives a 10-day free trial to new accounts. The goal is to decide if the customer is eligible for new customer incentives.

First, let's understand the architecture.

Image by Author

The above gives a high-level view of the simplistic customer de-duplication architecture.

Here is the flow

A new customer inputs the name, phone, address and email to register. These details are, first, combined into a string to serve as a search string. It is then converted into vectors using an embedding model to search for the five nearest similar customer details in the database.

Since no cut-off score is added to the similarity search result, we always expect to get the five closest results, regardless of whether they are actually similar records or not. To reduce these potential false positive results, a second validation is done to compare each field and score the similarity between the new customer details and the similar customer details returned by the vector search.

This involves using basic string similarity algorithms like the Jarowickler method with a threshold similarity score of 0.85 for the names, emails and addresses. The exact match score for the phone number is checked instead.

If a particular returned customer detail has a similarity score sum of greater than 0, then the returned customer detail is added to a response Dataframe.

Image by Author

If, at the end of the validation, the length of the Response DataFrame is greater than zero, then three things are returned as responses.

  • The customer details
  • Similar customers' details
  • An eligibility rejection message: “ Sorry, you are not eligible for the new customer 10-day trial.”
Image by Author

However, if the DataFrame length is zero, only the eligibility acceptance message is returned: Thank you for registering. Verify your email in your inbox and start enjoying your new customer 10-day trial.

Image by Author

To make all these logics come to life, I wrapped it up in a streamlit app.

GIF by Author


Now you understand the flow, let's translate it to code in 5 steps

  1. First, let's convert our MongoDB into a SuperDuperDB Object and input our data
import json
from superduperdb import superduper
from superduperdb import Document
from superduperdb.backends.mongodb import Collection

with open('data.json') as f:
data = json.load(f)


db = superduper(mongodb_uri, artifact_store="filesystem://./data/")

collection = Collection('customer_details')

db.execute(collection.insert_many([Document(r) for r in data]))

All that is done above is to: Convert the MongoDB instance into a SuperDuperDB object and define where the artifacts are to be stored. A Collection object was also instantiated, and then the JSON file containing the customer details was input into the Collection instantiated.


  • Adding the artifacts path is optional. It can be a local file path or an external storage path.
  • If you already have a MongoDB database with datasets, all you need to do is convert it into a SuperDuperDB object using your Mongo URI and instantiate the Collection.

Feel free to check out your dataset afterward by running the code below

result = db.execute(Collection('customer_details').find_one())

This is what the first row of the data looks like

Document({'Full Name': 'Denny Täsche', 'Email': 'denny.tä[email protected]', 'Address': 'Corina-Stumpf-Ring 36 02586 Gransee', 'Phone Number': '03772092016', 'details': 'denny täsche denny.tä[email protected] corina-stumpf-ring 36 02586 gransee 03772092016', '_fold': 'train', '_id': ObjectId('6565c8c620d98740773c2874')})

SuperDuperDB supports other popular databases like Postgres, DuckDB, etc. and can be converted to a Superduperdb object the same way

More info about how I generated the data can be found below

Creating Dataset Sythnetically: Thought-Process

Using domain knowledge to mimic real-world data discrepancies

2. Before adding the vector search functionality, you need to decide and instantiate an embedding model. For this demo, I would use a HuggingFace model

import sentence_transformers
from superduperdb import Model, vector

model = Model(
predict_method='encode', # Specify the prediction method
postprocess=lambda x: x.tolist(), # Define postprocessing function
batch_predict=True, # Generate predictions for a set of observations all at once

Other popular embedding models, including OpenAI API, Cohere API etc, are supported by SuperDuperDB as well. For more info, check here.

3. Next, add the vector search functionality to the SuperDuperDB, include the instantiated model, and specify the field for the vector search. In my case, I am using the field called “details”.

from superduperdb import Listener, VectorIndex

predict_kwargs={'max_chunk_size': 1000},


  • The Listener() function helps to listen to new requests coming in.
  • The key argument specifies which field the vector search will be done on in the collection.
  • The model is named as an identifier. This is useful when you have multiple models in the database and also for versioning purposes.

You can view the model in the database using.'model')

4. Your database is ready to be used for semantic search. What you need to do is

  • Define the search term to search,
  • Specify the number of nearest similar records to return and the embedded model identifier to use for the vector search.

Say the goal in the above is to search for five most similar customer details to

gesche herr, [email protected], 42130 neubrandenburg

The code would look something like this

search = 'gesche herr [email protected] 42130 neubrandenburg'
n= 5
result = db.execute(
.like(Document({'details': search_term}), vector_index=f'pymongo-docs-{model.identifier}', n=n)
.find({}, {'Full Name': 1, 'Email': 1, 'Address': 1, 'Phone Number': 1, 'details': 1, 'score': 1, '_id': 1})

The results can viewed below

for r in result:
{'Full Name': 'Arnim Sauer', 'Email': '[email protected]', 'Address': 'Henning-Fiebig-Allee 696 42911 Lippstadt', 'Phone Number': '0799339976', 'details': 'arnim sauer [email protected] henning-fiebig-allee 696 42911 lippstadt 0799339976', '_id': ObjectId('657056e7fdcaa041b4c800f5'), 'score': 0.5337394750011253}
{'Full Name': 'Meinolf Neureuther', 'Email': '[email protected]', 'Address': 'Auguste-Adler-Platz 0 36440 Belzig', 'Phone Number': '0715451963', 'details': 'meinolf neureuther [email protected] auguste-adler-platz 0 36440 belzig 0715451963', '_id': ObjectId('657056e7fdcaa041b4c800f7'), 'score': 0.5882743380008798}
{'Full Name': 'Gesche Herrmann', 'Email': '[email protected]', 'Address': 'Ilhan-Hermighausen-Gasse 8 42130 Neubrandenburg', 'Phone Number': '0816778617', 'details': 'gesche herrmann [email protected] ilhan-hermighausen-gasse 8 42130 neubrandenburg 0816778617', '_id': ObjectId('657056e7fdcaa041b4c800fb'), 'score': 0.8428972762297998}
{'Full Name': 'Gesche Herrmann', 'Email': '[email protected]', 'Address': None, 'Phone Number': None, 'details': 'gesche herrmann [email protected] ', '_id': ObjectId('657056e7fdcaa041b4c800ff'), 'score': 0.8105624427206899}
{'Full Name': 'Gesche H.', 'Email': None, 'Address': None, 'Phone Number': '0816778617', 'details': 'gesche h. 0816778617', '_id': ObjectId('657056e7fdcaa041b4c80102'), 'score': 0.7008963328169409}

Although the search helps to narrow the search to potentially similar records, we still see above that there are some false positive similarities in the results returned. To reduce this, we add another layer of similarity validation. Hence, the final step

5. Finally, we create a reranking method using RecordLinkage. Here is the code below

method = 'jarowinkler'
threshold = 0.85
target_df = pd.DataFrame([{'Full Name': 'Gesche herr',
'Email': '[email protected]',
'Address':'42130 neubrandenburg',
'Phone Number': None,
'_id': '12344556667788'

comparison_data = [result.unpack() for result in nearest_results]
comparison_df = pd.DataFrame(comparison_data).set_index('_id')

# Create an indexer for record linkage
indexer = recordlinkage.Index()
pairs = indexer.index(target_df, comparison_df)

# Setup comparison criteria
compare = recordlinkage.Compare()
compare.exact('Phone Number', 'Phone Number', label='Phone Number')
compare.string('Full Name', 'Full Name', method=method, threshold=threshold, label='Full Name')
compare.string("Email", "Email", method=method, threshold=threshold, label="Email")
compare.string("Address", "Address", method=method, threshold=threshold, label="Address")

# Compute similarity features
similarity_features = compare.compute(pairs, target_df, comparison_df)
similarity_columns = ['Phone Number', 'Full Name', 'Email', 'Address']
similarity_features['similarity_sum'] = similarity_features[similarity_columns].sum(axis=1)
similarity_features = similarity_features.reset_index()

# Filter results based on similarity sum
similar_ids = similarity_features[similarity_features['similarity_sum'] >= 1.0]['_id_2'].tolist()
filtered_df = comparison_df.loc[similar_ids]

# Sort by score and return
filtered_df.sort_values(by='score', ascending=False)

The idea is to re-compare the search term as a Target DataFrame with the returned search result as a Comparison DataFrame by checking the similarity of each field using the Jarowinkler string similarity algorithm with a threshold of an arbitrary 0.85. For each compared customer detail, if the sum of the similarity is greater or equal to 1, we keep the customer details. If not, we filter out.


  • This final layer can be switched or expanded to other business logic as an extra validation.


You just learned how to create a simplistic customer de-duplication platform. Feel free to check the web app repo

GitHub – anitaokoh/customer_deduplication

Contribute to anitaokoh/customer_deduplication development by creating an account on GitHub.

Key Takeaways

Real-life customer de-duplication platforms are way more complex than this, but at its core, the above demo demonstrates their core essentials, which are

  • A way to block matches or filter potential matches
  • A way to streamline and narrow down those potential matches to get the actual matches or reduce the false positive matches
  • A framework able to make decisions in real-time if that is the business need

Another fascinating thing about the SuperDuperDB framework is its extra functionality to stack models in the database, i.e. I could have stacked the embedding model and the RecordLinkage model together in the database. For more info, check it out below

Creating complex stacks of functionality U+007C SuperDuperDB documentation

With the declarative API, it's possible to create multiple

Subscribe to Medium using this link and get full access to every article on Medium

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓