Leveraging Vector Databases With Embeddings for Fast Image Search and Retrieval
Last Updated on June 29, 2024 by Editorial Team
Author(s): Hasib Zunair
Originally published on Towards AI.
Learn the what and why of vector databases and how to use Weaviate vector database with embeddings for searching and retrieving images.
Motivation
Conventional databases (e.g. relational databases) lead to performance issues and bottlenecks when storing high-dimensional vectors in tabular format. They also lack efficient similarity search since they are optimal for exact matches and range queries, resulting in higher compute cost. Finally, they cannot scale due to the rigid schema and complexity of managing high-dimensional data. Enter vector databases!
Vector databases are crucial for managing compute costs and scalability, while enabling rapid search of unstructured data such as images or text. They excel in storing high-dimensional vectors and efficiently retrieve similar vectors.
The article is organized as follows:
- Vector databases
- Comparison with conventional databases
- Image search and retrieval using Weaviate
All code used in this post is available on GitHub.
Vector databases
Vector databases store, index and query vectors. These vectors, also know as vector embeddings are simply an array of numbers that represent data (e.g. images or texts) instead of rows and columns.
In a vector database, vectors are first indexed and structured, enabling faster similarity search rather than comparing each vector one by one which is significantly slow. This process, known as vector indexing, simply clusters similar vectors together. Specifically, the vectors in a vector database are indexed using algorithms like HNSW, IVFADC, IVFPQ that enable Approximate Nearest Neighbor (ANN) search. These algorithms map the vectors to a data structure that will enable rapid query and retrieval, resulting in lower compute resources and time to find results in large datasets.
After storing and indexing, the process of querying is where a query vector can be sent to the vector database to quickly search and retrieve similar vectors. More specifically, when the vector database gets a query vector, it compares the indexed vectors to the query vector to determine the nearest vector neighbors. To compute nearest neighbors, vector databases rely on similarity metrics/measures like cosine similarity, euclidean distance or dot product between the two vectors.
Vectors are useful because they retain the contextual significance of the data making it rich in semantics. They can align similar entities within a vector space (i.e., clustering) for similarity search. They also transform complex unstructured data to an array of numbers, making it simple and powerful.
Comparison with conventional databases
Vector databases are different from conventional databases (e.g. relational databases) in a number of ways.
- Conventional databases store structured data in tabular format, while vector databases store unstructured data and vector embeddings.
- Conventional databases store in columns and retrieve by keywords. Other the other hand, vector databases store embeddings of original data that enables efficient similarity search.
- Conventional databases query for rows where the value usually exactly matches the query. In vector databases, a similarity metric is applied to find a vector that is the most similar to the query.
This makes it robust to typos and synonyms as it understands semantic concepts, making it more accurate.
There are some drawbacks of conventional databases, like relational databases (RDBs) for vector search:
- RDBs lead to performance issues when storing high dimensional vectors in tabular format, resulting in bottlenecks. Compute cost of similarity is higher, need to compare for all items.
- RDBs also lack efficient similarity search since they are optimal for exact matches and range queries. Cannot do similarity search with high dimensional vectors.
- RDBs cannot scale due to the rigid schema and complexity of managing high dimensional data.
- RDBs do not have vector indexing support. Without it vector search is very slow.
Vector databases prioritize efficient indexing of vector embeddings, enabling faster and more accurate search operator.
Image search and retrieval using Weaviate
Now, we go over an application of vector databases for the task of image search and retrieval. Specifically, we use Weaviate as our vector database where we store our data along with vector embeddings and then query the database with a vector originating from an input image in order to retrieve similar images.
What is the task and why is it important?
Image search and retrieval which falls under the umbrella of computer vision has many use-cases among which we are interested in two: product retrieval and visual search. The goal of product retrieval is to get the most similar products given the image of the product as well as other related attributes which is a feature in recommendation engines in retail websites. On the other hand, visual search aims to return the most visually similar items based on the input image. This is useful for customers looking for a specific product for which they do not know the name and can save time instead of going over many items in the product catalog.
Create a Weaviate database
To store our data and vector embeddings, we need to create a database. Head over to weaviate.io and create a Weaviate Cloud (WCD) account. Then, use the free cloud sandbox instance on WCD to create a sandbox cluster, which is your database. After creating a cluster, collect the API key and URL from the Details tab in WCD. An instance is shown below.
Install Weavitae Client library
We use Python as our preferred client. Simply run the command below in your terminal to install the client library.
!pip install -U weaviate-client
Connect to the Weaviate vector database
To connect to the database the we just created, we need the API key and the URL. This can be done using the following code snippet:
import weaviate
import json
import os
client = weaviate.Client(
url = "YOUR_WEAVIATE_URL",
auth_client_secret=weaviate.AuthApiKey(api_key="YOUR_API_KEY"),
)
where YOUR_WEAVIATE_URL
and YOUR_API_KEY
is your URL and API key respectively.
Upload dataset with vectors
The data consists of 10 products at Decathlon Canada for which we have several attributes like season year, brand label, image URL, sports label, product ID, and country code. Since we want to search and retrieve similar images, we are interested in the image URL. Specifically, we get the images using the respective URLs and compute the vector embeddings of the images using CLIP. Next, we add the vectors as a new column in our dataset. The entire process can be found here. We finally load the dataset which consists of the vectors using the following code snippet.
from google.colab import files
import pandas as pd
import io
uploaded = files.upload()
df = pd.read_csv(io.BytesIO(uploaded['vectors.csv']), sep=';', index_col=0)
df = df.reset_index()
df.head()
where the vector from a row in the dataset looks like:
[-0.01212814450263977, 0.32004088163375854, 0....
Define data collection
We define a data collection (a βcollectionβ in Weaviate) to store objects in. This is analogous to creating a table in relational (SQL) databases. Think of it as bookshelves where we want to store our books. Here, vectorizer is set to none because we add our own vectors.
# Class definition object.
class_obj = {
"class": "Images",
"vectorizer": "none",
}
# Add the class to the schema
client.schema.create_class(class_obj)
Store data in Weaviate
Now, we load certain objects from our dataset and add them to the collection. Basically for the 10 products, we store the product ID, image URL and the vector embedding of the image in our database.
import ast
vector_lists = [] # add all vectors from the dataset
# Configure a batch process
client.batch.configure(batch_size=100) # Configure batch
with client.batch as batch:
for index, row in df.iterrows():
print(f"importing data: {row['image_url']}, {row['product_id']}")
# one data point
properties = {
"product_id":row['product_id'],
"image_url": row['image_url'],
}
# convert string vector to list
img_vector = ast.literal_eval(row["vector"])
vector_lists.append(img_vector)
# add one data point as dict and the vector
batch.add_data_object(properties, "Images", vector=img_vector)
After this, our database now has 10 items where each item has the attributes: product ID, image URL and the vector embedding. We can now query the database with another vector to get the similar ones really fast!
Queries
Given our own image, we want to search and retrieve similar images from the vector database. Below is an example image we will use.
The first step is to compute the vector embeddings of the input image. We again use CLIP for computing embeddings.
import torch
import clip
from PIL import Image
# Load CLIP encoder
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def to_numpy(tensor):
return (
tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
)
Note: You always need to use the same vectorizer when adding vectors in your database and while making queries. This ensures semantic compatibility and accurate similarity search results. Different vectorizers may represent the same data in different ways.
Then upload your image and preprocess it:
# Upload image
uploaded = files.upload()
# Preprocess image and get embedding
image = preprocess(Image.open("shoe.jpg")).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
image_features_np = to_numpy(image_features).tolist()[0]
image_features_np = [fv for fv in image_features_np]
Since we have an input vector, we use the Near Vector
operator to find objects with similar vectors. This will return the top 3 similar products among the 10 items in our vector database. It also returns the certainty measure where values close to 1 represent that the input and retrieved vector are very similar.
# query vector
nearVector = {
"vector": image_features_np
}
# get nearest vector
result = client.query.get(
"Images", ["product_id", "image_url"]
).with_near_vector( # get nearest vector
nearVector
).with_limit(3 # top 3 results
).with_additional(['certainty'] # get certainity measure
).do()
print(json.dumps(result, indent=4))
For you own image, the output of the above code snippet would be something like:
{
"data": {
"Get": {
"Images": [
{
"_additional": {
"certainty": 0.7921097576618195
},
"image_url": "https://contents.mediadecathlon.com/p1291853/sq/1291853.jpg?f=224x224",
"product_id": "8493674"
},
{
"_additional": {
"certainty": 0.7880265116691589
},
"image_url": "https://contents.mediadecathlon.com/m11391253/k$148c4677adee0722821c354c34d12dc5/sq/Chaussures+de+running+Homme+Catamount+Brooks.jpg?f=224x0",
"product_id": "018dc1e4-9d0f-4548-baa9-d8b62549fa85"
},
{
"_additional": {
"certainty": 0.7830776572227478
},
"image_url": "https://contents.mediadecathlon.com/p1855779/sq/1855779.jpg?f=224x224",
"product_id": "8612301"
}
]
}
}
}
If we go to the image URL, we will see products like shoes, hockey sticks and shorts.
Conclusion
In this article, you learned about what vector databases are, why do we need them and how they are different from conventional databases. You also built a workflow for an image search and retrieval use-case using Weaviate. Specifically, you created a vector database, connected to it using a client, and stored your data and vectors in the database. Finally, you used your own images as queries to quickly search and retrieve similar images from the vector database.
About the author
I am a Ph.D. candidate at Concordia University in Montreal, Canada, working on computer vision research. At the time of writing, I was also an Applied ML Scientist at DΓ©cathlon Canada, where I helped build new ML systems that transform sports images and videos into actionable intelligence. If youβre interested to learn more about me, please visit my webpage here.
A special thanks to the members of the AI team at DΓ©cathlon Canada for the comments and review, in particular Yan Gobeil and Mitul Patel.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI