Lastest Feature of ChunkDot
Last Updated on January 25, 2024 by Editorial Team
Author(s): Rodrigo Agundez
Originally published on Towards AI.
Lastest Feature of ChunkDot
In my last 2 blog posts, I introduced ChunkDot, a multi-threaded matrix multiplication, and cosine similarity calculations at scale for dense and sparse matrices. I showed how it can calculate pairwise cosine similarity calculations for very big matrices containing either dense or sparse embeddings.
Cosine Similarity for 1 Trillion Pairs of Vectors
Introducing ChunkDot
pub.towardsai.net
Bulk Similarity Calculations for Sparse Embeddings
ChunkDot support for sparse matrices
pub.towardsai.net
chunkdot
Multi-threaded matrix multiplication and cosine similarity calculations.
pypi.org
In the latest release of ChunkDot, you can calculate the cosine similarity between embeddings stored in 2 different matrices. In the previous versions, you could only do it between embeddings in the same matrix. These embeddings can be stored in dense but also sparse matrices, ChunkDot supports both types.
Cosine Similarity Top K
Letβs go over how ChunkDot works when using a second embedding matrix in the calculations. Even though the core of the implementation is in chunkdot.chunkdot
(which can be used for matrix multiplication), Iβll focus on the chunkdot.cosine_similarity_top_k
function as shown below.
The algorithm consists of 3 main parts, splitting the embeddings matrix into the optimal number of chunks (optimal chunk size calculation given the available memory), performing the parallelized matrix multiplications over the chunks, and collecting the data from each parallel thread to form the similarity matrix. A more detailed description of this process can be found in the blog post below.
Cosine Similarity for 1 Trillion Pairs of Vectors
Introducing ChunkDot
pub.towardsai.net
The addition in the latest release is that the algorithm now supports a second embeddings matrix as input, called embeddings_right in the diagram, which contains the embeddings you want the embeddings matrix to be compared to.
This new functionality will allow the user to make similarity comparisons between items that do not necessarily belong to the same βdomainβ. Below are some of the questions this additional feature allows us to answer.
- Which items (e.g., pictures, videos, users, articles) from the last month are the most similar to todayβs items? This might be useful to show the users the latest content first and then recommend related content that perhaps is not as new.
- Which items from my competitorβs platform are the most similar to my itemβs platform? This might be useful to identify overcrowded item sections in the market and identify opportunities to capture the market in a new item category for example.
Example of using ChunkDot
Imagine we are running an e-commerce site with thousands of items. After creating features out of the items on the site or after creating a recommendation engine that contains some sort of item embedding layer, we convert all 100K items into 256-dimensional embeddings and store them in matrix form, in either a dense or sparse matrix (in this example we will be using a dense matrix).
n_items = 100000
embeddings_dim = 256
my_items = np.random.randn(n_items, embeddings_dim)
Our first use case is to find the most similar 50 items for each item on the site to display similar options to the user on the site. We can use the basic functionality of ChunkDot to compare all pairs of items (1E10). Even in my old MacBook Air with only 8GB of memory, it takes ~1 minute.
similarities = chunkdot.cosine_similarity_top_k(my_items, top_k=50)
similarities
<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
with 5000000 stored elements in Compressed Sparse Row format>
The similarities
matrix contains, for each item, the indices of the 50 most similar items out of the 100K.
Our second use case is to find the percentage overlap of our e-commerce offering in comparison to our competitors and then provide strategic recommendations of which types of items are overcrowded and which ones represent an unexploited opportunity plus a competitive price analysis.
For this use case, the new functionality of ChunkDot comes in handy. Using the same embedding procedure as we used for our items, we convert our competitorsβ 35K items into another embeddings matrix. After that, we calculate the 30 most similar competitor items for each of ours.
n_items = 35000
embeddings_dim = 256
competitor_items = np.random.randn(n_items, embeddings_dim)
similarities = chunkdot.cosine_similarity_top_k(
my_items, embeddings_right=competitor_items, top_k=50
)
<100000x35000 sparse matrix of type '<class 'numpy.float64'>'
with 3000000 stored elements in Compressed Sparse Row format>
This time the similarities
matrix contains, for each of our items, the indices of the 30 most similar items out of the 35K items of our competitor. With this information, we start all kinds of competitor analysis to satisfy the use case.
Conclusion
I hope you find this blog post and ChunkDot useful! I donβt think I will be adding much more to this package as this addition completes the functionality needed. Iβll probably just be updating the dependencies.
Having said that, adding GPU support to accelerate even further the calculations could be a fun addition, and it shouldnβt be super difficult since Numba supports GPU acceleration. Youβre welcome to contribute!
GitHub β rragundez/chunkdot: Multi-threaded matrix multiplication and cosine similarityβ¦
Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate forβ¦
github.com
I enjoyed a lot working on this problem and getting satisfactory results. Thanks for reading.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI