Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Lastest Feature of ChunkDot
Data Science   Latest   Machine Learning

Lastest Feature of ChunkDot

Last Updated on January 25, 2024 by Editorial Team

Author(s): Rodrigo Agundez

Originally published on Towards AI.

Lastest Feature of ChunkDot

Photo by Emile Guillemot on Unsplash

In my last 2 blog posts, I introduced ChunkDot, a multi-threaded matrix multiplication, and cosine similarity calculations at scale for dense and sparse matrices. I showed how it can calculate pairwise cosine similarity calculations for very big matrices containing either dense or sparse embeddings.

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

pub.towardsai.net

Bulk Similarity Calculations for Sparse Embeddings

ChunkDot support for sparse matrices

pub.towardsai.net

chunkdot

Multi-threaded matrix multiplication and cosine similarity calculations.

pypi.org

In the latest release of ChunkDot, you can calculate the cosine similarity between embeddings stored in 2 different matrices. In the previous versions, you could only do it between embeddings in the same matrix. These embeddings can be stored in dense but also sparse matrices, ChunkDot supports both types.

Cosine Similarity Top K

Let’s go over how ChunkDot works when using a second embedding matrix in the calculations. Even though the core of the implementation is in chunkdot.chunkdot(which can be used for matrix multiplication), I’ll focus on the chunkdot.cosine_similarity_top_k function as shown below.

Diagram describing ChunkDot’s Cosine Similarity Top K algorithm

The algorithm consists of 3 main parts, splitting the embeddings matrix into the optimal number of chunks (optimal chunk size calculation given the available memory), performing the parallelized matrix multiplications over the chunks, and collecting the data from each parallel thread to form the similarity matrix. A more detailed description of this process can be found in the blog post below.

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

pub.towardsai.net

The addition in the latest release is that the algorithm now supports a second embeddings matrix as input, called embeddings_right in the diagram, which contains the embeddings you want the embeddings matrix to be compared to.

This new functionality will allow the user to make similarity comparisons between items that do not necessarily belong to the same “domain”. Below are some of the questions this additional feature allows us to answer.

  • Which items (e.g., pictures, videos, users, articles) from the last month are the most similar to today’s items? This might be useful to show the users the latest content first and then recommend related content that perhaps is not as new.
  • Which items from my competitor’s platform are the most similar to my item’s platform? This might be useful to identify overcrowded item sections in the market and identify opportunities to capture the market in a new item category for example.

Example of using ChunkDot

Imagine we are running an e-commerce site with thousands of items. After creating features out of the items on the site or after creating a recommendation engine that contains some sort of item embedding layer, we convert all 100K items into 256-dimensional embeddings and store them in matrix form, in either a dense or sparse matrix (in this example we will be using a dense matrix).

n_items = 100000
embeddings_dim = 256
my_items = np.random.randn(n_items, embeddings_dim)

Our first use case is to find the most similar 50 items for each item on the site to display similar options to the user on the site. We can use the basic functionality of ChunkDot to compare all pairs of items (1E10). Even in my old MacBook Air with only 8GB of memory, it takes ~1 minute.

similarities = chunkdot.cosine_similarity_top_k(my_items, top_k=50)
similarities
<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
with 5000000 stored elements in Compressed Sparse Row format>

The similarities matrix contains, for each item, the indices of the 50 most similar items out of the 100K.

Our second use case is to find the percentage overlap of our e-commerce offering in comparison to our competitors and then provide strategic recommendations of which types of items are overcrowded and which ones represent an unexploited opportunity plus a competitive price analysis.

For this use case, the new functionality of ChunkDot comes in handy. Using the same embedding procedure as we used for our items, we convert our competitors’ 35K items into another embeddings matrix. After that, we calculate the 30 most similar competitor items for each of ours.

n_items = 35000
embeddings_dim = 256
competitor_items = np.random.randn(n_items, embeddings_dim)
similarities = chunkdot.cosine_similarity_top_k(
my_items, embeddings_right=competitor_items, top_k=50
)
<100000x35000 sparse matrix of type '<class 'numpy.float64'>'
with 3000000 stored elements in Compressed Sparse Row format>

This time the similarities matrix contains, for each of our items, the indices of the 30 most similar items out of the 35K items of our competitor. With this information, we start all kinds of competitor analysis to satisfy the use case.

Conclusion

I hope you find this blog post and ChunkDot useful! I don’t think I will be adding much more to this package as this addition completes the functionality needed. I’ll probably just be updating the dependencies.

Having said that, adding GPU support to accelerate even further the calculations could be a fun addition, and it shouldn’t be super difficult since Numba supports GPU acceleration. You’re welcome to contribute!

GitHub — rragundez/chunkdot: Multi-threaded matrix multiplication and cosine similarity…

Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for…

github.com

I enjoyed a lot working on this problem and getting satisfactory results. Thanks for reading.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓