Lastest Feature of ChunkDot

Last Updated on January 25, 2024 by Editorial Team

Author(s): Rodrigo Agundez

Originally published on Towards AI.

Lastest Feature of ChunkDot

In my last 2 blog posts, I introduced ChunkDot, a multi-threaded matrix multiplication, and cosine similarity calculations at scale for dense and sparse matrices. I showed how it can calculate pairwise cosine similarity calculations for very big matrices containing either dense or sparse embeddings.

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

pub.towardsai.net

Bulk Similarity Calculations for Sparse Embeddings

ChunkDot support for sparse matrices

pub.towardsai.net

chunkdot

Multi-threaded matrix multiplication and cosine similarity calculations.

pypi.org

In the latest release of ChunkDot, you can calculate the cosine similarity between embeddings stored in 2 different matrices. In the previous versions, you could only do it between embeddings in the same matrix. These embeddings can be stored in dense but also sparse matrices, ChunkDot supports both types.

Cosine Similarity Top K

Let’s go over how ChunkDot works when using a second embedding matrix in the calculations. Even though the core of the implementation is in chunkdot.chunkdot(which can be used for matrix multiplication), I’ll focus on the chunkdot.cosine_similarity_top_k function as shown below.

The algorithm consists of 3 main parts, splitting the embeddings matrix into the optimal number of chunks (optimal chunk size calculation given the available memory), performing the parallelized matrix multiplications over the chunks, and collecting the data from each parallel thread to form the similarity matrix. A more detailed description of this process can be found in the blog post below.

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

pub.towardsai.net

The addition in the latest release is that the algorithm now supports a second embeddings matrix as input, called embeddings_right in the diagram, which contains the embeddings you want the embeddings matrix to be compared to.

This new functionality will allow the user to make similarity comparisons between items that do not necessarily belong to the same “domain”. Below are some of the questions this additional feature allows us to answer.

Which items (e.g., pictures, videos, users, articles) from the last month are the most similar to today’s items? This might be useful to show the users the latest content first and then recommend related content that perhaps is not as new.
Which items from my competitor’s platform are the most similar to my item’s platform? This might be useful to identify overcrowded item sections in the market and identify opportunities to capture the market in a new item category for example.

Example of using ChunkDot

Imagine we are running an e-commerce site with thousands of items. After creating features out of the items on the site or after creating a recommendation engine that contains some sort of item embedding layer, we convert all 100K items into 256-dimensional embeddings and store them in matrix form, in either a dense or sparse matrix (in this example we will be using a dense matrix).

n_items = 100000
embeddings_dim = 256
my_items = np.random.randn(n_items, embeddings_dim)

Our first use case is to find the most similar 50 items for each item on the site to display similar options to the user on the site. We can use the basic functionality of ChunkDot to compare all pairs of items (1E10). Even in my old MacBook Air with only 8GB of memory, it takes ~1 minute.

similarities = chunkdot.cosine_similarity_top_k(my_items, top_k=50)
similarities

<100000x100000 sparse matrix of type '<class 'numpy.float64'>'
 with 5000000 stored elements in Compressed Sparse Row format>

The similarities matrix contains, for each item, the indices of the 50 most similar items out of the 100K.

Our second use case is to find the percentage overlap of our e-commerce offering in comparison to our competitors and then provide strategic recommendations of which types of items are overcrowded and which ones represent an unexploited opportunity plus a competitive price analysis.

For this use case, the new functionality of ChunkDot comes in handy. Using the same embedding procedure as we used for our items, we convert our competitors’ 35K items into another embeddings matrix. After that, we calculate the 30 most similar competitor items for each of ours.

n_items = 35000
embeddings_dim = 256
competitor_items = np.random.randn(n_items, embeddings_dim)
similarities = chunkdot.cosine_similarity_top_k(
 my_items, embeddings_right=competitor_items, top_k=50
)

<100000x35000 sparse matrix of type '<class 'numpy.float64'>'
 with 3000000 stored elements in Compressed Sparse Row format>

This time the similarities matrix contains, for each of our items, the indices of the 30 most similar items out of the 35K items of our competitor. With this information, we start all kinds of competitor analysis to satisfy the use case.

Conclusion

I hope you find this blog post and ChunkDot useful! I don’t think I will be adding much more to this package as this addition completes the functionality needed. I’ll probably just be updating the dependencies.

Having said that, adding GPU support to accelerate even further the calculations could be a fun addition, and it shouldn’t be super difficult since Numba supports GPU acceleration. You’re welcome to contribute!

GitHub — rragundez/chunkdot: Multi-threaded matrix multiplication and cosine similarity…

Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for…

github.com

I enjoyed a lot working on this problem and getting satisfactory results. Thanks for reading.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Lastest Feature of ChunkDot

Author(s): Rodrigo Agundez

Lastest Feature of ChunkDot

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

Bulk Similarity Calculations for Sparse Embeddings

ChunkDot support for sparse matrices

chunkdot

Multi-threaded matrix multiplication and cosine similarity calculations.

Cosine Similarity Top K

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

Example of using ChunkDot

Conclusion

GitHub — rragundez/chunkdot: Multi-threaded matrix multiplication and cosine similarity…

Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for…

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Lastest Feature of ChunkDot

Author(s): Rodrigo Agundez

Lastest Feature of ChunkDot

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

Bulk Similarity Calculations for Sparse Embeddings

ChunkDot support for sparse matrices

chunkdot

Multi-threaded matrix multiplication and cosine similarity calculations.

Cosine Similarity Top K

Cosine Similarity for 1 Trillion Pairs of Vectors

Introducing ChunkDot

Example of using ChunkDot

Conclusion

GitHub — rragundez/chunkdot: Multi-threaded matrix multiplication and cosine similarity…

Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for…

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement