Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


10 Cool Things You Can Do With Embeddings! [Part 1]
Computer Vision   Latest   Machine Learning

10 Cool Things You Can Do With Embeddings! [Part 1]

Last Updated on August 29, 2023 by Editorial Team

Author(s): Zubia Mansoor

Originally published on Towards AI.

Apply these concepts to solve real-world industry problems in deep learning

Taking a step away from classical machine learning (ML), embeddings are at the core of most deep learning (DL) use cases. Having a grasp of this concept enables you to do flexible tasks with the feature space, and reframe ML/DL problems differently, especially with high-dimensional data in Computer Vision and Natural Language Processing.

Embeddings have created a notable impact on several areas of applications today, including Large Language Models (LLMs). What the literature is lacking among these great, although scattered, concepts around embeddings is a coherent blueprint of different industry applications and how to get started in that space! That is why this blog will walk you through the different ways you can leverage embeddings and apply them in real-world industry problems.

“This article is part one of a two-part series. It is meant to serve as a beginner’s guide to different types of popular open-source models while familiarizing you with the core concept of embeddings.”

An Intuitive Explanation of Embeddings

Embeddings are low-dimensional, learned continuous vector representations of discrete variables [1].

We can break up this definition and absorb the important points:

  • lower dimensionality than input data
  • compressed representation of data
  • capture complex non-linear relationships learned by the model as a linear representation
  • reduce dimensions by storing relevant information and discarding the noise
  • usually extracted from the final layers of the neural networks (just before the classifier)

Let us intuitively understand the definition and the true potential of embeddings using an example. Suppose we are analyzing the performance of call centre operators based on their customer survey forms. The forms have been submitted by thousands of customers and contain millions of unique words. To perform any sort of text analysis, our input parameter set will be huge, and likely to result in poor performance. This is where embeddings come into play!

Instead of taking the original input parameter set, we take all the unique words and express them as embeddings. This way, we reduce dimensions and bring down the input feature set to, let’s say 9, which is more consumable by the model. Visualizing the embeddings for the operators, it is easy to spot that Operators B and C received similar customer feedback. Another example can be found here. [2]

Image by author

What Can We Do With Embeddings?

In part I of the series, we will explore the following applications of embeddings:

  • Use Text Embeddings To Find Similar Texts
  • Use Visual Embeddings To Find Similar Images
  • Use Different Types of Embeddings To Find Similar Items
  • Use Different Weights on Embeddings To Find Similar Items
  • Put Image & Text Embeddings In the Same Space (Multimodal)

Each point builds on the previous point, causing a multiplicative effect. The code snippets are self-sufficient and beginner-friendly to play around with. They can be used as building blocks for creating more complex systems. Now, let us dive deeper into each use case below.

I. Use Text Embeddings To Find Similar Texts

Text embeddings = representing text (words or sentences) as real-valued vectors in a lower dimensional space.

Text Embeddings

Now, suppose we want to know how similar or dissimilar the above sentences are. We can do that using a text encoder. Here, we use a pretrained DistilBERT by Hugging Face to calculate text embeddings and cosine similarity to compute similarity!

Compare DistilBERT text embeddings for different texts using cosine similarity

We observe that texts containing similar and matching words will give a higher similarity score. On the other hand, texts that are different from each other in terms of word choices, meaning, and context (depending on the model!) will result in a lower similarity score as shown below. Thus, text embeddings aim at using the semantic closeness of words to establish a meaningful relationship.

Real-World Applications: Document Search, Information Retrieval, Sentiment Analysis, Search Engine [3]

II. Use Visual Embeddings To Find Similar Images

Visual embeddings = representing images (that is, pixel values) as real-valued vectors in a lower dimensional space.

Instead of text, now we want to know how visually similar the two images are. We can do that using an image encoder. Here, we use a pretrained ResNet-18 to calculate visual embeddings. You can download the images here [4].

Drawing parallels with text, embeddings learned from images find similarity based on similar visual features. For instance, the visual embeddings might learn to recognize a cat’s ear, nose, and whiskers giving a score of 0.81. On the other hand, a dog has different visual features than a cat resulting in a lower similarity score of 0.50. Note that there might be some bias based on the images chosen, however, this is the general underlying principle. Thus, visual embeddings aim at using pixel closeness of images to establish a meaningful relationship.

Similarity scores between images on the left and the right

Real-World Applications: Recommender Systems, Image Retrieval, Similarity Search

III. Use Different Types of Embeddings To Find Similar Items

Earlier, we talked about how embeddings are a way of representing input information in a compressed form. This information can be of any type, ranging from various text attributes to visual attributes, and more. Naturally, the more attributes we can use to describe an item, the stronger the signal for similarity.

Let us take a simple example and walk through this concept together. Below, we have a dataset of five food items. I have created the attributes myself and generated the swatch colors using the Pillow library (explore the script here). You can download the data here (product images by [5]). For each item, we have 5 textual attributes and 2 visual attributes.

For each of these attributes, we can extract an embedding vector that represents that specific attribute. A natural question that crops up now: How do we decide what is similar? For instance, we saw that two sentences can be similar but if we were to visualize them, would they look the same? In contrast, two animals might have a similar score because they look similar, but if we were to characterize them using features, are they still similar? Therein lies the opportunity for leveraging different types of embeddings available to us.

We start by creating a single text feature called ‘TEXTUAL_ATTR’ by simply joining the words. Note that we take a simple example here. In real-world use cases, we would need to apply advanced NLP techniques to clean the text data before concatenating.

Combine individual text attributes into a single one

The GIFs below demonstrate what happens when we take into account either textual embeddings or visual embeddings.

Similarity using ‘TEXTUAL_ATTR’ embeddings only
Similarity using ‘PRODUCT_IMAGE’ embeddings only

Both approaches have their limitations, as outlined above. Therefore, we can turn to the ensemble approach in machine learning by combining the above embeddings to provide as much information about the item as possible. This helps us get stronger similarity signals. For example, we can find similar items by using textual, product image, and product color embeddings.

Let us see what happens when we combine all the embeddings!

Real-World Applications: Recommender Systems, Substitution Search

IV. Use Different Weights on Embeddings To Find Similar Items

This concept builds upon point III and is a more flexible extension. Let us take 3 different scenarios as follows:

  • You are looking for a replacement toaster at Costco. You want the same functionality as your previous toaster. In this case, you do not care if your toaster looks the same as your old one, but you do care about the features it provides (that is, textual features). In this case, you would want to use text embeddings only
Image by author
  • I tore my favorite pants and because I love them so much, I want a replacement that is as close as it can get. I am willing to sacrifice some of the features that my old pants had, but I do care about the aesthetics and how they looked on me. In this scenario, we can place more weight on the visual embeddings and slightly lower weight on the text embeddings.
  • If the given data has certain background elements like a human model, we risk learning the human model’s features instead of the actual product we are interested in. In the example below, we want to learn the features of the basketball and not the model holding it. In cases such as these, instead of using image segmentation, we can simply place lesser weight on the full body images, and more on the textual attributes as well as the color of the product to capture the desired embeddings.
Image by [6]

In each of these cases, using different weights gives us control over which sets of embeddings are important to us, and how to manipulate them. This idea can be used to build flexible systems to solve different types of business problems.

Real-World Applications: Bias Reduction, Adjustable Recommender Systems, Adjustable Similarity Search [7][8]

V. Put Image & Text Embeddings In the Same Space (Multimodal)

So far, we have been looking at text and visual data as entirely separate inputs. The closest we have come is combining the effects at the inference stage in point III. But what if we thought of text and images as objects? How do we collectively represent these objects?

The answer is multi-modality and CLIP (Contrastive Language–Image Pre-training [9]). If you are unsure of what multi-modality means, it simply combines different types of data (text, image, audio, numeric) to make more accurate predictions. Using CLIP, we can represent objects (in this case, text or image) as numeric vectors (embeddings) to help project them to the same embedding space.

The true value of the multi-modal nature of CLIP lies in translating similar concepts in text and images into similar vectors. These vectors then get placed close to each other depending on how similar they are. This means that if a text captures the concepts of an image well, it is placed closer to that image. Similarly, if an image is described well by a sentence, it lies close to that sentence. In the example below, the text “dog playing at the beach” outputs a similar embedding to that of an image of a dog playing at the beach.

Similar objects (e.g., text, images) are encoded close to each other in the vector space. GIF by author & images by [10]

As seen in the illustration above, CLIP unlocks the ability to move across the domain of text and image by combining the semantics of text with the visuals of an image (Image-to-Text, Text-to-Image). We can also stay within a specific domain like Text-to-Text or Image-to-Image. This opens up the possibility for powerful applications ranging from recommendation systems to self-driving cars, document classification, and more!

Let us demonstrate a text-to-image example using CLIP by HuggingFace. We will use the images of food items [5], as shown in point III. The idea is to traverse from text to image directly.

Watch the video below as I was walk you through a live demo or try it yourself here!

We see some cool results in the demo above! The dataset contains two images of the food item cake. Upon looking up “cake”, we first get an image containing a chocolate cake. However, when I add more descriptive keywords like “red velvet”, it gives me a slice of red velvet. Although both of these images portrayed a cake in a sense, only one of them was a red velvet cake that the model recognized. Similarly, we get a plate of fish and chips when we look up “fish and chips” but only “chips” gives us a box of fries. The model is also able to differentiate between fries and churros even though they look quite similar.

Now, let us do an image-to-text in a similar fashion. Here, we take examples [4] from point II to classify whether an image is of a dog or a cat.

For each image, we get the probability of the text matching the image. In the case of cat images, the label cat has a high probability. On the other hand, the dog image is classified as a dog.

Real-World Applications: Text-to-Image (Lookup), Image-to-Image (Similarity), Image-to-Text (Classification)

In part two of the series, we will look at more applications ranging from art to cool visualizations. Stay tuned!

If you found this article useful, and believe others would too, leave a clap! If you’d like to stay connected, you’ll find me on LinkedIn here.

Zubia is a Senior Machine Learning Scientist and works on product discovery, personalization, and generative designs in fashion. She is a mentor, instructor, and speaker on topics related to computer vision, data science, women/women of color in STEM, and career in tech.


[1]: Definition of embeddings

[2]: Embeddings explained

[3]: Google’s search engine using BERT text encoder

[4]: Images from Dogs vs. Cats (Kaggle)

[5]: Images from Food Images (Food-101)

[6]: Photo by Malik Skydsgaard on Unsplash

[7]: Combining different types of embeddings

[8]: Embeddings for recommender systems

[9]: OpenAI’s CLIP

[10]: Images by Oscar Sutton, Hossein Azarbad on Unsplash

[11]: CLIP blog

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓