Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

De-Mystifying Embeddings
Latest   Machine Learning

De-Mystifying Embeddings

Last Updated on June 30, 2024 by Editorial Team

Author(s): Shashank Bhushan

Originally published on Towards AI.

Understanding What Embeddings Are

Embeddings, sometimes also referred to as Feature representation, are a widely used technique/concept in Neural Network based machine learning. They are usually taken from an intermediate or hidden layer of a Deep Neural Network. Intuitively, embeddings are a set of numbers representing some information about the input or object that we are working with. But how do these numbers represent the information, and how does a model produce it? To understand this, let's look at the following example.

Images Taken From FashionIQ with modifications by the Author.

Suppose we want to come up with a representation of the visual quality of the images above. To do this, we first need to define what visual quality means; let's use the following three criteria:

  • On a scale of 0 to 1, how blurry is the image?
  • On a scale of 0 to 1, how well-lit is the image?
  • Whether or not the clothes in the image are crumpled.

Based on this, a good image would not be blurry, and it would be well-lit with no crumpled clothes (such as the second image). On the other hand, a bad image would be blurry and dim with crumpled clothes (such as the last image). Given these three criteria, we can now represent each image as a set of 3 numbers, the first capturing how blurry the image is, the second how dim it is, and the third a boolean (0 or 1) capturing whether or not the clothes are crumpled. This set of 3 numbers is a 3-dimensional representation or embedding for the visual quality (based on our definition) of the images.

Thus, in an embedding with N numbers (also called an N-dimensional embedding), each of the N numbers represents different characteristics (blurry, well-lit, .. in the example above) of the input.

Note: The information represented by N numbers is specific to the task that the embedding was built for and usually (more information on this below) does not transfer well to other tasks. For e.g. we cannot use the above 3-dimensional representation for whether or not there is a person wearing the clothes in the image.

Training an Embedding Model

Now that we know how to interpret embeddings let's see how we would go about training an ML model to generate embeddings. First, we need to convert our criteria into a task that we can train our model on; this is the most important step in the whole process; the closer the task is to our criteria, the better the embeddings will be. Once we have the task figured out, there are two ways we can train an embedding:

  1. Train a model on the task and then use an intermediate model layer as embeddings. These are used when the input can be passed to the model as is, such as an image.
  2. Using an embedding layer. These are used when working with categorical values such as text. They are essentially a lookup table that converts a categorical value to a dense representation or embedding. The layer is then used to convert the categorical value into a dense representation that can be fed into the ML model.

Keeping with our example above, let's say we want to create an N-dimensional embedding that captures the visual quality of the images. For our task, we will use simple binary classification. We will combine the 3 criteria into a binary yes or no answer, where yes means that the image has high visual quality and low means that it does not. As images can directly be fed into models, such as ConvNets or VITs, we can train an NN network (such as a ResNet) to predict whether or not an image has high visual quality or not. Once the model is trained, we will remove the final fully connected layer. The remaining model can now be used to generate (visual) embeddings for any given image. The images below capture this process.

Embedding Model Training, Image by Author
Creating Embedding model from Trained model, Image by Author

Note: We cannot associate semantic meaning associated with any given number from an embedding i.e. we cannot say the first number represents a certain characteristic. The ML model is just trying to solve the problem given to it (assign correct binary labels to an image), it does not know or care what would be the most intuitive features for human understanding.

Evaluating an Embedding Model

While this depends on what the embedding will be used for, a few common evaluations include:

  • How good is the embedding in clustering objects with similar characteristics?
  • The embeddings are used to train a small model (2–3 hidden layers) with relatively fewer samples on a similar task, and then the performance of the trained model is evaluated.

Gotchas For Embedding and Practical Considerations

While we have already touched upon some of these above, here are the general gotcha one should be aware of when working with embedding models:

  1. Embeddings are task-specific, i.e., embeddings trained on one task might not be useful for another.
  2. Embeddings from two different models cannot be compared even if they are trained on the same task and dataset. This is because there can be multiple different representations that can be suitable for the task at hand, so the numbers at different positions will not necessarily have the same meaning.
  3. Embeddings might not perform well on objects that they did not come across in their input. For e.g., if we train an embedding model on just images of clothes, it might not perform well on images of animals or humans.
  4. Larger embeddings are generally more expressive than smaller ones though they do come with additional computational and storage costs. Thus, in production settings, smaller embedding sizes are usually preferred.

Common Embedding Training Tasks

Finally, let us look at some common tasks used to train image and text embeddings.

Tasks for Image Embeddings

  • (Supervised) Classification: ML Models are trained on classification tasks just like in the example above, and then an intermediate layer is used for the representation
  • (Supervised) Image—Text Description Pairs: In such a setup, the aim is to either generate an appropriate description given an image or align the embedding/representation of an image and its corresponding description. This setup can be thought of as a generalization of the classification task, as the text description acts like a descriptive fine-grained label. This setup is used in training multiple SOTA image embeddings such as CLIP, ALIGN, and CoCa
  • (Self-Supervised) Contrastive Learning: In this approach, we first augment images in a given batch by applying multiple random operations such as cropping, rotation, etc. We then ask the model to ensure that the embedding generated for the original and the augmented image are closer to each other than all other images in the same batch. This task was first introduced by Chen et al and it's a very powerful and popular technique to create embeddings without having to label anything.
  • (Self-Supervised) Masked Patch Prediction: In this approach, random portions of the image are masked, and then the model is tasked with predicting the masked pixel. Introduced by He et al. this paper is another very powerful method that is well-suited for transformer-based image models.

Tasks For Text Embeddings

  • (Self-Supervised) Masked Token Prediction: For this task, the model is asked to predict a word or token given the surrounding tokens/words. Essentially, for a given sentence, we will mask out a word/token in the middle and ask the model to predict the word given the remaining words. This method was used as a pre-training task for BERT embeddings.
  • (Self-Supervised) Word Context prediction: This approach is sort of the reverse of the masked token approach. The model is tasked with predicting context tokens in a sentence for a given token. This method is one of the tasks used to train the famous Word2Vec embedding.
  • (Self-Supervised) Next Sentence Prediction: This task aims to make the model aware of sentence-level relationships. For this, two sentences are fed to the model, and the task is to determine whether or not the first sentence follows the other. These tasks, along with Masked Token Prediction, are the two pre-training tasks used to train the BERT embedding model.

That's all for this one; let me know your thoughts and if I should dive-deep into any of the embedding training tasks that I mentioned.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓