Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!


Autoencoder Average Distance — a classical way used internally at Microsoft to find out similarity…

Autoencoder Average Distance — a classical way used internally at Microsoft to find out similarity…

Last Updated on August 21, 2022 by Editorial Team


Originally published on Towards AI the World’s Leading AI and Technology News and Media Company. If you are building an AI-related product or service, we invite you to consider becoming an AI sponsor. At Towards AI, we help scale AI and technology startups. Let us help you unleash your technology to the masses.

Autoencoder Average Distance — A Classical Way Used Internally at Microsoft To Find Out Similarity Between the Given Datasets

The autoencoder average distance (AAD) uses a simpler approach to find out the distance between the two datasets. A neural autoencoder can convert any data item into a vector of numeric values. The idea of the AAD distance metric is to convert two source datasets into strictly numeric vectors with the same number of values and then compute the difference between the average of the vectors in each dataset.

This technique for computing dataset difference was developed by J. McCaffrey and S. Chen and has been used internally at Microsoft.


The machine learning domain is often characterized by the fact that the data available from the application of interest is usually scarce. In other words, considerable data is available for general purpose implementations in contrast to the limited amount of data available for dedicated investigations. Because of this reason, there is vast interest in the development of methods that can combine, adapt and transfer knowledge across datasets and domains. Entire research areas are devoted to these, including domain adaptation, transfer-learning, and meta-learning. These also constitute some of the active areas of research in the field of machine learning.

The Notion of Distance

A basic notion underlying all these domains is that of the distance (or similarity) between datasets. In order to determine the similarity between the two given datasets, we tend to figure out the distance between them. For instance, transferring knowledge across similar domains should intuitively be easier than across distant ones. Likewise, given a choice of various datasets to train a model on, it would seem natural to choose the one that is closest to the task of interest. This ultimately leads to an increment in the amount of data required for a particular task.

However, this notion still poses certain problems. For instance, despite its usefulness and simpleness, the notion of distance between datasets is an elusive one, and quantifying it efficiently and in a principled manner remains largely an open problem. Doing so requires solving various challenges that commonly arise precisely in the settings for which this notion would be most useful, such as the ones mentioned above. For example, in supervised machine learning settings, the datasets consist of both features and labels, and while defining a distance between the former is often — though not always — trivial, doing so for the labels is far from it, particularly if the label sets across the two tasks are not identical (as is often the case for off-the-shelf pretrained models).

Knowing the distance between two datasets can be useful for at least two reasons. Firstly, dataset distance can be used for transfer learning activities, such as using a prediction model trained on one dataset to train a second dataset quickly. Secondly, the distance between the datasets can be useful for augmenting training data — creating additional synthetic training data which can be used to build a more accurate prediction model.

Ways to determine Dataset Distance

There exist several ways to figure out similarities between the two given datasets. These do include a good level of mathematical calculations and rely on higher mathematical notions. Therefore, often these approaches appear to be heuristic and complex. The approaches in transfer learning that seek to quantify dataset similarity include various ways. A common approach is to compare the datasets using proxies. Most of these approaches lack guarantees, are highly dependent on the probe model used, and require training a model to completion (e. g., to find optimal parameters) on each dataset being compared.

Dataset similarities can be measured using several available techniques. Various notions of similarity between data distributions have been proposed in the context of domain adaptation. These include using Discrepancy Distance, Dataset Distance via Parameter Sensitivity, Theory of Optimal Transport, Adversarial Validation, and Finding Distance Metrics between the two datasets. All these approaches are distinctive in their own sense, and each of them possesses its own advantages and disadvantages.

Autoencoder Average Distance (AAD)

Although other techniques involve higher mathematics and often tend to get complex both in implementation and understanding, Autoencoder Average Distance(AAD) uses a relatively simple approach.

In this approach, we use a neural autoencoder and use it to convert data items into a vector of numeric values. The idea involves converting two datasets to be compared into strictly numeric vectors with the same number of values using the autoencoder and then computing the difference between the average of the vectors in each dataset. The AAD distance metric then involves computing the average in each dataset and then comparing the two averages to compute a distance. This gives us a good idea about the similarity between the two datasets.

For example, consider the MNIST dataset. We convert it into (0.3456, 0.9821, . . . 0.5318) using an autoencoder. Take another dataset consisting of items like (“male”, 31, $58,000.00, “sales”) which converted into (0.1397, 0.7382, . . . 0.0458). Once we have the numeric vectors of the respective datasets with the same number of values, our next task involves finding out the average of each dataset and then finding the difference between the two averages to get an appropriate idea of similarities between the MNIST and the other given dataset.

Advantages and Disadvantages of AAD

The other approaches often have a solid mathematical foundation and desirable mathematical properties, but they become too complex to be used in some scenarios. The autoencoder average distance (AAD) metric uses a simpler approach. Therefore, it becomes much easy to implement AAD.

The advantages of AAD are that AAD is easier to compute, simpler to understand, and can be easily used with any type of data, including data with mixed numeric and non-numeric predictor variables.

The main disadvantage of AAD is that AAD does not contain as much information as conveyed by other approaches. So it may not give desirable results in certain scenarios.

Implementing the Idea of AAD on Brats Datasets

Let's try to implement this idea of AAD on two Brain Tumour Segmentation (Brats) Datasets. The datasets can be downloaded from Kaggle using links given below.

The code for the implementation is provided in the notebook, link to which is given below.

An interesting thing about the code written above is that it can be used for any two datasets with minor changes.

Please note that the idea of AAD is implemented in the code in a very intuitive way, and it may require certain improvising. This is a mere attempt to reproduce the AAD technique used by Microsoft and is not strict. Microsoft solely possesses all the rights in case it turns out later that the technique is copyrighted by Microsoft. This code written above is based on my own understanding of the notion of Autoencoder Average Distance and is NOT any reproduction from some standardized piece of code in case it turns out later that this technique is copyright of Microsoft and is not available to use within public domain.


The notion of distance is such a basic and fundamental concept that it is most often used as a primitive from which other tools and methods derive utility. The technique of AAD proposed here seems to possess a potential solution to the problems arising out of the limited availability of particularized data. It would most likely be used as a tool within a machine learning pipeline. It seems the prospect of the potential impact of this technique is broad enough to encompass most settings where machine learning is used essentially. This particularly is owing to the relative simpleness involved in the AAD concept. Perhaps the most immediate impact of this could be through its application in transfer learning. Improvements in this approach can have myriad outcomes, ranging from societal to environmental, both within and beyond the machine learning community.

[1] Computing the Similarity of Machine Learning Datasets — Pure AI

[2] [2002.02923] Geometric Dataset Distances via Optimal Transport (

[3] Autoencoder — Wikipedia

Autoencoder Average Distance — a classical way used internally at Microsoft to find out similarity… was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Join thousands of data leaders on the AI newsletter. It’s free, we don’t spam, and we never share your email address. Keep up to date with the latest work in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓