A Probabilistic Algorithm to Reduce Dimensions: t — Distributed Stochastic Neighbor Embedding (t-SNE)
Last Updated on July 19, 2023 by Editorial Team
Author(s): Rajvi Shah
Originally published on Towards AI.
Data Visualization
One of the best techniques to reduce dimensions and visualize data based on probabilistic scores.
Data Visualization plays a crucial role in real-time Machine Learning applications. Visualizing data makes a much easier and convenient way to know, interpret, and classify data in many cases. And there are some techniques which can help to visualize data and reduce dimensions of the dataset.
In my previous article, I gave an overview of Principal Component Analysis (PCA) and explained how to implement it. PCA is a basic technique to reduce dimensions and plot data. There are some limitations of using PCA from which the major is, it does not group similar classes together rather it is just a method of transforming point to linear representation which makes it easier for humans to understand data. While t-SNE is designed to overcome this challenge such that it can group similar objects together even in a context of lack of linearity.
This article is categorized into the following sections:
- What is t-SNE?
- Need/Advantages of t-SNE
- Drawbacks of t-SNE
- Applications of t-SNE — when to use and when not to use?
- Implementation of t-SNE to MNIST dataset using Python
- Conclusion
What is t-SNE?
It is a technique that tries to maintain the local structure of the data-points which reduces dimensions.
Let’s understand the concept from the name (t — Distributed Stochastic Neighbor Embedding): Imagine, all data-points are plotted in d -dimension(high) space and a data-point is surrounded by the other data-points of the same class and another data-point is surrounded by the similar data-points and of same class and likewise for all classes. So now, if we take any data-point (x) then the surrounding data-points (y, z, etc.) are called the neighborhood of that data-point, neighborhood of any data-point (x) is calculated such that it is geometrically close with that neighborhood data-point (y or z), i.e. by calculating the distance between both data-points. So basically, the neighborhood of x contains points that are closer to x. The technique only tries to preserve the distance of the neighborhood.
What is embedding? The data-points plotted in d-dimension are embedded in 2D such that the neighborhood of all data-points are tried to maintain as they were in d-dimension. Basically, for every point in high dimension space, there’s a corresponding point in low dimension space with the neighborhood concept of t-SNE.
t-SNE creates a probability distribution using the Gaussian distribution that defines the relationships between the points in high-dimensional space.
It is stochastic since in every run its output changes, that is it is not deterministic.
Why do we need t-SNE?
- Handles non-linearity: When it comes to dimensionality reduction, PCA is widely used as it is easy to use and understand intuitively. It tries to preserve linearity to the dataset by maintaining the spread(variance) of the data-points. PCA is a linear algorithm. It creates Principal Components which are the linear combinations of the existing features. So, it is not able to interpret complex polynomial relationships between features. So, if the relationship between the variables is nonlinear, it performs poorly. On the other hand, t-SNE works well on non-linear data. The main objective of t-SNE is to maintain non-linearity of the data-points which can be helpful in overcoming challenges of PCA for some applications.
- Preserves local and global structure: t-SNE is capable of preserving the local and global structure of the data. This means, roughly, that points which are close to one another in the high-dimensional dataset, will tend to be close to one another in the low dimension. On the other hand, PCA finds new dimensions that explain most of the variance in the data. So, it cares relatively little about local neighbors, unlike t-SNE.
The above graph shows the final output on the MNIST dataset after implementing PCA.
The above graph depicts the final output on the MNIST dataset after implementing t-SNE.
Drawbacks of t-SNE
- Crowding Problem: Let’s suppose a square of points a, b, c, and d with length x is represented in 2D and now applying t-SNE, one wants to reduce dimensions to 1D, first a is represented on a line, now point b is represented on the left of point a at x distance and point c is plotted on the right of a point at x distance. Here, the neighborhood of a are preserved but one can’t preserve the distance between point b and point c.
- Computationally Complex: t-SNE involves a lot of calculations and computations because it computes pairwise conditional probabilities for each data point and tries to minimize the sum of the difference of the probabilities in higher and lower dimensions.
- Selection of Hyperparameters: Perplexity and Steps (will come later in the article)
- Cluster size: t-SNE does not consider the cluster size of any classes.
Applications of t-SNE
t-SNE could be used on high-dimensional data and then the output of those dimensions then become inputs to some other classification model. Also, t-SNE could be used to investigate, learn, or evaluate segmentation. Oftentimes one selects the number of segments prior to modeling or iterates after results. t-SNE can oftentimes show clear separation in the data. This can be used prior to using your segmentation model to select a cluster number or after to evaluate if your segments actually hold up. t-SNE however is not a clustering approach since it does not preserve the inputs like PCA and the values may often change between runs so it’s purely for exploration. It is used to interpret deep neural network outputs in tools such as the TensorFlow Embedding Projector and TensorBoard, a powerful feature of tSNE is that it reveals clusters of high-dimensional data points at different scales while requiring only minimal tuning of its parameters. It is widely used for Deep Learning applications.
Implementation of t-SNE to MNIST dataset using Python
To download the MNIST dataset, from data. First, we will load as well as understand columns and data-points. Also, separate the label column from the CSV file and store it in another dataframe.
Now, as a part of Data Preprocessing, we’ll standardize data as follows:
The next step is to implement t-SNE using Sk-learn.
Here, we’ll use the first 1000 standardized data-points for t-SNE. And prepare a model of t-SNE from the sklearn module using some default parameters. It is advisable to apply different perplexity, learning-rate to classify labels in a better way. Moreover, we’ll fit and transform the t-SNE model and plot that using seaborn as follows:
Output:
Trying with perplexity = 50;
Output: This looks very similar to the above plot with perplexity = 30.
Trying t-SNE with 5000 iterations instead of 1000;
Output:
Now, with perplexity = 2
Output: All of the information is lost, all data-points are randomly spread as follows;
Conclusion
We need to try different values of perplexity and the number of iterations in order to find the best solution. Try to implement t-SNE with all data-points(it will take some time to execute).
You can find the source code from Github
If you have confusion regarding any function/class of the library, then I request you to check the documentation for that.
If there’s any correction & scope of improvement or if you have any queries, let me know at Mail / LinkedIn.
For a detailed understanding of drawbacks, check out: https://distill.pub/2016/misread-tsne/
For the application of t-SNE, check out: https://ai.googleblog.com/2018/06/realtime-tsne-visualizations-with.html
References:
- https://mlexplained.com/2018/09/14/paper-dissected-visualizing-data-using-t-sne-explained/#:~:text=It%20uses%20the%20local%20relationships,points%20in%20high%2Ddimensional%20space.
- http://theprofessionalspoint.blogspot.com/2019/03/advantages-and-disadvantages-of-t-sne.html
- https://towardsdatascience.com/an-introduction-to-t-sne-with-python-example-5a3a293108d1
- https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
- https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/2900/geometric-intuition-of-t-sne/2/module-2-data-science-exploratory-data-analysis-and-data-visualization
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI