What Are Dimensions in Machine Learning?
Author(s): Max Charney
Originally published on Towards AI.
Introduction
When I first started reading papers involving deep learning, I often encountered discussions about high-dimensions (feature spaces) and models such as SqueezeNet. Iβm pretty sure data canβt be captured in the 18th dimension yet β what is happening? In this article, I will discuss how dimensions play a role in machine learning and where these dimensions can be seen in real applications.
First, a Distinction
In machine learning, βdimensionsβ can be applied in various contexts. Here are two important facets:
- in terms of images
- in terms of model learning/feature space
Image Dimensions. In deep learning, dimensions concerning images often refer to the image's size or shape. In a βthree-dimensional imageβ, the dimensions would be height, width, and color channel where the color channels are different color values, such as a red, green, and blue (RGB) value, for each pixel.
Dimensions can be manipulated and must be accounted for in model training. Some models, such as SqueezeNet, manipulate dimensionality, or color channels, through calculations and linear algebra.
While this is a quick background on image dimensions, the scope of this article primarily pertains to feature spaces.
Feature Dimensions
Dimensions also come into play when considering the number of features, or variables, used to represent each data point in a machine learning model. Letβs look at an example.
Imagine a model is trying to predict the price of a given house. For now, three main βfeatures,β or variables, will be taken into account:
- Dimension (feature) 1: Number of bedrooms
- Dimension (feature) 2: Square footage of the house
- Dimension (feature) 3: Distance from school
On a 3D graph, data points (houses) will of course be spread out where perhaps the x axis represents number of bedrooms, the z axis represents square footage of the house, and the y axis distance from school. Houses with similar features will cluster together and likely correlate with the price (shown with color).
Ultimately, the area where each dimension corresponds to different features is called the feature space.
Going one step further, knowing the distance between points is essential because allows us to determine the degree of similarity between data points, enabling the assignment of a given data point to the appropriate group or cluster in the machine learning model.
The distance between two points x and y in a feature space with n dimensions can be calculated with the following Euclidean distance formula:
To find the distance between points in a multi-dimensional space, cosine similarity is often used which involves measuring the cosine of the angle between the feature vectors that represent the data points. Hereβs one source to learn more: video.
Now that you understand the basics, Iβd like to share some interesting applications of these clusters and times theyβre particularly relevant. These examples will solidify your understanding of the concept and provide you with some pretty cool new information.
Unsupervised Learning
Recently, I read this paper that used deep learning to predict prognosis in cancer patients. An unsupervised learning approach was used, meaning that model learned patterns and relationships within the data without explicit guidance from labeled outcomes or predefined target variables. In order to evaluate the modelβs feature extraction accuracy or determine if the model is able to figure out defining pieces of data accurately, different patients (data points) were visualized.
As seen in the above image, clusters of patients with similar feature representations tend to have the same traits (race, sex and cancer type) even though the model was not explicitly trained on these variables.
This means that the model was able to learn, in an unsupervised fashion, relationships between factors such as sex, race and cancer type across different modalities, or data types.
High dimensional feature spaces can be visualized using t-SNE (t-distributed Stochastic Neighbor Embedding) which helps reduce dimensionality while preserving the pairwise similarities between data points. How t-SNE works is beyond the scope of this article, but hereβs one source that explains the tool well: link.
Some Other Applications
Large Language Models. When providing a large language model (LLM) with huge corpuses of text, even though the model is not told what different words mean, the model is able to learn what words are similar to each other . In the below example, visualizations using t-SNE were developed that show how words used in similar contexts (such as words, months, and names) are clustered closely together in the feature space.
Recommendation Systems. Cluster-based recommendation systems work by grouping users or items into clusters based on similar characteristics or preferences. Recommendations are generated at the cluster level so that users receive personalized suggestions based on the preferences of others within their cluster.
For instance, imagine a recommendation system for a music streaming service. Users who frequently listen to similar genres, artists, and moods might be clustered together. If a user falls into the βPop and Indie Enthusiastsβ cluster, the system would recommend new songs or playlists that are popular within this cluster.
This approach can be both accurate and specific to users, culminating in an easy and enjoyable user experience.
Conclusions
Utilizing many dimensions in machine learning is crucial; while complex, machines can often handle necessary calculations, making them a great tool.
Thanks for reading, I hope you learned something new!
Iβve listed some other resources below that may be of interest.
- Paper on multimodal deep learning for prognosis prediction
- Googleβs YouTube video on visualizing high dimensional spaces
- Mathematical and algorithmic approaches to dimensionality
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI