Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

What Are Dimensions in Machine Learning?
Artificial Intelligence   Latest   Machine Learning

What Are Dimensions in Machine Learning?

Author(s): Max Charney

Originally published on Towards AI.

Source: https://www.reddit.com/r/machinelearningmemes/comments/taelzq/ml_scientists_when_their_excel_spreadsheet_has_4/

Introduction

When I first started reading papers involving deep learning, I often encountered discussions about high-dimensions (feature spaces) and models such as SqueezeNet. I’m pretty sure data can’t be captured in the 18th dimension yet — what is happening? In this article, I will discuss how dimensions play a role in machine learning and where these dimensions can be seen in real applications.

High dimensional space where words represent data points. Source: https://experiments.withgoogle.com/visualizing-high-dimensional-space

First, a Distinction

In machine learning, ‘dimensions’ can be applied in various contexts. Here are two important facets:

  1. in terms of images
  2. in terms of model learning/feature space

Image Dimensions. In deep learning, dimensions concerning images often refer to the image's size or shape. In a “three-dimensional image”, the dimensions would be height, width, and color channel where the color channels are different color values, such as a red, green, and blue (RGB) value, for each pixel.

Color channels (dimensions) of an image. Source: https://www.researchgate.net/figure/11-Image-as-pixel-matrix_fig1_325625631

Dimensions can be manipulated and must be accounted for in model training. Some models, such as SqueezeNet, manipulate dimensionality, or color channels, through calculations and linear algebra.

While this is a quick background on image dimensions, the scope of this article primarily pertains to feature spaces.

Feature Dimensions

Dimensions also come into play when considering the number of features, or variables, used to represent each data point in a machine learning model. Let’s look at an example.

Imagine a model is trying to predict the price of a given house. For now, three main “features,” or variables, will be taken into account:

  • Dimension (feature) 1: Number of bedrooms
  • Dimension (feature) 2: Square footage of the house
  • Dimension (feature) 3: Distance from school

On a 3D graph, data points (houses) will of course be spread out where perhaps the x axis represents number of bedrooms, the z axis represents square footage of the house, and the y axis distance from school. Houses with similar features will cluster together and likely correlate with the price (shown with color).

Ultimately, the area where each dimension corresponds to different features is called the feature space.

Feature space/cluster graph. Source: https://www.researchgate.net/figure/D-scatter-plot-of-the-DLBCL-data-with-colors-representing-the-true-clustering-labels_fig2_346052105

Going one step further, knowing the distance between points is essential because allows us to determine the degree of similarity between data points, enabling the assignment of a given data point to the appropriate group or cluster in the machine learning model.

The distance between two points x and y in a feature space with n dimensions can be calculated with the following Euclidean distance formula:

To find the distance between points in a multi-dimensional space, cosine similarity is often used which involves measuring the cosine of the angle between the feature vectors that represent the data points. Here’s one source to learn more: video.

Now that you understand the basics, I’d like to share some interesting applications of these clusters and times they’re particularly relevant. These examples will solidify your understanding of the concept and provide you with some pretty cool new information.

Unsupervised Learning

Recently, I read this paper that used deep learning to predict prognosis in cancer patients. An unsupervised learning approach was used, meaning that model learned patterns and relationships within the data without explicit guidance from labeled outcomes or predefined target variables. In order to evaluate the model’s feature extraction accuracy or determine if the model is able to figure out defining pieces of data accurately, different patients (data points) were visualized.

The left image shows clusters by sex, the middle image shows clusters by race, and the right image shows clusters by cancer type. Source

As seen in the above image, clusters of patients with similar feature representations tend to have the same traits (race, sex and cancer type) even though the model was not explicitly trained on these variables.

This means that the model was able to learn, in an unsupervised fashion, relationships between factors such as sex, race and cancer type across different modalities, or data types.

High dimensional feature spaces can be visualized using t-SNE (t-distributed Stochastic Neighbor Embedding) which helps reduce dimensionality while preserving the pairwise similarities between data points. How t-SNE works is beyond the scope of this article, but here’s one source that explains the tool well: link.

Some Other Applications

Large Language Models. When providing a large language model (LLM) with huge corpuses of text, even though the model is not told what different words mean, the model is able to learn what words are similar to each other . In the below example, visualizations using t-SNE were developed that show how words used in similar contexts (such as words, months, and names) are clustered closely together in the feature space.

Numbers are clustered together, as well as months. Source: Google for Developers
Names are clustered together. Source: Google for Developers

Recommendation Systems. Cluster-based recommendation systems work by grouping users or items into clusters based on similar characteristics or preferences. Recommendations are generated at the cluster level so that users receive personalized suggestions based on the preferences of others within their cluster.

For instance, imagine a recommendation system for a music streaming service. Users who frequently listen to similar genres, artists, and moods might be clustered together. If a user falls into the “Pop and Indie Enthusiasts” cluster, the system would recommend new songs or playlists that are popular within this cluster.

This approach can be both accurate and specific to users, culminating in an easy and enjoyable user experience.

Conclusions

Utilizing many dimensions in machine learning is crucial; while complex, machines can often handle necessary calculations, making them a great tool.

Thanks for reading, I hope you learned something new!

I’ve listed some other resources below that may be of interest.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓