What Are Dimensions in Machine Learning?

Author(s): Max Charney

Originally published on Towards AI.

Source: https://www.reddit.com/r/machinelearningmemes/comments/taelzq/ml_scientists_when_their_excel_spreadsheet_has_4/

Introduction

When I first started reading papers involving deep learning, I often encountered discussions about high-dimensions (feature spaces) and models such as SqueezeNet. I’m pretty sure data can’t be captured in the 18th dimension yet — what is happening? In this article, I will discuss how dimensions play a role in machine learning and where these dimensions can be seen in real applications.

High dimensional space where words represent data points. Source: https://experiments.withgoogle.com/visualizing-high-dimensional-space

First, a Distinction

In machine learning, ‘dimensions’ can be applied in various contexts. Here are two important facets:

in terms of images
in terms of model learning/feature space

Image Dimensions. In deep learning, dimensions concerning images often refer to the image's size or shape. In a “three-dimensional image”, the dimensions would be height, width, and color channel where the color channels are different color values, such as a red, green, and blue (RGB) value, for each pixel.

Color channels (dimensions) of an image. Source: https://www.researchgate.net/figure/11-Image-as-pixel-matrix_fig1_325625631

Dimensions can be manipulated and must be accounted for in model training. Some models, such as SqueezeNet, manipulate dimensionality, or color channels, through calculations and linear algebra.

While this is a quick background on image dimensions, the scope of this article primarily pertains to feature spaces.

Feature Dimensions

Dimensions also come into play when considering the number of features, or variables, used to represent each data point in a machine learning model. Let’s look at an example.

Imagine a model is trying to predict the price of a given house. For now, three main “features,” or variables, will be taken into account:

Dimension (feature) 1: Number of bedrooms
Dimension (feature) 2: Square footage of the house
Dimension (feature) 3: Distance from school

On a 3D graph, data points (houses) will of course be spread out where perhaps the x axis represents number of bedrooms, the z axis represents square footage of the house, and the y axis distance from school. Houses with similar features will cluster together and likely correlate with the price (shown with color).

Ultimately, the area where each dimension corresponds to different features is called the feature space.

Feature space/cluster graph. Source: https://www.researchgate.net/figure/D-scatter-plot-of-the-DLBCL-data-with-colors-representing-the-true-clustering-labels_fig2_346052105

Going one step further, knowing the distance between points is essential because allows us to determine the degree of similarity between data points, enabling the assignment of a given data point to the appropriate group or cluster in the machine learning model.

The distance between two points x and y in a feature space with n dimensions can be calculated with the following Euclidean distance formula:

To find the distance between points in a multi-dimensional space, cosine similarity is often used which involves measuring the cosine of the angle between the feature vectors that represent the data points. Here’s one source to learn more: video.

Now that you understand the basics, I’d like to share some interesting applications of these clusters and times they’re particularly relevant. These examples will solidify your understanding of the concept and provide you with some pretty cool new information.

Unsupervised Learning

Recently, I read this paper that used deep learning to predict prognosis in cancer patients. An unsupervised learning approach was used, meaning that model learned patterns and relationships within the data without explicit guidance from labeled outcomes or predefined target variables. In order to evaluate the model’s feature extraction accuracy or determine if the model is able to figure out defining pieces of data accurately, different patients (data points) were visualized.

The left image shows clusters by sex, the middle image shows clusters by race, and the right image shows clusters by cancer type. Source

As seen in the above image, clusters of patients with similar feature representations tend to have the same traits (race, sex and cancer type) even though the model was not explicitly trained on these variables.

This means that the model was able to learn, in an unsupervised fashion, relationships between factors such as sex, race and cancer type across different modalities, or data types.

High dimensional feature spaces can be visualized using t-SNE (t-distributed Stochastic Neighbor Embedding) which helps reduce dimensionality while preserving the pairwise similarities between data points. How t-SNE works is beyond the scope of this article, but here’s one source that explains the tool well: link.

Some Other Applications

Large Language Models. When providing a large language model (LLM) with huge corpuses of text, even though the model is not told what different words mean, the model is able to learn what words are similar to each other . In the below example, visualizations using t-SNE were developed that show how words used in similar contexts (such as words, months, and names) are clustered closely together in the feature space.

Numbers are clustered together, as well as months. Source: Google for Developers

Names are clustered together. Source: Google for Developers

Recommendation Systems. Cluster-based recommendation systems work by grouping users or items into clusters based on similar characteristics or preferences. Recommendations are generated at the cluster level so that users receive personalized suggestions based on the preferences of others within their cluster.

For instance, imagine a recommendation system for a music streaming service. Users who frequently listen to similar genres, artists, and moods might be clustered together. If a user falls into the “Pop and Indie Enthusiasts” cluster, the system would recommend new songs or playlists that are popular within this cluster.

This approach can be both accurate and specific to users, culminating in an easy and enjoyable user experience.

Conclusions

Utilizing many dimensions in machine learning is crucial; while complex, machines can often handle necessary calculations, making them a great tool.

Thanks for reading, I hope you learned something new!

I’ve listed some other resources below that may be of interest.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

What Are Dimensions in Machine Learning?

Author(s): Max Charney

Introduction

First, a Distinction

Feature Dimensions

Unsupervised Learning

Some Other Applications

Conclusions

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

AI in Medical Imaging: A Life-Saving Revolution or Ethical Minefield?

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

What Are Dimensions in Machine Learning?

Author(s): Max Charney

Introduction

First, a Distinction

Feature Dimensions

Unsupervised Learning

Some Other Applications

Conclusions

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement