K-Means Clustering — What Every Data Scientist Should Know

Last Updated on October 6, 2025 by Editorial Team

Author(s): Aamir Raja

Originally published on Towards AI.

Discussing and simplifying intricacies of K-means clustering for machine learning.

K-Means Clustering — What Every Data Scientist Should Know — Photo by Igor Omilaev on Unsplash

K-means clustering. What exactly is this?

It’s an extremely useful machine learning model that creates clusters in order to identify patterns and/or structures inside of the data.

This model is an unsupervised machine learning model. This means that the data the model is working with is unlabelled. This would mean that there is no predefined outcome or target that we are aiming for when putting our data into a model, so this K-means clustering model will help us identify patterns to help us determine what we should aim for.

This is important because sometimes the data that a Data Scientist like myself can be working with can be unstructured and unlabelled, but this particular technique can do a good job of identifying the key patterns behind all the values.

This article will explain K-means clustering, how it works in practice and mathematically, and the practical real-world applications behind K-means clustering models.

How does K-means clustering work?

K-means works by identifying similar data points and deciding to group them all together because the machine learning model has learnt from scanning the data that these particular data points have similar characteristics, and so it works in a way to help categorise similar points together.

For example, you could have different areas of a city, and you may want to group different houses inside of a city into different groups according to the risk of criminal activity in those areas. Such data points could be grouped into three groups:

High crime risk
Neutral crime risk
Low crime risk

This could be because the data points that are being given to the model may be house values, and specific house values align to different parts of the city.

K-means also works by minimising the distance between data points in order to form clusters that are more likely to demonstrate similar patterns, habits, or shared characteristics. This means that the data points are minimising their distance to one another by using a mathematical formula referred to as Euclidean Distance, which will be explained further below.

This is important because this ensures that we have closely structured clusters with high levels of cohesiveness. The reason this is significant is because high cohesiveness means that the data points inside of the cluster are similar to one another and deserve to belong together. However, if they demonstrate low cohesiveness, then each data point inside of a cluster likely does not demonstrate shared characteristics — meaning the points shouldn’t be categorised into the same group.

For example, if you have a data point suggesting that one individual is at high risk of developing a disease, based on that individual’s lifestyle habits and personal genetics, you wouldn’t want to group that person with someone who is low risk and say that they are the same, as they do not have shared characteristics. Clearly, that would lead to irrelevant or inaccurate pattern identifications, which this part of K-means clustering helps to negate.

From the above, we can see the Euclidean distance, which aims to minimise the distance between the data points. X2−X1 illustrates that we are calculating the distance between the x-coordinates, and Y2−Y1 illustrates that we are calculating the distance between the y-coordinates inside of data points.

Point x could be representing one variable, and y could be representing another. We then would square both of these distances and take the square root of the above. This is how K-means clustering would calculate the distance between each data point and aim to minimise that to maximise similarity of each unique data point.

To add, K-means clustering also attempts to maximise the distance between each cluster that has been created. The reason for this is because the model hopes to ensure that each cluster that is created is as distinct and unique from one another as possible to help with pattern identification and recognition.

Now that we have explained how Euclidean distance works, we will demonstrate the true formula for K-means clustering below, as the above was to demonstrate how the notion works before it becomes slightly more complex.

Each cluster has a centre point to it. This is referred to as a centroid. Centroids are calculated by taking the mean of all the data points that exhibit similar characteristics to one another. This could be the mean of the y-values, x-values, and z-values, or more depending on how many values the model has assigned to the cluster.

Below we can observe the primary goal of K-means clustering. Here, the formula is attempting to minimise the distance between each data point inside of a cluster and the centre point of the cluster that it is assigned to.

Since the formula is taking this as a sum of squares, we refer to this as minimising the sum of squares between the data points and their assigned cluster.

This is done by taking the sum of squared distances inside of a cluster, as shown below:

Xi represents the data point.
k represents the total amount of clusters.
Ck is a cluster referred to as cluster k, as there can be many.
Uk represents the centroid of the cluster.

How would K-means clustering be applied to a dataset?

To begin the process, we would need to first determine the value of K. K in this instance would denote the number of clusters we have decided to create for our dataset.

Once this has been determined, the machine learning model itself will attempt to start creating centroids and assigning values to these centroids, with the values being selected in a random order. Also, as each new data point is added iteratively to a cluster, this means that the mean of the cluster changes due to the values inside the cluster tweaking. This is a continuously iterative process done by the model until the centroid inside the cluster converges to a particular point.

In order to ensure that the clusters are of good quality, each of the clusters should be different from one another, and all of the data points inside of a cluster should be similar to one another. This will be performed through Euclidean distance and minimising the sum of squares.

How are K-means models usually applied to real-world scenarios?

K-means models are usually applied for three main purposes:

Customer segmentation
Document clustering
Image segmentation

Customer segmentation is where different groups of customers are assigned to a particular group based on their shared characteristics, habits, or lifestyle choices.For example, customers who may be purchasing similar Amazon products may be assigned to one group. Customers can be grouped based on many shared characteristics such as:

Income
Age
Spending habits
Background
Product preferences
Location

Document clustering is when similar documents are assigned to a particular cluster. This can make it easier to access similar documents if you have thousands of documents and need to group them into a category according to topic or contents.

Image segmentation is when we assign particular images to a cluster. These images can be assigned based on the number of pixels in the image or the similarity of the colours of the pixels that make up that image, depending on the goal.

Conclusion:

Overall, K-means clustering is an effective unsupervised machine learning model that takes unlabelled data and helps categorise the data points into specific groups (clusters) that are distinct from one another based on the mathematical formula we listed above.

This can be especially helpful when the dataset you are working with does not have a specific objective. For example, you would normally use some variables to predict something else, but if that is unclear, this can help with pattern identification to determine what you can predict based on your available data.

Also!

If you enjoyed this article, please feel free to read my other articles where I regularly post about new data science topics and content to help inform you of the latest data science trends and foundational topics.

Have a great week ahead! 👋

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

K-Means Clustering — What Every Data Scientist Should Know

Author(s): Aamir Raja

Discussing and simplifying intricacies of K-means clustering for machine learning.

K-means clustering. What exactly is this?

How does K-means clustering work?

How would K-means clustering be applied to a dataset?

How are K-means models usually applied to real-world scenarios?

Conclusion:

Also!

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

Full-Stack Data Scientists for the Agentic Coding World

Building Production-Grade AI Skills with Snowflake Cortex AI Function Studio

I Tried 10 AI Agent Frameworks in 2026 — Here’s the Honest Guide I Wish I Had Earlier

How One Spring Boot Optimization Saved Our Startup $30,000 a Year

Inside Palantir AIP: How the World’s Most Controversial AI Platform Actually Works

What Is a Reverse Proxy? (And Why Every Backend Developer Should Care)

What Claude Opus 4.8 Actually Changes If You’re Building Agents

QWEN 3.7 Max Worked For 35 Hrs Straight And The Results Were Mind-blowing

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

K-Means Clustering — What Every Data Scientist Should Know

Author(s): Aamir Raja

Discussing and simplifying intricacies of K-means clustering for machine learning.

K-means clustering. What exactly is this?

How does K-means clustering work?

How would K-means clustering be applied to a dataset?

How are K-means models usually applied to real-world scenarios?

Conclusion:

Also!

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement