K-Means Clustering — What Every Data Scientist Should Know
Last Updated on October 6, 2025 by Editorial Team
Author(s): Aamir Raja
Originally published on Towards AI.
Discussing and simplifying intricacies of K-means clustering for machine learning.
K-means clustering. What exactly is this?
It’s an extremely useful machine learning model that creates clusters in order to identify patterns and/or structures inside of the data.
This model is an unsupervised machine learning model. This means that the data the model is working with is unlabelled. This would mean that there is no predefined outcome or target that we are aiming for when putting our data into a model, so this K-means clustering model will help us identify patterns to help us determine what we should aim for.
This is important because sometimes the data that a Data Scientist like myself can be working with can be unstructured and unlabelled, but this particular technique can do a good job of identifying the key patterns behind all the values.
This article will explain K-means clustering, how it works in practice and mathematically, and the practical real-world applications behind K-means clustering models.
How does K-means clustering work?
K-means works by identifying similar data points and deciding to group them all together because the machine learning model has learnt from scanning the data that these particular data points have similar characteristics, and so it works in a way to help categorise similar points together.
For example, you could have different areas of a city, and you may want to group different houses inside of a city into different groups according to the risk of criminal activity in those areas. Such data points could be grouped into three groups:
- High crime risk
- Neutral crime risk
- Low crime risk
This could be because the data points that are being given to the model may be house values, and specific house values align to different parts of the city.
K-means also works by minimising the distance between data points in order to form clusters that are more likely to demonstrate similar patterns, habits, or shared characteristics. This means that the data points are minimising their distance to one another by using a mathematical formula referred to as Euclidean Distance, which will be explained further below.
This is important because this ensures that we have closely structured clusters with high levels of cohesiveness. The reason this is significant is because high cohesiveness means that the data points inside of the cluster are similar to one another and deserve to belong together. However, if they demonstrate low cohesiveness, then each data point inside of a cluster likely does not demonstrate shared characteristics — meaning the points shouldn’t be categorised into the same group.
For example, if you have a data point suggesting that one individual is at high risk of developing a disease, based on that individual’s lifestyle habits and personal genetics, you wouldn’t want to group that person with someone who is low risk and say that they are the same, as they do not have shared characteristics. Clearly, that would lead to irrelevant or inaccurate pattern identifications, which this part of K-means clustering helps to negate.

From the above, we can see the Euclidean distance, which aims to minimise the distance between the data points. X2−X1 illustrates that we are calculating the distance between the x-coordinates, and Y2−Y1 illustrates that we are calculating the distance between the y-coordinates inside of data points.
Point x could be representing one variable, and y could be representing another. We then would square both of these distances and take the square root of the above. This is how K-means clustering would calculate the distance between each data point and aim to minimise that to maximise similarity of each unique data point.
To add, K-means clustering also attempts to maximise the distance between each cluster that has been created. The reason for this is because the model hopes to ensure that each cluster that is created is as distinct and unique from one another as possible to help with pattern identification and recognition.
Now that we have explained how Euclidean distance works, we will demonstrate the true formula for K-means clustering below, as the above was to demonstrate how the notion works before it becomes slightly more complex.
Each cluster has a centre point to it. This is referred to as a centroid. Centroids are calculated by taking the mean of all the data points that exhibit similar characteristics to one another. This could be the mean of the y-values, x-values, and z-values, or more depending on how many values the model has assigned to the cluster.
Below we can observe the primary goal of K-means clustering. Here, the formula is attempting to minimise the distance between each data point inside of a cluster and the centre point of the cluster that it is assigned to.
Since the formula is taking this as a sum of squares, we refer to this as minimising the sum of squares between the data points and their assigned cluster.
This is done by taking the sum of squared distances inside of a cluster, as shown below:
- Xi represents the data point.
- k represents the total amount of clusters.
- Ck is a cluster referred to as cluster k, as there can be many.
- Uk represents the centroid of the cluster.

How would K-means clustering be applied to a dataset?
To begin the process, we would need to first determine the value of K. K in this instance would denote the number of clusters we have decided to create for our dataset.
Once this has been determined, the machine learning model itself will attempt to start creating centroids and assigning values to these centroids, with the values being selected in a random order. Also, as each new data point is added iteratively to a cluster, this means that the mean of the cluster changes due to the values inside the cluster tweaking. This is a continuously iterative process done by the model until the centroid inside the cluster converges to a particular point.
In order to ensure that the clusters are of good quality, each of the clusters should be different from one another, and all of the data points inside of a cluster should be similar to one another. This will be performed through Euclidean distance and minimising the sum of squares.
How are K-means models usually applied to real-world scenarios?
K-means models are usually applied for three main purposes:
- Customer segmentation
- Document clustering
- Image segmentation
Customer segmentation is where different groups of customers are assigned to a particular group based on their shared characteristics, habits, or lifestyle choices.For example, customers who may be purchasing similar Amazon products may be assigned to one group. Customers can be grouped based on many shared characteristics such as:
- Income
- Age
- Spending habits
- Background
- Product preferences
- Location
Document clustering is when similar documents are assigned to a particular cluster. This can make it easier to access similar documents if you have thousands of documents and need to group them into a category according to topic or contents.
Image segmentation is when we assign particular images to a cluster. These images can be assigned based on the number of pixels in the image or the similarity of the colours of the pixels that make up that image, depending on the goal.
Conclusion:
Overall, K-means clustering is an effective unsupervised machine learning model that takes unlabelled data and helps categorise the data points into specific groups (clusters) that are distinct from one another based on the mathematical formula we listed above.
This can be especially helpful when the dataset you are working with does not have a specific objective. For example, you would normally use some variables to predict something else, but if that is unclear, this can help with pattern identification to determine what you can predict based on your available data.
Also!
If you enjoyed this article, please feel free to read my other articles where I regularly post about new data science topics and content to help inform you of the latest data science trends and foundational topics.
Have a great week ahead! 👋
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.