Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

K-Means Simplified
Data Science   Machine Learning

K-Means Simplified

Last Updated on October 28, 2020 by Editorial Team

Author(s): Luthfi Ramadhan

Data Science, MachineΒ Learning

Short Introduction and Numerical Example of K-Means Clustering

Source: http://graphalchemist.github.io/Alchemy/images/features/cluster_team.png

What is clustering?

Clustering is the process of grouping a set of data objects (or observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another but different from objects in other clusters. Clustering is commonly used to explore a dataset to either identify the underlying patterns in it or to create a group of characteristics.

Clustering is sometimes called automatic classification. Because a cluster is a bunch of data that are similar to one another within the same cluster and dissimilar to data in other clusters so that a cluster of data can be treated as an implicit class. The difference here is that clustering can automatically find the groupings. This is a distinct advantage of clustering.

Clustering has been widely used in many applications one of which is business intelligence. In business intelligence, clustering can be used to organize a large number of customers into groups, where customers within a group share strong similar characteristics. By doing this, it is easier to develop business strategies for enhanced customer relationship management.

In this context, different clustering methods may generate different clusterings on the same data set. The grouping is not performed by humans, but by the clustering algorithm such as K-means. Hence, clustering is useful in that it can lead to the discovery of previously unknown groups within theΒ data.

What isΒ K-means?

K-means is a well-known algorithm used to perform clustering. The idea of k-means is that we assume there are k groups in our dataset. We then try to group the data into those k groups. Each group is described by a single point known as a centroid. The centroid of a cluster is the mean value of the points within theΒ cluster.

How K-means actuallyΒ work?

First, it randomly creates k centroids where k is the number of the cluster we are aiming to. Each sample is assigned to the cluster to which it is the most similar based on the euclidean distance between sample and cluster centroid. Then it updates each centroid using the mean of the assigned sample. All the samples are then reassigned using the updated centroids. The iterations continue until the assignment is stable or there is no difference between the new assignment and the previous iteration assignment. The k-means procedure is summarized asΒ follows:

  1. Choose the number of k clusters.
  2. Initialize k number of centroids.
  3. Assign each sample to the closest centroid based on its euclidean distance.
  4. Update the new centroid of each cluster using the mean of the assignedΒ sample.
  5. Reassign each sample to the new centroid. If any reassignment took place, repeat step 4 otherwise stop the iteration.

Example

Here is a dummy dataset containing 10Β samples:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Step 1: Choose the number of KΒ Clusters

let say we want to cluster the data into 2 clusters.

Step 2: Initialize Centroid

Initialize 2 random centroids:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Step 3: Assign each sample to the closestΒ centroid

We then compute Euclidean distance between each sample and each centroid with the following formula:

Photo by:Β me

Current Centroid:

Photo by:Β me

Apply the formula into our data, we get the result asΒ follow:

Photo by:Β me

Assign each sample to the closest centroid based on previously obtained euclidean distance.

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Step 4: Update the newΒ centroid

We then update the centroid using the mean of the assigned sample in eachΒ cluster.

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Step 5: Reassign each sample to the new centroid. If any reassignment took place, back to step 4 otherwise stop the iteration.

Compute Euclidean distance between each sample and each new centroid.
Current Centroid:

Photo by:Β me

Distance Between each sample and each centroid:

Photo by:Β me

Assign each sample to the closest centroid based on previously obtained euclidean distance:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Check for reassignment:

Photo by:Β me

1 Reassignment happens, so we back to step 4. Update the centroid using the mean of the assigned sample in eachΒ cluster:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Compute Euclidean distance between each sample and each new centroid.
Current Centroid:

Photo by:Β me

Distance Between each sample and each centroid:

Photo by:Β me

Assign each sample to the closest centroid based on previously obtained euclidean distance:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Check for reassignment:

Photo by:Β me

2 Reassignments happen, so we back to step 4. Update the centroid using the mean of the assigned sample in eachΒ cluster:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Compute Euclidean distance between each sample and each new centroid.
Current Centroid:

Photo by:Β me

Distance Between each sample and each centroid:

Photo by:Β me

Assign each sample to the closest centroid based on previously obtained euclidean distance:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Photo by:Β me

Check for reassignment:

Photo by:Β me

No assignment happens, so we stop the iteration and get the final result asΒ follow:

Photo by:Β me

Plot the dataset using ScatterΒ plot:

Final Result, Photo byΒ me

K-Means Disadvantage

The k-means algorithm is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results will highly depend on the initial centroid. To get better results, it is common to run the k-means algorithm several times with different initial centroid and a different number of k clusters.

References

[1] J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques (2011)
[2] Sung-Soo Kim, What is Cluster Analysis? (2015)


K-Means Simplified was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓