Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]

228 Park Avenue South New York, NY 10003 United States

Website: https://towardsai.net/ Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about

Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, Website, Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:

Towards AI Cover

Logo:

Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net

Follow us on: Facebook X LinkedIn Instagram Youtube Github Google My Business Google Search Google News Google Maps Discord Shop Towards AI, Medium Editorial Medium Flipboard Publication Feed Sponsors Sponsors Contribute

5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

K-Means Simplified

Data Science Machine Learning

K-Means Simplified

Towards AI Team

20 likes

October 27, 2020

Last Updated on October 28, 2020 by Editorial Team

Author(s): Luthfi Ramadhan

Data Science, Machine Learning

Short Introduction and Numerical Example of K-Means Clustering

Source: http://graphalchemist.github.io/Alchemy/images/features/cluster_team.png

What is clustering?

Clustering is the process of grouping a set of data objects (or observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another but different from objects in other clusters. Clustering is commonly used to explore a dataset to either identify the underlying patterns in it or to create a group of characteristics.

Clustering is sometimes called automatic classification. Because a cluster is a bunch of data that are similar to one another within the same cluster and dissimilar to data in other clusters so that a cluster of data can be treated as an implicit class. The difference here is that clustering can automatically find the groupings. This is a distinct advantage of clustering.

Clustering has been widely used in many applications one of which is business intelligence. In business intelligence, clustering can be used to organize a large number of customers into groups, where customers within a group share strong similar characteristics. By doing this, it is easier to develop business strategies for enhanced customer relationship management.

In this context, different clustering methods may generate different clusterings on the same data set. The grouping is not performed by humans, but by the clustering algorithm such as K-means. Hence, clustering is useful in that it can lead to the discovery of previously unknown groups within the data.

What is K-means?

K-means is a well-known algorithm used to perform clustering. The idea of k-means is that we assume there are k groups in our dataset. We then try to group the data into those k groups. Each group is described by a single point known as a centroid. The centroid of a cluster is the mean value of the points within the cluster.

How K-means actually work?

First, it randomly creates k centroids where k is the number of the cluster we are aiming to. Each sample is assigned to the cluster to which it is the most similar based on the euclidean distance between sample and cluster centroid. Then it updates each centroid using the mean of the assigned sample. All the samples are then reassigned using the updated centroids. The iterations continue until the assignment is stable or there is no difference between the new assignment and the previous iteration assignment. The k-means procedure is summarized as follows:

Choose the number of k clusters.
Initialize k number of centroids.
Assign each sample to the closest centroid based on its euclidean distance.
Update the new centroid of each cluster using the mean of the assigned sample.
Reassign each sample to the new centroid. If any reassignment took place, repeat step 4 otherwise stop the iteration.

Example

Here is a dummy dataset containing 10 samples:

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Step 1: Choose the number of K Clusters

let say we want to cluster the data into 2 clusters.

Step 2: Initialize Centroid

Initialize 2 random centroids:

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Step 3: Assign each sample to the closest centroid

We then compute Euclidean distance between each sample and each centroid with the following formula:

Photo by: me

Current Centroid:

Photo by: me

Apply the formula into our data, we get the result as follow:

Photo by: me

Assign each sample to the closest centroid based on previously obtained euclidean distance.

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Step 4: Update the new centroid

We then update the centroid using the mean of the assigned sample in each cluster.

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Step 5: Reassign each sample to the new centroid. If any reassignment took place, back to step 4 otherwise stop the iteration.

Compute Euclidean distance between each sample and each new centroid.
Current Centroid:

Photo by: me

Distance Between each sample and each centroid:

Photo by: me

Assign each sample to the closest centroid based on previously obtained euclidean distance:

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Check for reassignment:

Photo by: me

1 Reassignment happens, so we back to step 4. Update the centroid using the mean of the assigned sample in each cluster:

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Compute Euclidean distance between each sample and each new centroid.
Current Centroid:

Photo by: me

Distance Between each sample and each centroid:

Photo by: me

Assign each sample to the closest centroid based on previously obtained euclidean distance:

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Check for reassignment:

Photo by: me

2 Reassignments happen, so we back to step 4. Update the centroid using the mean of the assigned sample in each cluster:

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Compute Euclidean distance between each sample and each new centroid.
Current Centroid:

Photo by: me

Distance Between each sample and each centroid:

Photo by: me

Assign each sample to the closest centroid based on previously obtained euclidean distance:

Photo by: me

Plot the dataset using Scatter plot:

Photo by: me

Check for reassignment:

Photo by: me

No assignment happens, so we stop the iteration and get the final result as follow:

Photo by: me

Plot the dataset using Scatter plot:

Final Result, Photo by me

K-Means Disadvantage

The k-means algorithm is not guaranteed to converge to the global optimum and often terminates at a local optimum. The results will highly depend on the initial centroid. To get better results, it is common to run the k-means algorithm several times with different initial centroid and a different number of k clusters.

References

[1] J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques (2011)
[2] Sung-Soo Kim, What is Cluster Analysis? (2015)

K-Means Simplified was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Towards AI - Medium

Towards AI Team

Established in Pittsburgh, Pennsylvania, US — Towards AI Co. is the world’s leading AI and technology publication focused on diversity, equity, and inclusion. We aim to publish unbiased AI and technology-related articles and be an impartial source of information. Read by thought-leaders and decision-makers around the world. We have thousands of contributing writers from university professors, researchers, graduate students, industry experts, and enthusiasts. We receive millions of visits per year, have several thousands of followers across social media, and thousands of subscribers. All of our articles are from their respective authors and may not reflect the views of Towards AI Co., its editors, or its other writers. | Information for authors → https://contribute.towardsai.net | Terms → https://towardsai.net/terms/ | Privacy → https://towardsai.net/privacy/ | Members → https://members.towardsai.net/ | Shop → https://ws.towardsai.net/shop | Is your company interested in working with Towards AI? → https://sponsors.towardsai.net

Feedback ↓ Cancel reply

The World’s Leading AI and Technology Publication.

Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world.

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Others