K-Nearest Neighbors from scratch

Last Updated on August 1, 2023 by Editorial Team

Author(s): Sayar Banerjee

Originally published on Towards AI.

It’s easier than you think

Hello readers!

As a graduate student in business analytics, I’ve been busy with coursework, research, and assignments. However, after a brief hiatus, I’m glad to be back to writing about analytics and machine learning.

Currently, I am taking a course on “Machine Learning”, where my professor taught us about the K-Nearest Neighbors algorithm. During the class, he challenged us to code up the algorithm from scratch. This provided me with a great opportunity for a new blog post topic.

The Algorithm

The K-nearest neighbor is a trivial algorithm to implement from scratch. Its beauty is in its simplicity.

Here is a simple implementation that I will try to replicate using Python:

For each new observation, we compute the distance metric between it and the training data and store them in a data structure.
Depending on the number of neighbors (or k) the user supplies, we find the top closest observations in the training data to the new observation.
If it is a regression problem, our predicted response is the mean of the k neighbors. If it is a classification problem, our predicted response is the majority vote of k neighbors.

Although at this stage the algorithm looks complete, there are certain choices and trade-offs we have to make that I shall discuss next.

The Choices

Clash of labels in majority vote logic

For classification problems, sometimes we may get results where more than one label contributes to the majority vote logic. In such a case, there are ways to choose a unique label as the final prediction.

For example, let’s say that we have trained the k-nearest neighbors classifier to predict the color of a pixel. Here, we choose k as 5. Consequently, the predicted responses from our neighbors are Blue, Blue, Red, Red, and Green respectively.

Two labels predict blue, and two predict red. We must find a way to simplify this outcome to a single result.

We may opt for the following methods:

Random pick: Here, we just randomly pick one of the outcomes. It’s simple but may not be consistent.
Weighted labels by distance: Here, we simply choose the label with the smaller average distance or higher average similarity.
Reducing k until just one label is the distinct outcome: Here, we essentially reduce the number of neighbors until one distinct label is the predicted response of the majority vote.

To keep this blog post simple, I will stick to SciPy’s default implementation of Mode.

Distance Metric

The choice of distance metric can be useful when considering the type of data. However, each of them has its pros and cons. Some common distance metrics used in the k-nearest neighbors algorithm include:

Euclidean distance: The Euclidean distance computes the distance between two observations in Euclidean space. It is computed by taking the square root of the sum of squared differences between two observations.
Manhattan distance: This Manhattan distance computes the sum of the absolute differences between two observations. It is often useful when features are categorical or when our data has a high number of dimensions.
Cosine similarity: The cosine similarity is computed by the cosine of the angle between two vectors. It is often used when data is sparse such as movie reviews, word documents, and market transaction data, or when the magnitude of the vectors is not essential.
Gower similarity: This distance metric computes the similarity between two observations that are mixed or when the features are categorical. It does so by applying a different similarity metric depending on the type of feature it’s computing similarity on. When a feature is numerical, it uses a variation of the Manhattan distance. When the feature is categorical, it uses measures such as Jaccard or Dice coefficient.

For simplicity, in this blog, I will only implement Euclidean distance.

Regression/Classification Function

The method for predicting classes in a classification problem is as follows:

Similarly, here is the method for predicting a continuous variable:

As you can see, both functions are very similar. The only difference between the two is that for classification, we use mode, whereas, for regression, we use the mean.

The np.linalg.norm the function computes the Euclidean distance of each data point in our training data from each data point in our test data.

The np.argsort function sorts our array and returns the sorted indices.

Finally, we compute our predicted class/value and return the array.

That’s all there is to create a simple implementation of K-nearest neighbors!

Results

For results, we compare our naive implementation to sci-kit-learn’s current state-of-the-art implementation of KNN.

Here they are:

As you can see, our model matches the performance of the current state of the art!

Of course, different distance metrics and subtle implementation differences will change the results.

Furthermore, like any other algorithm, KNN has its own set of challenges that can impact its performance, such as:

Number of dimensions
The scale of the data and outliers
Value of “k” chosen
Imbalanced data in classification problems

Conclusion

In summary, KNN is a simple but powerful algorithm, but it has its own set of challenges that need to be addressed to ensure accurate predictions.

I hope you found this blog enjoyable!

Please feel free to check out some of my other blog posts on Medium and connect with me on LinkedIn. Don’t forget to leave a note!

Until next time! U+270BU+1F3FD

U+1F449 Link to GitHub repo

References:

U+1F449 https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors

U+1F449 https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

K-Nearest Neighbors from scratch

Author(s): Sayar Banerjee

It’s easier than you think

The Algorithm

The Choices

Clash of labels in majority vote logic

Distance Metric

Regression/Classification Function

Results

Conclusion

References:

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

The Fundamental Mathematics of Machine Learning

Built-In AI Web APIs Will Enable A New Generation Of AI Startups

Auditing Predictive A.I. Models for Bias and Fairness

Why is Llama 3.1 Such a Big deal?

5 AI Real-World Projects To Set Foot in The Door

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

K-Nearest Neighbors from scratch

Author(s): Sayar Banerjee

It’s easier than you think

The Algorithm

The Choices

Clash of labels in majority vote logic

Distance Metric

Regression/Classification Function

Results

Conclusion

References:

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement