K-Nearest Neighbors from scratch
Last Updated on August 1, 2023 by Editorial Team
Author(s): Sayar Banerjee
Originally published on Towards AI.
Itβs easier than you think
Hello readers!
As a graduate student in business analytics, Iβve been busy with coursework, research, and assignments. However, after a brief hiatus, Iβm glad to be back to writing about analytics and machine learning.
Currently, I am taking a course on βMachine Learningβ, where my professor taught us about the K-Nearest Neighbors algorithm. During the class, he challenged us to code up the algorithm from scratch. This provided me with a great opportunity for a new blog post topic.
The Algorithm
The K-nearest neighbor is a trivial algorithm to implement from scratch. Its beauty is in its simplicity.
Here is a simple implementation that I will try to replicate using Python:
- For each new observation, we compute the distance metric between it and the training data and store them in a data structure.
- Depending on the number of neighbors (or k) the user supplies, we find the top closest observations in the training data to the new observation.
- If it is a regression problem, our predicted response is the mean of the k neighbors. If it is a classification problem, our predicted response is the majority vote of k neighbors.
Although at this stage the algorithm looks complete, there are certain choices and trade-offs we have to make that I shall discuss next.
The Choices
Clash of labels in majority vote logic
For classification problems, sometimes we may get results where more than one label contributes to the majority vote logic. In such a case, there are ways to choose a unique label as the final prediction.
For example, letβs say that we have trained the k-nearest neighbors classifier to predict the color of a pixel. Here, we choose k as 5. Consequently, the predicted responses from our neighbors are Blue, Blue, Red, Red, and Green respectively.
Two labels predict blue, and two predict red. We must find a way to simplify this outcome to a single result.
We may opt for the following methods:
- Random pick: Here, we just randomly pick one of the outcomes. Itβs simple but may not be consistent.
- Weighted labels by distance: Here, we simply choose the label with the smaller average distance or higher average similarity.
- Reducing k until just one label is the distinct outcome: Here, we essentially reduce the number of neighbors until one distinct label is the predicted response of the majority vote.
To keep this blog post simple, I will stick to SciPyβs default implementation of Mode.
Distance Metric
The choice of distance metric can be useful when considering the type of data. However, each of them has its pros and cons. Some common distance metrics used in the k-nearest neighbors algorithm include:
- Euclidean distance: The Euclidean distance computes the distance between two observations in Euclidean space. It is computed by taking the square root of the sum of squared differences between two observations.
- Manhattan distance: This Manhattan distance computes the sum of the absolute differences between two observations. It is often useful when features are categorical or when our data has a high number of dimensions.
- Cosine similarity: The cosine similarity is computed by the cosine of the angle between two vectors. It is often used when data is sparse such as movie reviews, word documents, and market transaction data, or when the magnitude of the vectors is not essential.
- Gower similarity: This distance metric computes the similarity between two observations that are mixed or when the features are categorical. It does so by applying a different similarity metric depending on the type of feature itβs computing similarity on. When a feature is numerical, it uses a variation of the Manhattan distance. When the feature is categorical, it uses measures such as Jaccard or Dice coefficient.
For simplicity, in this blog, I will only implement Euclidean distance.
Regression/Classification Function
The method for predicting classes in a classification problem is as follows:
Similarly, here is the method for predicting a continuous variable:
As you can see, both functions are very similar. The only difference between the two is that for classification, we use mode, whereas, for regression, we use the mean.
The np.linalg.norm
the function computes the Euclidean distance of each data point in our training data from each data point in our test data.
The np.argsort
function sorts our array and returns the sorted indices.
Finally, we compute our predicted class/value and return the array.
Thatβs all there is to create a simple implementation of K-nearest neighbors!
Results
For results, we compare our naive implementation to sci-kit-learnβs current state-of-the-art implementation of KNN.
Here they are:
As you can see, our model matches the performance of the current state of the art!
Of course, different distance metrics and subtle implementation differences will change the results.
Furthermore, like any other algorithm, KNN has its own set of challenges that can impact its performance, such as:
- Number of dimensions
- The scale of the data and outliers
- Value of βkβ chosen
- Imbalanced data in classification problems
Conclusion
In summary, KNN is a simple but powerful algorithm, but it has its own set of challenges that need to be addressed to ensure accurate predictions.
I hope you found this blog enjoyable!
Please feel free to check out some of my other blog posts on Medium and connect with me on LinkedIn. Donβt forget to leave a note!
Until next time! U+270BU+1F3FD
U+1F449 Link to GitHub repo
References:
U+1F449 https://scikit-learn.org/stable/modules/classes.html#module-sklearn.neighbors
U+1F449 https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI