Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

# K-Nearest Neighbors from scratch

Last Updated on August 1, 2023 by Editorial Team

#### Author(s): Sayar Banerjee

Originally published on Towards AI.

## It’s easier than you think

As a graduate student in business analytics, I’ve been busy with coursework, research, and assignments. However, after a brief hiatus, I’m glad to be back to writing about analytics and machine learning.

Currently, I am taking a course on “Machine Learning”, where my professor taught us about the K-Nearest Neighbors algorithm. During the class, he challenged us to code up the algorithm from scratch. This provided me with a great opportunity for a new blog post topic.

## The Algorithm

The K-nearest neighbor is a trivial algorithm to implement from scratch. Its beauty is in its simplicity.

Here is a simple implementation that I will try to replicate using Python:

1. For each new observation, we compute the distance metric between it and the training data and store them in a data structure.
2. Depending on the number of neighbors (or k) the user supplies, we find the top closest observations in the training data to the new observation.
3. If it is a regression problem, our predicted response is the mean of the k neighbors. If it is a classification problem, our predicted response is the majority vote of k neighbors.

Although at this stage the algorithm looks complete, there are certain choices and trade-offs we have to make that I shall discuss next.

## Clash of labels in majority vote logic

For classification problems, sometimes we may get results where more than one label contributes to the majority vote logic. In such a case, there are ways to choose a unique label as the final prediction.

For example, let’s say that we have trained the k-nearest neighbors classifier to predict the color of a pixel. Here, we choose k as 5. Consequently, the predicted responses from our neighbors are Blue, Blue, Red, Red, and Green respectively.

Two labels predict blue, and two predict red. We must find a way to simplify this outcome to a single result.

We may opt for the following methods:

• Random pick: Here, we just randomly pick one of the outcomes. It’s simple but may not be consistent.
• Weighted labels by distance: Here, we simply choose the label with the smaller average distance or higher average similarity.
• Reducing k until just one label is the distinct outcome: Here, we essentially reduce the number of neighbors until one distinct label is the predicted response of the majority vote.

To keep this blog post simple, I will stick to SciPy’s default implementation of Mode.

## Distance Metric

The choice of distance metric can be useful when considering the type of data. However, each of them has its pros and cons. Some common distance metrics used in the k-nearest neighbors algorithm include:

• Euclidean distance: The Euclidean distance computes the distance between two observations in Euclidean space. It is computed by taking the square root of the sum of squared differences between two observations.
• Manhattan distance: This Manhattan distance computes the sum of the absolute differences between two observations. It is often useful when features are categorical or when our data has a high number of dimensions.
• Cosine similarity: The cosine similarity is computed by the cosine of the angle between two vectors. It is often used when data is sparse such as movie reviews, word documents, and market transaction data, or when the magnitude of the vectors is not essential.
• Gower similarity: This distance metric computes the similarity between two observations that are mixed or when the features are categorical. It does so by applying a different similarity metric depending on the type of feature it’s computing similarity on. When a feature is numerical, it uses a variation of the Manhattan distance. When the feature is categorical, it uses measures such as Jaccard or Dice coefficient.

For simplicity, in this blog, I will only implement Euclidean distance.

## Regression/Classification Function

The method for predicting classes in a classification problem is as follows:

Similarly, here is the method for predicting a continuous variable:

As you can see, both functions are very similar. The only difference between the two is that for classification, we use mode, whereas, for regression, we use the mean.

The `np.linalg.norm` the function computes the Euclidean distance of each data point in our training data from each data point in our test data.

The `np.argsort` function sorts our array and returns the sorted indices.

Finally, we compute our predicted class/value and return the array.

That’s all there is to create a simple implementation of K-nearest neighbors!

## Results

For results, we compare our naive implementation to sci-kit-learn’s current state-of-the-art implementation of KNN.

Here they are:

As you can see, our model matches the performance of the current state of the art!

Of course, different distance metrics and subtle implementation differences will change the results.

Furthermore, like any other algorithm, KNN has its own set of challenges that can impact its performance, such as:

• Number of dimensions
• The scale of the data and outliers
• Value of “k” chosen
• Imbalanced data in classification problems

## Conclusion

In summary, KNN is a simple but powerful algorithm, but it has its own set of challenges that need to be addressed to ensure accurate predictions.

I hope you found this blog enjoyable!

Please feel free to check out some of my other blog posts on Medium and connect with me on LinkedIn. Don’t forget to leave a note!

Until next time! U+270BU+1F3FD

## References:

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI