A Basic Implementation of K-Nearest Neighbors in Python

5 min readSep 24, 2020

What is K-Nearest Neighbors?

K-Nearest Neighbors is an algorithm used in the fields of data mining and machine learning to complete classification and regression tasks. The algorithm allows unknown attribute values or categorical labels of a given data point to be extrapolated and predicted based on pattern recognition and majority vote systems. The driving mechanic used by K-Nearest Neighbors is the comparison of distances between points in a training data set to describe similarity in the data. Each attribute of the data is encoded as an individual dimension of this space, so that each objects’ attributes may be expressed as a function of its position. Given a significant amount of data points in combination with a sufficient number of attributes, hidden commonalities and patterns emerge from the distribution of the data points when they are plotted in an arbitrary space. By plotting a test data point with an unknown class label and finding the closest k number of neighbors, the class labels of those neighbors can be tallied, and the most common label is returned as the predicted class label for the tested data point.

Fig. 1: A two dimensional plane with plotted data points color coded into three class labels.

Fig. 2: An example of the prediction varying based on number of k-neighbors used

Implementation of K-Nearest Neighbors

An important component of a successful implementation is identifying points at which the runtime complexity can skyrocket and trying to reduce the runtime of the processes when certain complexities are unavoidable. My implementation of the KNN algorithm started with developing the distance function. K-Nearest Neighbors has a downside of requiring large amounts of calculations for distances that grows linearly as your data set grows. As a result of this dependency the runtime of the implementation is largely dependent on how streamlined the calculation itself can be made. Rather than take an iterative approach to the sum of the vector component differences as suggested by the explicit Euclidean distance formula, I used the L2 Norm function from NumPy to calculate Euclidean distance between two vectors.

import numpy as npdef calc_distance(self, vec_a, vec_b):
    return np.linalg.norm(vec_a - vec_b)

This calculation method automatically deals with the dimensionality of the vectors so the number of dimensions does not have to be explicitly known. This significantly reduces the complexity of my distance calculation while drastically outperforming explicit user developed methods in base python and marginal improvements over functions from specially created libraries.

Fig. 3: Comparison of distance calculation performance

In order to calculate the distance between the test point and the training data, all that is required is to implement the distance function in a loop where the test point is a static parameter and the other parameter is the iterated points within the training data set. Once the list of distances is calculated, NumPy’s argsort() function is used to get the indices of the k nearest neighbors. It is not possible to sort the original distance list because sorting that list would destroy the index parity between the training set and distance list that is used to identify each point. Argsort() solves this issue by creating a new list that is the indices of the distance list in order as if it were sorted by distance. Because it returns the list of indices as if the distances were sorted rather than the sorted distances, the indices can be used to perform a lookup for the corresponding points in the training data set. Then the indices are used to return the labels for the points.

def calc_labels(self, x):
    distance_list = list()
    for x_train in self.X_train:
        distance_list.append(self.calc_distance(x,x_train))
    k_index = np.argsort(distance_list)[:self.k]
    k_labels = [self.y_train[idx] for idx in k_index]    return k_labels

Due to the exercise limitation of using only NumPy and base python, I created a counting algorithm for my labels using a bucket approach. This results in the most popular label for the k-neighbors being returned, which is then passed to the final method, the predict method. Finally the predict method simply performs a list comprehension call for these methods for the entirety of the test set that is passed in, allowing for a large amount of predictions to be processed sequentially and returned as a list.

def calc_most_common_label(self, arr):
    # Get the maximum number of class labels
    maximum = len(set(self.y_train))
    # list comprehension to create a buckets array
    buckets = [0 for i in range(maximum + 1)]
    # Count the labels and increment the buckets
    for value in arr:
        buckets[value] += 1
    # Return the index of the largest bucket
    # AKA return the label with the highest occurrence in the list
    most_common_label = buckets.index(max(buckets))    return most_common_label

I gained many insights into algorithm implementation during this exercise. An important component of a successful implementation is identifying points at which your runtime complexity can skyrocket and trying to reduce the runtime of your processes when certain complexities are unavoidable. K-Nearest Neighbors has a downside of requiring large amounts of calculations for distances that grows linearly as your data set grows. As a result of this dependency the runtime of the implementation is largely dependent on how streamlined the calculation itself can be made. K-Nearest Neighbors is a very simple algorithm that is appealing to small to medium sized use cases with decent flexibility but lacks the ability to scale into larger datasets efficiently and incorporating other methods.

The GitHub repository for my implementation can be viewed here:

tyleretheridge/CS-Data-Science-Build-Week-1

For your first CS Build Week, you'll be implementing a few watershed algorithms that are very often used in the data…

github.com

Attributions

Figure 1, By Agor153 — Own work, CC BY-SA 3.0,

Figure 2, By Antti Ajanki AnAj — Own work, CC BY-SA 3.0

Figure 3, Performance graphic replicated using code found at https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy

Thanks for reading

A Basic Implementation of K-Nearest Neighbors in Python

tyleretheridge/CS-Data-Science-Build-Week-1

For your first CS Build Week, you'll be implementing a few watershed algorithms that are very often used in the data…

Written by Tyler Etheridge