# K-Nearest Neighbor (KNN) K-Nearest Neighbor (KNN) – We’ll be seeing what is the KNN that is the K nearest neighbors algorithm. So we already saw what was the naive Bayes classifier. And today we’ll be seeing what is KNN so KNN is basically a supervised classification algorithm in which you have some data points or data vectors which separated into a different number of several categories and it tries to predict the classification of a new sample. That particular population set now this K nearest neighbours algorithm.

That is the KNN algorithm is a lazy algorithm. Now what I mean by lazy is that it starts to only memorizes The process and it does not like learns itself. So it’s like just consider like you have certain points and if you say KNN algorithm to come here, then it will come to this point or if you tell KNN to go there then it will merge to that point. So it’s a kind of like it does not take any own decisions of itself. So that’s why it’s a lazy algorithm now it classifies any new Points based on a similarity measure so that similarity measure can be like anything you have the euclidean distance.

So it basically calculates the euclidean distance between two points. So it’s a proximity measure and then it’s tries to identify who our parents neighbour. So let’s understand the working of KN. How does it work? So basically on this axis, you have some set of points you have that is 30 points that is 30 vectors or 30 data points. Now some data points are labelled in black and some are in blue and some are in red now. We just Chose one point that is we are interested in this point.

Now we want to find the neighbourhood of this particular point. Now what it does is it basically calculates at distances from this query point to each of this current point. So these points are the current point and this is the query point and these all are the example points. Or the current points. So what this basically does is it calculates? It distances of this query point from each current Point like to this to this to this to all these points and it basically uses the Euclidean distance or you can use any distance measurement in distance or Minkowski distance or any other distance.

So it will be basically the X 1 minus x 2 square plus y 1 minus. By 2 square and it tries to have a distance Matrix for all these particular data points now which all points are there in its neighbours. It just classifies or that will be a particular neighbour. Now, for example, say if you have K is equal to 3, then which all points which are closer to this. That is the query point. Then it will label all those as its neighbours.

Now, there should be two things. Are you need to remember about two points? About the K parameter. So the K must be an odd parameter like 3 or it can be 5 or 7 now why it should be odd because if you just assign K as any even number like 2 4 C’s then there will be a tie in the classification of that particular label. So for example, if there is now you have this point and you have these two points, so there are two blue points and there is one black point now to which The is basically a majority scheme. So the majority voting scheme is there.

So here blue win since there are two of the blue points and you have only one Black Point as its neighbours. So this point or this query point would be assigned the class label as blue so blue will be the class level of this query point. Now another condition is that K must not be a multiple of the classes. So if you have to say for example seven classes now, then K should not be equal to 7 into 2 that this 14 classes so This also would be a condition for the tying above the class assignment.

Now, there is one particular variation or there is one particular scenario where you have K is equal to 1 so in that case, what happens is like you have particular data points, then it will be assigned certain regions into the partitions is say, for example, this is into one region and this is into one region. This is another region. So this is region 1 region to Region 3 and region So what it basically does is it partitions the space into several regions? So this is called a Voronoi partition space.

So what it suggests is like if any data element that belongs to this region or if any data point that belongs to region one would be assigned to this class label that is this circle or if there is any data element, which is falling into this region. It would be assigned as this class level similarly its work like that. So this is a particular scenario or a case where you have K is equal to 1 so it divides into Voronoi my partition space now, let’s talk about the Algorithm of KNN. How does it work? So we basically start with the loading of the data.

You have a number of data points you loaded from a CSV. Or dot XLS file then you initialize the K that is a key is a hyperbola amateur in this case. So you just assign the nearest neighbours to it. That is evil assign K is equal to 3 or k is equal to 5 or k is equal to seven any odd numbers. That is the only condition now for each of the samples in the training data.

So you will be having your training data and test data so for each data point in the training data, you have to calculate the distance between that query point and the current point so that have explained here that is this will be our query point and this will be our current point. So with any distance measure like euclidean distance or Manhattan distance And even distance measures you need to calculate the distance of that query point with that current point and you have to add the distance as well as the index of the example to an ordered collection means there would be some collection like this.

So there would be some index 0 1 2 3 4 5 and for each traversal, it would assign each of the distance to each of these indexes. Now you have this ordered collection you need to have a sorted collection. So for that, we need to sort the ordered collection of distances and the indexes from small to large. That is we need to sort it into ascending order.

Means which is the first point which is visited and which has the minimum distance from that query point. Now, you need to pick the first K entries from a certain collection. So so you have 30 points. So from the query point, you would be calculating all the Euclidean distances so that you will put into your memory or the ram then with the K specified which is here. You will pick the first that many k entries and then you sorted now.

Need to get the labels of the selected K entries which are there. So if it’s a regression problem, then you need to return the mean of the K levels means how many number of labels are there / that many labels or if it’s a classification problem. Then you need to return the more of the K levels and if it’s K is equal to 3 then 3 will be the more of that. So this KNN algorithm can be used with both regressions. as well as the classification for Just classifying a particular problem in supervised learning.