 K nearest neighbors is really an incredibly useful algorithm. One of my favorite things about it is that it lends itself well to drawing pictures. We'll start by looking at using it for classification, being able to take a data point and say whether it belongs to category A or category B. Here, in this example, we're looking at mushrooms. So we're mushroom hunting and we would like to be able to identify whether a mushroom is edible, safe to eat, or toxic. It may kill us. Huge disclaimer, I know nothing about identifying mushrooms. Nothing here is factual or will help you identify safe mushrooms to eat. Notionally, this is a fun problem because a trained mushroom hunter can gather mushrooms that are delicious safely and avoid the ones that are toxic. Even when the two are fairly similar by being able to hone in on distinguishing characteristics. This is exactly what we want from a good classification algorithm. We want to be able to say, does this thing belong to category A, edible, or category B, toxic? And in this case, the consequences are quite severe. So in this imaginary world, the height and cap diameter of a number of mushrooms were collected, the ones known to be toxic are identified as crosses and the ones known to be edible are identified with circles. You can see that there's definitely a pattern here, but there is no clean way to say mushrooms higher than a certain height are all toxic or all edible. Mushrooms with a cap diameter greater than a certain threshold are all toxic or all edible. The two groups are tight, but the boundary between them is wiggly and irregular. What we'd like to be able to do, the classification problem, is given a new mushroom where we know the height and we know the cap diameter, we'd like to be able to determine, is it toxic or edible? Can we use this set of examples to classify a new mushroom? The K nearest neighbors approach is to look and find the closest neighbors to the point that we care about. K refers to the number of neighbors that we try to find. In this case, there are five neighbors within that circle. In practice, small numbers for K tend to work well. Five, seven, even one, depending on your problem, just look for the very closest neighbor. When working with a two-class classification problem, so either toxic or edible, it's really convenient to have a K that's odd because that way if you have two of one and three of the other, you can go with the one that has the largest number. One will always out-vote the other. If you have three classes or more, then there's always a possibility of a tie. So you have to handle that tie in some way that makes sense given your data and the problem you're trying to solve. For now, we'll stick with a two-class classification problem and we'll stick with a K that's odd. So in this image, K is five. All five of those neighbors are toxic. Therefore, KNN says a mushroom located at the plus sign with that cap diameter and that height will also be toxic. If we look at a different location, we can see where the plus sign is there with that cap diameter and that height, the five closest neighbors are all edible. So in that case, that would be predicted to be edible. In this location, the five closest neighbors includes three edible mushrooms and two toxic mushrooms. The edible outweighs the toxic, so that point would be declared edible. And you can see by looking at the plot that that definitely falls within the spread of the other edible mushrooms. So that's a very reasonable decision to make. Then if we look at another point, if it includes, say, four edible mushrooms and one toxic mushrooms, definitely classified as edible. Here's one where we see a border case. In this location, we see a plus sign that seems to be very close to where the edible mushrooms are and the toxic mushrooms are. It technically includes three edible mushrooms and two toxic mushrooms. So it should be classified as edible. But if I'm looking at that, I'm thinking maybe that's going to be toxic. That's close enough to the rest that it might be toxic. So this is an example of where you may want to choose a different solution than a majority vote to determine your classification. That's a separate topic. We'll dive into that more in a later video. But this illustrates the concept of K nearest neighbors. You find the nearest K data points, and you have them vote on what the winning category is. Now, the choice of K makes a difference. Let's say we also in this data set had an edible mushroom that we had found that has a cap diameter and height that places it out here amidst the other toxic mushrooms. Now, if we would like to know, if we would like to guess whether a mushroom with this cap diameter and height is toxic, we could say use a K of 1. Find the closest neighbor, which is edible. Say, well, it's very close to this other edible mushroom that we found. So it's probably edible too. Now, if we have 100% confidence in our prior data and our labeling of that data, this is not a bad way to go. But if we have some noise in our data set and we think maybe some of our points are mislabeled, then this is not a good way to go because any mistake gets propagated. A way to protect against this is to have a larger K. So if we ramp that K up to be K equals 5, that just includes the edible mushroom and four toxic mushrooms. Those four toxic mushrooms will override the vote from that edible mushroom and it will be classified as toxic. So depending on what we believe about the accuracy of our data, that will affect the choice of K. A higher K does more correction for noise and inaccurate data points. A lower K gives us greater resolution. Another thing that makes a big difference in K-nearest neighbors is how you scale your features. Let's imagine that on our mushrooms, we measured our cap diameter in centimeters, but our height in meters. What that would do effectively is squish all of our data down toward the X-axis. It's still perfectly valid. That's a valid measurement. It's still accurate. It's still conveying the same information. But if you look now when you draw a circle around this particular point of interest, it captures, if you squint, just barely one toxic mushroom and four edible mushrooms. If we re-plot this exact same set of data and this exact circle with the height in centimeters instead, we get this really elongated ellipse that captures, just barely, if you look closely, the one toxic mushroom and the four edible mushrooms. This two casual eyeball would clearly be a toxic mushroom. That plus sign is nested right down in there with the other toxic mushrooms. But because of how the features were scaled, it captured all of those other edible mushrooms above. The feature scaling matters. This means that when using K-nearest neighbors, it's not enough to blindly feed your raw data into it, but that you have to think carefully about each dimension of your data and imagine a one-unit difference in that dimension and try to estimate all of those dimensions and make sure a one-unit difference means about the same thing in each case. An alternative, which we won't go into in detail here, is that you can learn a feature scaling. You can adapt and shift those feature scalings to give you the best possible results. I'm going to put a plug in here. The end-to-end machine learning course 221 on K-nearest neighbors walks through several examples of implementing KNN on different data sets, and one of the things we do is adaptive feature scaling. So if you'd like to walk through how to actually implement this in Python and have access to the code, please take a look at course 221 at end-to-end machine learning. Another thing that makes the difference in K-nearest neighbors is your distance metric. So not just feature scaling, but how you combine those features to make a distance. So imagine in our little space here with just two measurements, height and cap diameter, we have a circle here that represents our neighborhood, our closest neighbors. What this circle implies is that we're using an L2 norm. We're taking the height squared and the cap diameter, the difference in cap diameter squared, adding it and taking the square root, and that's the distance. It's the Euclidean distance. It's the as the crow flies distance if you're looking at a map. It makes a nice circle. All of the points on that circle are an equal distance from that plus sign. But that's not the only way to make a distance metric. We could also take and just add up the individual dimensions. We could say the distance is the height in centimeters plus the cap diameter in centimeters. Also known as the Manhattan distance or the Malanova's distance or the L1 norm. This is another perfectly valid way to create a distance. When working in just two dimensions, the L2 norm is a really convenient way to visualize distance. But when working with a lot of features, you get a very high dimensional space. There's some counterintuitive mathematical results that show that as the dimensionality of your space gets high, everything starts to gravitate toward being roughly the same distance apart in this L2 norm. But that's not true for the L1 norm. So if you have a high dimensional space, I really like the L1 norm for distance. So this is something to consider when you're using K nearest neighbors. If you're looking at say distances on a map, the L2 norm, the Euclidean distance makes great sense to use because physically that is the distance. That's what we're... That was the intuitive motivation for the notion of distance. But if you have a long laundry list of features, then the L1 norm or some other distance metric may be a better way to go. There are exotic distance metrics that you can use. I really like the L1 norm because it's so simple to explain and compute. We can use K nearest neighbors not just with continuous data like height and diameter, but also with categorical data. So imagine now we're looking at mushrooms in a different way and it's actually the cap diameter combined with the cap shape, whether it's indented or rounded, whether there's a little divot in the middle where it's nice and convex all the way across. This cap shape is either or the way we've represented it categorical. It's either indented or rounded. And so that means all of our mushrooms either fall on this line that lines up next to indented or this line that lines up next to the rounded. And then the cap diameter is a continuous variable and it's spread across. If we want to use K nearest neighbors here, we can do the same thing. We can treat it the same way. For instance, if we want to know, we find a mushroom in this location that's indented and it has this cap diameter, we'd like to guess whether it's toxic or edible. We look for in this case the seven closest neighbors, K equals seven. We find two are toxic, five are edible. We can declare it edible. But don't forget scaling matters. And in this case, what that means is the distance that those two lines are apart change the results that we get. Here's another example. If we're looking at this location and using K equals five, we could look for the five closest neighbors and find that right here among the indented cap shapes, this mushroom would be projected to be edible. Even though an eyeball check says, you know what? It's sitting there among the indented cap shapes where they're all the rest are toxic. It's probably toxic. But this is an example of shifting the feature weight or the feature scaling so that the cap shape just doesn't carry much weight. And in that case, those other edible mushrooms with rounded tops would overpower the toxic mushrooms with indented tops and cause that position to be projected as edible. To repeat, feature scaling really matters also with categorical data. We can also use K nearest neighbors for regression. That is, if we don't want to assign a category to a data point, but we want to assign a value to a data point. Let's step back and say, now we're not interested in categorizing whether a mushroom is toxic or edible. We're interested in estimating the mass based on the height and the cap diameter. We've collected some data points for a given height and for a given cap diameter. The number represents the total mass of that mushroom. And you can see as the cap diameter gets larger, as the height gets larger, the mass gets larger. That makes sense. That makes sense all around. Now, if we want to look at a new point, we find a new mushroom. We know its height. We know its cap diameter. And we'd like to estimate its mass. We can use K nearest neighbors to do that too. We look for this case, K equals 5. We look at its 5 closest neighbors. We list all of their masses. 51, 52, 59, 67, 73. And we can take the average of that. In this case, that is 60.4. So by that scheme, we would estimate that this point has a mass of 60.4 grams. Now, looking at that, you might decide that's a good estimate. You might decide it's not. It is to the right of 67. But that 67 is already kind of low between 51 and 59. So maybe that's about right. If we take a median, we get a 59, which is even lower than our 60.4. So maybe that's a better answer. Maybe it's not. What you use depends on you. This is a choice that you get to make about the model. You can think of it as a hyperparameter. It's something that you choose. And the way to choose is you run them both on data that you already know and you see which gives you the better result. So two equally valid ways to do it. A third way to do this is to take all those data points and weight their influence by how close they are to the point that you care about. So in this case, we do take our five closest neighbors, but some of those are much closer than others. 51, 52, 73 are kind of all out closer to the borders, to the edge of that circle. 59's in the middle. That's 67 is really close. So using that, we give the closest data points the most weight. And if we weight this, then we might find that the weighted mean would be higher, closer to 67, maybe 66. Again, depending on the nature of the problem that you're solving, you can choose to use a weighted voting scheme. Such schemes exist for categorical data as well. You can, depending on how close the data points are, they can have a larger or a smaller vote. So k-nearest neighbors, in my opinion, is highly underrated. They have zero training time. They're lazy, which means there's no model to train. All the calculation they do is when you ask it to do inference, when you ask it what the category or the value of a new data point is. So that means there is no training time at all. Compare that to the hours or weeks or even computer centuries used to train other models, deep neural network models, for instance. Another great thing about k-nearest neighbors is sample efficiency, meaning that you don't have to collect very many data points before you're able to start making good, useful inferences about your space that you're working in. This, again, in contrast with, say, neural networks where if you're doing categorization, you might need tens of thousands of examples of your data before you can start to reliably categorize new points into their classes. K-nearest neighbors is explainable. If you ask me why the algorithm picked a certain value at a certain point, I can point to the specific measurements around it that influenced it and tell you exactly how they contributed. That type of explainability is very rare. And if you're in a system where you need to know why a certain answer was arrived at, you can't do better than k-nearest neighbors. It's also incredibly easy to add and remove data. GPT-3 is a natural language processing approach. It's a neural network that I believe the state of the art requires hundreds of millions of dollars of compute time to train. If you wanted to remove the influence of one piece of text from that data, you would have to take it out of your data set, out of your training set and retrain the whole thing at the cost of another multi-hundred million dollar check. It's just not feasible to do. Also, if you want to add new data, you have to add it and then retrain it. Now, that's a little bit dramatic. There are ways to short change that a little bit, but not much. There is a lot of retraining you have to do every time you change your training data set. With k-nearest neighbors, it's easy. You just add or remove the data point from your collection and the next time you go to make an inference, it won't be there, or the new one will, and you'll get a slightly different result. It is trivial to modify the data to tell, say, someone who was represented by the data who wanted their data point deleted, that, yep, we deleted your data. It will no longer be reflected in the model in any way. It's very nice to be able to say that with confidence, with certainty, and to be able to demonstrate to an audience how that is done and why that's the case. Another more subtle element that is nice about k-nearest neighbors is they're less sensitive to class imbalance, at least global class imbalance. Let's say we had a data set that had a thousand edible mushrooms and just ten poisonous ones. Then k-nearest neighbors is okay with that, as long as locally within the neighborhood of any new mushroom that we collected, there was a nice representation of any edible and toxic mushrooms that happened to be close to it. As long as the local density is balanced, the global imbalance doesn't really matter. So that's a rare property among machine learning algorithms as well. Now, it's not without weaknesses. No algorithm is perfect. Everything has trade-offs. So k-nearest neighbors, it's expensive to compute. If you have a lot of data points and you have to find the five closest, it means you have to figure out how close all of them are so you can find the five that are the closest. And calculating that distance can get very expensive if you have billions of data points. Also, as we've called out, it's sensitive to feature scaling and it's sensitive to your choice of distance metric. So you have to get those right or you can get really nonsensical answers. Now, fortunately, there are workarounds for the weaknesses. So you can make it work for you. As we called out, you can do learned feature scaling. Even though it is expensive to compute the distance for the whole data set, there are clever data structures like KD trees and ball trees that I won't go into right this moment, but what they allow you to do is just compute the distance for a small chunk of the data. It lets you say, hey, I'm looking for a data point in this region and then I'll let you focus on the data points in that region and just calculate the distance to all of them. And you can use that to save the huge computation of computing the distance to all of your data points. Also, there's this really clever thing called data reduction. If I have 10,000 edible mushrooms, but let's say that 500 of them are nearly identical, I can throw most of them out and just keep one good exemplar, one good prototype that shows what I'm looking for there. And that way, when I have a new one that I want to check out, if it's similar to that, it'll be able to pick it up by K nearest neighbors, but I don't have to keep all 500 examples of it. So by doing clever data reduction, making sure that the examples you keep are spread nice and evenly across your feature space, you can still get very high quality answers, but at a small fraction of the data storage and compute requirements. I will call out that methods for doing data reduction are a topic of active research. This is not a solved problem. Good data reduction methods depend very much on the nature of the data and the problem you're trying to solve, but here's a plug for if you're looking for a good data algorithmic research topic, you can do a lot worse than efficient data reduction in K nearest neighbors. I hope I've sold you on the fact that K nearest neighbors is a solid machine learning algorithm. It doesn't get as much love as deep neural networks, but it should. It is robust. It is the crescent wrench of the toolbox. It works pretty well on almost everything, if approached carefully. If you'd like to learn more about it, I have a course that I mentioned course 221 in end-to-end machine learning, where we coded up from scratch in Python. We apply it to four different data sets, one with penguins, one with zoo animals, one with diamonds, and one with interpolating elevation in maps. And we show how to implement some of these things like automatic feature scaling that lets you step up your K nearest neighbors performance. Thanks for tuning in.