Implement the k-nearest neighbors classifier effectively by understanding the significance of choosing the appropriate number of neighbors. Learn how training this supervised machine learning model enables accurate predictions for new data points.
Key Insights
- The k-nearest neighbors (KNN) classifier is a supervised machine learning algorithm that predicts classes by analyzing the majority vote among the closest data points, typically using an odd number like three or five to avoid ties.
- Selecting more neighbors generally increases model accuracy but also intensifies computational demands; three neighbors is a common starting point, though five has become increasingly popular due to improved computing power.
- Training the KNN model involves fitting the classifier with labeled data points (X, Y coordinates) and their corresponding classes (zero or one), enabling the model to predict the category of new, unseen data points effectively.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's start a k-nearest neighbors classifier, our supervised machine learning model based on the values we've got so far. This time, we'll save the zipped version of X and Y. We'll create a list of data points by zipping X and Y together, and we can now look at the data points. Here they are.
This is similar to before, but now we've saved them because we'll feed this data to the k-nearest neighbors classifier. First, we create our k-nearest neighbors classifier. We’ll define the KNN model as our k-nearest neighbors classifier.
We set the number of neighbors to three. Three is a typical number, and five is also common. This defines how many neighbors to consider when looking for the majority class.
The more neighbors you include, the more accurate your model will be, but also the more time-intensive it will be to train. Three is a good midpoint. These days, people often use five neighbors.
Again, since computers are getting faster, this process is becoming more efficient. It also depends on the complexity of your data. One is generally considered too low now.
It must be an odd number to avoid ties when determining the majority class. If we used an even number of neighbors, like four or two, there could be a tie. We want to get a definitive prediction, not an uncertain one.
That’s why we use an odd number of neighbors. Let’s run that. Now we have a KNN model, but it's not trained yet.
Training or fitting our model involves providing it with X and Y data. In this case, our X represents the data points and their X, Y coordinates. Classes represent the final category each point belongs to, either 0 or 1.
Let's run this, and I'll explain it. KNNModel.fit: We'll pass the data points and their corresponding classes. Now we have a trained model!
It is now trained on the data, in a slightly different way than the previous model. It didn't create a new algorithm. Instead, it now has all the data it needs to apply the k-nearest neighbors algorithm to any new points we test.
Let’s test it, in fact. First, we'll add a new data point. Then we'll test how well it predicts the class of this new point.
We’ll add a new data point, and then we’ll test how well it predicts the class for that point.
We want to obtain a definitive prediction, not an uncertain one.
Let’s test it now. We’ll add a new data point and observe how well the model predicts it.