Gain clarity on how K-Nearest Neighbors (KNN) effectively categorizes iris species by analyzing multiple dimensions simultaneously. Understand how multidimensional data, challenging for humans, becomes easily manageable for computers.
Key Insights
- K-Nearest Neighbors (KNN) classification visually demonstrates clear clustering of iris species—setosa, versicolor, and virginica—based on sepal width and length.
- While humans easily interpret two-dimensional data visualizations, assessing multiple dimensions such as sepal width, sepal length, petal width, and petal length proves challenging.
- KNN simplifies working with higher-dimensional datasets, as computers efficiently calculate and compare multidimensional distances to classify data points accurately.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's take a look at some images to visualize what we're doing with these irises. If you run this code block, you get an image, and it's of a particular species called Versicolor of irises, and we're looking here at the sepal width and length. This is the length; this is the width. These are sepals, and these are petals.
You don't need to have a lot of domain knowledge about flowers to do this one. And we can graph the sepal width against length for everyone and get a graph that may help us understand how K-Nearest Neighbors is going to work with this. So that's the next code block right here.
Run that, and the three species we're going to work with are Setosa, Versicolor, and Virginica. And you can see we have, we look at sepal width and length, we have many setosas over here, many virginicas over here, many versicolors over here. When we have a new item like this one, it's pretty obvious that it's going to be a Virginica.
The nearest neighbors are definitely the Virginica ones. However, the issue that we'll face here is that, yeah, it's pretty easy for us humans to take a look at this data and say for any dot which one it is if we're only looking at two dimensions, sepal width and height. It's much harder to eyeball when we're actually looking at sepal width, sepal height, petal length, and petal width.
Now, that is four dimensions, four variables, and it's hard for us to visualize things in four dimensions. But for the computer, it's actually quite easy. It's very easy for it to calculate the distance between four dimensions and its nearest neighbors and determine the smallest average distance between it and the others along four dimensions, working in four-dimensional space.
This is very hard for us to work with. Even three-dimensional space becomes much more challenging for us, let alone four, five, or six. So, that's where k-nearest neighbors will really help us: working with multi-dimensional, higher-dimensional datasets, as we'll see throughout this lesson.