Creating a DataFrame with Iris Dataset

Transform the Iris dataset into an intuitive DataFrame by mapping numeric target labels to meaningful flower species names. Learn to streamline this process using Pandas' apply method with both regular Python functions and lambda expressions.

Key Insights

Utilize Pandas to convert the Iris dataset into a structured DataFrame, clearly labeling columns such as sepal length, sepal width, petal length, and petal width, covering a total of 150 observations.
Map numeric target indicators (0, 1, 2) in the original dataset to their corresponding flower species—setosa, versicolor, and virginica—improving readability by creating an additional "species" column.
Apply Pandas' apply method effectively with standard Python functions and lambda functions as demonstrated in the article, enabling efficient data transformation and enhancing dataset comprehension.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's start making this into a DataFrame that we can work with. We can first take a look at one other bit of target names, and that's Setosa, Versicolor, Virginica. Now we'll use those. We can also look at feature names: Sepal Length, Sepal Width, Petal Length, and Petal Width.

So those are going to be our column names so that we can make this into a proper dataset. So let's do that. We're going to say `iris_dataframe`, give me a Pandas DataFrame, where the data is `iris_data.data`, and the column names are `iris_data.feature_names`. And then we can look at our DataFrame.

All right, it's split it up into these four columns, each one having, remember, a row, an array of four items. And we've got our column names: Sepal Length, Sepal Width, Petal Length, Petal Width. And there are 150 total rows.

Next, we'll take a look at adding target. We don't know which of these flowers is Setosa, Versicolor, or Virginica. So before we actually add a target as a column—that's our goal here—we can look at `iris_data.target`. And that's an array of zeros, ones, and twos.

And again, that's Setosa, Versicolor, and Virginica. If that's our target names, we can now get them onto our data. We can say `iris_dataframe`, add a new column `target`, and it equals `iris_data.target`. And now let's look at our `iris_dataframe`.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

You can see now it has a target going from zero and then, at the tail end, all twos. All right, now this is all well and good, but I'm definitely going to forget which one's zero, which one's one, and which one's two. I don't know about you, but I'm definitely not going to keep that in mind.

In fact, I don't know it now. We're going to make a `species` column. And first, to do so, we're going to need to go over this `target` column.

And for each value, translate it from these numbers—zero, one, or two—to flower names. Now the flower names are in `iris_data.target_names`, and that's an array: Setosa, Versicolor, Virginica—zero, one, two. We can look at the index in `iris_data.target_names` to get the flower species.

So for every one that's zero, we'll look at that target index zero in target names. If it's one, we'll look at that index in the target names.

And if it's two, we'll look at that index in the target names. So to do that, we need to use Pandas' `apply` method. The `apply` method takes in a function.

Now we'll do this both with a named function and with a lambda so you can see the different ways to do it. I prefer to start with a regular Python function.

And this is one that takes in a target number and returns a flower name that the target number maps to. So I'm going to make a `flower_name` variable within this function that will be in the iris data in the target names—the one at that target number. So again, if `target_number` is zero, this will be the array `target_names[0]`—Setosa, Versicolor, Virginica—at index zero.

If `target_number` is one, then it will be target names at index one, and so on. And we'll save that string as `flower_name` and return it. All right, so now what we can do is use that function and give it to Pandas to run on every target.

Right, so the first target, first row—it'll run it on the target number and give us back that flower name and make that the value for `iris_df['species']`. So it's `iris_df['target']`, but applying our `get_flower_name` function. And now let's, you know, double-check a couple of these by—actually, let's do `iris_df.sample(10)` to get 10 random flowers.

We're defining our `get_flower_name` function. We're saying apply that flower name function to every target value and save that as the species value. Let's try that.

There are some random ones. It applied that function to the target and got Versicolor. It applied that function to this one and got Setosa, and so on.

And here are some twos. Now we have a very human-readable species. If you want to, you know, try that with a lambda, we could have skipped defining this function to begin with and just done this:

Again, this is if you're pretty comfortable with your Python lambdas, then this is a good way to do it. We could have done—instead of this line, and instead of this function—we could have just done this all in one line: `iris_df['species'] = iris_df['target'].apply(lambda target_number: iris_data.target_names[target_number])`.

Again, this does the same thing as the function up here. It just does it quicker and in one line. If we run that—while I’ve made a typo at the start—try running it again. There we go.

Same result. It just depends on which style you prefer. But either way, we now have a human-readable set of species.

Key Insights

Colin Jaffe

How to Learn Machine Learning