Transform categorical data into numerical values using one-hot encoding to ensure compatibility with machine learning models. Learn to efficiently apply Pandas' get_dummies method to convert categorical variables into actionable binary columns.
Key Insights
- Convert categorical data like "low," "medium," and "high" into numerical format using one-hot encoding, creating separate binary columns for each category.
- Utilize Pandas' get_dummies function with the integer data type to efficiently transform categorical strings into numeric binary columns.
- Append the resulting one-hot encoded dataframe back to the original dataset, enabling machine learning models to effectively process and predict based on categorical features.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's take a look at how we can convert these values into numbers for the computer to be able to model. If you remember our linear regression, we ended up with sets of arrays that contained just numbers. We can do a similar thing here, except that we are going to want to convert them into zeros and ones.
That's because our high, low, and medium aren't really like a measurement. They don't have a meaningful median. There's a mean salary, but we're not sure exactly what these categories represent numerically.
Instead, they are just going to be zeros and ones—a one for low, a one for medium, or a one for high. Now, one for everything would mean that everything would be one. So instead, what we're going to do is we're going to have three separate columns.
For each row, there will be a low column, a medium column, and a high column. Each of these columns will simply have a zero if that row is not, say, a high here, or a one if it is. So everything will have a one in either the low column or the medium column or the high column, and zeros in the others.
That way, the computer will just look at zeros and ones and say, "Okay, there's a one here in this column." And again, it doesn't know what these columns represent, but a one in this column must mean that, you know, I'm finding a pattern where ones stayed and zeros left—or the other way around. But it will give it predictive information, information it can hopefully predict based on, and it will be in a format that it understands.
We'll use a technique called one-hot encoding that takes categorical data—which category you're in—and converts it into ones and zeros. And we'll use the Pandas get_dummies function to return a new DataFrame with these new columns. "Get dummies" is a historical name; "dummy data" is essentially what it produces here.
But that's not how we think of it. We think of this as one-hot encoding. So here's how we're going to do that.
We're going to say salary_OHE, for "one-hot encoding, " is the DataFrame we get when we run pandas.get_dummies on a column. In this case, it's HR data salary, which again is a string—low, medium, or high. And the second thing we pass it is, what's the data type? It's int.
That's not a string note; that's the Python function int, which converts values to integers. So it'll run that on each one to make sure it's an integer. Let's look at that salary one-hot encoding DataFrame.
All right, it's showing us the first five and the last five rows. Same number of rows, so it gave us one for each one. And we can see this one was a low salary.
These two were medium salaries. These two were high salaries. And our last five were all low salaries.
Sorry, those were low. But you saw what I meant, even though I said it incorrectly. So that's what we've got here.
What we need to do now is append that DataFrame to our original DataFrame so that we have high, low, and medium to work with. We'll do that next.