One-Hot Encoding for Categorical Data in Machine Learning

Convert categorical salary data into numerical format using Pandas' one-hot encoding.

Transform categorical data into numerical values using one-hot encoding to ensure compatibility with machine learning models. Learn to efficiently apply Pandas' get_dummies method to convert categorical variables into actionable binary columns.

Key Insights

  • Convert categorical data like "low," "medium," and "high" into numerical format using one-hot encoding, creating separate binary columns for each category.
  • Utilize Pandas' get_dummies function with the integer data type to efficiently transform categorical strings into numeric binary columns.
  • Append the resulting one-hot encoded dataframe back to the original dataset, enabling machine learning models to effectively process and predict based on categorical features.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's take a look at how we can convert these values into numbers for the computer to be able to model. If you remember our linear regression, we ended up with just these sets of arrays that were just number, number, number, number, number, number. We can do a similar thing here, except that we are going to want to convert them into zeros and ones.

That's because our high, low, and medium aren't really like a measurement. They don't go into a median. There's like a mean salary, and we're not sure exactly what these numbers represent.

Instead, they are just going to be zeros and ones, a one for low, a one for medium, and a one for high. Now, one for everything would mean that everything would be one. So instead, what we're going to do is we're going to have three separate columns.

For each row, there will be a low column, a medium column, and a high column. Each of these columns will simply have a zero if that row is not, say, a high here, or a one if it is. So everything will have a one in either the low column or the medium column or the high column, and zeros in the others.

That way, the computer will just look at zeros and ones and just say, okay, a one here in this column. And again, it doesn't know what these columns represent, but a one in this column must mean that, you know, I'm finding a pattern where ones stayed or zeros left or the other way around. But it will give it predictive information, information it can hopefully predict based on, and it will be in a format that it understands.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

We'll use a technique, one-hot encoding, that takes this categorical data, what category are you in, and converts it to ones and zeros. And we use the Pandas get dummies to return a new data frame where those values are in there, where it has these new columns. A get dummies is a historic name, historical name, where dummy data is sort of what it's producing here.

But that's not how we think of it. We think of this one-hot encoding. So here's how we're going to do that.

We're going to say salary OHE for one-hot encoding data frame is what you get when you run pandas.get dummies, and you pass it a column. In this case, it's HR data salary, which again is a string, low, medium, or high. And the second thing we pass it is, what's the data type? It's an int.

That's not a string note, that's the Python function int for converting to a function, converting to an int. So it'll run that on each one to make sure it's an integer. Let's look at that salary one-hot encoding data frame.

All right, showing us the first five and the last five. Same number of rows, so it gave us one for each one. And we can see this one was a low salary.

These two were medium salaries. These two were high salaries. And our last five were all low salaries.

Sorry, that was low. But you saw what I meant, even if I said it wrong. So that's what we've got here.

What we need to do now is append that data frame to our actual data frame so that we have high, low, and medium to work with. We'll do that next.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master machine learning with hands-on training. Use Python to make, modify, and test your own machine learning models.

Yelp Facebook LinkedIn YouTube Twitter Instagram