Data for Readability: Enhancing Index and Column Clarity

Rename columns and rearrange data to improve readability and identify salary as a factor influencing employee retention.

Transform raw data into clear, human-readable insights by renaming indices and reordering columns in your dataset. Learn practical techniques to prepare categorical data effectively for machine learning models.

Key Insights

  • Improve readability by renaming index values from numerical codes (0 and 1) to descriptive labels ("Stayed" and "Left"), making data interpretation intuitive.
  • Reorder dataset columns using Python's .pop() and .insert() methods, such as moving the "High" salary column to a preferred position, to enhance clarity and data analysis.
  • Convert categorical variables like "Stayed," "Left," and salary categories ("Low," "Medium," "High") into numerical representations, enabling computers and machine learning models to process and analyze the data effectively.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Okay, we're going to rename some things and move things around and make this more human readable. Zeros and ones are what the computer needs for left versus stayed, or actually stayed versus left. Again, it's not very human readable.

And high, low, medium, it's not in order. So let's see what we can do there. First, we can rename our index.

Instead of being zero and one, we can say for the left versus salary crosstab, it's index.indexValue equals a list of the row names. "Stayed, " which I will hopefully spell correctly, and "Left." And now we can look at that again.

And there, now it's "Stayed" and "Left." Much easier to read. Okay, our next step, which is a little more tricky code-wise, not hard or easy, but definitely a little more complex.

First, we need to remove and save the high column. If we want to move the high to the left, it's not simply a matter of moving it over. We actually have to take it off and then append it to the end.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

So here's how we'll do that. We'll say high_column—though we can name it whatever we want—is what we get when we run.pop("high"), removing that column. Now don't run this yet.

If you run this partway through, you will end up losing your high column. You could go back and rerun all your earlier cells to get it back, but let's avoid having to do that at all. So let's run this only once we're at the end so that running this, it'll run everything before we lose the column, we'll put it back on.

Now we say insert, insert a column at index two, meaning after zero and one as the third option, the third column. We'll call it "high" again. And it'll be that high column that we just made.

And now we can run this and we should have "high" over on the right. All right, now this is much easier to read and we can see that there's a relationship here between salary—low, medium, and high—and "Stayed" or "Left." Now, again, we're not graphing this for a more visual audience.

This might be good to make a nice chart for. But for now, we can just see as numbers people, as data people, that for low salary folks, that's about 70% who stayed. For medium, it's higher; about the same number of people stayed among the medium-salary folks, but about two-thirds as many left.

That means it's about 80-20 instead for medium-salary folks. Only 20% of them left versus 30%. And it's vastly different for high folks.

There are fewer of them, so the sample size is smaller, but it's still fairly large. Without precisely calculating, that's over 90%. Around 92–93%.

So a lot more people stayed. This seems like it's a significant thing to train the model on. Given people's salaries, it might be worth examining further.

Now, the computer can't look at "Stayed" and "Left." It can't look at low, medium, and high, because those are words, and they don't mean anything to the computer as words. We need to give it numbers.

That's what computers understand. So the next step, we'll convert these values to numbers.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master Machine Learning with Hands-on Training. Use Python to Make, Modify, and Test Your Own Machine Learning Models.

Yelp Facebook LinkedIn YouTube Twitter Instagram