Transform raw data into clear, human-readable insights by renaming indices and reordering columns in your dataset. Learn practical techniques to prepare categorical data effectively for machine learning models.
Key Insights
- Improve readability by renaming index values from numerical codes (0 and 1) to descriptive labels ("Stayed" and "Left"), making data interpretation intuitive.
- Reorder dataset columns using Python's
.pop()
and.insert()
methods, such as moving the "High" salary column to a preferred position, to enhance clarity and data analysis. - Convert categorical variables like "Stayed," "Left," and salary categories ("Low," "Medium," "High") into numerical representations, enabling computers and machine learning models to process and analyze the data effectively.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Okay, we're going to rename some things and move things around and make this more human readable. Zeros and ones are what the computer needs for left versus stayed, or actually stayed versus left. Again, it's not very human readable.
And high, low, medium, it's not in order. So let's see what we can do there. First, we can rename our index.
Instead of being zero and one, we can say for the left versus salary crosstab, it's index.indexValue equals a list of the row names. "Stayed, " which I will hopefully spell correctly, and "Left." And now we can look at that again.
And there, now it's "Stayed" and "Left." Much easier to read. Okay, our next step, which is a little more tricky code-wise, not hard or easy, but definitely a little more complex.
First, we need to remove and save the high column. If we want to move the high to the left, it's not simply a matter of moving it over. We actually have to take it off and then append it to the end.
So here's how we'll do that. We'll say high_column—though we can name it whatever we want—is what we get when we run.pop("high"), removing that column. Now don't run this yet.
If you run this partway through, you will end up losing your high column. You could go back and rerun all your earlier cells to get it back, but let's avoid having to do that at all. So let's run this only once we're at the end so that running this, it'll run everything before we lose the column, we'll put it back on.
Now we say insert, insert a column at index two, meaning after zero and one as the third option, the third column. We'll call it "high" again. And it'll be that high column that we just made.
And now we can run this and we should have "high" over on the right. All right, now this is much easier to read and we can see that there's a relationship here between salary—low, medium, and high—and "Stayed" or "Left." Now, again, we're not graphing this for a more visual audience.
This might be good to make a nice chart for. But for now, we can just see as numbers people, as data people, that for low salary folks, that's about 70% who stayed. For medium, it's higher; about the same number of people stayed among the medium-salary folks, but about two-thirds as many left.
That means it's about 80-20 instead for medium-salary folks. Only 20% of them left versus 30%. And it's vastly different for high folks.
There are fewer of them, so the sample size is smaller, but it's still fairly large. Without precisely calculating, that's over 90%. Around 92–93%.
So a lot more people stayed. This seems like it's a significant thing to train the model on. Given people's salaries, it might be worth examining further.
Now, the computer can't look at "Stayed" and "Left." It can't look at low, medium, and high, because those are words, and they don't mean anything to the computer as words. We need to give it numbers.
That's what computers understand. So the next step, we'll convert these values to numbers.