Learn how to utilize crosstabulation in Pandas to analyze relationships between variables such as salary and employee retention. Gain insights into effectively organizing data for clearer interpretation and modeling.
Key Insights
- Crosstabulation in Pandas enables analysts to examine relationships between variables, such as salary and employee retention, by producing frequency tables that summarize data numerically.
- Analyzing salary against employee retention through crosstabulation can help identify important features that influence employee decisions to stay or leave, providing valuable inputs for predictive models.
- While Pandas-generated crosstab tables effectively reveal data relationships, initial outputs may require additional steps to reorder columns (e.g., low, medium, high salary) for improved readability and clearer interpretation.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Now we'll do some crosstabs—crosstabulation, that is. This involves comparing variables against each other. It's a different form of data analysis because, unlike a graph, it produces numbers.
We can produce a graph from those numbers if this were a graphing-focused course. But what we're going to do is we're going to break it down by salary. Let's examine the impact salary had on employee retention. And again, this is part of data analysis because we want to see which values could help our model.
Which values should we give to our model as features, as inputs? It seems salary could significantly impact whether an employee stays or leaves. Let's run a crosstab, which is built into Pandas. It'll compute a frequency table of two or more variables, the distribution of values in the data, and some insights into their relationships.
It simply returns a new table. Again, it's not a graph—just a new DataFrame. All right, let's take a look at how we could create that.
It's very simple. We could call it left versus salary crosstab—that's just a name we're giving it.
We'll ask Pandas to generate a crosstab. Crosstabulate the "left" column of HRData against the "salary" column of HRData. And then we'll look at that crosstab.
All right, this is good, but it's a little hard to read because the columns—high, low, and medium—appear out of order compared to what we'd expect. It's a little hard to see the relationships here. So, in the next step, we'll reorder this to be more human-readable.