The Impact of Salary on Retention Through Crosstab Analysis

Learn how to utilize crosstabulation in Pandas to analyze relationships between variables such as salary and employee retention. Gain insights into effectively organizing data for clearer interpretation and modeling.

Key Insights

Crosstabulation in Pandas enables analysts to examine relationships between variables, such as salary and employee retention, by producing frequency tables that summarize data numerically.
Analyzing salary against employee retention through crosstabulation can help identify important features that influence employee decisions to stay or leave, providing valuable inputs for predictive models.
While Pandas-generated crosstab tables effectively reveal data relationships, initial outputs may require additional steps to reorder columns (e.g., low, medium, high salary) for improved readability and clearer interpretation.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Now we'll do some crosstabs—crosstabulation, that is. This involves comparing variables against each other. It's a different form of data analysis because, unlike a graph, it produces numbers.

We can produce a graph from those numbers if this were a graphing-focused course. But what we're going to do is we're going to break it down by salary. Let's examine the impact salary had on employee retention. And again, this is part of data analysis because we want to see which values could help our model.

Which values should we give to our model as features, as inputs? It seems salary could significantly impact whether an employee stays or leaves. Let's run a crosstab, which is built into Pandas. It'll compute a frequency table of two or more variables, the distribution of values in the data, and some insights into their relationships.

It simply returns a new table. Again, it's not a graph—just a new DataFrame. All right, let's take a look at how we could create that.

It's very simple. We could call it left versus salary crosstab—that's just a name we're giving it.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

We'll ask Pandas to generate a crosstab. Crosstabulate the "left" column of HRData against the "salary" column of HRData. And then we'll look at that crosstab.

All right, this is good, but it's a little hard to read because the columns—high, low, and medium—appear out of order compared to what we'd expect. It's a little hard to see the relationships here. So, in the next step, we'll reorder this to be more human-readable.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning