Predict employee retention accurately using logistic regression, a powerful model suited for classification tasks. Examine crucial metrics and data-driven insights to determine whether employees will stay or leave a company.
Key Insights
- Apply logistic regression instead of linear regression for binary outcomes, particularly useful in predicting discrete variables like employee retention ("stayed" or "left").
- Analyze comprehensive employee data including satisfaction level, average monthly hours, number of projects, and promotions received to build predictive models.
- Evaluate prediction accuracy effectively by utilizing standard tools such as StandardScaler, train-test splits, and various performance metrics provided by libraries in Python.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's talk about what we're doing next. We have done a Linear Regression predicting continuous values like price. Now, what about discrete values? Dog versus cat—classification problems.
For that, we use Logistic Regression. In this case, we're going to predict whether employees stayed or left their jobs. Given a certain salary, average working hours, department, or whatever features we feed into the model, we want to predict whether the employee will stay or leave. We need a different model—a different type of model—a Logistic Regression.
It's not about drawing a line. It's about answering yes or no—stayed or left. So let's take a look at what code we have.
We are bringing in almost exclusively the same kind of things that we brought in for the last one. StandardScaler, train_test_split. We are bringing in some new metrics.
We're going to dive a bit deeper into how we can best measure our success or failure. How accurate was it by different readings, different measurement tools? And instead of bringing in Linear Regression to create our model, we're bringing in Logistic Regression. All right, make sure you run that and run this, which again, may take a minute if you haven't run it yet, but I already did. Our base URL should be the same.
And now we're grabbing from our CSV some human resources analytics data. We're going to turn that CSV into a DataFrame and call it HR data.
It's what you get when you run pd.read_csv using the base URL we defined above. I'm waiting for this autocomplete to speed up a little bit. There it is.
And the HR CSV URL. And then we can take a look at our HR data, assuming that worked. Here's our data.
We can see quite a lot of columns that can help out. This is their satisfaction level. How well did they perform on their last evaluation? How many projects did they have? What were their average monthly hours? How many years did they spend at the company? How many work accidents have they had? A lot of zeros, that's good.
Did they leave or stay? We have a lot of ones here. One is for left, zero is for stayed. Our first five people all left, our last five people all left.
How many promotions did they receive in the last five years? Well, none—maybe that's why they left. We can see. And what department are they in? Our first five folks are in sales, our last five are in support.
And what is their salary? It is categorized as low, medium, or high. So that's the data we have to work with. We're going to dive into what we'll do with that data in a moment.