Learn how to efficiently fill missing data values using mean and mode techniques to enhance your dataset. This guide demonstrates step-by-step methods to handle incomplete Titanic passenger data by applying practical data cleaning strategies.
Key Insights
- The article outlines a practical method for filling missing age values by calculating and applying gender-specific mean ages—27.9 years for women and 30.7 years for men—to improve data completeness.
- Explains the use of Pandas functions such as
apply()
combined with custom functions to systematically fill missing numerical data based on conditional logic. - Demonstrates filling categorical missing data by using the mode (most common value), assigning the value "S" to two missing entries in the Titanic dataset's "Embarked" column.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
We're going to fill our age values now. We're going to fill them in with the mean age. This is maybe not the perfect way to do it, but it is a way where we can have some real values that we can use even if they're not, even if about 20% of these ages will be generic.
They'll at least make the rest of it able to be worked with. We're going to calculate the mean age for women. This is just some math, some leveraging of our data frame knowledge and so on.
First, let's make a women's data frame that will be the Titanic data where the sex value is female. So that means it'll be just a filtered version where it'll only be those rows. Now we'll say, okay, great.
I want to get the ages of those. We'll say ages of women is women data frame age column. And then now that we've got just a series of numbers, we could say, okay, the mean women age is, let's round the ages of women dot mean to one past the decimal place.
And we can check that out. Women's 27.9. And we'll do the same thing for men. And in fact, I'm going to do a little bit of copy and paste here and change men to women.
Okay. I think that did it. Let's run this again.
Yep. Men's mean age should be 30.7. All right. Now we're going to fill it in.
We're going to do an apply to apply this age to any empty spots, any NA values. So I'm going to define a function. This could be done, it should be called maybe fill mean age by gender.
We'll take in a passenger. And this is a function that's just in charge of one row at a time. And then we'll say apply this to all the rows.
We'll have an if. We'll say if it's not true that there's no age. This is a little bit of you know, computer double negative kind of thing.
But basically this is saying if there is already an age value. In that case, we want to return that passenger's age. Right? There already was an age value.
So that, that's what we want. Now, if we're here in this case, elif, then that means that they don't have an age already. We'll say, okay, if their sex is male, then return mean man age.
And else, they must be a woman, return mean woman age. Or it's just a little function that's in charge of one passenger returning the right value for them. If they already have an age, return their age.
If they're a man, return mean man age. Otherwise, return mean woman age. So what we need pandas to do for us is to apply this function to every value in age.
We'll say titanic data at age is now equal to the version where we apply it to every single one. And the axis we wanted to apply it to is columns. And let's just take a look at titanic data.
Okay. 27.9,30.7. I don't see any of those values here. Did it work? Well, we could check a couple of things.
First, we could check, is there still any eddy values for age? No. So that's good. And now, let's try to see if we could find any ones where they are a man and have that men's age.
We'll say men at mean age equals titanic data where titanic data at age is mean man age. And they're a man. And then let's just look at that.
All right. Looks like they got 30.7. And there are 124 that got fixed. All right.
That looks pretty good. We can check the women too. But I'll let you know right now.
It worked. All right. Next, we're going to take care of the embarked values of which there are two NAs.
So the way we're going to do that is we're just going to fill in those two NA values with S. That's because that's the mode. It's the most common value. So let's say titanic data at embarked equals titanic data at embarked, where we fill NA with S. And let's just check the is NA sum.
All right. The only ones left are cabins. And again, we're not going to worry about those.
So next, we are going to do some data analysis and try to figure out which of these features seems important enough to train our model on.