Filling Missing Age and Embarked Data for Titanic Analysis

Learn how to efficiently fill missing data values using mean and mode techniques to enhance your dataset. This guide demonstrates step-by-step methods to handle incomplete Titanic passenger data by applying practical data cleaning strategies.

Key Insights

The article outlines a practical method for filling missing age values by calculating and applying gender-specific mean ages—27.9 years for women and 30.7 years for men—to improve data completeness.
Explains the use of Pandas functions such as apply() combined with custom functions to systematically fill missing numerical data based on conditional logic.
Demonstrates filling categorical missing data by using the mode (most common value), assigning the value "S" to two missing entries in the Titanic dataset's "Embarked" column.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

We're going to fill our age values now. We're going to fill them in with the mean age. This is maybe not the perfect way to do it, but it is a way where we can have some real values that we can use even if they're not, even if about 20% of these ages will be generic.

They'll at least make the rest of it able to be worked with. We're going to calculate the mean age for women. This is just some math, some leveraging of our DataFrame knowledge and so on.

First, let's make a women's DataFrame that will be the Titanic data where the sex value is female. So that means it'll be just a filtered version where it'll only be those rows. Now we'll say, okay, great.

I want to get the ages of those. We'll say ages of women is women DataFrame age column. And then now that we've got just a series of numbers, we could say, okay, the mean women's age is, let's round the ages of women dot mean to one past the decimal place.

And we can check that out. Women's mean age is 27.9. And we'll do the same thing for men. And in fact, I'm going to do a little bit of copy and paste here and change women to men.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Okay. I think that did it. Let's run this again.

Yep. Men's mean age should be 30.7. All right. Now we're going to fill it in.

We're going to do an apply to fill this age to any empty spots, any NA values. So I'm going to define a function. This could be done—it should be called maybe fill mean age by gender.

We'll take in a passenger. This is a function that's just in charge of one row at a time, and then we'll apply this to all the rows.

We'll have an if. We'll say if it's not true that there's no age. This is a little bit of, you know, computer double-negative kind of thing.

But basically this is saying if there is already an age value. In that case, we want to return that passenger's age. Right? There already was an age value.

So that's what we want. Now, if we're here in this case, elif, then that means that they don't have an age already. We'll say, okay, if their sex is male, then return mean man age.

Else, they must be a woman—return mean woman age. Or it's just a little function that's in charge of one passenger returning the right value for them. If they already have an age, return their age.

If they're a man, return mean man age. Otherwise, return mean woman age. So what we need pandas to do for us is to apply this function to every value in age.

We'll say Titanic data at age is now equal to the version where we apply it to every single one. And the axis we want to apply it to is columns. And let's just take a look at Titanic data.

Okay. 27.9,30.7. I don't see any of those values here. Did it work? Well, we could check a couple of things.

First, we could check, is there still any NA values for age? No. So that's good. And now, let's try to see if we could find any ones where they are a man and have that men's age.

We'll say men_at_mean_age equals Titanic data where Titanic data at age is mean man age and they're a man. And then let's just look at that.

All right. Looks like they've got 30.7. And there are 124 that got fixed. All right.

That looks pretty good. We can check the women too. But I'll let you know right now.

It worked. All right. Next, we're going to take care of the embarked values, of which there are two NAs.

So the way we're going to do that is we're just going to fill in those two NA values with 'S'. That's because that's the mode; it's the most common value. So let's say Titanic data at embarked equals Titanic data at embarked, where we fill NA with 'S'. And let's just check the isna sum.

All right. The only ones left are cabins. And again, we're not going to worry about those.

So next, we are going to do some data analysis and try to figure out which of these features seems important enough to train our model on.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning