Preparing Kaggle Titanic Model for Evaluation

Clean test data by filling missing age values with gender-specific means and missing fare with the overall mean.

Prepare and refine your Titanic dataset predictions, then submit your results to Kaggle for evaluation. This guide walks through crucial data cleanup and preprocessing techniques to improve your model's accuracy.

Key Insights

  • The Kaggle Titanic test dataset consists of 418 passenger entries, around 20% of the full dataset, with missing values including 86 ages, one fare, and 327 cabin values.
  • Use the previously created fill_mean_age_by_gender function to efficiently replace missing age values, calculating separate mean ages for men and women to enhance data accuracy.
  • Address the single missing fare value by computing and inserting the mean fare rounded to two decimal places, finalizing data cleanup before labeling, encoding, scaling, and submitting predictions to Kaggle.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Kaggle being a competition, although I don't think it's very competitive, enough to make you want to keep working on this stuff without making it something that people are viciously competing over. But it is a competition and therefore, they actually keep the Y test to themselves. What we're gonna do is we're going to load the X test, we're going to do the same work we did on the above, we're gonna run our model on it, and we're gonna get a set of predictions.

And then we're gonna upload that set of predictions to Kaggle and see what they think of it. So the first thing we're gonna do is we're gonna load that CSV with the X test in it. All right, that's going to be, it's saved in our folder, in the Google Drive we gave you folks.

We're gonna call it, I'm actually gonna call it, I'm gonna shorten this to X test. We're gonna read a CSV and it's our base URL plus CSV, nope, CSV slash test Titanic.csv. Let's see, X test. There it is.

All right, we've got our passengers and there are 418 of these. Again, about 20% of the data. Okay, let's see what NA values we have.

It's gonna be quite a lot and we're gonna have to do some work on that. And you folks could see that actually right here we got some NANs here for age and a bunch for cabins. Let's actually check it.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

X test.isna.sum. Yep, we have 86 ages, one missing fare and 327 cabins that we don't care about. All right, let's clean up the data. Quite a lot of work to be done on that but we'll do, we already explained the steps we were doing last time so we'll just do it again.

Let's get women's mean ages. We'll say women, I can spell, is X test at. X test, let's see, it's the one where, yeah.

Sex is female. And once we got that, we could say, okay, ages of women is women at age. And finally, mean woman age is ages of women.mean. And we'll also round this, round it to one place.

One place past the decimal. All right, that's our mean woman age for this sample. And let's do the same thing for men and I'm gonna do some copy and paste here.

And there we go and change this. All right, that looks good. Yeah, looks good.

All right, now we can use, actually reuse the function we did before and just apply this. Yeah, apply this particular function, the function we wrote before up above. See the earlier video.

And say our X test at age should now equal X test dot apply. What did I call that function? I'm gonna have to go look and see what it is. Oh, here it is.

Fill mean age by gender. That's what I named it. There we go.

And apply that to the columns. And then we will fill in the missing age with the missing fare, with the mean fare. And we'll do that in one line this time.

Mean fare is round our X test fare value dot mean to four places. Maybe I'm gonna do two places. Two plus decimal place would be like cents.

And then we'll say, we'll do a fill in A for this one. We'll say X test fare equals X test at fare, fill in A using the value mean fare. It should only be one value that's missing.

And then just to double check that we did that all right, let's check for missing values at the end here. X test dot is in A dot sum. All right, looks like we did it.

Only missing values are cabins, which we don't care about. All right, now that we've cleaned up our data, we're gonna label and code and scale it in our next step and then submit it.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master machine learning with hands-on training. Use Python to make, modify, and test your own machine learning models.

Yelp Facebook LinkedIn YouTube Twitter Instagram