Prepare and refine your Titanic dataset predictions, then submit your results to Kaggle for evaluation. This guide walks through crucial data cleanup and preprocessing techniques to improve your model's accuracy.
Key Insights
- The Kaggle Titanic test dataset consists of 418 passenger entries, around 20% of the full dataset, with missing values including 86 ages, one fare, and 327 cabin values.
- Use the previously created
fill_mean_age_by_gender
function to efficiently replace missing age values, calculating separate mean ages for men and women to enhance data accuracy. - Address the single missing fare value by computing and inserting the mean fare rounded to two decimal places, finalizing data cleanup before labeling, encoding, scaling, and submitting predictions to Kaggle.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Kaggle being a competition, although I don't think it's very competitive, enough to make you want to keep working on this stuff without making it something that people are viciously competing over. But it is a competition and therefore, they actually keep the Y test to themselves. What we're gonna do is we're going to load the X test, we're going to do the same work we did on the above, we're gonna run our model on it, and we're gonna get a set of predictions.
And then we're gonna upload that set of predictions to Kaggle and see what they think of it. So the first thing we're gonna do is we're gonna load that CSV with the X test in it. All right, that's going to be, it's saved in our folder, in the Google Drive we gave you folks.
We're gonna call it, I'm actually gonna call it, I'm gonna shorten this to X test. We're gonna read a CSV and it's our base URL plus CSV, nope, CSV slash test Titanic.csv. Let's see, X test. There it is.
All right, we've got our passengers and there are 418 of these. Again, about 20% of the data. Okay, let's see what NA values we have.
It's gonna be quite a lot and we're gonna have to do some work on that. And you folks could see that actually right here we got some NANs here for age and a bunch for cabins. Let's actually check it.
X test.isna.sum. Yep, we have 86 ages, one missing fare and 327 cabins that we don't care about. All right, let's clean up the data. Quite a lot of work to be done on that but we'll do, we already explained the steps we were doing last time so we'll just do it again.
Let's get women's mean ages. We'll say women, I can spell, is X test at. X test, let's see, it's the one where, yeah.
Sex is female. And once we got that, we could say, okay, ages of women is women at age. And finally, mean woman age is ages of women.mean. And we'll also round this, round it to one place.
One place past the decimal. All right, that's our mean woman age for this sample. And let's do the same thing for men and I'm gonna do some copy and paste here.
And there we go and change this. All right, that looks good. Yeah, looks good.
All right, now we can use, actually reuse the function we did before and just apply this. Yeah, apply this particular function, the function we wrote before up above. See the earlier video.
And say our X test at age should now equal X test dot apply. What did I call that function? I'm gonna have to go look and see what it is. Oh, here it is.
Fill mean age by gender. That's what I named it. There we go.
And apply that to the columns. And then we will fill in the missing age with the missing fare, with the mean fare. And we'll do that in one line this time.
Mean fare is round our X test fare value dot mean to four places. Maybe I'm gonna do two places. Two plus decimal place would be like cents.
And then we'll say, we'll do a fill in A for this one. We'll say X test fare equals X test at fare, fill in A using the value mean fare. It should only be one value that's missing.
And then just to double check that we did that all right, let's check for missing values at the end here. X test dot is in A dot sum. All right, looks like we did it.
Only missing values are cabins, which we don't care about. All right, now that we've cleaned up our data, we're gonna label and code and scale it in our next step and then submit it.