Combine passenger class and gender data using pandas to reveal compelling patterns in Titanic survival rates. Utilize clear visualizations to better understand how class and gender intersected to affect passengers' chances of survival.
Key Insights
- Created a new categorical column (p-class_sex) by combining passenger class (p-class) and gender, enhancing data analysis capabilities and visualization clarity.
- Identified significant survival rate disparities: first-class females had high survival rates (91 survivors to three fatalities), while third-class males fared exceptionally poorly.
- Observed that gender advantage diminished significantly among third-class passengers, with equal survival and fatality numbers (72 surviving and 72 perishing), highlighting the critical impact of socioeconomic status on survival.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
We're going to do a little bit of fancy pandas DataFrame work to make p-class sex a thing. P-class sex will be a combination column that will combine their passenger class—first class, second class, or third class—and their gender. So first we'll define a list of possible values.
First class female, first class male, second class female, second class male, third class female, and third class male. Then we're going to make their values a combination of the p-class value and the sex value. And the way we're going to do that is we're going to say titanic_data at p_class_sex.
It's a new column and it will be p-class plus an underscore plus sex. There's only one more thing we need to do, which is that p-class is a number (1,2, or 3), while titanic_data['sex'] is a string. To convert this one to a string so it can be concatenated with this underscore and with the value of titanic_data sex, we're going to use astype(str).
And then our last step to make this work is to make it a categorical value. That means it has only specific possible values. We're going to say now titanic p-class sex is pandas' categorical column from titanic p-class sex.
And the categories are the order up above this list here. Then we can take a look at the series titanic data p-class sex. There we go.
We've got the head and the tail of the series—third class male, first class female, third class female, etc.—all within 91 rows. Great.
It's going to be really helpful; now we can take a look at that as a graph. We can graph that and see if this could be valuable and observe how these three columns—survived, passenger class, and sex—interact. So here we're going to: our axis is a Seaborn count plot where X is Survived and the hue is p-class sex, our new column.
And the data is titanic_data. And now we can see how each of them did. Third class male did very poorly.
Barely any of them survived. Second class male also did very poorly. And if you look at the females: first-class female—only three perished.
Ninety-one survived. Second class females—six perished, 70 survived. It's only when you get to third class that it evens out the gender advantage.
Seventy-two and seventy-two. That class was maybe not so important by the time you get down to third-class passengers, so the advantage of being a woman didn't fully counteract that. So yeah, we're seeing quite a lot of good data analysis here.
Our next step is to start putting this into data that the computer can read for modeling. Then we'll dive into a random forest classifier and see how it can help us analyze all this data.