Preparing Titanic Dataset: Splitting and Scaling Techniques

Prepare Titanic data by selecting features, separating labels, and scaling numerical columns.

Prepare Titanic dataset effectively by splitting training data and standardizing critical variables using StandardScaler. Understand how preprocessing enhances model accuracy with Random Forest classifiers.

Key Insights

  • Use separate training and testing datasets provided by Kaggle to streamline Titanic data analysis without additional splitting procedures.
  • Apply StandardScaler to standardize the "age" and "fare" columns, centering both around a mean of zero and scaling by standard deviation, thus ensuring model neutrality towards feature magnitude differences.
  • Define features ("Pclass," "embarked," "sex," "age," "fare," "siblings/spouses," and "parents/children") and target labels ("survived") clearly to facilitate accurate predictions using the Random Forest classifier.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Next, we're going to split our data. We don't actually need to split it quite as much as we normally do. In this case, the test data is actually in another file.

So, Kaggle has split it up into training and test data for us. So, everything we've got here is training data. So, let's actually split it up.

We're going to, based on our domain knowledge and the data analysis we've done, say X train is our Titanic data with the following columns: P class, embarked, sex, age, fare, siblings and spouses, and parents and children. Let's take a look at that X train and see if it's what we think it is. It's good.

And we're going to split up Y as well; Y is just the Survived column from the original. And there we go.

Survived or perished. Our answers, our labels. Okay.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Now, we're going to want to scale age and fare because age varies quite a lot, and fare varies quite a lot as well.

They're on a different scale. We don't want them to think that fare being twice as much as age has any meaning on it. So, to help the model realize that, we're going to scale everything around a mean of zero and scale it down to the standard deviation.

We're going to use our typical tool for that, which is the standard scaler. We're going to say SC equals standard scaler. And we don't really need to, as it says here, deal with Y data.

Y data is already zero and one. We're going to use a somewhat fancy pandas trick called fancy indexing.

That's actually the community's term for it. We're going to do some fancy indexing to scale age and fare at the same time. We're going to say Xtrain.loc all rows, columns, age, and fare equals standard scaler fit transform version of that.

Xtrain.loc all rows, columns, age, and fare. And then we'll take a look at Xtrain. Did I forget to run this? That's exactly what I did.

Very common mistake I make. And then some people make as well. But definitely, I make that a lot.

All right. Great. So, we can see that age and fare are now on the same scale.

And they are now both centered around zero as a mean and scaled by standard deviation. Yeah. All right.

Now that we've scaled everything, it's time to start talking about the model we're going to use, random forest classifiers.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master Machine Learning with Hands-on Training. Use Python to Make, Modify, and Test Your Own Machine Learning Models.

Yelp Facebook LinkedIn YouTube Twitter Instagram