Standard Scaler for Data Preparation in Machine Learning

Scaling input data ensures machine learning models focus on meaningful patterns rather than numeric discrepancies. Learn how standardizing values around a mean of zero improves predictive accuracy.

Key Insights

Scaling input values prevents machine learning models from identifying misleading numerical patterns, such as falsely interpreting horsepower as disproportionately more significant than fuel efficiency or engine size.
The standardization process involves adjusting values to a mean of zero and measuring all data points in terms of standard deviations from that mean, converting original measurements (e.g., horsepower) into standardized values.
Using the Standard Scaler from scikit-learn, the X training and testing datasets are transformed, making data devoid of original scale context and ready for the machine learning model training phase.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

We've split up our data into X, which are our inputs, our features, and Y, which is our price in thousands. Now that we've got those, we need to talk about training and testing. We have our features and our target, but we also want to split things up into our training and testing data.

We're going to name it X-train for the 80% of data the model is trained on, and Y-train for the answers that the model will predict—the targets. Now, we'll also use X-test and Y-test, and these are the standard names. If you name them anything else, you're doing it wrong, because that's not what those things are called.

So don't come up with fancy names for these. These are the standard names for them, and then other programmers know what those values are. So let's talk about how we split those up.

We already have our X and Y, and it's 100% of the X and 100% of the Y. We're going to split each of these up, and here's how we split them. Let's take a look at this image. If you execute this block, we're going to use train-test-split is the name of it.

Here's our full dataset. We've already done this. We've split it up into features, X, and target, just one column, Y. Our inputs, our characteristics of the car, our prices, the end goal.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Now, we split features up into X-train and X-test, and again, X-train is about 80% of it, and X-test is 20, and we split this target, this Y, into Y-train and Y-test, and the reason is that we can then take this and show the model our, here are these rows, and this should be the goal. Make a formula that goes from these inputs to this, and then let's test it with this new data, this new X data.

Can you get the correct Y data, the correct targets? All right, so the actual code itself is fairly simple. It may seem a little more complex than some previous examples, but it's not that hard.

It's just going to be that it gives us a tuple, so we're going to call train-test-split, and again, that's a function from scikit-learn that's going to split up X and Y for us, so we're going to call that function and give it X and Y, and finally, how big should our test size be, and the standard is 0.2 of the data, in other words, 20% of the data, so it's going to take the X data and split it up into X-train and X-test, the Y data and split it up into Y-train and Y-test. The order that we pass it in matters, as with any function call, but also, this is actually going to give us back a tuple. If I look at what's the type of this, it's a list.

That's actually technically a list, not a tuple, but yes, we're going to unpack that list, so we're going to say X-train, X-test. That's the first two values it returns. The next two are the Y-train and the Y-test, and again, these are the standard names for them, and it has to be in this order.

If we are confusing our, if we're putting the wrong values in the wrong places, then we're going to end up with Y data for our X and X data for our Y, and our training data will be trained against the test data, which is incorrect. It'll be all kinds of wrong, so we want to make sure we get those in the right order, and if we look at those, they should all be the same. The X-train data, if we look at its length, 122 rows, it should be the same for our Y-train, 122 rows.

For our Y-test, 31 rows. Again, that's 20% of the data now, and our X should also be 31 rows. Nope, because I didn't capitalize it.

There we go, 31 rows. Okay, so great, and you can confirm what is X-train. I'll show you the head and the tail, the first five and the last five, and we can see it's just the columns we want, and the same for Y-train, just the columns we want.

Now, you notice these are not in order anymore, these numbers on the left, the row numbers. That's because not only does it split it up, but it does it randomly. It shuffles it up.

I'll show you what I mean. Let's put here our X-train and our, let's, yeah, let's show our X-train. It starts with row 90,134,46.

Our Y-train, same numbers. Again, they're shuffled, but they got split up so that they line up. The first X, set of fuel efficiency, horsepower, and engine size, goes with the first Y, row 90 of Y, disk press.

So it'll know; it will look at the training data and know what the answer is for each one. And that's how you split up our testing and training data. One line of beautiful code.

Key Insights

Colin Jaffe

How to Learn Machine Learning