Scaling input data ensures machine learning models focus on meaningful patterns rather than numeric discrepancies. Learn how standardizing values around a mean of zero improves predictive accuracy.
Key Insights
- Scaling input values prevents machine learning models from identifying misleading numerical patterns, such as falsely interpreting horsepower as disproportionately more significant than fuel efficiency or engine size.
- The standardization process involves adjusting values to a mean of zero and measuring all data points in terms of standard deviations from that mean, converting original measurements (e.g., horsepower) into standardized values.
- Using the Standard Scaler from scikit-learn, the X training and testing datasets are transformed, making data devoid of original scale context and ready for the machine learning model training phase.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
We've split the data off into X train, Y train, X test, and Y test. Our next step is to take the X values and scale them. Right now, the X values, let's take a look at the X values up here instead of the Y values.
It's very easy for the computer, it doesn't necessarily understand, the model doesn't really understand what these values mean, what they signify, it's not trying to. When it's looking for patterns, it could easily see something like, okay, the horsepower is about seven times as much as this fuel efficiency, so maybe that means something. And for this one, the horsepower is 10 times the, sorry, 15 times the fuel efficiency.
Or this one, the horsepower is five times the engine size, 50 times the engine size, off by scale. And this one, the ratio is even greater. It might see some patterns that don't really mean anything, that we know don't mean anything.
And those patterns involve, those wrong patterns involve the ratios of numbers. How much bigger is this number than another number? To make sure that it doesn't see these patterns, what we're going to do is we're going to scale the values so that they're all on the same scale. So that when we're looking at X, all the columns will numerically look about the same.
And here's what they'll be. They'll all have a mean of zero, right? If the mean was 50 before, now it's zero. And that means everything else went down by 50, right? So it moves things over to be, have their mean at zero.
And it also scales it into standard deviations from that mean. If before something was 50, and the 50 was the mean, and 65 was one standard deviation from that mean, then now instead of 65, it'll be one. If the mean was 50, and the standard deviation was 15, then 65 becomes one.
Why? Because the mean is set to zero. And the standard deviation is 15. And measuring 65 from the mean, you get one standard deviation away.
So something that's right at the mean is going to be zero. Something that's one standard deviation will be one. Something that's one standard deviation in the other direction will be negative one.
And so everything will be scaled to a normal distribution right around the value zero. And it's really what we're looking at now is a measurement of standard deviation, not the measurement of, you know, exactly how much horsepower. Now the model can look at how are things different from each other? How are things different from the average? If the horsepower is a lot greater than the average, that'll tell you a lot more than saying the horsepower is 50, which would actually be very low.
So that's what we're doing with standard scaler. Let's make it happen. We're going to create a standard scaler.
And we're basically just using the library from scikit-learn, standard scaler library, which we already imported. We're going to call it sc. That's the standard name for this.
And now that we've got this sc, and let me run that. Don't forget to run your block. What we're going to do is run its fit transform value.
And the fit transform, that function, that method will take its data and scale it down. If we do sc.fit transform the X train data, well, then that will return a scaled version of it. We need to say like, okay, X train data is now the scale to transform version.
And we'll do the same thing for X test. We don't need to do it to Y because the model is looking to get the answer that's Y. It's not looking at what the input is of Y. So let's take a look at these values. X train is now just a list of three, sets of three values.
Just a list of these little lists that is each one three values. And this is how far away from the mean this column was. This is how far away from the mean this column was.
This is how far away from the mean this column was. And the same here. And this is all, again, this is all the, now that we've made it into like proper computer numbers, devoid of any context, devoid of their, of their original scale.
Now, now it's ready for the computer, right? It doesn't, it doesn't care about what a fuel efficiency is. It doesn't know what that is. It can't calculate it.
I don't think I can calculate it, but it knows, hey, if you've given these three numbers, it should give you this Y value. So look for patterns. That's what we're going to tell it.
Okay. Next, we're actually going to train our model.