Enhance your model's accuracy by effectively identifying and managing outliers in your dataset. Gain insights into how dataset size and variability significantly influence the reliability of model predictions.
Key Insights
- Removing outliers from a dataset slightly improved the linear regression model's accuracy, but the improvement was marginal and inconsistent across multiple test runs.
- Small datasets of around 150 rows, when split into training and test sets, exhibited considerable variability, leading to inconsistent prediction accuracy ranging from as low as 44% to as high as 79%.
- Future analysis should focus on larger datasets, as limited data contributes significantly to prediction variability and impacts model reliability.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's try this model again and see if we get a better answer without the outlier. We may not. Let's give it a shot as part of exploring our data.
Let's create an X2 and a Y2. First, X2 is our car sales with the columns fuel efficiency, horsepower, and engine size, but not price in thousands. That's wrong.
All right, and our Y2 is our car sales at price in thousands. All right, now let's split that data. We'll say X train.
You know, this is a great time, great opportunity to show, hey, we don't always have this memorized. The order I said was really important. What we're going to do here is we're going to call X, we're going to call train test split, and we're going to pass in our X2, our Y2, and our test size of 20%.
This returns a topple, but I literally every time forget what the order is of train versus test, and so it's really worth always making sure you get this right, and you're not expected to have to memorize the structure of the return object every time. So let's go back to where we're doing it. It's X train, X test, Y train, Y test.
So X train, X test, Y train, Y test. Unpack those from those values, and we should name them, there we go. It was mad at me because I had a slight indentation there, and that's a problem because Python uses indentation to define code blocks.
That's Python. Yeah, Python indentation, something you need to know. Okay, let's run these.
You can see I failed to actually run them. It's very important that you do that. Let's scale our X values now.
X train equals StandardScaler().fit_transform(X train), and the same thing for X test. Again, here we are scaling things so that they're all centered around the mean of zero and are measured in standard deviation. Okay, run that, and now what we're going to do is make the model.
Let's call it Model Two, and it's a Linear Regression. Let's train the model, and what did I do wrong here? Oh, we train it on our data, of course—our training data.
It needs some data to train on. We can't just tell it to train on nothing. All right, now let's test it and see how it did without the outliers.
Might have been the same. We'll skip right to, what's the score? Score Two is Model Two using its score method. We give it the X-test and the Y-test data.
Predict using the X-test data and measure accuracy against the Y-test data, and it's significantly worse. So, here's the funny thing. If I run this again, I'll get the same value.
However, there is a fair amount of randomness here. If I run this from here down, run all of these, we will actually get a different number, and that's because sometimes when we're working with small sample sizes, as we are here, we're at around 150 rows, which is not really that many. It's not that much data in the grand scheme of things.
So, when we split it up 80%, 20%, which is what this line does, sometimes we can get quite a variance of how well the training data predicts the test data. How similar are these two datasets? We're talking about 30 random numbers from the 150. So, how similar are they to the original training population? So, if I run this and all the code below it, and there's a command here, Runtime > Run Cell and Below, now it's 79%, which is significantly better than we did.
Now, there are other ways we could measure this. Now it's 60%. Now it's 69%, only slightly above the original.
44%—another bad one, right? So, these numbers are generally very good here. But the question is, is it better after removing the outliers? And the answer is, like, a little.
I've run this 10,000 times in a loop to see what we get. I've looked at, on average, which of these models—Model One or Model Two—scores better. Model Two scores slightly better without the outliers, but only very slightly.
Now, this highlights two things. One, removing the outlier helps slightly. It might help more if there were strong reasons for removing the outlier or might actually hurt if there weren't.
But that's one technique we have in our tool chest to improve our model's accuracy: removing outliers. Another one that's very important that we'll be doing pretty much from now on is working with bigger datasets.
This high amount of variance comes from a significant problem with this data, which is just that there isn't enough of it to consistently get a good model out of it. We'll take a look in future lessons at much bigger datasets and see what we get from those. All right.
We'll move on to the next document—the next notebook. I hope that you took a lot out of training your first model and looking at its score and thinking about ways we can improve it.