Enhance your machine learning model's accuracy by effectively identifying and handling outliers in your data. Learn practical methods to assess when and how to remove outliers without compromising your data integrity.
Key Insights
- Outliers, such as an $80,000 car with a seven-liter engine, can significantly skew predictive analysis and affect model accuracy, demonstrated by an accuracy score of 69%.
- Determining whether to remove outliers requires careful consideration, as some anomalies represent genuine trends—like increased retail sales during the holiday season.
- Systematic filtering of data before training models, as illustrated by the process of removing unusually priced cars, ensures improved data quality and potentially higher accuracy in predictive results.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Okay, so our score was 69%, our accuracy score. And again, that was very good. Let's see what we can do to make it better.
Now, there is one method—and there are a lot of methods, and it's outside the scope of the course to discuss every single method for improving your data and thus your accuracy. But let's talk about one: outliers. Outliers are values that are far outside the norm.
So there's a car here that has a price of $80,000, and it's the only one. And a seven-liter engine, and it's the only one. Let's actually take a look at our data.
We'll be able to see this outlier visually. If we look at engine size versus price in thousands here, it mostly follows a very specific pattern. And there's this dot way over here.
What are you doing over there, man? Right, so this has an engine size of eight and a price in thousands of quite high as well. And then we have another couple of dots over here that represent extremely expensive cars. Right, so we might consider that an outlier.
This one is definitely going to skew your line a little bit toward this area. Now, outliers are not inherently bad. And a lot goes into deciding whether you should remove outliers, identifying what they are, and clarifying your reasons.
A really good example of that is if you're looking at sales across the year for a store, total sales will spike during the holiday season. That's not really an outlier. It's a real trend, right? It's a real thing that happens.
It represents a real difference that happens at some times. We can't just throw those numbers out. In fact, it's actually the majority of the sales.
So that's why there are reasons to remove it, reasons not to. An outlier can represent, on the other hand, data that just is outside the norm and doesn't really help us understand most cars. It also might even be an error in the data.
An error in the data is often gonna show up as an outlier. Like, yep, that value's way too high. There might've been a faulty measurement or somebody might've entered the data wrong.
So it can clean up your data as well. There's various arguments in both directions and you have to look carefully at what your reasons are and make sure you're thinking about it in a clear, systematic way. But let's see if this outlier is helping or not.
Are we gonna get more accurate if we remove the outlier and run it again or less? All right, so we're gonna filter out some outlier cars. Let's take a look at how we might do that. First, let's just see how many cars we have. What's our X value way back there? It's 153 rows.
Okay, what if we got rid of all cars? And again, this is our X. This is before we split up the data into test and train. We'll say, take X and split it off into ones where the price in thousands was less than or equal to 80. So this should filter out maybe one, I think.
Or, I might have made a mistake in that line; let's see. I might have misspelled price in thousands. We'll take a look.
Errors can definitely happen sometimes. All right, so price in thousands. Let's just take a look at X again.
Oh yes, of course. That's a good learning moment, right? X has already been split off into our X and Y, right? We actually can't filter out by that. What we need to do is filter out by our cars.
So in our next video, we'll take a look at our cars and filter them before splitting them into X and Y.