Refine predictive models by effectively removing outliers from datasets, focusing on automotive sales data. Learn to assess improvements in model accuracy through strategic data preprocessing techniques.
Key Insights
- Removing price outliers above $80,000 from the dataset eliminated two rows, improving the quality of the car sales data.
- Filtering out vehicles with engine sizes greater than seven reduced the dataset by one more entry, resulting in an optimized set of 150 rows.
- The process outlined involves redefining variables X and Y, re-splitting into training and test sets, and retraining the predictive model to assess the impact of outlier removal on accuracy and performance.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's go back a step and, instead of looking at X, look at cars, our overall car sales. That has all of our values, and again, still 153 rows. This is before we split it off into X and Y, because we want to look at rows where engine size is more towards the norm, and price in thousands is more towards the norm.
We'll remove those outliers. We're at 153 rows. We'll set car sales equal to car sales where the price in thousands column is less than or equal to 80.
All right, that cut out two rows, two outliers where the price was greater than 80. Let's do one more filter, though this may remove the same outliers.
We might not actually see less. Let's see. For car sales, let's also keep rows where the engine size column is less than or equal to seven.
Yeah, that removed one more row. You can see down here, it changed from 151 to 150. All right, so we've removed a couple of outliers.
Now our next step is to use that filtered data to redeclare our X and Y, split them into training and testing sets, retrain our model, and see how it compares. Let's take a look.