Refining Data: Removing Outliers for Improved Model Training

Refine predictive models by effectively removing outliers from datasets, focusing on automotive sales data. Learn to assess improvements in model accuracy through strategic data preprocessing techniques.

Key Insights

Removing price outliers above $80,000 from the dataset eliminated two rows, improving the quality of the car sales data.
Filtering out vehicles with engine sizes greater than seven reduced the dataset by one more entry, resulting in an optimized set of 150 rows.
The process outlined involves redefining variables X and Y, re-splitting into training and test sets, and retraining the predictive model to assess the impact of outlier removal on accuracy and performance.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Let's go back a step and, instead of looking at X, look at cars, our overall car sales. That has all of our values, and again, still 153 rows. This is before we split it off into X and Y, because we want to look at rows where engine size is more towards the norm, and price in thousands is more towards the norm.

We'll remove those outliers. We're at 153 rows. We'll set car sales equal to car sales where the price in thousands column is less than or equal to 80.

All right, that cut out two rows, two outliers where the price was greater than 80. Let's do one more filter, though this may remove the same outliers.

We might not actually see less. Let's see. For car sales, let's also keep rows where the engine size column is less than or equal to seven.

Yeah, that removed one more row. You can see down here, it changed from 151 to 150. All right, so we've removed a couple of outliers.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Now our next step is to use that filtered data to redeclare our X and Y, split them into training and testing sets, retrain our model, and see how it compares. Let's take a look.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning