Use Scikit-learn and Pandas to build a linear regression model that predicts car sales prices. Learn to manage data quality issues, visualize important relationships, and identify key predictive features.
Key Insights
- Apply Scikit-learn's train-test-split to effectively divide the dataset into training and testing subsets, ensuring robust model evaluation.
- Use Pandas' correlation matrix functionality and visualization tools like heat maps and pair plots to identify significant features influencing car prices.
- Examine data quality thoroughly, noting specific issues such as missing values in critical columns like resale value and price in thousands, and understand their impact on predictive accuracy.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
So that's our overview. Let's take a look at what we'll be doing here. We're going to be predicting car sales using a linear regression machine learning model.
And in the process, we will work with scikit-learn and their machine learning libraries, particularly train-test-split to provide data and split it into training and testing. We'll also visualize it with a heat map and a pair plot to see exactly what's important in our data.
We'll be doing a lot of data analysis. And we'll also be doing some looking at the data from a human perspective, thinking about domain knowledge. What do we know about cars? Okay, let's get started by taking a look at our car data.
So you should have already run these blocks to connect to Google Drive and all of that. You should also run this to make sure that you've got the correct URL for our CSV file. Let's try to create a data frame based on that data.
You may get an error here, in which case, again, go back and make sure that your Google Drive has a Python machine learning notebook, machine learning bootcamp folder in it. In there, a CSV folder. And in there, a car sales dot CSV file.
See our setup video if you need more help. All right, let's create a data frame. Let's give this a go.
We'll say car sales. Car sales data here is our Pandas data frame. And we'll base it on, nope, that would be if we're creating it from data that's already in Python.
It's in a CSV, so we need to run read CSV. And we're going to give it a path to that CSV. Our base URL, which I thought would autocomplete.
There it is. Plus our car sales URL. And if we do that, and we output that in here, If you got an error here—some red text and an error—then it means your file is not there.
It almost definitely means your file is not there. Okay.
But this is what our car sales looks like. When you output a data frame in Pandas, of course, if it's many rows, and this is 157, it will give you the first five and the last five. All right,
We can see we've got a lot of different values here. Resale value after one year, vehicle type,
How much it costs—this is going to be our target, what we're trying to predict. Engine size, horsepower, wheelbase, just a lot of different factors.
That's a lot of data and a lot of noise. Which of these is important? Which of these should we train our model on? That becomes the question. That's a question we'll start to find a method for answering.
Finally, before we go any further, let's take a look at the quality of our data. We can look at car sales. How many NA values do you have? If you're not familiar with Pandas, those are 'not available' values.
And we can actually see many of these. These ones have no value for this column. Not a number is what NAN is.
Like, that's what you gave me. It's not a number. And it should be a number.
It's not. So that's not a valid value. Let's have it sum up.
Which values have NA values? Most of these columns are good, but 'resale value after one year' is missing a bunch, and a couple of others are missing values too.
'Price in thousands'—that's important. Then we have many other columns where some rows are missing data. We'll clean up our data, which is a big part of this, in a little while.
All right, so that's a summary of our data. Let's start thinking about what columns to keep.