Correlation Matrices in Data Analysis with Pandas

Uncover relationships within your data using correlation matrices, a powerful analytical tool provided by pandas. Learn how identifying correlations can refine your analysis and guide strategic decisions.

Key Insights

A correlation matrix is a pandas-generated tool that quantifies the relationship between variables on a scale from -1 (perfect negative correlation) to 1 (perfect positive correlation), helping analysts easily identify variable relationships.
Horsepower strongly correlates with both price and engine size, indicating that higher horsepower typically means larger engine size and higher price; conversely, fuel efficiency negatively correlates with price and engine size, indicating that more fuel-efficient vehicles tend to have smaller engines and lower prices.
Despite initial expectations, data analysis reveals that "sales in thousands" shows minimal correlation with price, horsepower, and engine size, suggesting this variable may not be valuable for predictive modeling.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

We're looking at this data and we've made some decisions as to what we think is important, but let's do a little data analysis. Let's figure out what is correlated with what. Now, we can check that with a correlation matrix.

A correlation matrix is a way to tell what values are correlated with what other values from a table, from a pandas data frame. And it gives you back a new data frame where each item is being compared to every other item. If it says 1.0, it's perfectly correlated.

That's because it's comparing it to itself, right? Like the more the price goes up, the more the price goes up. Like, yeah, obviously that's the relationship one-to-one because it's the same thing. It's looking in a mirror.

So the other values will range from one, with this other value actually being perfectly correlated, to negative one. It's perfectly correlated, but in the opposite direction. So, the more the price goes up by a dollar, the more sales go down by a dollar, right? But most of them will be in between.

If it's zero, then it's not correlated in any way. These two variables have nothing to do with each other. And if it's more towards one or negative one, then they're more correlated.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

It doesn't matter in which direction they're correlated. What matters is whether there's a pattern here. All right, so this is a great way to visualize things, and it's something given to us right by Pandas, because Pandas is wonderful. All right, so we're going to say carSales.correlation. Give me a correlation matrix.

Let's take a look at that. Okay, now this is a lot of numbers. We'll find a way to visualize this in a moment.

But you can see there's, running down the diagonal here are all these 1.0s, right? Again, that's comparing horsepower to horsepower. They are perfectly correlated. But the other values are more or less correlated.

Horsepower doesn't seem to affect how much overall sales you have from selling that car. If we look at horsepower, and we go over to engine size, they're pretty correlated, which makes sense. The bigger the engine, the more horsepower it's going to have.

It's also pretty well correlated, though, with the price. If you look at horsepower, it explains a lot of price. Surprisingly, fuel efficiency could be useful, but in the opposite direction.

The less fuel efficient something is, the more it costs. The higher the price in thousands. Also, negatively correlated, the more fuel efficient something is, the less the engine size is.

And again, these numbers are approaching one, means that there's a high correlation there. What about some of these ones that are very highly correlated, right? We can definitely see that horsepower is very related to the price, right? So that seems like it could be important. Again, this is one of those things where our domain knowledge has paid off to some degree.

Price in thousands, for example, that's something we figured would be related to horsepower, and it is. But price in thousands related to engine size, less correlated. Horsepower seems to matter more.

Fuel efficiency, in the opposite direction, does seem to predict price in thousands, but again, less so than the other values. And sales in thousands is the weakest predictor among these. So, looking at this data analysis, we might conclude that sales in thousands doesn't really belong here.

And honestly, the domain knowledge supports this, right? Because these things are actually sort of a very complex relationship, right? If something costs a lot, does it mean that people are gonna spend more overall money on it, right? This is the price of one of them, price in thousands, and this is how much money the company is making from that model, right? And so there's, hey, maybe at a certain point, the price goes up too high and people just aren't buying them. And maybe if the price goes too low, sales also decrease because you're not making enough money off each car, right? So, it's not very well correlated, and that makes sense.

But if we relied on domain knowledge, we might think sales in thousands and price in thousands should be very related. However, when we look at the actual data analysis, we see that these two are not very highly correlated.

So, I think what we're going to do now is take another look at this, and based on this data analysis, we'll likely conclude that sales in thousands isn't going to be useful for the model.

Key Insights

Colin Jaffe

How to Learn Machine Learning