Mean, Median, and Mode in Pandas for Resale Value Analysis

Learn how to effectively use pandas to calculate mean, median, and mode for analyzing your dataset's resale values. Gain clarity on when each statistical measure offers the most insightful perspective for your data analysis.

Key Insights

The pandas mean method calculates the mathematical average, useful for evaluating the entire data set including outliers, providing an overall view rather than a typical value.
Median, calculated using pandas' median method, identifies a central, representative value that best reflects typical data points without being skewed by extreme outliers.
Mode, determined by pandas' mode method, shows the most frequently occurring values; however, it can be less meaningful for datasets such as resale values where repeated points are uncommon.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

We're going to take a look now at how we can use Pandas to get mean, median, and mode for a column of data, a Pandas Series, in this case, resale value. So let's make some variables. I'm going to make resale_mean equal to cars["Year Resale Value"].mean().

And that's a Pandas method on a Series that tells you what's the mathematical mean average (it has to be on a numerical Series). Let's do the same for median, which works the same. And for mode, which works a little differently. One aspect that is different is that mode does not have to be numerical because it's just looking at what values show up most often.

Let's print those out. And let's also take a look at what's the type of mode. There we go.

Let's execute that. So the mode is a Series itself, which means that when we're looking at this, we're getting back a column of data. And that sort of is reflected here, the way we're printing it out.

If we actually make it, we'll see it a little better as a Series if we just output the value instead of trying to print it. Here's the value. You can see it, it looks like a Pandas Series.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

The column name, index value, index value, and these four values, zero to three, are the modes that tied for the most common values in this column. We'll take a look now and judge which of these is helpful, which of these answers the question, what questions do they answer? The way I'm going to do that is I'm going to take a look here at a random dataset here, a random set of year resale values. The way I'm going to do that is I'm going to use the built-in Pandas method `sample`, which will give you a random value if you pass no argument, or a certain number of random values from the dataset.

If we look at these ten, we can see that the median is pretty close to answering the question here. Ignoring all the NaNs, we'll return to what those are. But all of these five values are pretty close to the median.

The median is answering the question, What does your typical value look like? What is the approximate typical value? What do they all center around? The mean is more taking outliers into effect. Median is ignoring those. And mode is what is our typical value, or what values are appearing the most in our data.

And this data is not particularly useful for that, because the mode is a poor measure for data where values don't repeat very often. These only appear a couple of times each. And it's more of a backwards-looking, Hey, what is showing up frequently in our data? Let's take a look at another sample of ten, and you'll get that idea.

Only one of these, again, is above the mean, because the mean is taking into account quite a few of these outlier values. And we still haven't even seen, I don't believe, any of the mode values. Any of those in there? No, we'll see them eventually.

We can see a couple values above the mean now, and here's where we're starting to see a little bit of an outlier well above the mean. And we'll see some more of those as we take a look at some more just random values just to get a sample. I don't believe I've seen a single one of these mode values, because again, it's just not, it's answering a question we wouldn't actually ask of this data.

There's plenty of times that it is the right question to ask. What value, what particular value or values are showing up quite a lot in our data? And here's one of those outliers I talked about that is skewing the mean quite a lot, but again, if you're looking to take into account outliers and figure out, you know, what is the mathematical middle, this is as opposed to like the more typical middle without outliers, then, you know, the mean is what you want. And here's another one that's an outlier, just not quite as extreme.

And here at last is one of the mode values. There might have been one that I missed in these random samples. So those are the kind of questions that mean, mode, and median ask.

We're looking at some actual values, and we have to ask which is the right question to ask of this data. Mode doesn't seem very helpful for this particular set of data. Median would be helpful for understanding what our typical value looks like in general.

And mean is valuable when you're looking, trying to look at the entire dataset as a whole, including the outliers, trying to take those into account.

Key Insights

Colin Jaffe

How to Learn Machine Learning