Gain clarity on measuring data effectively by understanding the implications of using mean versus median. Learn why median provides a more accurate reflection of 'typical' values, especially when data includes outliers.
Key Insights
- The mean average calculates the mathematical middle of a data set but can be significantly skewed by outliers, as seen when a single low temperature reading of 48 degrees lowered the mean to 79.36.
- Median offers a more representative measure of typical data points, demonstrated by sorting grades numerically and identifying the median (83), which stayed constant despite extreme outliers like scores of 0 or 194.
- Calculating median in Python is facilitated using libraries such as NumPy, which provide built-in methods like
np.median()
to quickly identify a data set's middle value.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's talk about averages and questions you ask of data. When we're looking at the average, the mean, the mean average, what is it actually answering? What question is this answering? Yeah, what's the average degree, I suppose. But it's also asking, what's the mathematical middle? And by that, I mean, you know, if we add up all the numbers and divide them, it's like, what is mathematically the middle value? And it turns out for this one, it's 79.36. But that is thrown off a little bit by, if you look up at this, by an outlier.
We have this 48 degrees, and it's not what you typically see. The typical numbers are going to be higher than that. So this number 48 is actually throwing off this 79 degree mean.
Your typical degree is actually higher than that. So it's answering a different question. What's the average? Sure, but if we're looking for like, what's the typical one? Like if I look at a random one, you know, what is it likely to be close to? And it's likely to be higher than 79.36. If we look at these numbers, more of them are higher than that.
Higher than that than are lower. So let's talk about, let's do that with median. Let's talk about what median is.
And if you know your averages, then this is all, you know, helpful to you, a refresher, helpful for a refresher in Python. But it's also, I want you to look at it from a different perspective. If you haven't looked at it from this perspective already, which is that what tools we want to use to study data should be related to, well, what information are we looking for? What question are we trying to ask of the data? What information are we trying to glean from the data? In this case, if our question is, if we look at these grades here, if our question is, what's the average? Well, let's look at that mean average.
Let's print out np.mean of grades. 72.67. You know, 66 bar. So how many of these numbers are above 72.6? One, two, three, four, five, six, seven.
Most of them. How many are there? I can't count. Nine? Nine.
Of the nine of them, only two of them are below that average number. So that's telling us what the mathematical average is, but it's not really answering a more typical question to ask. Which is, what is your standard number like here? What is your typical number? And median does that, and it does it in kind of an interestingly simple way.
What we do is we sort the numbers. To calculate the median, you sort the numbers. So your degrees might look more like this.
Sorted from zero up to 94, or the other way. It doesn't actually matter. Okay.
And then median, you look at the middle number. In this case, it's 83. And if we eyeball this, with a small dataset like this, we can eyeball it very easily.
83 seems like kind of a typical value here. Right? Most of the numbers are around 83 in a way that they're not mostly around 72.6. If we're trying to think about what our typical value is, median is a better measurement. Again, it depends on what question you're asking.
There's actually quite a few times when we do look at mean, because we're looking at the mathematical average, as opposed to what's the middle value? That's the question that median answers. What's your middle? And the nice thing about the middle value is that it doesn't get thrown off by outliers. If this one is zero, our median is still 83.
If this one is 50, our middle is still 83. If this one is zero, our middle is still 83. If this is 94, our middle is still 83.
If this is 194, that's quite an outlier, our middle is still 83. Right? The numbers, the higher, incredibly higher low numbers, the outliers, shouldn't really affect, like, what's the typical value? If we're looking for the typical value, median answers that better. Let's look at how we would actually measure median, how we would actually get that value.
It's actually very simple. We could do it through some Python. It'd be a little tricky, but we could do it.
We could take Python, sort the list, and then look at the middle value of that. You know, count up how many there are, divide it by two to get the middle number, and look at the address. The one right in the middle.
Now, there's a faster way. Feel free to do that, but I'm not going to do that. I'm going to use something we have built in, which is we can look at we can look at the median value.
And the way we can calculate that is by looking at NumPy's median method. Passing into it our grades, in this case. And we can look at instead, you know, we can compare that to the mean.
Let me label that as well. All right. If I execute that, our median mean value is 72.6, and our median value, as we discussed, is 83.
Okay. In our next video, we'll take a look at another average mode.