Master the fundamentals of statistical analysis by understanding measures of central tendency, including mean, median, mode, and mid-range. Gain practical insights into how sample adjustments using degrees of freedom (n-1) enhance statistical accuracy.
Key Insights
- Calculating the mean for a sample using "n minus one" (degrees of freedom or Bessel's correction) consistently yields results closer to the actual population mean; in a study involving 676 random numbers conducted 100 times, using "n minus one" provided the closest estimate 41% of the time.
- The median provides a robust measure of central tendency unaffected by outliers, making it ideal for data with extreme values, whereas the mid-range calculation (maximum plus minimum divided by two) emphasizes the effect of outliers and drastically changes when extreme values are present.
- Excel offers specialized functions for measuring central tendency, including AVERAGE for mean, MEDIAN for middle data points, MODE.SNGL and MODE.MULT for identifying single or multiple frequent values respectively; conversely, mid-range must be calculated manually using MAX and MIN functions.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
In this section, we're going to go over the measures of central tendency. Central tendency describes data that falls into the central area, so it's where most of your results will be found. We're going to take a look at several different types of calculations you can use to measure central tendency.
That will include the mean, the median, the mode, and some other visualizations that we'll review to look at the results of central tendency. So what is the mean? The mean is the sum of all the data divided by the total count of the data. Now the total count of the data can be referred to as the population.
That's usually referred to symbolically as the letter n, and so that's used as the denominator to divide the sum of all the data. This is commonly referred to as average in Excel, but the technical term is mean. The mean of a sample, which is not the entire population, uses n minus one as the denominator.
When you have a sample size, it's smaller than the actual population, so you have to make an adjustment when you want to divide the sum of all that data by the count. Since the count is a sample, the calculation we'll use is the total count minus one. So let's take a look at an example visually so you can understand what I'm talking about.
So some people may think that New York is New York City, but New York City is actually just a small part of New York. The entire population goes beyond the five boroughs. So this is a visual display of the entire population.
You'll get a much more accurate result for central tendency when you include the entire population. But it may not be practical to get information from the entire population, so we'll use a sample. And when you're using a sample, you'll have to make adjustments.
So let's take a look and see what the difference is between the average as it relates to an entire population and as it relates to a sample. So here I have some numbers. If we want to calculate the mean, there isn't a mean function in Excel.
There's just the average function. So you'll use average. That will be equal to average.
And then you just select all the values. When I press ENTER, the average is 4.111. Now if I want to get the average of the sample or the mean of the sample, I'm going to have to do a little bit more work. I can't just type average.
I'm going to type sum, open parentheses. I'm going to select all the values, close parentheses, and then I'm going to divide that by the count of all the values now. So far that's just a regular average, but here is where there's an important difference.
I'm going to enter minus one. I want to make sure to put that in parentheses because of the order of operations. So I'll wrap count in parentheses.
Now I will get the average of the sample and that's 4.65. It's relatively close and you may not think that that makes much of a difference, but let's actually put this into practice and see if using n minus one gives us a more accurate result. So let's take a look at this table here. This is an entire table and we have three samples.
We have a yellow sample, a green sample, and an orange sample. Now the mean for the entire table, including all three samples, is 4.67. Now we've calculated the average of the sample and also the average of the entire population. So the average of the entire population is listed first and then the average of the sample is listed next.
So since we know the population of the entire table is 4.67, let's see which is closer, the average of the sample or the average of the entire population. We'll start with the yellow sample. Now the average of the yellow sample is 4.11. The average of the sample, not the entire population, is 4.63. Is that closer to the population of the entire table? It is.
4.63 is closer to 4.67. So I have a dropdown here where I'm going to say the winner is the sample. Let's take a look at another sample. Maybe that was an anomaly.
I'm going to take a look at the green sample here. The average of the entire population for that is 4.889. The average of the sample, which is n-1, is 5.50. So which is closer? Well in this case, it's the population. Now we have a tie.
One for the sample, one for the population. Now let's take a look at the orange sample. Maybe this will decide whether or not the average of the sample or the population is closer.
Now the average of the orange population is 4.0. The average of the sample of that orange area and sample, the average of that is 4.5. Which is closer? Well in this case, it's the sample. Two out of three times we have determined that the average of the sample is closer than the average of the population. And so this type of calculation where you take the count in minus one is referred to as degrees of freedom or Bessel's correction in statistics.
Bessel was the person who came up with that calculation. Now it's not 100% accurate as you can tell, but it's statistically more accurate than the average without correction. Now if you don't think that is enough to say that the average of the sample is more accurate, we do have another sheet where we work with a much larger population with just as many samples.
So I'm going to go to that sheet here. So let me explain what you're looking at here. You're looking at a table of 676 random numbers from 100 to 999.
And we selected three green samples from those random numbers. Now the population mean is 522. If you take all the records and come up with an average, you'll get 522.
Now we want to determine what's closer when it comes to the sample. So for the first green sample, the average, just the regular average, is 449. Now if you come up with the average where you subtract one from the count, it's 506.
In that case, for the first green sample, the average of the sample that's calculated by n minus one is actually closer to the average of the entire table. Let's take a look at the second green sample. Again, n minus one is closer to 522.
And then if we take a look at sample number three, again, the sample for n minus one is much closer to the average of the entire table. We also, just to check and see if it makes a difference, we also used an average that added one to the count. Either way, n minus one was always statistically closer to the average of the entire table.
I'm going to scroll down here. And again, if you didn't feel like that was enough experimentation to figure this out, well, we continued beyond what we have here. And we can tell you that after generating 676 random numbers 100 times and using three sample sizes, like the three sample sizes here, in the average for all the different types of average, 41% n minus one was closer to the population of the entire table.
34% was closer just using the average without n minus one. And only 25% was closer when you added one. So overall, statistically, the best method for obtaining an average that matches the average of the entire table is going to be an average where you use n minus one for the sample.
Now I'm going to go back to central tendency. We took a look at the mean. Now we're going to take a look at the median.
How does the median work? The median is the value that's in the middle of the set of ordered or ranked data. And it's not influenced by outliers. So there is a function in Excel for median, and it's simply equal MEDIAN.
And then you just select your data. I'll go here. I'll press ENTER.
I get five. So that is the median for the values in that box. Now let's take a look at something that's a little easier to understand.
I want to get the median for one, two, three, and four. So I'll go here and I'll type median. Now this is in some ways kind of like average, but let's see what we get here.
What do you think the result would be? Probably somewhere between two and three. When I press ENTER, you are correct. It's 2.5. Now you might say, what's the difference between that and the average? Well, let's calculate the average.
Let's see if there is any difference. I'm going to select the numbers again. One, two, three, four.
I'll press ENTER. I also get 2.5. But here's the difference. What if I increase one to ten? You would think that would make a difference in terms of the median.
Let's test the median and the average by changing this to ten.
And it makes a bigger difference when it comes to the average. Less so for the median. What if I go over here to the right and change this to 400? I'll do something that's a little more drastic.
I'll press ENTER. There's a big difference when it comes to the average.
Which is 103. When it comes to the median, it's 6.5. So outliers don't make as much of a difference when you're working with the median. I'm going to undo this.
So you'll want to use the median calculation when you don't want to have drastic changes due to outliers. Now what if you do care about outliers? So let me change the value here for this calculation. You'll use the mid-range function.
Unfortunately, mid-range isn't a built-in function in Excel. You'll have to manually calculate this by taking the maximum value, adding it to the minimum value, and dividing it by two. So you'll have to actually write this out.
So I'm going to type equal, open parentheses, then type max. This is going to return the highest value out of all the values in that range. Close parentheses.
Then I'm going to add the min, open parentheses, and I'm going to select all the values. Close parentheses, close parentheses. And now I'm going to divide that by two.
So I'm taking two values, the max and the min, I'm adding them together, and I'm dividing it by two. When I press ENTER, I get 4.5. So looking at the values, that's probably what you would expect. I'm going to do the same thing with 10,20,30, and 40.
So all these values increase by increments of 10. Let's take a look and see what the mid-range is. Equals, open parentheses, MAX.
And then what I'm going to do is select 10,20,30, and 40. Close parentheses. I'm going to add that to the min of the same values.
Close parentheses, close parentheses. And then I'm going to divide that by two. Now this should get me a value that's somewhere between 20 and 30, because that's like the mid-range, like literally the mid-range.
When I press ENTER, I get 25. Now what if I go over to 40 and change that to 400? That drastically changes my result. Now my mid-range is 205, even though of the numbers that are available in that range, 10,20, and 30 are much closer together.
But the 400 has a drastic impact on my mid-range. So you'll want to use the mid-range when outliers are important to your result. So I'm going to undo that.
So I went from 25 to close to 205. Now let's take a look at the mode. The mode is a measure of frequency.
The mode seeks to answer the question, which value repeats the most? So I'm now going to go over to practice number four. I'm going to type equal MODE. Now this function has been updated, and MODE is now deprecated.
And you now have two new functions, MODE.SNGL and MODE.MULT. If you have a single repeating value, MODE.SNGL will find that, but if your data has multiple repeating values, then you'll want to use MODE.MULT. Now for this table, I'm going to use MODE.SNGL, and I'm going to select the values here.
When I press ENTER, I will see the value that repeats the most in this range. As I look carefully, I can see that that value is one. It's actually the only repeating value, which is why you would use MODE.SNGL.
I'll press ENTER, and there we go. And that is just the number one. Let me change that font color to black.
So what if you have multiple values that repeat? Well, then you'll want to use MODE.MULT. Now this is an array formula, and an array formula can display multiple values across multiple cells, even though your original function is written in one cell. So I'm selecting the range.
I see one repeats, two repeats, also three repeats. When I press ENTER, one, two, and three are represented in the result. The values that repeat are extended down the row.
So in our analysis for central tendency, we looked at mean, median, mode, and mid-range. Let's take a look at how we can use this when we're working with data where we want to calculate the central tendency of the population as it relates to the manufacturing data below. This is broken up across four weeks.
After we get our values, we're going to use conditional formatting to accentuate the data visualization to determine the best week. So the first thing we want to do is come up with the sum. Now I could select the entire range.
If I use the keyboard shortcut ALT equal, or on the Mac, command shift F, I will automatically get the total for everything. Now I want to get the mean. So I could select the first column.
The mean is the average. So I'll go to the auto sum function and choose average, and that drops it right there. And then I can just auto fill this across.
So I'll show you multiple ways to come up with the result. Now I want to get the max value. I could select one column and get max and then drag it across.
But what you may not know is if I select all four columns and then choose max, I don't have to worry about auto filling across because it will give me the total for all four weeks, the total max value. Now I want to do min. I could reselect the entire range, or I could use a keyboard shortcut.
If I hold down the Shift key and press the up arrow, well now I'm just selecting the four weeks. I'll click the dropdown here and choose min. I want to get the count of values.
Again, I'll hold down the Shift key, press the up arrow. I'll click the dropdown here and choose count. I want to get the median.
Now the median is different from the average. In order to get the median, we want to type median. So equal MEDIAN.
And then we just select the range. So that function exists in Excel. I'll press ENTER.
I can select that original value, hold down the Shift key, press the right arrow. The keyboard shortcut to copy the formula to the right is CTRL+R. Now I would like to get the mode. The mode checks to see which values repeat.
So another way to come up with the answer is to select all the blank cells first and then type my calculation. In this case, I'm going to type equal MODE. And for this, we're going to use MODE.SNGL.
And then I'm going to select the first range. I'm not going to lock any of the values here because I want this to be relative to the column that the calculation is being calculated from. If I press CTRL+Enter, now I have mode for all four columns.
I just want to double check to make sure the right column is being selected. It is. And then finally, we want to do mid-range.
I'll go here. I'll use the same method I just used, equal, open parentheses. Now it's going to be the maximum value plus the minimum value.
Luckily for me, max and minimum both exist already in this column. So I'll select max plus min, close parentheses, and then divide that by two. If I press CTRL+Enter, that will copy that formula all the way across and use the appropriate relative cells.
I'll press CTRL+Enter. And there we go. All right.
Now that we have all our values, we have one last step. We want to be able to use conditional formatting to find out which of the three weeks, which of the four weeks is best. So for that, we could use conditional formatting.
And for conditional formatting, we'll use color scales. So I'm going to select the sum. I'm going to go over to conditional formatting.
And I'm going to choose the first color scale. The first color scale highlights the highest values in green and the lowest values in red. So I want to know which of the four weeks is going to give me the highest value.
I'll choose the first color scale. That is week three. Because three, the value for three has a background color of green.
And that is actually the highest value out of all four. Mean, I'm also interested in the highest value. So I'm going to go over to conditional formatting.
I'll choose color scale. And I'll choose the first color scale. It looks like week three wins again.
Now I'll do max. When I want my high values, I definitely want the highest of the highs. So I'll go to color scale.
This is an example where week three did not win out, but it's actually pretty close to the highest value, which is week four. I'll choose min, the values for min. For that, I'll also want the first color scale.
Out of the lowest values, I want the highest low. Values don't matter here. That's a count.
They're equal to each other. There are 10 values. Median, again, I want the highest value.
I'll choose the first color scale. Mode, in this case, doesn't apply as much. Let's use it anyway.
I'm going to go to conditional formatting. I want to find the number that repeats the most. I'll choose color scale.
Choose the first one. And then finally, we want to use the mid-range. I'll click the dropdown for Conditional Formatting.
Go to color scale. Again, choose the first color scale. I want the highest mid-range.
Now when I look at all the values here, I can see that week three is the one that consistently produces the highest values. For the most part, even in situations where it doesn't produce the highest value such as max, it's pretty close.
It's just about seven points off from the highest value. So, this is a way to use conditional formatting to accentuate the best week based on the values for the sum, the mean, the max, the min, the median, mode, and mid-range. So, that taken all together covers the topic of measures of central tendency.
Thank you for watching.