Correlation and the Relationship Between Variables Explained

Analyze the strength and direction of linear relationships between variables using correlation coefficients.

Uncover the key concepts behind correlation and learn how it measures the linear relationship between variables using correlation coefficients. Gain practical insights into analyzing correlations efficiently with Excel.

Key Insights

  • Correlation measures the strength of linear relationships between two variables, represented by a correlation coefficient (r) ranging from -1 to +1. Positive correlations indicate variables increase or decrease together (e.g., high humidity and rain), negative correlations show inverse relationships (e.g., longer school days and declining student attention), while values near zero imply weak or no correlation.
  • Using Excel's CORREL function simplifies calculating correlation coefficients, as demonstrated with restaurant data showing a strong positive correlation (r = 0.993) between the number of diners and duration of dining.
  • The article demonstrates Excel's Data Analysis ToolPak to compare impact variables, revealing a stronger correlation between the number of employees and revenue (0.87) compared to payroll hours (0.75), emphasizing that correlation does not imply causation.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Correlation. Correlation examines the presence of a linear relationship between different variables. Correlations are detected by calculating a correlation coefficient.

A correlation coefficient, in this case r, describes the strength of a relationship between two variables. Now the r coefficient can range from positive 1 to negative 1. And so there are three types of values that we're talking about. A positive correlation has a correlation value of plus 1. This means where one variable increases, the second variable increases.

Or that could be the opposite, where one value decreases, the other value decreases. The closer to plus 1, the stronger the linear relationship. What's an example? Here are two examples of a strong correlation.

High humidity accompanies rain. There is a direct correlation. High volume of traffic on the roads accompanies more traffic accidents.

There is a direct correlation of plus 1 between those two events. A negative correlation has a correlation value close to negative 1. This means where one variable increases, the second variable decreases. So this is where the values, the activities, and the values for them diverge.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

So the closer to negative 1, the stronger the negative linear relationship. So what are examples of that? Here's one example. In school, the longer the school day, the less the attention span of the student.

So the longer the school day is, the less attention span you're going to get from the students in the class. The other type of negative correlation is the older your car gets, the less chance it's going to be drivable. And so that's a negative 1 value for the correlation.

Finally, we have the example of a weak correlation. A weak correlation has a correlation value close to 0. This means that there is little to no relationship between the variables. Here's an example.

Kids that eat strawberry ice cream on Sunday don't go to school on Monday. There is no relationship between those two events. Someone could be making something up.

Another type of correlation where there's no direct relationship, although you might think there is, when the groundhog sees his shadow six more weeks of winter versus an early spring. So that has not been scientifically proven, although it is a kind of weather pastime that we watch and may believe that there is some kind of correlation when there's actually not. So for our exercise, we're going to take a look at a correlation between people dining at a restaurant and the average minutes spent at that restaurant.

So one person at the restaurant, we will say, spends about 45 minutes at the table. If two people are at that table and are having dinner, lunch, or breakfast at the restaurant, they're going to spend 55 minutes at the table. Three people, we say 57.

Four people, 62. We want to know, is there a relationship between the number of people dining at a restaurant and the total amount of time spent at the restaurant? We can do the math to calculate this. It's pretty involved.

As you can see here, we're taking the number of people and the amount of minutes spent at the restaurant. We're going to square both the number of people and the amount of time spent at the restaurant. Now for these four columns, we're adding up all the values and getting numbers that are orange, blue, light green, and gray.

Then we're going to multiply X and Y to come up with a value. Let me just confirm if that's the case. Yes, we're multiplying X and Y to come up with a value.

Then we perform all these calculations over on the right. R is then equal to A divided by B times C, which gives us a correlation of 0.993. If you round that up or just round it, that's going to return one. There is a strong correlation between the number of people at a restaurant and how much time they spend there.

If you don't want to do the math, Microsoft has made it really simple for you. In Excel, you can simply enter the corral function, and you will get exactly the same value. I'll type C-O-R-R-E-L.

I'll press TAB, and all I need to do is select array number one, in this case the X values. I'll enter a comma, and then I'll select array number two, which is the Y values. I'll press ENTER, and I get the same result, 0.993. To answer the question, what does this mean? R implies a positive correlation, since the value is close to a positive one.

This means where X increases, Y also increases. Now, a very important point. However, we cannot be sure that there is a direct relationship between those variables.

It's likely, but it's not 100%. So, this brings up the phrase that you may have heard of when it comes to statistics, and that is correlation does not mean causality. It could simply be due to chance.

There is that possibility. So, we'll do a couple of more examples. So, we want to check multiple variables.

We want to see what is the impact of payroll hours and employees on revenue. Does payroll hours impact revenue more than employees? So, let's do some analysis. Now, this is something that you can use the data analysis tool pack to calculate for you.

I'm not going to use the corel function this time. This time, I'm going to go over to the data tab. I'll choose data analysis, and I want to look for the correlation.

I want to look for the correlation tool here in the data analysis tool pack. I'll click okay. So, now I want to select my input range.

My input range is going to be the revenue, the payroll hours, and the employees. I'll include the titles. I just need to indicate that I do have labels in the first row, and then I'm going to choose my output range.

So, I'm going to go over here, and my output range is going to be this cell right here. I'm going to click okay. Correlation is having trouble to offset the input output reference.

So, I'm going to click okay. I'm going to try that again. Perhaps I'm next to some cells that may be blocking the information from showing.

So, I'm going to go over to the side, and I'll choose another range. So, I'll go right here, side by side. Ultimately, I want to be able to create a table that looks exactly like what I have here on the left.

Okay. So, oh, I see what happened. I'm only selecting F68.

So, for some reason, I lost my range. So, I'm going to select D56 to F68. There we go.

Okay. Now, I'll go back to my original output range. Okay.

Perfect. So, I have my range. The information is arranged in columns.

My labels are in the first row, and I'm outputting it to H67. I will now click okay, and there are my values. Now, it doesn't look like what I have here.

So, there are a couple of things that I could do to make it look like that. First, let me grab this and just move it up here so it's side by side. One of the first things I might want to do is change the formatting here.

I want to show these values as displaying only two decimals. So, I'll click the drop down here and choose number format. That goes a long way to making this look just like what I have here.

The other thing that I need to do is fill in the values for payroll hours, employees, and employees again. Now, the values that are here are actually the same values that I have here. They're just transposed.

I can manually type them in, or I could do this trick where I copy the values and I transpose them in the same area. So, I'm going to copy this. I'm going to go over here and I'm going to transpose by pressing control ALT V on the PC, on the Mac, command ALT V, and I'll choose transpose.

These are the numbers that I need up here. So, what I can now do is copy this. I'm going to select the range here and I'm going to bring up pay special.

So, I want to make sure to choose skip blanks. Now, if I don't want to bring in the formatting that I have here with the underline, I'm just going to choose values. I don't want to paste everything about what I copied, just the values.

And then now, I'll click okay, and it pastes just the values that were missing in the table. Now, I can apply the color scale by selecting these values here. I'll head over to conditional formatting.

I'll choose color scale, and I want to choose the first type of color scale where the highest value is going to be green. So, now when I go to evaluate this table, what I can say is when it comes to revenue, there is a stronger correlation between employees and revenue. Because if I look at this revenue column,.87 is closer to one than payroll hours, which is.75. So, the assumption you'll also need to make is that the kind of employees who are being hired are the kind of employees who will make a difference in revenue.

So, it's not necessarily that you're going to hire some back office employees who are not going to make any difference in the revenue. We have one more exercise here. This time, we're taking a look at the impact among shifts.

I'll do this one briefly. You can also do this one on your own. So, I'm going to head over to the data tab.

I'll choose data analysis tool pack. Correlation is selected. I'll click okay.

The range is different this time. So, I'm going to remove or click on this button next to the range and select a new one. Then, the information is in columns.

I am using labels. The output range is going to be different. I'm going to go over here.

I'll backspace here. I'll put the output range right down here. Now, I can click okay.

There are my results. I will change the decimals to two decimal places. I can do that whole thing related to, sorry, I clicked on, I can do that whole thing all over again that relates to copying and pasting.

So, I'm going to copy here. I'll go here, bring up a special, choose transpose. Then, I will copy this, select the range, bring up a special, and then make sure to choose skip blanks and values.

I'll click okay. Now, if I want to use conditional formatting color scale, again, I'll use number one. We're looking to see if there's any significant difference between the shifts.

We don't have any values that are really close to one. So, there is no correlation between one shift and another. So, there's almost no relationship.

All right. I can delete this. So, this section has been about reviewing the possibility of there being a correlation between two independent events.

What you'll need to remember is that correlation does not mean causality. Even though there may be a strong correlation, it could also simply be due to the chance that makes it look like there is a relationship between those two independent events. So, this is correlation.

Thank you for watching.

Garfield Stinvil

Garfield is an experienced software trainer with over 16 years of real-world professional experience. He started as a data analyst with a Wall Street real estate investment company & continued working in the professional development department at New York Road Runners Organization before working at Noble Desktop. He enjoys bringing humor to whatever he teaches and loves conveying ideas in novel ways that help others learn more efficiently.

Since starting his professional training career in 2016, he has worked with several corporate clients including Adobe, HBO, Amazon, Yelp, Mitsubishi, WeWork, Michael Kors, Christian Dior, and Hermès. 

Outside of work, his hobbies include rescuing & archiving at-risk artistic online media using his database management skills.

More articles by Garfield Stinvil

How to Learn Data Analytics

Master data analytics with hands-on training. Data analytics involves the process of drawing insights from data analysis and presenting them to leaders and stakeholders.

Yelp Facebook LinkedIn YouTube Twitter Instagram