Learn how linear regression can help you analyze attendance data and predict concession sales at sporting events. Discover how to use NumPy and Matplotlib to create clear visual insights through scatter plots and best-fit lines.
Key Insights
- Understand linear regression as plotting a best-fit line that minimizes the sum of squared distances from each data point to the line, providing a clear relationship between attendance and concession sales.
- Use NumPy arrays and the
np.polyfit()
method to efficiently calculate slope (m) and intercept (b) for a linear regression from given attendance and concession sales data. - Create informative scatter plots with Matplotlib by clearly labeling axes, adjusting axis limits to enhance readability, and visually plotting a regression line to predict future sales based on attendance.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
Let's load up a picture now. Remember, we did this with the bell curve, so we're going to just say we need the base path. We'll say base path.
We're going to have to maybe go get that again. If you have it from the last file, you can copy paste it. I closed that file, so it'll be just as easy to go in here and also just a reminder of how to when you've mounted your Google Drive, you know, why would you do that? Because you want to go into this folder icon environment here and get stuff.
And what do we want to get? We want to find our folder, and we want to get our so-called base path, and then we could get a slash after that, and then we could say imgPath. We've got some picture in there, so we'll load it up just like we have some picture, you know, that bell curve picture, and make it a little bit smaller. That's kind of okay, so this is just fun.
It's not like a teachable graphic like the bell curve. This is just for fun, but we want to have more practice loading pictures, and also just this is a big topic we're going to get into. It's part of the name of the lesson.
So Matplotlib line and scatter, we just did that. Now we're getting into linear regression. So in the line and scatter, we found, we learned how to make a line scatter and how to do it in one plot system, like one coordinate system.
So it's possible where you might see this, the lines and the dots coexisting a lot, you know, a very common use case for that would be in what's making, to make what's called a regression line, and a regression line is the best fit through many dots, and the best fit is that line which minimizes the collective distance between all the points. If you move the line up or down, tilt it up or down, some of the dots are going to be farther away now, right, and you'll get closer to some, but you'll tend to get farther away from more than you get close. So this line minimizes the collective distance from the points to the line.
It actually minimizes the squares of the distances, the sum of the squares of the distances, but it's really just, think of it as, it's the closest line. If that was a road and these are all houses, this is the road that requires the shortest driveways collectively to get to the road from the houses. So what we're going to do is we've got this baseball scenario given or ballpark or stadium.
If we had many attendance figures for games and we had the concession sales numbers for all the games, how many beers and hot dogs and whatever souvenirs that they sold per game, there'd be some kind of correlation, right? You'd pretty much be able to figure out that the more people who you have at the game, the more revenue you're going to generate. Concession sales will positively correlate, as they say, with attendance, right? If you have a low attendance, you're not getting the same concessions as if you have high attendance, common sense. Now, there could be some other variables that play in, like maybe a big game attracts, where the prices go higher and attracts more affluent fans.
They spend more, or maybe on weekends, people spend more because they're going to drink more or something. So there are some variables that are, it's not just a matter of how many people are there, right? But it's pretty close. So here we have a list.
These are hypothetical values for attendance and concessions for our imaginary Python, you know, the Pittsburgh Pythonics or whatever the name of the team is. So here's our attendance figures for the Poughkeepsie Pythonics and Poughkeepsie Pythons. I like that.
But, you know, maybe Python Southern would be better. Pensacola Pythons, right? Don't they have Pythons loose in the Everglades? We're going to get a list of numbers for plotting attendance versus concessions, concession sales for the Pensacola Pythons baseball team. That sounds like a minor league team, right? Affiliated with the Marlins or something.
Let's print. Gotta have fun with this stuff. So pardon my indulgence in this.
Now we're going to need arrays. I'm going to make some arrays here. And the reason we need arrays, NumPy arrays, out of these lists is we're going to use some fancy NumPy move to derive from all the independent plots, like the 10,12,15.
What are the length? What is the length of this thing anyway? They're the same length, but we should make sure they're the same length. We do need the same number of items in each list, right? So as you plot for each game, you have a concessions figure. 14 each, good.
We've got two lists of 14 numbers, attendance, concessions. Just eyeballing it, it looks like roughly 30 bucks per person per game, give or take. And then we can print the shape.
And these are both vectors, right? Concessions, attendance. Why would it give me an error? It doesn't like attendance. Oh, list doesn't have an attribute shape.
Of course. You got to make your NumPy arrays out of that. Make NumPy arrays.
And we're not making the NumPy arrays to get the shape, which we know is just a vector. We're doing the NumPy arrays because there's a NumPy operation, a method that's going to require these values coming in as arrays, not plain lists, to take all the dots and from that figure out the line. So it's going to take all, we're going to feed it in many dots, just like this.
And then it's going to calculate the corresponding line using the y equals mx plus b formula for a line, which you may remember from high school. We're making NumPy arrays from the lists. We're going to say attendance array, abbreviate a little bit.
Attend array equals np dot array, if you recall. That's how you make an array from a list. What's the difference between the list and array? Aside from no commas in the output, you've got shape, you've got the ability to reshape, and you've got dimensionality, the concessions array, 14, nothing, 14, nothing, these vectors.
All righty, let's move on. So this is the kind of the statty mathy lesson. Y equals mx plus b is the famous slope of a line equation.
I mean, if you, hopefully you recognize this. You know, you don't have to be a math whiz to use this stuff, but I don't know how you got to 10th grade without this. And I mean, if they drummed anything in your head, it would have been something like this, even if you don't really get, I mean, you should recognize it.
If you took high school math, I don't care how long it's been, that should at least ring a bell, I would hope. But I'll talk you through it anyway. If we know y and we know m and we know b, we can calculate X. If we know, so in other words, in this case, we've got four variables.
If you know three of them, you could figure out the other one. You could solve for the other variable. So there is a method of NumPy.
We've already used three NumPy arrays in this lesson. We use np.argmin to get the index of the minimum item in the stock prices, right? There's our stock prices. There's the minimum, but where is it? At what index? We use np.argmin. That's a NumPy method.
Takes a list, returns the index of the minimum value. There's another NumPy method, argmax, takes a list, returns the index of the maximum value. And then there's np.array, which is used to make arrays from lists.
Now we're on to another NumPy method, np.polyfit. And what it does, it takes an X array and a y array. That's why we had to make arrays from our, we couldn't just leave our attendance and concessions as raw lists, because it's NumPy, right? So it wants what it likes. It likes its NumPy array format.
And then it takes a number of slopes in the line that you want. Now, if you're trying to do a classic linear regression, it's called linear, right? It's only, it's a straight line. So there's only one slope.
Now, if it looked like a Nike swoosh or something, that would be two slopes, two chain, you know, it changes direction. It's got, it goes in this direction, then it hits this min, and then it switches to another direction. So the np.polyfit takes those three arguments and it returns m and b off the line that you could make through that.
And then what you do is you do a plot, you do a plot where you feed in the X and the y, because you can feed in the X and the y in the plot. You don't, you know, it's optional, but you could feed it in. And the X is your attendance.
And the y though, is not going to be your concessions because we don't want dots. Your y is going to be solved, is going to be in terms of X, or X is the array. And the formula is this mx plus b. We'll write, you know, so you'll feed the plot, your X and your y. When it comes for the y, you feed it the formula to solve for y, and the result is this continuous line.
Let's get the best fit graphic. We're going to say image. Let's just directly, let's just directly CONCATENATE.
We don't need a variable for the image. Okay, that's a little bit big. So what shall we set it at? Let's go with 700.
Okay, so here you have, you know, it's linear regression, what we're looking at, but this is broken down more. So you've got these five points plotted in xy space. Scatter.
We did that. We did exactly that, right? Here's your min and your max. It's two points in xy space.
Okay, fine. Here's five points in xy space. This line is the best fit line calculated.
Now that's what the NumPy polyfit will do for us. It'll take the knowns, right, the X, the five dots in this case, and then calculate the best fit to run through it. And then we plot that as a separate thing, right? We have, we scatter our five dots and then plot the regression line as xy, where y is represented as mx plus b. We can calculate the y's continuously and plot a line.
So how is the best fit line work? As I said, it was the, not just the minimum distance, right? The shortest driveways. It's the minimum. If you took the squares of the distances and added all that up, it's the minimum.
Okay. We're going to make a scatter plot using the two arrays, concession sales and attendance. The expected result is 14 scatter dots.
Now for this, it does not matter if you use the NumPy array version or the list version of the data, right? The original list. We'll say plt.scatter. We'll just say attendance, concessions. That's enough to get started.
Run that. There you go. Now the dots are too big.
We'll set the size to five. It's fine. Okay.
So there's our attendance and concessions. And again, it's kind of squeezed. The axis, the y-values don't go any higher than they need to be.
So let's set those. Let's give a little breathing room above and below. We'll go the y-limb.
We'll widen and increase height of the chart. We'll say plt.xlimb to widen it. We'll start, we'll move a little bit to the left and a little bit to the right.
We'll go 12,000 and 35,000, just a little bit bigger. Okay. And then the y, we don't want the dots squeezing the bottom and top, right? Filling up the bottom and top like that.
A little bit of extra breathing room on the y. We'll give the y, we'll increase the height with the y-limit. And we'll go 30,000 will be the min y, because really the min is looking like about 40. And we'll go 90.
This is actually hundreds of thousands, by the way. Make sure you get enough zeros here. You need five zeros.
300,000,990,000, we'll call it. There. Not too pinched, not too much gap.
Nice balance. All right. Let's take the name.
All right. The title. Plt.title. Concessions.
Pentacle of Pythons. Actually, I should put the name in it first. There you go.
We could do the color. There we go. The dots can be a smidge bigger, but the bigger the dots, the less accurate, right? But still.
All right. Get that working before we continue. And the next thing we want to do is let's label the axes.
So the X-axis, which is plt.xlabel, X-axis is going to say fans. No, paid attendance, right? Players don't count as attendance, umpires. Then the y-label, concession sales.
No, let me say revenue. Concession sales in the USD. U.S. dollars? USD? USD is fine.
Okay. And we could do, it's easier to read the labels if they've got their own color, in my opinion. But you don't want to go crazy with colors.
Keep it, you know, we've already got black, blue, and orange. That's probably good enough. So next, what we want to do is we're going to plot the linear regression.
We increase the range of the y and the X. I already said that. Okay, fine. Right? Title and label the chart.
Right. Y to increase height of chart. Plot linear regression line.
Attendance, concessions need to be erased. That's what we did. That's why we made them NumPy erased.
Now there is a formula, y equals mx plus b, that gives you the slope. Well, it's the, it's the line formula, where m is the slope and b is the y-intercept. And X and y are the xy values at any point.
X and y, right? Point on a line. M is the slope, b is the y-intercept, where the line crosses. If we want this formula, where are we going to get the m and the b from? That's where this NumPy method polyfit comes in.
It returns m and b. Obtain m and b. Np.polyfit method returns m, b. So let's try this. We're going to say m, b. That's what it returns. Np.polyfit. And it's possible that a method can return more than one thing.
The order typically, it totally matters usually. So you put them in the right order. And we want to feed in, it's going to be X, y, and num slopes.
All right. So our X, remember, has to be a NumPy array. It's like our X array and our y array.
So that's going to be attend array comma concesh array. And we want one slope because we want a straight line. That'll log the m and the b in.
And then we're going to plot the regression line. The key, the trick is you use as plt.plot X, y, except here, instead of providing y directly, we provide mx plus b, right? Which is how you solve for y. Y equals mx plus b. So instead of putting y here, you put mx plus b. So you're going to get mx. You can't just multiply them by putting them next to each other.
You need the star multiplication operator there. And then you have your slope. X, yes, the attendance.
Y in terms of the y values calculated in terms of X. We'll say plt.plot attendance array comma m, right? We have m from the polyfit method. Got return m times attendance array again. X, right? We have X again.
Plus b comma one slope. Color, I don't know, purple. There.
Oral again. Maybe we don't want this key. Fewer is more, right? The line weight's too thick, so we'll say 0.25. That line's pretty thick.
A nice thin line. There it is. Your best fit line, real stat.
This is a real statty stats mathy lesson. Math isn't too complex. This is an important stat thing, and it's very useful to know how to do.
And of course, this being data science, we want a nice statty flavor here for some of this stuff. Plt.show, show me the regression line. There.
Done. Put a bow on it. Pensacola Pythons attendance versus concessions.
If we were given a new attendance, so it's with this, we can try to predict, right? We have 25,000. Okay. You could just look it up on the line.
Oh, about 70,000. All righty. This lesson is done.
Thank you very much. We're on to the next lesson 10, which will conclude this introductory Python programming and data science course. Thanks for sticking with it.
We'll see you in the next lesson.