Predict concession sales at baseball games using linear regression. Learn how Python and NumPy simplify creating accurate prediction models from attendance data.
Key Insights
- Utilizing linear regression, the article demonstrates predicting concession sales based on attendance data at baseball games, highlighting practical data science applications.
- The article covers using Python's NumPy library, specifically the
np.polyfit()
method, to calculate slope (m) and intercept (b) values for the regression line. - Guidance is provided on plotting data visually, explaining scatter plots for raw data points and line plots for representing the best-fit regression line effectively.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
What we're going to be running a linear regression on, again this predicted best fit line, that's a line, that's why it's a linear regression, we're going to do that on some more realistic data. Attendance and concessions at a baseball game. If they know that the attendance will be this amount today, how much should they expect to receive in concessions? And you know, that's the kind of real world question a lot of data science solves.
Now, let's run this as is, make sure we've got an attendance and concessions in, and let's make sure the two lists have the same number of values. If we're going to plot them, we can't have an attendance without a concession or a concession without an attendance. So let's check length of attendance and length of concessions.
Yep, they're both 14, great. All right, we've got a bunch of a scatter plot here. This is going to be making a scatter plot, which means the dots, right? It's going to give us a dot for each of these similar to what we had here.
Remember the x and the y, but this time the x is attendance, and the y is concessions. Let's execute this as is. We'll play with this more in a minute.
But these are some dots that represent this. As humans, we can look at and make a little line here and try to think about exactly what would be a good fit here and thus be able to plot like, okay, well now if the attendance is right here, what will be concessions? We can kind of eyeball it, be like around here somewhere, but predictions should be more accurate than that. We can do better than human eyeballing.
So instead, we're going to draw a line that perfectly fits this data, that minimizes this, that is a linear regression. So here's how we're going to do that. First, we're going to make an m and a b values.
This is in the famous equation y equals mx plus b. That's how you graph a line, right? m is the slope, you know, for as x increases, how does y increase or decrease? And b is the offset, you know, then where does y hit the line? What is y when x is zero? Okay. We need those m and b values, and fortunately, there's a great NumPy method for that. So let's take a look at it.
Again, our x is the attendance, our y is concessions. We're going to be able to get concessions in terms of x using these m and b. And if we run np.polyfit, we're going to say attendance and concessions. That's our x and y. And we also need to tell it how many lines to make.
Or basically, you know, yeah, like what dimension to do. It's just a one-dimensional line. And this will actually give us back a tuple.
It'll give us back two values, the m and the b. We can make, we can declare those variables as just m and b. It's the standard. Okay, now that we've got that, we can do a scatter plot. Sorry, we can plot the line through here using pyplot.
We'll say pyplot, give me a line plot. That's just the, you know, that's just what the plot method does, a line. We give it an x, attendance, and we give it a y, and y is going to be our concessions.
We'll do concessions plus b times attendance. Why, what is this y? Well, it's x, y times, sorry, not times concessions. m times concessions, right? We want y plotted for every x. It should be m times attendance plus b. Our y will be m times x plus b. The numbers that polyfit gave us to graph this line.
And finally, we'll do a little bit of making it look nice. We'll make the line red and make it not too thick. And there'll be a little error.
And the reason there's an error here, we can fix it, is that this kind of vector operation doesn't really work on regular Python lists. What we're going to do is we're going to make a data frame next, and we're going to look at, be able to say, okay, I want all the values in a data frame series for attendance, every attendance times m, and then add b to each one. All right, that'll be in the next video.