Explore how SQL window functions solve the limitations of GROUP BY, allowing you to retain detailed data insights. Learn the practical method of using subqueries to filter results based on windowed aggregations effectively.
Key Insights
- Demonstrates how standard SQL GROUP BY clauses lose detailed records, while window functions with the OVER and PARTITION BY clauses retain detailed data within grouped results.
- Illustrates that aliases created in the SELECT statement using window functions cannot be referenced in the WHERE clause directly, necessitating the use of subqueries to handle this limitation.
- Highlights that the OVER clause distinguishes a window function, allowing for flexible aggregation either across the entire dataset or partitioned subsets, thus enhancing SQL query capabilities beyond simple GROUP BY aggregations.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
The goal here was, first of all, to be in the other database, switching over to game shows, double-clicking on public there, looking at my Jeopardy table here. All right, so the situation was that we wanted to find, on any given air date, what is the highest point-value question, and we want to see the questions, not just the point values. And so the problem with GROUP BY was we can GROUP BY air date and see the highest point value, but because you do a GROUP BY, you don't see what goes into creating those groups.
That was a problem with GROUP BY. We want to group, but I just don't want to lose what goes into those groups. And the way I think of it is, if you put your data into different groups, you're putting them into different buckets, right? And a window function is like making those buckets clear—like a window that you can see into those buckets—and you can actually see the data that's in there.
So just again, just to review, if I say to GROUP BY, so we can see why GROUP BY is not enough, if I say to GROUP BY air date, and I show people those groups, and I say, let's go find the MAX value, I do see the highest point values per air date, but I lost all the questions. I lost all the other information. I can't see what goes into that calculation.
It's not what I want. I want to see this stuff and this information. So instead of doing that, I'm going to go back.
I'm going to take the data that we have, and I just want to add a column so I can keep all of this. We're just going to add it over there using an OVER function, right? You're going to add it over there. So I'm going to add a new column.
So I'm going to keep all the columns that I have, and I want to do the MAX value, which, if I just try to do that, will fail, right? Because MAX value produces one single value and all of those rows will not work with just one row. If I just do that—failure—because you'd have to do it in a GROUP BY. Or it also could say, or a window function. It could say that I'm going to do it as a window function. So I'm going to say, do it over top of the data.
Now, that does add a column, but as I've done it this way, it just does it as one giant group because I didn't say to GROUP BY—or in this case, PARTITION BY. So it just does one calculation for every question everywhere. That's the highest point value of any question in the history of all of this stuff.
Now, I want to PARTITION BY because I can't use GROUP BY again, because we've already used GROUP BY for some other feature. But functionally, it's the same. The same thing.
I want to PARTITION BY the air dates because I want the MAX value per date. Now, on any given date—let's say I have this air date—for all of these air dates, the highest point value is 1,000. Then if I go to another air date, eventually I'll get to one.
Okay. So this air date is also 1,000. Go to another air date.
Okay. Here's the 12th. Ah, 1,500.
Now, when I look at these questions, some of them are matching. Some of them are not. I only want to see the questions that have a point value that is the same as the MAX value.
So I'm going to want to refer to this. So I'm going to say, I'm going to show all the columns, then this column, and I'll let this be known as FoundMax. So I'll name that my FoundMax.
Okay. I would like to add WHERE this is equal to value. Like, I'd like to say WHERE value = FoundMax. And I'd like to try to use this.
I know it's going to not work, but I want you to see. And it says, "Hey, that doesn't work, " because that alias has not yet been defined. Because WHERE happens before SELECT? When we're looking at the execution order, see how WHERE happens before SELECT?
We're like, okay, well, if I can't use the alias, maybe I can use the column definition. Maybe I could use that. Then we get a different error.
Window functions are not allowed in WHERE. That's my biggest issue—yeah, I can't use it in WHERE for the same issue. Because that window function is what would create it.
That window function is created in the SELECT. When we're doing the WHERE, that new column has not been added yet. So since that new column doesn't exist at the time of the WHERE, it says, "Hey, I can't use that because I'm going to make that column, but I haven't made it yet."
So I'm going to go back. I'm going to move that down. And what I'm going to do is I'm going to say, let's put this into a parent query.
Let's put it as a subquery that I'll SELECT everything FROM. And I'm going to wrap parentheses around this to treat it as a table. And if we treat that as a table, and now we're nesting that execution order—
So it has its own execution order that will be finished and done, baked in. And then to this parent query, that just looks like a fully formed normal table that's already gone through its own execution order. So this starts over again and says, "Hey, to me, that's just a normal-looking table."
Don't know what you did to create it. It's still looking like a normal table to me. Now, when I try to run this set, it says, "Hey, give me an alias."
Just to make SQL happy, I’ll give it an alias—let’s say W for window function.
Or if you just see it, you might just see it as just W. In SQL, if you just see what looks like random letters in certain places, they’re not just random—they’re aliases. Not everybody writes out the word AS, because you can always leave off the word AS—it's just optional.
That just makes it happy, even though we're never going to use it. It's just SQL wants it. Okay.
And then in the parent query, I could put the WHERE in that parent query. And so now this will equal this. On everything, this will equal this. This will equal this.
We know that these questions are the ones on any given air date. Look at how many we got here. Like, just on this date alone, look at how many questions we've got.
Now, some dates only have one. Some dates only have one. So all these—they just have one question each.
Some have two, and you can have any number of them. And that's why when we're doing a GROUP BY, we can't just add a column for which question has that highest point value. Because it's not just one—it could be many.
And that's why I wanted to see through the window—to see the underlying data, to see what's inside. So three different ways to use aggregate functions—all of which are useful—but I do use plain aggregate functions and GROUP BY more commonly than window functions. But window functions are still useful.
Can you use PARTITION BY outside of OVER? No, PARTITION BY is for—so OVER is what creates the window function.
So PARTITION BY is a feature of OVER. Right. You can use OVER without PARTITION BY, because if you just use OVER, you're doing a plain aggregate.
It's like you're doing a plain aggregate function over it. So that OVER is what makes it a window function. And so just like we can do a regular aggregate function or GROUP BY, we can do an OVER or an OVER with PARTITION BY.
Right. So each has like just the regular—the regular aggregate function or the GROUP BY. Right.
So two window functions—just like non-grouped or grouped. Right. So that would be not grouped.
That would be grouped. It's the OVER that makes it the window function.