Discover how SQL window functions can reveal vital details hidden by regular aggregates, offering deeper insights into your data. Learn when and how to effectively utilize these functions to maintain visibility of underlying data while performing calculations.
Key Insights
- Understand that window functions differ from regular aggregate functions and GROUP BY statements in that they add a new column for aggregated calculations instead of replacing existing data. This feature lets users see both individual details (such as employee salaries) and aggregated summaries (such as departmental average salaries) simultaneously.
- Recognize the limitations and workarounds of window functions, specifically that their results cannot directly be filtered in a WHERE clause due to SQL's order of execution. To overcome this, perform the window function within a subquery first, and then filter results using a parent query.
- Identify the use of the keyword "PARTITION BY" in window functions as analogous to the "GROUP BY" statement in regular aggregates. Partitioning groups data into defined segments, enabling calculations (like departmental average salaries) within each segmented group, while retaining visibility of individual records.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
So back when I introduced aggregate functions, I said, aggregate functions are so useful. So there are actually three different ways we can use aggregate functions. We've seen two, we're going to see the third.
Now this is the least used. And so the most common ones are just a regular aggregate function or a group by. Those are the two most common.
Just because we're showing this third doesn't mean that now this should replace the other ones. It's just another option. I would say you're going to use this less than plain aggregate functions or group by.
So those are going to be much more common. But every once in a while, window functions can be useful. When we think about a regular aggregate function, you take all the stuff and you add it all together or do all this stuff and average it out.
Group by does that once per group. The problem with any normal aggregate functions or the group by ones is you can't see what goes into those groups, right? So lots of numbers go in, you get one number out. You don't see what goes into it, you just see what comes out.
And that's fine in certain examples, certain needs. But the window, we can look through the window to see what makes up those groups. We can actually not only see the results, but we can see what goes into those results.
Now you don't always need that, but sometimes you do. So in other words, instead of, when we do an aggregate function, so far, let's say you take a column and you add up everything and you do a sum. It replaces the column with the sum.
You lost the original column and you got it replaced by the sum. What if we could keep the column and just add another column with the sum? We can still keep the original data and we can just add another column for the aggregation. That's a windowed function.
We're adding another column rather than replacing what's already there. So let's see how this works. Like I said, this is not the one that you're going to use most of the time, but in certain situations, it can be very useful.
I'm going to choose our server here and just connect into the standard company database that we've been using a lot. This time, we're going to go back to the employees and departments tables. So I'm going to review, so we can see once again, our employees table, which has a list of employee names, employee emails, and their salaries.
And I don't want to see all the other columns. So right now, I just want to show their name, their departments, so I do want to know what department they work in, and their salary. We hide the rest of the columns.
I don't need to see it all. I'm interested in who works in what department and I'm interested in people's salaries. So just displaying some data here.
When we see this, if you take all the salaries—so let's say for example, I just look at the salary column alone, just that. If you do a plain aggregate function to review those, if you were to sum all of the salaries, this is what this company spends in salary across all of their employees. In total, they spend as much.
Notice how it replaced the information that was there with this new information. Can I see the salaries that went into calculating this? No, I can't see that anymore. So regular aggregate functions replace the data with new.
We lost all those rows and we just get one row of a result. Let's do this instead in a different way. This time, I'm going to go back to my three columns because I like to have my three columns here because I want to see the people, their department, and their salary.
Now I'd like to add another column. And as we saw before, if I just try to do a sum of salary, remember how this failed before? It's going to fail again. Doesn’t work.
They say, hey, that's an aggregation. So you could put it in a group by—God bless you. You could do it in a group by if you want to do it in a group by, but we've talked about this before.
You can't have some columns that have a lot of rows and then one column with just one row. Doesn't work. But I want to add a column.
So what makes it a window function is to say, apply it over the data. Just adding that word 'over'. So it still has the underlying data and we just stick it on top of it as another column.
This word 'over' is what makes it what we call a window function because we're still looking through the window to the underlying data that creates that calculation. This repeating of this information might be what—I think somebody mentioned this. They thought that it would do this by default, right? Which is the regular aggregate function, but it didn't do that by default.
But this is—I forget who mentioned that. It was like, oh, it should do that. Well, you can make it do that.
You just have to use a window function to do that. So you can apply it over. Now, of course, this is a new column that doesn't exist.
We could give it a nice name. We could give it an alias if we want to. We could say that's total salary, just to give it a nice name.
This is why I like to put things on different lines, especially when things get longer. Making things on different lines doesn't change anything—it just makes our code easier to read. Now, yes, we can have it over.
Now, I can't say that in this case, that seeing total salary next to an individual person's salary is very interesting. But if I were to say the average salary—now the average salary right now, because I'm saying let's just do the whole company, right? That sum was the whole company. If you do the average, that's the average for the whole company, right? So across the whole company, Bruce, Mr. CEO, he blows our average, right? So he makes the average really high because everybody else is way below him.
So he skews our average. So I think a company-wide average is not very useful in this situation, but it does work as a company-wide average, just like it was a company-wide sum that we were doing. Do you have a question? A percentage of the total, like how would the math work on that, like? What percentage is that of that total? And what is the percentage to a salary? The total is the sum of all salaries.
Oh, if you want like a percentage—like what percentage is that of this? Right, like, oh, if you do, so for example, if you do sum, and you say sum like this—so like, is this 10% of that, or is it 5% of that or something like that? I'd have to think about how to do that. We could probably do it. When you guys are doing your exercise, let me think about it.
Right, well, what I want to do is I want to do the average, right? But I don't want to do a company-wide average. I think what’s more useful is a department-wide average. I want to do an average per department.
So, for example, within Department One, what's the average of Department One? What's the average of Department Two? Because they work in the same department, there should be more similar salaries within the same department. If you want to do a group by—because that's kind of what I want to do—I want to do a group by here. And let's just go back for a second before I do this with the window function to remind ourselves of if we were grouping by.
If we're grouping by department ID, say we group by department ID, right? And we're doing the average salary. The problem in this case with group by is I can't see people's individual salaries next to the groups. I see the groups, I see the average salaries, but I can't just add another column.
This is the problem we solved before. Group bys can be perfect in some cases, but here, if I want to see the group by plus the underlying stuff, the group by itself won't work. Because the moment I add another column to the select, I have to also add it to the group by—they have to be the same.
So I don't want to create more groups, right? So I can't, in a group by, see the underlying data in this at the same time. But I can with the window function. So I could take this window function that I have here—the one that says: show me the employee name, the department, the salary, and the average salary applied over the data as a column, called whatever I want.
How do I do a group by? The term group by has already been taken. I want the same thing, but I need to use a new name, so they call it partition by. Partition just means to kind of break it into parts, kind of like a group by.
They could have called it part by, I guess. They could have called it whatever they want, but they just couldn't use group by because that name was already taken. So they needed a new name, and it's partition by.
But it's the same thing—same thing as group by. If I was grouping by the department ID before, I'd be partitioning by the department ID now. The partition is the groups that you want to create.
Now when I look at Department One, look at their salaries—100,000 and 110,000. Halfway in between is 105,000. That's the average.
Then with Department Two—88,000 and 75,000—the average is 81,500. There's only one person in Department Three, so they are the average. And then of the people who don't work in the departments, we have their average—for the people who don't.
So partition by is exactly like group by, but done as a windowed function. So you can see the underlying stuff, so you can see what went into the groups as well as your aggregate function. We don't always need this, but when you want to compare things, this can be really useful.
So then you might want to say, well, okay, we've got average salary, and I might want to see the higher or the lower, you know, maybe only above or below a certain criteria. And so you might want to start to do some filtering. Sounds good, right? So let me throw in the WHERE, and you might want to see where this is over 100,000.
So I can say WHERE that is over 100,000. So one, two, three, one, two, three, over 100,000. Or maybe it's less than, it'll do less than.
Error, window functions are not allowed in WHERE? Meaning this column is a window function column. You can't use them in WHERE? That's not very good. I don't like that because I want to filter by that.
Now, let's think a sec. So they're saying I can't use a window function column because that's what that is. I can't use it in my WHERE.
Is there a reason why? There is, actually a very good reason why. And we go back to execution order. Where is that window function defined? It's defined in the SELECT.
Notice that doesn't come from the original database. It gets created by the query. When the WHERE is being executed in the execution order here, the WHERE is being executed well before the SELECT creates that column.
So technically when we do a FROM, if we JOIN, this gives us our base data, all of the columns and rows that we have. Then it goes into the WHERE, it has not added the column for the window function yet. That won't happen until the SELECT eventually adds it.
That's why it doesn't work—because of the order of execution. So you're like, well, that's a bummer. That, if we can't do that, would majorly limit this, but there's a workaround.
So think about using parentheses to change the order of operation. Remember how we said when we did subqueries, how if you put something inside of parentheses, it's going to do that first, and then it's going to pass it to the parent query. So imagine this: each query has this order of execution.
If the nested query, the subquery, has this order of execution and it's done, then if we pass this to a parent query—so the parent query, this is going to look like this is all done. The SELECT will have already happened to the parent query, because it's going to go through all of this. What if we move the WHERE to a parent query? So the child query, the nested one, doesn't try to do the WHERE.
We get two execution orders: one for the nested and one for the parent. We need to use subqueries here. So what I'm going to do is, I want to take all of the information from here, and think of that as a table.
Can I SELECT things from a table? Sure. Now, a lot of times you're used to thinking SELECT from a table in my database. But does this produce a table? Sure.
Can that be a subquery to say, SELECT everything from that? Sure, because that produces a table. So I can put it in a subquery where a table would be expected. So, if I put this and say, SELECT everything FROM a table in a subquery, and I wrap those parentheses around this query, which produces the table, this has its own order of execution.
And it will produce this table. I want to pass it to this parent table. And when I try to run this, it says error.
Subquery in FROM must have an alias. So they're saying SELECT … FROM (SELECT …) AS foo, as an example.
Like they're saying AS foo, as an example. For whatever reason, think of it this way: does this have a name? No.
So SQL's like, please give me a name. Just, I want a name so I can feel good. Even though I'm not going to use it for anything, because I'm not doing a JOIN.
Just give me a name so I can feel good. Like, okay, I'll stroke your ego. Okay, you want a name? Okay, I'll call you whatever.
You call it foo, Manchu, whatever. Like you can call it whatever you want. I don't care what you call it.
So I could call it S for subquery. You can call it E for employee. You can call it anything.
You just got to give it any name for it—just to make it happen. I know we're not doing a JOIN, but just for whatever reason, when you put a table in a subquery, it needs to have a name. But you can put whatever name.
If you want to put W for window, you call it W for window. It doesn't matter. We're not going to use it.
We're just making SQL work. Now, if I put my WHERE into that—oh, now the name is not this. The name is average salary.
That's the name of this—here. That's the name of it.
Because this happens first. This creates a table with this. So this alias here is baked in.
And so after that's all done, that's the column name. To the parent query, this nested query's order of operation was already done. It went through all of that stuff for the nested query.
The SELECT has given it a name, and that's now a real name. That column is now called that. So to the parent query, it just only knows these names.
That's all it knows. We have to use that alias that we gave it to refer to that column. And this works fine because to the parent query, it just views this as a regular table that's already been created.
So this is how we can still use a WHERE with a window function. And you can see the ones that are less than or the ones that are greater than. So you can still filter that if you want.
So that was the last part there. This is where everything comes together. And this is another really good use of a subquery: to change that order of operation.
We get two queries. So you can do one part here, and then it's really kind of just a workaround to use the WHERE. So these are the weird things—SQL works in a very specific way.
So you then learn to use the features to work around the way that it works. It's very different from a lot of other coding languages and things. It's just, it's a different way to think.
It's unusual. If this is the first coding that you've ever done, this is a unique thing. There will be similar concepts that you might learn in other programming languages, but even if you've learned other programming languages, SQL is just a bit different in the way that it works.
It has some unique concepts. So it does help prepare you for some other things. Other things might help prepare you a little bit for this, but it still just has its own quirks and weird ways that it works.
That are sometimes very unintuitive to figure out yourself. I remember when I was learning SQL, I was like, yeah, I never would have figured that out myself when I first learned it. Once you start to understand how it thinks, things start to make more sense, and then you have to work within those limitations.
If you look up PostgreSQL.org, they do have documentation where they list all this stuff. I don't think it's necessarily the best place to learn about all this stuff. I think of it more like a dictionary—it’s going to describe it like a reference.
And so it's not always the best way to learn things. So yes, you can go to the documentation. They do document everything.
It's not always the most helpful for me. So normally when I'm trying to look for something, I'll type in PostgreSQL into Google. I never just type in SQL because there's too many SQL flavors.
So I never just type in SQL. If I'm doing SQL Server, I would type in SQL Server. If I'm doing PostgreSQL, I type in PostgreSQL—or whatever flavor you're doing, right? Type in that flavor and then what you're trying to do.
Also, if you're ever like—let's say we're all learning PostgreSQL here—if at some point you get into a database that's SQL Server, maybe you know how to do something in PostgreSQL. You could say, how do I do this in SQL Server? And this has been around for so long—
Probably somebody else is going to have that same question and say, oh, this is how you do this PostgreSQL feature in SQL Server and show you the code differences if there are differences for that. So I typically find that more useful when I'm looking for a feature or trying to find out something. I often end up on things like Stack Overflow to answer a lot of these questions.
But yes, you can go to PostgreSQL.org, and that is where the documentation is. I just don't think it's that great. And sometimes Google will link you out there, but it's not as nice as I'd like it to be.
Your challenge is going to be this. When we go back to the Jeopardy table, remember how we had that highest point value where we wanted to find out what was the highest point value on any given area? And then we wanted to see the question that it was for. Remember how we used the subquery to do that? We could also use a window function to do that.
We're going to go back and try to solve this in a different way. So instead of using that other subquery technique, we're going to try to use a window function to take care of that. Let's see if we can use a window function.
There it is. All right, so this is 1.1 aggregate windowed functions. This will be our last way of doing aggregates.
Like I said, though, plain aggregates—things with GROUP BYs—are definitely way more common. This is a third way, but this is, I would say, generally speaking, the least used way, but it's still very useful in certain situations.