Discover how SQL's "GROUP BY" clause enhances aggregation functions by calculating data summaries per group rather than just a single total. Explore real-world scenarios like analyzing sales per state or orders per zip code.
Key Insights
- Use SQL's "GROUP BY" clause to perform aggregate calculations like COUNT, AVG, MAX, and MIN for separate categories, such as determining the average salary per department or counting orders per state.
- When grouping data, always ensure columns used in the "GROUP BY" clause appear in the SELECT statement, along with your aggregate function, to properly display summarized results.
- Postgres uniquely allows using column aliases in "GROUP BY" clauses for simpler and more readable queries, unlike SQL Server which lacks this convenience.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
Let's continue on here. Now, earlier we saw that there were aggregation functions, and I mentioned how important aggregation functions were like min, max, count, average. You're going to do these a lot, but one problem sometimes is that aggregation functions, they only give you one value.
That's it. And sometimes we want to do that not just once, but we want to do it for a group. And we want to create different groups, and we want to do those aggregations every single time for each one of those groups.
So for that, we have what's called group by. So group by is going to give us another way to use our aggregation functions. Do not think that this replaces normal, plain aggregation functions.
This just makes aggregation functions even more useful because now they're another way to do your aggregation functions. Don't think of it as replacing it. Just think of it as now I have two ways to use my aggregation functions.
Sometimes you just want one number—perfectly good, like we've been doing—but sometimes you want that aggregation for every single group. So, for example, with all my employees who work in departments, if there is a salary, I might want to find what is the average salary of the whole entire company.
I want to get one number for the entire company. That is a single summary value as a single aggregation function. But what if I want the average salary per department? Every department, I want to see the average per department.
That is not one outcome. It's one outcome per group, but I need to tell it how I want to create my groups. So group by says, I want to create groups with the department IDs.
So for every department ID, I want to find the average salary per department ID. So it creates those groups. And for each group, it says, Oh, I'll take the average for department ID one.
So department one, it takes those and creates the average for those for the twos. It creates the average for those. And so you'll get a row for every single group.
You can have as many groups as you want. If you want one value, plain aggregate function, if you want multiple aggregations, that's group by. Now, group by is a very specific way that you write them. You will say what you're going to group by.
There's some column, like in that case, department ID. If you're going to create groups, you need to show people those groups, because why would you create groups that you're not showing to people? So the select controls, which columns you're showing to people. So the idea is that whatever you're grouping by, you're going to choose a column and you're going to show people who column.
You're always going to have it in both places. And then the other column you're showing is whatever aggregation function you're doing—some min, max, average, whatever that is, count.
That is the syntax that you're always using behind the scenes. What it does is it still applies aggregation functions one at a time, but because of the grouping part, it says, if we're grouping by states, it says, okay, let me split apart the states. Let me take all the New Yorks, make a table out of those behind the scenes.
Let me take all the Connecticuts. So I'll make a table out of those. Let me take all the New Jerseys.
Let me take, make a table out of those. And for each of those tables, it will do the aggregation one at a time because aggregations output a single number or a single value. And then once it's done them all, it just glues them all together into a little group sandwich there and shows you one row per group.
So it's still doing a regular aggregate function. It's just doing it once per group. It applies the same aggregation group by group by group.
So you see those results in your end, you're going to see the group and the aggregate function. So you're going to see one column for the group, one column for the aggregate function. So for example, if you want to know how many orders there were, you would count the orders, plain aggregate function.
If you want to know how many orders each person has made, each user, you want to see a list of users, and next to it, you want to have another column with the count of how many orders each person has made. Oh, aggregation over and over and over again, same thing. It's a count, but it's per person.
So, for example, you want to group by the user ID because in orders, we have a user ID. That's how we know who placed that order. So I'm going to see a list of user IDs.
So I'm going to group by user ID, and I'm going to show people who user ID as a column. And then I'm going to add the aggregate function, which is to count, so that there's going to be two columns—one with a list of user IDs and another column with the count of how many orders they have because count counts rows in orders. Each row is an order.
When you're counting rows, you're counting the number of orders in PostgreSQL. What's nice is if you ever need to use a column alias, like remember how we could name columns, you can use them in your group by, which makes things a lot easier. SQL Server doesn't let you do that.
So it kind of makes things more tedious, but luckily we're doing PostgreSQL here so we can use column aliases in group by should we ever need to. It's one of my favorite features of PostgreSQL. They let us do that.
It might not seem like a big thing, but it's actually a very, very nice feature. We'll see that later. So I'm going to open up my group by here.
I'm going to choose my database up here. We're still using our company data. Now, just to remind us, if I am using a regular aggregation function, if I want to count the number of orders, I would simply say to count.
I get one number out of that. It tells me I have 500 orders. That is a plain aggregate function.
We use them. It's great. Very useful.
Group by does not replace that. It's just another feature that we sometimes want. Let's say now I want to see a list of states and next to each state, I want to have how many orders there were per state.
Because I want to know what's the best state in terms of how many orders we have. When we're doing this, we have all of our orders. And when I say to group by, I think, what do I want to have a list of? These are my groups.
I want to have a list of states. So I'm going to include ship state as my grouping. Whatever you group by, you need to show people those groups.
We need to have one column with those. And this is not done, but if I run the code, I will have all of those states, and I have 50 rows. But the whole point of a group by is to add a column with an aggregate function.
If you don't do that, I may as well just have used distinct on my states. The whole point of groups is that they will aggregate things into groups. Like here's all the California orders.
Here's all the New York orders. But if you don't do anything with those things in those groups, what's the point of grouping them? The whole point is to look in those groups and do some sort of aggregation. In our case, we want to count.
We know how many orders we have. And remember that whenever you're doing counts, you're counting rows in your table. If you're in the orders table, you're counting orders.
If you're in the users table, you're counting users. You're counting whatever your table is. We're counting orders in the orders table.
So I now have a list of all 50 states because we have shipped to all 50 states, and how many orders have been shipped to all those 50 states. I might want to sort by that count column. I'd want the highest at the top.
So I need to switch to descending order. So there you go. Here are the top-performing states in terms of how many orders were shipped to those states.
It's an aggregate function done over and over again for each and every group that we have. So these are now two different ways to use our aggregate functions because they are so useful. Now, in this case, I'm only having one column B for my groups.
What if we want to group by multiple columns? If I have my orders table here, you know how states are broken down into zip codes, right? If I looked at a state and I say, okay, there's 30 orders in a state. If I break that down into zip codes, we're going to break those 30 orders down into like 10 are here, five are there, two are there, right? We're going to break it into a more granular analysis of different zip codes. So let's say I want to group by not only the ship states but also the ship zip code.
Technically speaking, I could just break it down by zip code. But if you look at a zip code, can you just look at a zip code and know what state it's in? If you can, you are a machine, like you're an animal, like that's, that's amazing. Most people can't.
So I want to show state and zip code. Now, whatever I'm showing or whatever I'm grouping by, I want to show people those groups. So this is a formula. Whatever you have in group by also put it into your select.
Those are always going to match, except you also want to add your aggregate function. What are you doing for each of the, one of these groups? I'm counting the number of words. So I'm going to add my count.
Again, I'm counting orders because counts always count rows in whatever table you're at. Now I'm grouping by two columns. So I have more groups. Before I only had 50 groups.
Now I've got a lot more than 50 groups because we have a lot more than 50 zip codes. We have 50 states, and each state is broken down into zip codes. To make this a little cleaner.
I'd like to order by my states. That way, all of the zip codes for a certain state are all put together. There's some mistakes in our data here.
I know there's some four-digit zip codes. I'm going to use my Jedi powers to say, you did not see the four-digit zip codes. Pretend they're not there.
It's I know it's a mistake, but just, you didn't see that. Just ignore that. Just pretend they're all five digits or 10 digits.
Ignore the errors. Now I can see, yeah, within this state here, this zip code had this many orders. Each one of these just had one, you know, some of the zip codes, these are the more popular zip codes here in this state, right? Now we have more groups because we have more columns.
So the more columns you add, you're just going to get more and more and more groups because you have more possibilities of groups to create. So you can group by more than one column if you want to, but no matter what, it's always a formula. Whatever you put in group by, whether they're one or more columns, you will always put that into select except you add the aggregate function.
That is the formula you will always use in group by. You can't add anything else because if you're going to display another column, you must add it into group by. So said another way, what you show people can only be the groups and the aggregate function. You cannot add another column because remember when we did an aggregate function by itself, remember how we couldn't show other columns, right? Same deal here.
You can only show your groups and the aggregate function because if you throw another column in there that's not part of your groups, this won't make sense. You can only show your groups with your aggregate functions. That's all this can do.
It's very specific in the way that it works. It only does this. It's great at doing it, but it only does this.
You can't also see other things. Now let's say these zip codes, if you break zip codes down into the last four, the reason we created those, originally we started with just five digits of codes, by the way, but as we got more and more people, we needed to break those zip codes down into sub-zip codes. Now imagine you're the size of Amazon or something.
If you broke it down into the 10-digit zip codes, you'd have so many zip codes. We might not want to be able to do that, but if you break it down into 10, it might be that granular. Like for example, within this first part here, the last part could be like, there could be 20 different sub-zip codes within the main zip code.
So what if we don't want so many options? I know in this dataset, we're not going to really see this because we have a small dataset, but if you scale up to the size of a real-life company with tons of customers, I don't really want this last part here. So I just want the first five characters. Those first five characters are on the left of this.
This is text. It's A-to-Z. We can use a string function on this. I know we haven't done string functions yet, but we're going to do one right now. They let you manipulate strings and just get a certain part of a string. I just want the left five characters.
That's all I want. I don't want the, what comes after that. So I'm going to take this and I'm going to say for the zip code, I just want the left part of it.
And I want five characters. That's what I want. And I got to do it in both places here.
So this way, I'm going to get fewer groups because I'm chopping off that last part of the zip code. So I'm kind of, in a way, merging them all together so that I just have them all lumped into one zip code. What's cool about these string functions is you're actually going in.
And of course, nothing changes the database here because in a select statement, I can't change the database, but when I'm manipulating these strings, I might not want all the rest of that stuff. I'm wanting to just get rid of it. So I can say for the purpose of my query, just give me the first five characters starting from the left.
So give me the leftmost five characters. So the dash and the four that come after it just get cut off. I just throw it away.
It won't make a difference in terms of the number of groups that this database has because we're so small, but in a real-life database, this would help because otherwise, you would just have way too many zip codes, just too many to deal with. Alright, so I'm going to give you a little time here to work on these challenge questions here.
This is 2.1 group by.