Solution- Outer Join and NULL in SQL

Understand your database structure and data relationships before writing SQL queries.

Master SQL joins by understanding precisely how tables connect and reveal hidden insights within large datasets. Gain clarity on identifying sales trends, user behaviors, and unsold products through effective join strategies.

Key Insights

  • Understand your dataset thoroughly to effectively perform SQL joins; knowing table structure and relationships is essential for accurate querying.
  • Identify outliers such as unsold products by using outer joins; for example, joining the products and line items tables with a left join reveals products without associated orders through null values.
  • Recognize that joins simply match equivalent values across tables and do not necessarily require primary or foreign keys; matching song IDs from unrelated music platforms illustrates how joins can occur even without formal table relationships.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

So let's talk through this solution challenge question here. So just as a recap of one of the things that I mentioned yesterday, but just to really drive it home until you understand the dataset you're working with. It's kind of hard to understand all the ways in which you can use things.

So your first thing, whenever you start working with a new database, is to look through it, but also to understand the intent of the data. And whenever you're working for a company, if you don't understand how the data works, if you don't understand what it means, talk to somebody else at the company to try to understand what this stuff means. And, you know, some of it you could just figure out maybe on your own, but sometimes there might be certain names that are confusing.

So first you want to make sure you understand what's going on. And just to explain our database in this case, the idea here—if I open up the tables and look at the ER diagram—is this: users place orders. Those orders know who placed the order, when it was made, and who it was shipped to, which could be different from where that person lives.

It does not know any status. It doesn't know what was ordered. Those are broken up by line items.

So associated with a certain order, we know the product that was purchased, the price, the quantity, and the status—you know, was it returned? Was it shipped? That sort of stuff. But we also have this list of products where we have the title of the products, tags for searching when the products were created, what their normal price is. It's possible that somebody paid a different price.

SQL Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes,  1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Maybe they got a discount or something. But when we think about products, we may or may not have sold all of these products. We've listed them for sale on the website or app, but we might not have sold all of those.

So let's say we define a new product that goes into the products table. Is there automatically a line item for that product that we just added? No, because somebody needs to purchase it, right? They need to place an order. So the only products that will be in line items are products that have been purchased.

What if a product has not been sold? It will be in products, but it will not be in line items. We're trying to, when we're searching for things, figure out what tables we need to look at and what we need to involve to find what we're looking for. Because if I look in products, I only know my list of products—I don't know whether they've been sold or not.

Line items tell me what's been sold, but they don't tell me what hasn't been sold. I need to see the connection between my list of all the products and the list of sold products to figure out those outliers of products that haven't sold. Kind of like if I was thinking about users and what orders they placed, if I just want to see the users who placed an order, that would be a match.

But what if I want to see users who didn't place an order? Those would be outliers, and I'd have to consult the orders to know whether they placed an order or not. We have to think about the tables that we can connect to give us this information.

There's a created_at column in these, but they're not connected to each other. So let's think about what these are and the way that joins work. So why is user ID connected to user ID? Because these are unique things.

For example, a user has a unique number—user number one, user number two. In orders, we need to know who placed that order.

And so when we think about user number one placed this order. Theoretically, could it be possible that two orders were placed at the same exact time? Or two users could create their accounts at exactly the same time? It would probably be pretty hard, but in theory, if you're getting enough users, even though they track it down to the millisecond, two people could create their account at exactly the same time.

So would the time when users created their accounts be a good way to identify a unique individual person? That wouldn't be a good way, because in theory, two people could create their account at the same time. Now, this is when the user created their account.

This is when the user placed their order. Those two pieces of information have nothing to do with each other as far as connecting tables.

If I look up when an order was placed, does that help me to know who placed that order? No. The only way I can know who placed the order is to look it up by their user ID. When we're doing joins, we're looking for two columns that have data that is equivalent to each other—user number one, user number one, placed that order.

User number two, user number two, placed that order. We're looking for something unique and repeated so that we have the same data in both.

Let's just say we have the same time here and the same time here. So a person created their account and placed an order at exactly the same millisecond? Probably not, right? They're going to have to create their account and then place an order. Those two things don't connect to each other because they're completely different types of data.

They're not like data. I know they're the same kind of data in the sense that they're both time data, but they have nothing to do with building a connection between those two tables. Does that make sense? Also, just to be clear here, when I select user ID and it highlights user ID over here, I really like that DBeaver makes that connection for me.

Not all database apps will do that. Sometimes you just have to look to see: one will be labeled a primary key, one will be labeled a foreign key. Primary and foreign keys are not required to make joins, though.

They are often used to make joins, but all a join needs is to say that there's some column in one table and some column in another table that has the same data that will sometimes match. For example, there was a company I was training that works in the music industry, and they get database dumps of song play counts. At the end of every day, let's say they get a data dump from Amazon, Spotify, and Apple Music.

They get this database dump once a day at the very end of the day, containing play counts for songs. They aggregate all of that data together to say, “This is how many song plays across all of these services they got.”

So they need some way to connect those databases, which they didn't create. They get that database information dumped into their database from these different services. When they create that database there, they have one for Amazon, one for Apple, and one for Spotify, let's say.

Think about products and how products have a UPC code, which is a universal product code. Think about songs: if songs have a unique number, in the Apple, Amazon, and Spotify tables—which are just three unrelated tables—they all have a song identifier because there needs to be some universal song identifier that all these services use so that they know that that's a certain song.

All three tables never get updated for each other. The whole idea of a primary and a foreign key is that in the database, if something ever got changed, it would update the other one—they're literally connected.

But since they're just getting these database dumps coming in, they're not modifying data or anything. They don't have primary and foreign keys set up. They're just three separate unrelated tables.

But we can find a common thread that binds them all together in that song identifier. So all three tables have a song identifier, and it wouldn't necessarily light up like this because they're not primary and foreign keys.

Because which one defines that? None of them define it—they don't say, “I'm the creator of this song ID.” They just all share the same song ID, so they're just three unrelated tables.

So here there would be no connection between those three tables. They're just three separate tables, but they have a common column. They might all be called the same thing, like song ID.

Because they're from different companies, they might not have chosen the same name, yet it's the same kind of data. If one called it apple_song_id and another called it music_id, the data inside is still the same. The name doesn't matter—they can be different names or the same name, and it doesn't matter whether they light up or not as far as primary and foreign keys.

The only thing that's required for a join is to say, “I know these two columns, and the data inside matches.” As long as that data matches, we can join on that data.

If you say, “Well, an order was for user number two, ” and you change the two to a three, you've reassigned that order to a different person because now they have to go to user number three—that’s a different person. If it was accidentally attributed to the wrong person, you could change that here, but you're never changing it in the place that defines it because user number two will always be user number two.

Over here, you could say, “Oh, wait, actually, sorry—that wasn't for this person, it was for another person.” Let me change it there to point to that other person. Then three will point to three, and now we know which user.

So there's always got to be some matches where the data is the same. Many times it's a number, just because we have unique numbers for things, but it doesn't have to be a number. It could be something else that's unique, like some sort of model number or letters and characters. It doesn't have to be a number—we're just saying this equals this.

And that's what a join does. When we say join on, you're saying these are my two columns and they equal each other. When they’re equal, that's when we're joining them.

Joins are very simple when you actually look at what they do—they simply compare values in two columns.

Does this equal this? That's why we join on two columns. We're saying when this equals this, that's a match. We know it's a match; I'll put that row into the result set. They're just looking for matches.

And then if we do an outer join, we're saying, “Don't just give me the matches where they are equal; also give me all the non-matches as well—the outliers.”

I know joins can be kind of confusing in the beginning, but when you look at what they actually do, they're just trying to match rows in two different tables, looking for equal values. That's all they do.

They say, “Does this equal this?” and if they don't find a match, an outer join would include it, but an inner join wouldn't. So they go row by row.

Think of how much work a join is—it's amazing how fast it works. I'm amazed because they go row by row and compare every user to every order and say match, no; match, yes; include the row.

So there are many times they don't include the row, and only a few times they do include the row. But they do a ton of matching. There's actually a little animated GIF.

SQL joins animated GIF on this Data School website—if this helps you understand what it does. Here they're looking through and saying, “Hey, does Matt equal Matt? Does Lisa equal Lisa?” When it flashes red, it means that’s not a match.

So it's like Matt equals Matt? Yes—include the row. Matt is not Lisa. Matt is not Sarah. Matt is not Louis.

Is Lisa all of those? Jeff? And so it goes through—Jeff got forgotten because Jeff had no matches, so he was left off. It's a lot of work to do a join, yet they happen so quickly—it's kind of amazing.

That's all joins do. And then if you do an outer join, it also includes the outer. See, this is just a regular join.

If it were a left join where it included this, then Jeff would have been output into the result set. So the result set of a join starts empty, and only when matches are made does it include them. In the case of an outer join, it also includes rows even if there is no match.

But the question is, what does it put on the right? If there's a match, it puts the stuff on the right from the match. If there is no match, it just puts nulls—because if it matches with nothing, it puts nothing. It can't make up data.

Okay. If the goal here is to find products that have no orders.

Products. And I can know what was ordered by looking at line items, because these are the products that were sold.

There's a product ID here. If I find that product ID in line items, I know it was sold. But I want to see sold or not sold.

Because I want to see all the products, whether or not they've been sold. If I just want to see products that have been sold, I would do an inner join—just the matches. But I want to see products whether or not they have been sold.

Give me the sold ones and the not sold. So I want every product, no matter what. When I do this, I care about products.

Because I want to find products that have no orders, I'm going to start with products. Right now I'm just going to type join, but I'll talk about which kind of join I'm going to do properly in just a moment. I'm going to join with line items, and I'm going to join on the column that has common data, which is product ID.

That connects the two. So in one table—let's say I start with products—it has product ID. Then it equals, in the other table, the same product ID. Because the columns have the same name, I must specify which table they come from.

If the columns had different names, I would not have to specify the table they come from because it would be obvious. It's only when you have two columns with the same name that it becomes ambiguous. Now, this would only give me products that sold.

But if I want all the products, I want a left join, because I want all products, whether they've sold or not. Join would only give me the matches.

Left join says, “I care about this left table; give me all rows from that left table.” So you'll get every product, no matter what, whether there's a match or not.

Which means that on the right-hand side, in the line items, if there's a match with a line item, it'll fill in the actual line item data. But if there's not a line item, what will it fill in on that side? Nulls. So that gives me the join.

You might be able to browse the data to find nulls. If I scroll all the way down to the bottom, there are some nulls down here. This is intentionally a small training database so it helps you visualize the data. But in real-life data, you're not going to deal with 100 users and 500 orders.

You might deal with hundreds of thousands or millions of users, and thousands, millions, or even billions of orders. So you can't always just browse your data to start making sense of it.

Sometimes you can, but when you're looking through millions of rows, it could take you a really long time. That's why we learn on a small training database, so you can start to understand these things.

When you scale up to a full-size database, some of these things you just have to know—you can't just browse for the stuff. You'll have to type in a search. If I want to find those products that didn't sell, I'm looking somewhere in my line items table. And if you think about the columns that are here—are you going to have an order ID for a product you didn't sell? No.

Are you going to have a price, a quantity, a status in the line item side? Really, none of this data would be filled in—it would all be empty. So technically you could search in any of those columns.

I would personally recommend the line_item_id, because that's the primary key. In a primary key, you must have a value. That defines the line item’s identifier—you must have that if you have a line item. So there's no way the database could ever have that be null.

If I look for rows where that line item ID is null, those are the products that have no line item associated with them. But technically speaking, you could also look for order ID—that would work. Really, any of these columns would be required if you actually had a line item.

I personally would look for the primary key, but you could look in any of those columns, and in this particular case it would be fine. You don't have to show all of these in the final result set if you don't want to see them all.

If you're just looking for some of this information, you could just type in those columns that you want to see. If you don't want to have all this line item stuff—if you just want to have the stuff from products—you could say, “From the products table, show me all the columns, ” and that will just show you the columns that are in the products table.

That way you still needed to do the join—because you needed that information for the WHERE to work—but you don't have to show people the empty stuff you don't want them to see.

When I first got started with SQL, I thought the asterisk was so simple—it was asterisk or something else—and I'm like, “Wow, you can actually do a lot with that asterisk.” But the asterisk simply means all columns. If you have it by itself, it's all the columns for everything. But if you put it inside of a table, it's all the columns just in that table.

Just like if you have this table with this column—here you have this table with all the columns. Pretty cool.

Cool things you can do with it. So that was the point of doing this join. The next step now is to look for those outliers and find them based on their nulls.

If I left it on the asterisk, see how we see all of these extra columns that I don't really need? But if I just want to find products that have no orders, why would I want to show people the line item stuff being empty? As far as making a nice result for somebody else, if they just want to see a list of the products, I just want to show them the products information.

So it's just to hide those extra columns that I don't want, and I don't want to list out 1,2, 3,4, 5 columns by writing out the names—because that takes a long time to write out all those column names.

If you use asterisk, you're saying all the columns from all of the tables. I have the products table here on the left, and I have the line item table over here on the right.

So this means all from both. But I could say, “Just show me the line item columns, ” which would be completely pointless in this case. Or, “Just show me the product table columns, ” which is actually what I'm looking for.

You don't have to show all of those columns—this only works when you have a join with columns from multiple tables. And that's one of the reasons why we define the alias, so I can use that shorthand instead of having to write out products.

How do you identify which is the primary key? When you're looking at tables, let's say I double-click on Tables here and look in the ER diagram, the primary keys are listed up here at the top. See how those are up here? So those are the primary keys. That's one way you can identify a primary key.

The stuff on this top line—those are your primary keys. Notice how a primary key connects into another column—so this would be a foreign key.

Not all of these tables have a primary key—some don't have anything listed up there in that top part.

If you see something above this line, those are primary keys. The other way you can see it is if you look at a particular table's output: there's a little orange key—it's small, but it's a little orange key—and that tells you it's a primary key. It's right there in the result set.

Let's say we want to find the names of people who ordered something in a quantity of 5 or greater. People who placed an order. Do we care about people who didn't place an order? No.

So do we want an outer join or an inner join? Inner join. We don't care about those outliers—the people who didn't place an order—we just care about people who did. We'll just do a regular plain inner join.

Keep in mind, it's called a join for a reason. It is the most common kind of join.

I know we learned outer joins after inner joins, but most of the time inner joins are the way to go. Most of the time you're just looking for matches.

The vast majority of times that's your default—your assumption should be, in most cases, you just do join. That's why they call it the normal join—just join—because it is the most common join.

Every once in a while, you do want those outer joins, but that's not the norm, or else they would have called that the normal kind of join.

So just your regular inner join—just called plain join—that is your normal join that you do most of the time. When we're looking for people who did place orders in a quantity of five or greater, we're just going to do a regular join.

We want the names of people, and we want only people who purchased a quantity of five or greater. On any individual line item we're looking for, is the quantity five or more on an individual line item? Because that's where the quantity is—quantity is over here in line items.

So I want the quantity from line items, and I want the name from users, but is there a way for me to connect just those two tables—just those two? Not directly, because there's nothing common between just those two.

Yes, I want data from here, and I want data from here, but there's no way to connect them directly. I can go through orders—I have to go through orders—because I can know who placed an order by user ID, and then I can know what this order is associated with. This line item is associated with this order. So I can go from this line item to this order to this user.

So I have to go through orders. I can either go line items → orders → users, or users → orders → line items. I can go in either direction—I don't care which table you start with.

The one you start with, though—we read left to right—the result set we create goes across. Your first table will be on the left; your next table will go to the right; your next table will go to the right of that. Start with the table that you want to start seeing. So it's up to you which way you want to go.

I would probably start with users because I want to find users. I'd probably start with those, but it doesn't really matter—it's just ordering my columns in this case.

If I start with my users and then I join because I just want to do a regular join, it's the most common. Just want people who did place an order and I join in my orders. Every join requires two columns to say I'm joining on the equality of these two.

So what columns am I joining on? Users. User ID. Those two things are equal to each other.

When the users table has a user ID, user ID, that when that equals in the orders table, it's also called user ID. For me, I just go and say, OK, well, I want one column of this and I want one column of this. So just for me, for mentally, I just think, let me find a column in the first table.

And then that equals another column in the second table. It doesn't matter the order of these because two equals two and two equals two, no matter which side they're on. So the order of those don't matter.

But just for my mental health, I just think I'll do one from the first table and then I'll do one from the second table. Because I've got to have a column in each table to say this equals this or this equals this. The order doesn't matter.

They're equal to each other. So that gets me my user table over here on the left. And then as I go across, it gives me the order data over here.

So this is all the order data. Now I know that, you know, this user here, Elma, she placed this order at this time, shipped it to this person. But what I don't have is the line items that comes next.

Each join, you just do one join at a time. Now we want to join in again. And I just want when orders match up with the lineup.

Not looking for outliers here. And now I want to join from orders to line items on what? Order ID. When order ID equals order ID, that's what I want to join on.

So I want to join in line items. Line items on. And just for my mental health, I just think, well, I need one column from here.

So there's called order ID. I need one column from the other table. And it's also called order ID there.

So I've got one column in each table. Because I have one column in each table. When those two are equivalent, I know there's a match between that order and that line item ID.

They match. They go together. Now I have even more columns.

This user data is on the left. And as I scroll to the right, I now will see order data, which comes next. And then as I keep scrolling over, eventually I get to line item data, which is on the right.

So each new table just gets more and more columns on the right. I just keep getting a bigger and bigger table, not just columns, but also rows. Because I'm matching line items to orders and there might be multiple line items per order.

So I'm just adding not only columns, I'm adding rows as well. Now in any given row on an individual row, I know who placed what order for which items all on any given individual line. Now whenever I look at a line, I just want to see people who have a quantity of five or greater.

And so I've got my quantity over here. I can say only show me rows where that quantity is greater than or equal to five. Now I'm hiding many rows.

Now, as far as columns, I'm really only interested in the user data. Maybe just their name and email. If that's all I want to see in my list of columns, I can say, just show me name and email.

And I know that those are the people who purchased a quantity of five or greater in any individual line item. And here I'm saying to display name, email and quantity. So let's also add quantity here so that I know how much they purchased.

If I want to put the highest quantities at the top, I could say order by quantity. Now that would put the lowest at the top. If I want to flip it, I have to add descending because the default is ascending.

So that puts the highest at the top. Secondarily, I'd like these people to also be put alphabetically. So for all the people within 10, I want to alphabetize them by their name.

So I'm going to put a comma and then say name. So first in the list, it'll organize, it'll sort by quantity of 10. After that, within all the 10s, it'll sort alphabetically by name.

Within the fives, it'll alphabetize those within the groups of five. So it kind of breaks them into groups. First, it does it by quantity.

Secondly, it alphabetizes by name. This default is ascending for name. Although if I wanted to do it reverse, I could add descending to there.

And that would do a reverse alphabetical order. So I would do Z to A rather than A-to-Z for the name. I want to be A-to-Z. So each column, you can say which order you want to be.

One to three or three to one, A-to-Z or Z to A. Your limit is how many columns you have. You can only sort by columns you have. That's your only limit.

So you can say the first thing you want to sort by, the second thing you want to sort by. But also at some point when you get to like three or four, do you really have groups that you can sort within? Because all the tens will be sorted alphabetically. But do you really have these? Like there's no sense in sorting the email like that won't change sort at this point.

Also, I see some rows that are a duplicate of other rows. If I want to get rid of those, how would I get rid of those? Select distinct, yes. Distinct eliminates duplicate rows.

Now anybody that did it more than once is going to be gone. I only need to see them once. And so you take things one step at a time.

And in the end, you've gotten a big query, but you've done it one step at a time, adding piece by piece. You keep filtering and filtering and filtering down to what you want. You start with lots of data and you whittle it down to only showing what you want.

Once you're done, if you want to export this and share it with somebody, down here, you can right click or control click if you're on a Mac, and you can export this data to give it to somebody. Let's say they want to get it in Excel or something. When you right click anywhere in this result set and export your data, you want to choose CSV, which is a comma-separated value file.

You can open those up in Excel or Apple's Numbers or something. You can import those into data visualization software. If you want to take it into Tableau or Power BI or something like that to visualize the data, you could do that.

So CSV is what you want. You click next, next, next. When you get to the directory, this is where it's going to put the file.

You just hit the browse button and choose the folder where you want to put it. Let's say I choose my desktop, and then I hit proceed.

So it just made a CSV file on my desktop right here. I could open that up in Excel or put it into my data visualization software and do something else. That's how you get it out and give it to somebody.

photo of Dan Rodney

Dan Rodney

Dan Rodney has been a designer and web developer for over 20 years. He creates coursework for Noble Desktop and teaches classes. In his spare time Dan also writes scripts for InDesign (Make Book JacketProper Fraction Pro, and more). Dan teaches just about anything web, video, or print related: HTML, CSS, JavaScript, Figma, Adobe XD, After Effects, Premiere Pro, Photoshop, Illustrator, InDesign, and more.

More articles by Dan Rodney

How to Learn SQL

Master SQL with Hands-on Training. SQL is One of the Most In-demand Programming Languages and is Used Across a Variety of Professions.

Yelp Facebook LinkedIn YouTube Twitter Instagram