Solution- Filtering with WHERE in SQL

Explore the essentials of filtering and sorting data with SQL queries to efficiently narrow down database results. Learn practical techniques to enhance query accuracy and productivity in your workflow.

Key Insights

Understand how to effectively filter SQL query results using the WHERE clause, such as retrieving all orders associated with a specific user ID or products with quantities greater than or equal to three.
Discover the significance of using arithmetic operations (e.g., price multiplied by quantity) to sort data accurately, enabling identification of the true most expensive items in your result set.
Gain proficiency with aliasing columns for clearer output and learn the importance of distinct queries to eliminate duplicate entries, resulting in streamlined and precise data retrieval.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

So let's talk through these challenges here. Again, the overall idea is that a lot of times when you start off with a query, you might want to just quickly browse the whole thing, but you're normally not going to want to stop there. You're going to want to filter it down and only show the parts that you want to see.

So even if I say I want to look at the orders table and I look at the orders table quickly, I'm not going to stop with this query. I'm going to want to filter it to say, I don't want to see all of the orders. But somebody might say, “Well, hey, I want to see all the orders for a particular person.”

In this case, I have unique user IDs for each person. So looking at the orders, I don't have the name of the person or anything in this table. Later, we'll see how we can join tables together to get the name of this person.

But for right now, if I just know that user ID number 30, I want to pull all the orders for that person. That means that when I look through the user ID, notice how this same person placed both of these orders—two different orders placed at different times, potentially shipped to different places.

I want to find all of the orders for a certain person—in this case, user ID number 30. So I’d be looking in that column to see where the user ID equals 30, which is a number, so I don’t have to put any quotes around it.

SQL Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes, 1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

These will be all the rows—in this case, five orders. And I can see their orders, where they were shipped, and when they were placed, for this one particular person.

By the way, when we're running our queries here—yes, you can click the little play button. But on a Mac, you can press Command + Return to run the query, or on Windows, Control + Return to run the query. So yes, you can click the button, but using Command + Return (or Control + Return on Windows) can be much faster.

You're going to run queries a lot. So getting used to that keyboard shortcut—Command + Return or Control + Return—means you can just click into a query and run it more efficiently.

All right, so I want to select all the columns.

I want to see all the columns in line items. So I could start by just showing myself all of the data in the line items table for now, just so I can see what data is there. But I don't want to see everything.

I just want to see where people purchased a quantity of three or more. So, what do I add to this? A WHERE clause: quantity greater than or equal to three.

Now, notice that my quantities here are five or ten. It’s possible that on this website, there’s a preset list such as 1,2, 5, and 10. Maybe they don’t allow people to choose a quantity of three.

Or maybe nobody has chosen that option. Just because you look for something doesn’t necessarily mean it’s there. There could have been quantities of 3,4, 5,6, 7,8, or 9.

But just because you’re looking for it doesn’t mean you’ll find it, right? Like if you’re looking for customers from Florida, what if your result set is empty? That would tell you what? That no one is from Florida, right? An empty result set could still be an answer. It might not be the answer you thought you would find, but as long as your code is correct, it could simply mean that nothing gets returned. You may just have no customers from Florida, for example.

I can look in my bank account for a million dollars—doesn’t mean I’m going to find it. You can look in your database for something, and it doesn’t mean you’re going to find it. You might not have customers who meet those criteria.

So your end result set might return absolutely nothing. For instance, did anybody order in a quantity of 30 or more? Nothing. Nobody ordered 30 or more. I now know something valuable—there’s no error in my code, it’s just that nobody placed an order that large.

Ten? Yes—people ordered quantities of ten. But nobody ordered more than ten. For example, if I search for more than ten, nothing appears. Ten was the highest quantity found here.

I want to see—so I’ll return this to the correct condition—I want to see all the rows from line items that have a price less than $30. If these are all the line items, I only want to see the ones with prices below $30.

Now, we are being specific here. We’re saying “less than $30.” When people ask you a question, it’s a good idea to consider: does it matter whether I include that number or not? For example, if I follow exactly what they say—“less than 30”—then that would mean not including 30.

Oh, that’s supposed to be a WHERE clause—sorry. So, this is everything less than 30. And if I had a product that was $30—which I don’t believe I have—but let’s just see.

Nope, no products that are $30. So for me, it wouldn’t really matter if I used “less than or equal to 30, ” because I don’t have any $30 products. But you never know—in the future, I might get a $30 product.

If somebody is asking you for something and they're speaking plain English—not like a programmer—they’re not going to use phrases like “less than or equal to $30.” They’re not going to use that kind of language because they’re not programmer-type people. So you do always want to question: ask yourself whether you want to include $30 products in the results or not. I think, a lot of times when people say it, they probably mean to include those things. But this is where you shouldn’t be overly literal about how a non-programmer speaks.

If you’re not sure, go back and ask them follow-up questions so you can understand the intent of their request. In this case, it doesn’t really matter, but I would want to clarify with the person just to be sure. Now here, if I’m looking for $30 or more, that would be “greater than or equal to 30.”

This would include everything that’s $30 or more. Now, if I just move on—did I remember everything? Oh no, it also says “Ordered by most expensive first.” When you do something—especially when you’re first getting used to SQL—I find that, at least for myself, I would get caught up in trying to make something work, and I’d be like, “Yay! Yay! I did it!” And then I’d forget to go back and re-read the original question.

Sometimes I’d forget things they mentioned because I was just happy that I got one thing fixed or working—and then I’d forget to go back. So always, when you’re done with a query, go back and re-read the person’s question and make sure that you’ve answered every single part of it, and that there wasn’t anything you forgot. Ordered by most expensive first. So how do I do that? ORDER BY price, which—if I just did that—the cheapest would appear at the top.

So as you said, I need to add DESC to put the most expensive at the top. Now, actually, let’s think about this for a second. I know we didn’t tell you this, but here’s a quick sanity check.

We said “the most expensive, ” but when I said that, I was referring to the product price. If we’re just factoring in price, is there anything we’re missing from this? Yes—quantity. For example, this line item is significantly less than another one, because only one unit was purchased, whereas another had five units purchased.

If you’re only looking at the product price, you’re missing out on the full value of the line item. So truly, to calculate this accurately—and this is a little sneaky, because we didn’t explicitly say this—if you don’t factor in quantity, are you truly identifying the most expensive line item? Wouldn’t we have to multiply price by quantity so SQL can compute the total value? That’s how it would actually sort.

And oh, look—some of these items were ordered in quantities of 10. Even though the product price is lower, the total value is much greater than something else with a higher price but only one unit ordered. So you can actually perform calculations in your query.

Interestingly, I didn’t even show people the math, but SQL is still doing it behind the scenes and sorting based on it. That’s powerful. This is where I say: what you’re generating in the result set, and what you’re doing down here, is not limited to what already exists in the database.

For example, could I create a column that shows price times quantity? Yes. I can add a column for price times quantity. It’s not a column that exists in my database, but I can still use it—because a comma indicates that we’re creating a new column.

The asterisk by itself means “all the current columns in the database.” But when you use an asterisk between two values such as price * quantity, that means multiplication. You can add, subtract, divide, or multiply in SQL. So price * quantity creates a new column that doesn’t technically exist in the database but will be visible in the result set if you include it.

Do I have to show that column to users? No—I can still ORDER BY something that isn’t shown in the results. SQL will still calculate it in the background. Yes, SQL can perform calculations and math for you.

The comma says “add another column.” If I say price, quantity, and then price * quantity, I’m instructing SQL to return the price column, then the quantity column, and then another column. Each comma separates a column in the output.

The last column is created by price * quantity. Notice there may be a question mark for that column’s header—SQL doesn’t know what to call it because it doesn’t exist in the database. It’s something I’m creating using data from the database.

Of course, I might want to give that column a friendly name. So I could use the AS keyword to assign it a label. Maybe I call it value.

And that is just a name that I came up with. This is called a column alias. A column alias is like saying you are also known as something else.

Technically, my name is Daniel, but I go by Dan. Yes, I’m Daniel, but I go by Dan—that’s my alias. Now, some people go by a completely different alias, like if they have fake IDs—that’s also an alias. So this is just like saying, “Hey, I don’t like the default column name. Let me give it a new one.”

You could technically rename any column. If you don’t like price, you could call it product_price using AS. You can rename any column to be whatever you want, even if it’s just to make the name more readable or user-friendly.

But when there is no column name—like when you do a calculation—because a column created by a calculation has no default name, you’ll want to give it a nice, readable name. And when we’re working with strings, we use single quotes. Double quotes are reserved for aliases.

If you called this column Line Item Value and wanted to put quotes around it, you'd have to use double quotes. That’s because names with spaces require quotation marks. Technically, column names can use quotes if you want to create nice-looking, user-friendly names. You can even capitalize them if you’d like.

"Line Item Value"—you can capitalize that. If you want cleaner, more presentable column names, you can rename them however you like using a column alias.

Double quotes indicate a column alias, whereas single quotes are used for string values—like when you’re comparing values in a WHERE clause. I know we didn’t mention this earlier, but I wanted to help you start thinking more broadly and learn to look back at your queries and question yourself.

That’s the sniff test—did I truly understand what was being asked? Did I properly answer the question? Speed is great—as long as you’re correct. But I don’t care how fast you are if you’re wrong.

Always look back at your query and ask, “Did I get this right?” You might be happy you didn’t get an error at first. You think, “Yeah, I didn’t make a typo.” But you also have to ask, “Did I give the correct answer?” It’s really important that we’re accurate in our results.

I know this was a bit of a tricky case and we didn’t explain every detail, but that’s part of why I wanted to use this as an example—to ask, “Are we really doing this the correct way?” We didn’t technically require it, but ideally, we want to limit the results to see just the top 20 most expensive line items. If I’m viewing all the columns from line_items, those represent all of the line items in the table.

If I’m ordering by just the price, I’d be sorting based on that column. If I want the most expensive items at the top, I’d have to use DESC to sort in descending order. And if I just want to see the first 20 rows, I’d need to limit my results to 20 rows so I’m not displaying the full dataset.

Again, technically speaking, do you want the top 20 most expensive based on unit price? Or based on price * quantity? That would give you very different results because the latter factors in quantity to determine the total value of the line item.

Now, if we want to find orders shipped to one of these zip codes, looking over here, I see that the zip code is a string—it’s stored as text (A-to-Z). So I’m going to use single quotes, not double quotes, because we’ve seen that double quotes are used for column aliases.

If I retrieve everything from orders, I’ll see all the orders. But I don’t want to view all the rows from all the orders—I just want to see the ones that were shipped to a specific zip code.

So I want to filter by zip code, and there are a couple of ways I could do this. One way is to say, “Let’s find all the records where the zip code matches the first one listed here.”

I could view all the records that were shipped to that zip code. This could return multiple rows, though in our case, we only have one order. Or I could filter for another specific zip code instead.

Notice there were two orders shipped to that zip code. This is one way I could do it, but there is a more efficient way. Does anyone have a more efficient way of writing this? Yes—IN.

So, is the zip code in this list? I can use parentheses to define the list. Is the zip code equal to this zip code or that zip code? Both of those queries will return the same results. But obviously, using IN involves a lot less writing.

I will always say the most important thing is that you're correct. Both of these queries are correct. But this one uses less code.

So I would personally prefer the IN version. Does that zip code appear in this list? If so, include it in the result set. So, this is the one I’d recommend as best practice.

But the other one works perfectly fine as well.

So why doesn’t BETWEEN work here? BETWEEN works for ranges—like numerical ranges. You can be between 1 and 10, right? But in this case, we’re working with text, not numbers.

If the zip code were a number, then BETWEEN could work. But I also don’t want all the zip codes from one to the other—I only want these specific two.

I don’t want a range of zip codes from one to the other, which would include a large number of zip codes. Also, BETWEEN doesn’t work here because it’s stored as text, not as numeric values.

You can use BETWEEN with number ranges, like BETWEEN 1 AND 20. But these are text values, not numbers, because they’re stored as strings.

And that’s not a mistake. If I browse this data, I can see that the reason it needs to be stored as text is because of these dashes. Dashes are not numbers—they're characters.

Also, some of these zip codes start with a zero. If a number is stored as a numeric data type, leading zeros are removed because they’re not mathematically necessary. But in this case, I need to preserve those leading zeros.

So there are very good reasons why the zip codes are stored as a string. But this is also why you must always check what data type you're working with—not what you assume it should be.

Always look at your column and simply ask: what do I have? What was I given? Whether there’s a good reason for it or not doesn’t change the fact—you must work with the actual data type provided.

Although there was a good reason for this—it’s text, so I must use single quotes.

And that’s why BETWEEN won’t work. BETWEEN could work—for example, if I want to get orders from users 10 through 20, I could use BETWEEN 10 AND 20 because those are numeric values.

You can click anywhere in a query, and when you run it, it executes the whole query. However, you can also highlight a portion of the query to run just that part.

But if you just click any line within the query—regardless of where you are—it’ll run the entire query. It’s smart enough to recognize that. This is why you need to be careful: if you don’t use semicolons, this may result in an error, because SQL won’t know that one query has ended and another has begun.

If you write multiple queries together without proper separation, they need either a blank line between them or a semicolon. If you insert an extra line space, DBeaver is smart enough to understand that, due to the separation, these are two distinct queries. The best practice, however, is always to use semicolons.

Semicolons clearly indicate the end of one query and the start of another—even without line breaks. That’s why you can sometimes get away with using a blank line between queries, but using semicolons ensures consistency and prevents confusion.

Now, in this result set, we have Mr. D. Lubovitz listed twice and Governor listed once. I only want to see a unique list of their names. I only want to see the names column.

If I only display the ship_name, notice that I have some duplicate rows. I don’t want to see duplicates. What do I add? DISTINCT.

DISTINCT tells SQL to remove duplicate rows. If you don’t limit your SELECT to just the name—if you kept it as an asterisk—are these rows the same? Not exactly. Even if the names are the same, the rows are different because they represent different orders.

So, SQL doesn’t eliminate the rows unless they’re identical across all displayed columns. You have to be specific about only looking at names if you want to eliminate duplicate names. Remember, DISTINCT removes duplicate rows based on the data shown in the result set.

If your result set includes only one column, you can easily eliminate duplicates in that column. But the more columns you add, the fewer duplicate rows you’re likely to have—because there’s more room for differences.

You cannot apply DISTINCT to just one column while displaying multiple others. For example, if you show all columns, DISTINCT still removes entire duplicate rows. You can’t say, “Only eliminate duplicate values in one column but keep everything else, ” because that would result in that one column having fewer rows than the others—and rows are meant to stay together logically.

So DISTINCT removes full duplicate rows—and it worked here. I just wanted to see the names, so I now have a list of names without duplicates.

Those are some bonus tips. But basically, we’ve now covered those techniques, so I want to continue on from here.

Key Insights

Dan Rodney

How to Learn SQL