Navigating databases can seem daunting at first, but tools like ER diagrams simplify data management and help prevent common query mistakes. Discover best practices for using wildcards, structuring SQL queries, and improving clarity in your database searches.
Key Insights
- Utilize Entity Relationship (ER) diagrams, such as those easily created with dbeaver, to quickly visualize database tables, columns, and relationships, simplifying the process of locating data without manually opening multiple tables.
- Use wildcards sparingly and precisely in SQL queries; excessive wildcard usage can produce unintended false positives, so always verify results independently to ensure accuracy, especially in large databases.
- Select the clearest and most explicit SQL query approach over concise but ambiguous alternatives, as clearly written queries improve readability and reduce potential errors for yourself and future collaborators.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
Let's talk through these challenges here. So, if I want to find all the people with a Gmail address, first of all, I have to know what table it lives in. So, as you get familiar with your database, you'll start to remember which tables things are in, but it's just like whenever you move to a new city.
You don't remember what streets things are on. You don't remember, "Hey, wait, does this city have a certain restaurant? Do we have an Arby's? Do we have a Wendy's? Where is it?" As you get used to things, you won't have to keep referencing your database as much. But in the beginning, when you're not used to the names that other people have created for these things, you'll have to spend more time getting used to things.
And by the way, if you want to get kind of a bird's-eye view of all the stuff—so instead of having to open up users, instead of having to open up products—if you double-click on tables in the ER diagram, this is the Entity Relationship diagram. This shows you all of the tables with all of their columns, and you can fit things on your screen much more easily. So, here we can see the relationships of all of these, and I can very quickly kind of look through all of these without having to open and close things.
So, this ER, or Entity Relationship diagram, which we're going to talk more about in the next set of slides—this is a quick way to just kind of see what you have in your various tables. Easier than having to open it up and then try to have all these on screen. You might not be able to fit all of these things on your screen all at the same time.
All database apps can do an ER diagram. So, DBeaver does it very easily. I don't want to save the changes to this.
DBeaver does it very easily by just double-clicking. In SQL Server Management Studio, we have to create the ER diagram. We have to, like, right-click and say create, and then choose which tables we want to put in there, just in case you don't want them all to be there.
So, I would say that DBeaver creates it the easiest, but all database applications should have some way to create it, even if you have to say to create it and then choose which tables you want to put into it. DBeaver just is the fastest one because you can just double-click and select the tab, and it's already made. But they all should be able to do it somehow, even if it's not quite as few clicks, because you want to be able to see the relationships that are in your database, and you want to be able to see the columns very easily.
So, they can all do it in some way. So, here I can see that users have an email, and that's where their email is stored. So, I want to find their email.
So, I'm going to look in the users table, and specifically, I just want the email column. So, I just want to look at that column. Well, actually, no, I don't need to look at the column.
I want to get all the user data. Yeah, I want to see all the user data, but I only want to see it for Gmail users. So, I want to see their names.
I want to see all the other stuff, but just if the email contains "@gmail.com, " I only want to see those rows. So, what did you put for WHERE? So, let's say… So, first of all, in our database, which is small—like how many users do I have in my database? I only have 100 users because there's 100 rows. So, I don't have a lot of users.
It's a very small training database. It's nice to learn on a small database, but in real life, you're probably going to have thousands, hundreds of thousands, millions of users. So, there's going to be a lot more people.
So, that means there's a lot more room for variation. So, if you did this in our database, it's probably going to be fine in our database. Same thing with this one and probably even the same thing with this one.
But you have to understand that there could be more variation in other databases. So, for example, what about if somebody had this email address: flingmail.com? Is gmail.com in there? Is "gmail" in there? They can have that email, right? Or what if it was like huntingmail.com, right? I don't know that these are real things. I'm just coming up with—there could be somethingingmail.com, right? There could be all sorts of things that have "gmail" in it, but that's not the same as gmail.com, right? As its own thing.
So, we have to be very careful with wildcards, especially the bigger your dataset. Think of each wildcard as a little bomb that could explode in your face and be like, "No, I included way too much." You have to be careful of what are called false positives, where you're including more than you wanted to.
They are wild. So, they can go really wild, and they can include way more than you intended them to mean. So, this is where the sniff test comes in handy.
Let me look around and let me verify that I didn't accidentally include things that I didn't mean to include. The computer is going to think and take you literally of all the possibilities that it could include. So, the problem with "gmail" like this is it doesn't include the ".com, " or like what if it was somethingingmailing.com? Do I need something after that? I don't need to include a wildcard.
So, the goal here is to include as few wildcards as possible. Don't come up with a reason to just throw wildcards in there. That's like just popping grenades.
Like, just—we want to use as few as possible. The more wildcards we add, the more things could accidentally include things that we didn't mean for them to include. My goal is to have as few wildcards as possible, not as many wildcards as possible.
I think when people first get started, they're like, "I'm just going to throw them in here just in case. I want to make sure I have them all." But what you don't realize is you accidentally include things that you didn't mean to.
So, then you need to rein them back in. While technically these queries are correct on our database, in a bigger database, these could go wrong. So, I wouldn't use this one.
Way too dangerous. And even this one without the at symbol is also too dangerous—it could include too many. So, I think the safest option is to say, "I don't know what comes before the @gmail.com, but I know it's going to be @gmail.com." I know that for a fact.
The only thing I don't know is what comes before the @ symbol. So, I would definitely include "@gmail.com" and not put anything afterwards. I think that's the best way to get all of the Gmail addresses that I want and not accidentally get the things that I don't want.
I know we don't have those as examples in this database, but as you scale up, that will be a problem. And the bigger your dataset is, the more accidental false positives you might include. So, I don't want to scare you away from wildcards, but you need to respect the wildcards.
They can be a little more wild than you think, and they can accidentally include things that you didn't mean to include. So, always double-check your results to make sure what you're looking for are not false positives—things that you didn't mean to include, but you accidentally included.
So, always be very diligent about when you're using wildcards. And that's why I only use them when I have to, right? When I know there's some unknown quantity. I don't look for an excuse to use them. I'm going to try not to use them whenever possible because I know they can sometimes include more than I thought.
So, just use them only when necessary. I want to see all of the orders shipped to Florida. So, if I'm looking in my orders table, right now I'm seeing all of the orders—but I don't want to see all of the orders.
I want to see orders that were shipped to Florida or Texas. So, I only want to see rows where the state is… and what comes next? There are a couple of ways you could do this.
IN ('FL', 'TX'). So, close parentheses. That's the most code-efficient way to do it, because you're saying, "Is the ship state found in this list? Is it equal to one of the states in this list?" So, I can see that there are some Texas and some Florida entries.
So, that's the least amount of coding that you could do. But other ways we could do it are like: OR, yes, okay.
So, they were saying it equals Florida or it equals Texas. Yeah. So, that's what I was looking for there.
This would also work too. So, you could say it's Florida OR it's Texas. That would also work perfectly fine.
Same exact answer. This is just less typing. So, I think when we get used to using IN, I know it takes a little bit of time to get used to thinking that way.
But if you're thinking about finding something in a list, it's easier to just say, "Find it in that list." It's always equality, but we know we're looking for equality here. But this works perfectly fine.
Just a little bit more code, but both of these perform equally well. So, whichever one—either one—works. There's not always just one answer.
Sometimes you can write it in different ways. Can you use IN with LIKE? No. So, IN is always equality—not using wildcards.
So, if you're using LIKE, you could do it here. And, you know, if you needed to put wildcards in here, you'd have to use ILIKE instead of equals, and you'd do it this way. Yeah.
So, IN is always equality. So, you can't use wildcards with IN. But let's see—five most recent orders shipped to New York.
So, kind of like we were doing here, right? But we're looking for New York orders. But once we found all the New York orders, I want the five most recent. So, how do I get the five most recent? ORDER BY created_at, right? And DESCENDING, because if you don't do descending, that would put your oldest at the top—2021.
So, we need the descending to flip it. And then also LIMIT it to five. And you must do it in this order.
So, you can't put LIMIT before ORDER BY. This is Structured Query Language. We've got to do it in this order to get it done.
I want to see all the products that have "plate" and are more than $20. So, I'd be in the products table. The title is going to have "plate" somewhere in it.
I don't know where. And I want the price to be more than $20. Now, you can do this in either order that you want.
Like, if you want to take care of the $20 first, you could do that first. You can say WHERE the price is more than $20. So, it would be greater than.
You might want to throw the equals in there. Like, sometimes when people say, "Oh, it's more than 20 bucks, " they probably mean 20 or more. Depends on how specific they're being.
But normal people don't speak in less than, more than, greater than. They just say, "Oh, it's more than or less than." They don't add the equals in there. Not that it would matter in our case, because we don't have anything that's equal to exact whole dollar amounts. But this would filter us down.
Now we're down to 45. Originally, for our products, we had 52 products. I can see the 52 down here.
Now we're down from 52—we're down to 45. We got rid of some of the products.
But now I only want to see ones here that have the word "plate." So, what do I add here? AND, right? Because I want it to be both—that they're more than $20, and… and then what do I put next?
Title—yeah. ILIKE, so I'm not case sensitive. Plate with percentage. Yeah.
Now here, if I just put one percentage sign at the end, this would mean it must start with "plate." If I just put it at the beginning, that would mean it ends with "plate." But what if "plate" is somewhere in between? This would allow "plate" to be in between.
So, there could be things before or after. I don't know what I have when I'm first getting started with things. In this case, I don't have anything other than things ending in "plates."
But this does leave the option that "plate" could be somewhere in there—like "silver plated, " right? If "plate" is in there, it would still find it. So, there are only two plates that cost more than $20. Also, depending on how you order this—like, I could rearrange these.
I could put this up here. And why might you want to rearrange this? For example, here, if I select the first part, these are all of my plates. And then, if I do the whole query, those are plates that are more than $20.
So, it's kind of nice that you can select part of a query—like, here's the whole table, all my products. Here's products that are plates. Here's products that are plates that are more than $20.
So, I think it's kind of cool that you can select part of your code and run just that selected part. That's very different compared to most programming languages that just run entire files. They don't normally do it based on selection.
That's kind of unique to SQL. Now, let's say somebody is allergic to rubber. So, they don't want to see any of the rubber products.
If I look at all my products, they're going to include everything, including rubber—like those sleek rubber pants. If I want to get rid of the rubber, I like to start by finding the rubber, making sure that I'm only finding the rubber, and then I flip the NOT and turn it on. If I'm looking in that title for where it's ILIKE—so I don't worry about case sensitivity—and then it would be "rubber."
And I don't know if there are characters before it or after it. I just know "rubber" is somewhere inside. In this case, it happens to always be in the middle, but it could be at the beginning, middle, or end.
I have now successfully found all the rubber products. But I want to see everything that's NOT rubber. So, I can add NOT here, or I could add it over here.
In this case, you can put it in either place. Because when you have a word, you can put NOT with the word. Either way works just fine.
You'll get the same result. But again, I just always like to find the positive—make sure I didn't make any silly little typos, like only having one "b" and not finding anything. I make sure I can find it, and then I don't accidentally include things that I didn't mean to—especially with wildcards.
Can you imagine every place you could have gone wrong? Probably not. Verify your results, especially with wildcards. Then throw the NOT in, right? So, add the NOT.
Now, if I want to find products that are "gray" with an "e, " or "gray" with an "a, " because they can spell it either way, what would we do here? This is another place where people might do different things. What should I be looking for here? I want "gray" with an "e" or "gray" with an "a." Tags, IN, gray, grey.
Okay, yeah, that's one way to do it. Okay.
Does anybody have a different way of doing it? Use ILIKE. Okay. And then "gray" like this.
And did you use a percentage sign here? Where did you put it? You used underscore. Yeah, because we never used underscore, right? Underscore means exactly one character. And we know there’s exactly one character there, right? Although you could also try the percentage.
And again, in our database, it's a small database. So all of these queries will technically work. But let's say you scale up and you get a bigger database.
Let's talk about the benefits—the pros and cons—of each of these. Now, let me ask you, if I didn't give you the query, if I didn't say what we were trying to find, if you looked at this, would you know what we're trying to find? If you just looked at that query, or if you looked at this query, would you know the code? Would you be able to envision what we're trying to find? Not very clear. If you look at this query, do you know what we're trying to find? Yes.
Also, what if next year "groi" becomes a thing? I don’t know. It’s the year of the "groi, " right? New words get invented. I can’t imagine any other word other than "gray" or "grey"—like these two spellings.
But this does open up the possibility. Also, this could be—what if something is "greeny, " right? That could be "greeny, " or "grainy." So this is much more dangerous.
That would be the worst. This would at least be better because it would only be one letter. So I can imagine a lot fewer ways it could go wrong.
I can’t really imagine this going wrong, technically. But what I don’t like about it is the lack of clarity. When you're coding, you have to think about other people who you're going to be sharing code with.
Now, you might say, "But I work by myself." Okay. Your future self will be looking at your past self's code.
You’re going to write code, and you’re going to look back at it and you’re going to say, “What did I write two weeks ago?” Because at the moment, you were in the right mindset. But two weeks from now, are you going to remember what the heck you wrote? I’ve looked back at code that I’ve written—I know I wrote this, but I don’t remember why I wrote it this way.
Because it wasn’t very clear. This, I think, is the best code. Because, number one, it will always be correct.
Number two, it’s really easy for you and anyone else to understand. It’s very clear. So, I think clear code, in this case, is more important than saving a couple of characters.
Yeah, you didn’t type in quite as many characters. But I think the time you spend trying to figure out what that even does, and the possibility that it could go wrong, are both reasons why I wouldn’t go with either of those solutions. Sometimes we try to avoid wildcards.
We avoid them whenever possible, to be honest. We only use them when we have to. You have to think about all the possibilities that you could accidentally get that you don’t want to.
So, in this case, we only have two alternatives. It’s "gray" or it’s "grey." So IN, I think, is perfect for doing that.
So I would definitely highly recommend this solution as opposed to the other ones. There’s just too much of a risk, and they’re too unintuitive for other people to look at that code. And if you added a comment to say "gray or grey, " did you end up saving any typing? No, you probably actually typed even more than this.
So you know, why not just go with something that is more straightforward, that can’t be wrong, rather than trying to explain what code you were trying to write? And by the way, that’s why the double dashes are there—so that you can leave comments to yourself and others. So you can explain to your future self or to other people.
You can leave little notes around so that you can remember or communicate with other people. If there’s still something you need to add, you can leave a little note to yourself to go back and add this feature or to fix this thing if something’s not quite working right.
Now, we only want to find line items that have been returned. So, if I'm in the line items table—so here we've got a status—and these statuses would be something that people would choose from a menu, because we don't want people typing in statuses. So, we only have a few, like "Pending, " "Shipped, " "Canceled, " and "Returned."
So, "Returned"—it's always going to be that that status is "Returned." It's always going to equal that. No need for wildcards or anything.
We just simply say it equals "Returned, " and so it finds all of the returns. So, now we just see the returns. Now, people might be interested in those returns.
They want to see the most expensive returns—most expensive by price times quantity. We're talking about the value.
So, we have a column here—or sorry, I should say we have all the columns. I could add a column for price times quantity, where I keep all the current existing columns, and then I add a new one for price times quantity. I'm going to call that one "Value."
That way I get a nice name here. Now I have the value. You can call it whatever you want—call it "Line Item Value."
I'm just making up the name "Value." That's what I'm choosing here. So, if I then want to order by that column, ORDER BY, I use that name to say order by that column with that name.
And I can say ORDER BY that value, which by default puts the cheapest ones at the top. Or I could do DESCENDING to put the most expensive ones at the top. So, these are the most expensive returns—starting with the most expensive at the top and working my way down.
So, if I'm finding the most expensive returns, the most expensive ones are at the top here.