Discover how databases efficiently organize data using primary and foreign keys, helping you connect tables seamlessly. Learn why splitting data into multiple tables prevents duplication and improves database performance.
Key Insights
- Understand the importance of primary keys as unique identifiers within a table, typically numeric values that remain constant and cannot be null or duplicated, ensuring each entry's uniqueness.
- Learn how relational databases utilize primary and foreign keys to connect tables, such as linking users to orders, preventing data duplication and boosting efficiency, especially critical for large-scale databases like Amazon's.
- Explore the common relationships between tables, including one-to-one (e.g., a user and their passport) and one-to-many (e.g., a single user making multiple orders), and how Entity Relationship diagrams visually represent these connections in tools like dBeaver.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
To understand some things that we're going to learn in just a moment, we're going to learn about joins. Joins are a way to take two tables and put them together because we have data in two different tables and we've only been working with one table at a time. To understand how this works, we need to understand a few bits of information about how databases work and some of the terminology.
What I want you to focus on is not so much about the names of things, but understanding the underlying implication of like, what does this mean? Like, why do I care about this stuff? Sometimes you can get confused about like primary keys and foreign keys. Don't focus so much about the names, focus on like, what's the understanding behind the concept here? So what this really is about is we need to understand the relationships of tables that connect to each other. And there are primary keys and foreign keys typically used to connect these tables together.
So what do we talk about when we talk about primary keys and foreign keys? So let's take a look at, let's say a users table. A users table, think about names. Could there be multiple people named John Smith? Yes.
Could there be multiple people with johnsmith@gmail.com? No, but could one John Smith work for a company, leave the company, and then another John Smith be hired and could that email be reassigned to the new John Smith? So you could give up an email and it could go to somebody else. When considering how to identify a unique person, is an email a great way to identify a person given that they could give their email to somebody else? Not a great way to identify a unique person. A good way is just assigning them a number, like user number one.
I don't care if user number one changes their name, changes their email, that's still the same person, right? They're user number one. So behind the scenes, they will never know if they are user number one or user number two. That's going to be in the database.
But the database will always refer to that person as user number one. What's nice about this is they will never be able to change their user number. So I know forever that user number one will always refer to this person, whatever their name is, this person, whatever their email address is.
It will always be the same person, whoever user number one is. If they ever delete their accounts and it gets erased from our database, user number one will go away and there will never again be a user number one. There could be now, if they sign back up, they could be a new user with a new account, but nobody else will come in and take over user number one.
That will be permanently relinquished, never to be used again. We know that user IDs or user keys are a way to uniquely identify each person. So a primary key is a way to identify each person in a unique way.
Most of the time, they are simply a number that increments by one. But the primary key is important for identifying that person uniquely. That user ID, if it's a primary key, it cannot be empty, meaning they must have a user ID.
You can't have it be null or empty. So they have to have a number and that number will increase for each person. Now, when you then connect a user to an order, so let's say a user makes a purchase.
The users table contains the user information. The orders table does not have the user information, but it does need to reference a user because who placed that order? We need a way to identify a person and what's a unique way that we can use to identify a person who never changes? We can use their user ID. For clarity, I put a U as user, but the U would not be in the database.
That's just for us to help mentally connect these things. If user number one is John Smith and the second order was placed by user number one, we know that we can look up in the other table that user number one is John Smith. Let's say user number one, oh, sorry, order number one.
Order number one was placed by user three. Who is user three? Sue Parker. See, what you're doing here is you're connecting, you're looking up things.
We've got our data in two different tables. The first thing people sometimes ask is, hey, why don't we just throw it all into one table? Why do we have to split things into different tables? The reason is if you have just one table, you duplicate your data a lot. Think about every order that a person makes.
So Daisy Jones, Daisy Jones placed two orders. If you just have one big table, you have to repeat the person's name in every single row. You have to repeat the person's email in every single row.
What happens if they want to change their name? What if they want to change their email? You have to update every single row. Imagine you're amazon.com and somebody's placed thousands of orders. You're storing their email thousands of times.
What happens if they want to change their email? I hope your database doesn't crash mid-update and then you have some with the old email and some with the new email. But also just think about how much duplicate data there is and how inefficient from a storage standpoint that is. You waste storage space.
And when you're at the level of Amazon, your database is huge because you've got millions and millions or billions of transactions. Every duplicate you do, that's a waste of storage space. We are efficient by splitting our data into different tables to avoid duplication of our data.
That's why we split into two tables. But we still have to connect those for the sake of our query. Because for a query, which doesn't get stored on a computer somewhere, it just gets run and thrown into the result set.
For our query, we may need to access both user data and order data. We need to join those together. We need a way to associate or link these two.
And keys match up, primary key and foreign key. And the way that I think of this is the primary key is the primary way we identify something. And when you leave your home country and go to a foreign land, you're considered a foreigner.
You refer back and you come back home. So in the orders table, when we refer to a user, you're saying, let's go to the users table to look up that user. If you forget which one is primary, which one is foreign, it doesn't matter.
Just know that you're looking for two columns that have common information that is used to connect these two tables. And notice what I'm doing here is to say that this order was placed by this user, and I'm gonna go look up that user in this other table. And I'm connecting one row of one table to another row of another table.
So essentially, when we're trying to connect these things, we're taking one row from one table and another row from a second table. And we're joining them into one big row, as though they originally lived in the same table, but they don't live in two different tables. And we do that to keep our database small.
We're not repeating our data. We're saying, hey, these two orders, which are two different orders, were placed by the same person. When that gets output, we'd see that this order was placed by this person, and this other order was also placed by this person.
So there'd be some duplicates in our result set, but that result set doesn't get stored anywhere. It's just in our query that we're doing this. When we think about relationships, which is what this is all about, for a U.S. passport, can one citizen have more than one U.S. passport? No.
Is a passport for multiple people, or is it for one person? One person. On both sides, it's one, so this is a one-to-one relationship. If you have a list of people in one table, and a list of passports in another, the primary key person number one identifies John Smith.
User number one over here in the passports table, where it says this passport number, when it expires, is for user number one. We know that John Smith has this passport, and this passport is for John Smith. It's not for anybody else.
What's very common in the case of, let's say, orders, can John Smith place more than one order on our website? Yes. If you're gonna say, no, only one order for you, you're not gonna be in business for very long. You get one order, and that's it.
No, of course, you can place many orders. Right? So like you have an Amazon account. You place lots of orders. Now let me ask you, for any one of those orders, is anybody else gonna see that order, or are only you gonna see it in your account? Only you're gonna see the order that you placed, right? If Amazon's looking at an order, they know that it's for you.
So an order is for one person, but one person can place many orders. So this is a one-to-many relationship. It's the most common relationship you'll encounter because one person can place multiple orders, but each order is only for one person.
It's the most common thing. That's what we're gonna see in our database. Now the ER diagrams, just ignore the SQL Server part, but in SQL Server, you need to create a diagram, versus in dBeaver, where you could just double-click on tables.
And here you can see in the ER, or Entity Relationship Diagram, you can see the relationships. And you'll notice that some things are not connected. There are no relationships.
They're a standalone piece of data. Here I can see that employees are connected to the departments they work in. And I can see that users place orders, which are made up of line items, which relate to our products.
And what's cool about dBeaver, which I don't, so not every program does it quite as nice as dBeaver, but watch when I click on User ID. See how it highlights over here, User ID? I wish every database app did this, but for example, Microsoft SQL Server Management Studio does not do this. They don't show you the connection.
For example, Order ID, see how it highlights Order ID? So they're showing you that this connection is made on Order ID. So the idea is that users, right, which are defined in the Users table, there's gonna be an order, and we're gonna connect it based on the User ID. By the way, this is the primary key up here.
These are the primary keys. And you can actually see when you do a query here, you don't see it over here, but when you do a query here, let's say I do a query. I'm gonna create a new script here, and I'm gonna say select everything from Users, for example.
Actually, let me do Orders. All right, so let's do Orders here. See how there's a little key right here? That little key right there, it's a little like orange key, I know it's very small, but that shows that that is a key.
That's the primary key. Also, when I hover over, it says P key on the far right. P is for primary.
That's the primary key. And this one here, that refers to the Users table. So that's a little orange arrow.
It's kinda hard to see these icons, but they refer out. I think it's much easier to see this in the ER diagram. So this is the primary key up here at the top of each table.
And normally speaking, in most situations, that primary key is gonna be referred to. When somebody places an order, we say, hey, go look up the User ID. When somebody places an order in line items, line items are knowing which order they are associated with by the Order ID.
So for every line item, you might have like four different products. They're all for the same order, so they share the same Order ID, because you can have four products on an order, or you could have one product on an order. You would never have no products on an order, because you wouldn't have an order then.
So you can have one or more products, and they'd be associated with the same Order ID. And then when it comes to the Product ID, it says, hey, go look up in the Products table, look up what the Product ID is to know the title, the tags, when the product was created. So there's always some connection here, and I like that the green highlights that actual connection.
I wish every database app did that highlighting. Not all of them provide this feature.