Discover practical strategies for handling missing data in your datasets by using effective methods like filling with averages or removing unnecessary rows and columns. Learn how these essential data-cleaning techniques can enhance your data analysis and visualization.
Key Insights
- Understand that using
df.dropna()
indiscriminately can lead to significant data loss; for example, dropping rows with any missing values could reduce a dataset from 4,195 rows to 3,552 rows, removing approximately 15% of data. - Explore a better alternative by filling missing data using calculated means, such as replacing absent median age values with the column's average (25.5) or missing fertility rates with their mean (3.89), resulting in cleaner and more complete datasets.
- Learn to save cleaned datasets by exporting data frames to CSV files using the
to_csv
method, including details on managing the index correctly to prevent unwanted columns like "Unnamed: 0".
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
Okay, data cleaning. Now, df.dropna calls on a data frame will drop all rows where any column is missing a value. That's a little drastic because you're going to lose the whole row in that case.
It's like you have an employee and all you don't know is their favorite color or something or you got 20 other pieces of data on them like their salary and how long they've been there and all this other important data. And you don't necessarily have a social security number and etc. But we don't know their favorite color.
So let's just drop the row because we're missing one piece of data. So you know, you wouldn't necessarily do that. But you could.
If we call dropna on the data frame, it would get rid of all 595 of these. And any one of these 113 that weren't that, you know, didn't overlap with the 595. So you'd probably throw out good 15% of your data.
You can also clean off a specific column. You could say df the column name and dot fill in a value. And then you could you could call that on the whole data frame or on a specific column, which is more typical.
And it will fill all the missing data in that column with a value. So maybe for if we're missing the median age, right, we wouldn't want to fill it with zero because that might mess up calculations and averages and skew the data. But you might want to calculate the average and fill it with that.
That's a common fill move is to just fill missing numbers with a mean. You know, you have a thousand cars and their mileage or their. I mean, it's a little more going on, right? If you if you have many cars and you want to fill in their missing mileage with averages, you want to also know the year probably so that you're not, you know, so they're not putting 40,000 miles on a new car.
So there are multiple factors involved potentially in coming up with a good average like it's not one size fits all. But filling missing data with averages means is very common. You can also drop the other way.
You could just say I don't want to drop the rows because I want to keep all my rows. I want all 4,195 rows. I just want to get rid of these columns because there's so much missing data that I'd rather I'd rather just lose the whole column.
And that's more typical and probably better. And in fact, again, we're not even going to keep any of these columns, but we're going to make a new DF where we can practice some of these operations. So let's make one where we drop we do this first move.
This is a kind of drastic move. That's why we're going to make we're going to do it on a copy, not to what we don't. We do not want to drop any rows off our 4,195.
We need them all. But that's not to say we can't practice on a copy. We're going to make a copy.
And then on the copy, we'll drop we'll call drop and a real call drop and a on the original and it'll return the copy with the change. We'll say no, no. DF is going to equal the pop DF dot drop and a and you don't even have to feed anything in here.
It's just whatever is missing drop. And then we're going to print. The shape of that in the head there.
So it works. If you go into the middle, let's say we might get lucky and see some missing ones. Nah, but some number of rows did get thrown out.
Anyway, that's not the important thing. The thing we care about is the shape, right? If the shape of the original is 4,195 and we you know, that's how many rows are and we whittled that down to 3,552, we threw out a lot of rows. Well, all 595 of these went and then some number of these on top of it.
The original DF is unchanged. That's important. We do not want to change the original and we don't want to change the original.
So that drop and a move doesn't operate in place on the original unless you tell it to. It returns a copy, which is what we have here. Now, let's make another copy.
Well, we can do this off the original because we're just going to drop the columns anyway. So let's say we want to fill. We've got this median age that's missing a ton of data.
We also got a fertility rate missing a ton of data. We could go into the column, any column you can calculate. You can add up the values of the whole column and you can also get the median of all the values of a column.
So let's calculate the mean. Let's get the average age of the entire median age column. So this column with all the missing data, we're going to get the average value of that.
See what it is. We're going to say we'll call it the mean median. It's the median age, but we want the mean of all the median ages.
And that's going to be the POPDF median age column. What do we want to do to the median age column? We want to get the mean of that. Just run it.
And it's this big float 25.49. All right, let's round that off to one decimal. It's still giving this is NumPy thing. Okay, let's float that.
Here, let's float the median age. Okay, now we have a number. Then let's round the float.
All right, skim the skim. There we go. One decimal.
So the median age is 25.5. Now, that's not the most scientific way to go about that. You'd probably want to independently calculate those based on how industrialized the country is. The median age in countries varies drastically, but whatever.
That's fine. 25.5. It's a great age. Now we're going to fill any missing median age anywhere with that 25.5. We're going to target the median age column.
Again, we're doing this to the original, but that's fine because we're going to get rid of all but three columns ultimately anyway. We're going to target that column and fill it with itself, pop with a fillna, hold on it. And what are we filling with? The fillna method takes the median age.
So you're going to go to the median age and you're going to fill all missing values with that mean 25.5. And that's going to return the value, the result, which you're going to save. You're going to stamp that change on the original median age column. And then we should probably sample 20 just to see if it works.
So what we're looking for is some 25.5s. Here's one. There's one. And notice the nans for the other one.
So those are the rows where you had the three columns with the missing data. We took care of the nan. Let me give you a little challenge.
Challenge, fill the missing fertility rates with the mean value for that column. Follow the same procedure that you just did. Pause.
Come back when you're ready. All right. So to get the mean fertility rate, we're going to target the fertility, what's the name of the column exactly? Fertility rate column.
Target the fertility rate column and we're going to get the mean of that. And it's a NumPy float. We want to get a real float, just a plain simple float.
And we want to round that number off to probably one decimal, unless they're doing two decimals in the dataset. Oh, they're doing one decimal. No, they're doing two.
Fertility rate's doing two decimals. Okay, great. We'll round to two decimals.
Okay, there's your float. There's your round. There you go.
3.89 is the mean fertility rate. So then you would fill, right? So then we're going to say fill missing fertility rates with 3.19,3.89. We'll say popdf fertility rate equals itself.fillna. And we're going to pass in this number. And if we run, if we do another sample, median age, there's a 25.5 and there's the 3.89. So it's working.
We'll leave the last one just null. So that's enough. To get an idea of how to fill, how to fill with a mean, which is very useful.
That's a very standard, very standard data cleaning operation. Fill missing numbers with a mean. Now, after you make all these changes, what if you wanted to make an, what if you wanted to save? We load up a data frame, manipulate it.
Okay. Well, we, you know, we filled all these missing values and I mean, that's about all we've done, but still, what if we wanted to save that now for later use? Best way to do that is save it to a CSV file so we can load it right up. So the data frame itself has a method called to CSV where you feed in a path and then you feed in an index false, which means you do not want it to make a new column out of the indices of the data frame because they're in our case, we wouldn't want that because it's just index zero to 4,194.
You wouldn't need that as data, but if the index values were say meaningful data, then you would not say index false because you'd want to preserve that index. And the way to do that would be like, let it be relegated to a column status. So it would exist as a column and you give the path, we would want to set the path to be our base URL, all the way to our CSV file.
Index false prevents the index from becoming a new column called unnamed zero. And then you would just boomerang it back up, read CSV, right? And load the file back up to make sure you got it. Okay.
So what we're going to do is save the no null DF to a CSV in case we want to use it someday. Let's not specify index false so that we get that new unnamed zero column. Otherwise you won't know what the heck I'm talking about really.
All right. So that's our path that we're going to. We've got our base URL.
We're going to say no, no null DF. That's the data frame to CSV. That's the method.
The path is the base URL or base path. Plus the rest of the path ending in the name you want, you know, declare a name as part of the path. Name doesn't exist yet.
You're making that name right now. We are deliberately not specifying index false. So you're going to see what that means.
Run. Now let's look. Did we get the file? Go in your CSV folder.
Pop no null. Oh, we got it a couple of times. 3.06 p.m. There it is.
That's me right now. Okay. Well, we do need one of them.
We're going to load it back up. So it made the file. Now load the save CSV right back up it again.
We'll say no null. I don't know. Loaded DF equals pd.read no null from CSV.
How's that? pd read CSV. Pass in your base path plus your path to the file. Check the shape.
Load the head. Note the shape. 15 now.
Remember it was 14 columns before. Why? Because you got a new column. What's the column? It consists of the index values from the previous data frame, from the data frame that this was made from.
The CSV was made from a data frame. Why would it do that when they're the same numbers? Because the index values are not necessarily the same numbers. So in case that was useful data, it preserved it as a new column.
So keep going. Oops. Get that to work.
All those moves. Check your folder. Make sure it's working.
Oops-a-daisy. We got a weird column called unnamed zero. We meant that.
This is the index saved to new column in case it was useful data. Now let's save the data frames of CSV again. But this time we're going to add index false to prevent the new unnamed zero column from being made.
And we're going to change the name of the file as well to 2, version 2. We'll say not. We could use the new no null. Let's go back to no null df.
We'll just run this line again. No null df. And we're going for file 2. And this time we're going to say index false to prevent that weird unnamed zero column from being made.
Pop no null. Let's call it index false so we really understand. You can see it.
Okay. Did we get our index false file? Takes a little while to make it. There it is.
Index false. Now what we're going to do is load that right back up again. We'll say no null index false df equals pd.read csv base path plus name of file with csv in it.
So this string right here. Boom. And did it work? Of course it worked.
The shape now should only have 14 columns. Yep. And we loaded it back up it doesn't have that index false or that unnamed zero new column.
So how to save your dfs as csvs with and without a new column called unnamed zero. Get that working before you move on. All right.
We're going to make a new df now with only three columns. Remember, we only need three columns as I kept saying. We just want the year.
We're going to make a new let's take that out. We're going to make a new df with just the three columns we want. We'll say pop three call df.
That's going to equal we're going back to the original now we're done with no null practice. Pop df. We want just these three columns.
You can just say this. These three columns. And it will automatically give you all rows.
You don't have to use lock or ilock. And it works. But you can also use lock or ilock.
So let's override this by saying all rows comma these columns and we're doing it on lock right because we're using the column names. That would also work. Boom there.
All right. Challenge. Make a new df called pop 2020 that only has the year 2020.
In other words, it's only going to have 18. No, it's only going to have countries. So from pop three call df right.
Make a new df called pop 2020 df. It just has the year 2020. Right.
So there's 200 of those. You'll know it works if you get 200. Well, you'll see the whole year column will be nothing but 2020s.
200 of them. There's 200 countries sampled in that year. So pause.
Try it. Come back when you're ready for the answer. Okay.
We're going to make a df called pop 2020. And that is going to be made from the pop three call df. But we're going to do one of those boolean checks.
We're going to say year. Right. We're putting in a filter condition here.
Year is 2020. So it's basically going to loop the dataset, the data frame row by row and look at every single year and check to see if year equals 2020. And if it does, that boolean is true.
And that true row will go into the new result. So it's an accumulation of rows where the boolean inside the square brackets is true. Running gun.
Yep. 200 by three and they're all 2020s. 200 rows, meaning 200 countries, right? No countries in the same year twice, right? Everything's it's just unique country after unique country.
Let's get the top 10 of those populations. That's easy enough. Just say square 10, right? There it is.
There's your 10 most populous nations. We could also save that top 10,2020 df. If you get the shape of that, of course, it's 10, right? It's not 200 anymore.
It's 10, three. Because you said, I just want the top 10. We had all 200 of them and said, okay, let's just shave off the top 10.
It's already sorted by population. So that's your top 10 leaderboard for 2020 in population. Now, if you have a column where every piece of data is exactly the same, like a year column, you can probably drop the column.
You just know the year and that's it. You don't need to keep 200, you know, 100 or 10 copies of the same piece of data. So what we're going to do is drop the year column.
So since it only has one value, now you could drop the column or you could just take a slice, make a new df where you just slice off country and population and don't keep the year. So let's do that. We're just going to re-declare it, except this time we're going to say, let's shave it down a little bit more, right? We shaved it to three countries or three, excuse me.
We shaved it to three columns. Now we're shaving it to two columns. And now we've whittled that shape down to 10 by two.
I think where we came from, 4,195 × 14, now it's 10 by two. And believe it or not, we're going to make it even smaller than that. We're going to move the, we're going to take the countries and move them over to become the index.
And the reason for that is because we're going to do that horizontal bar like we did with the parental edu. And the bars will show as their ticks, their, their identifiers or labels, the index. And if the index is numbers, you would just see these bars with numbers next to them.
They wouldn't have countries next to them. We're going to take the country column and reassign it as the index so that our bars can be labeled.