Discover how to transform and visualize data effectively using Python, including pivot tables and dynamic bar chart races. Learn to manage complex datasets and enhance your insights with compelling visualizations.
Key Insights
- Learned how to pivot datasets in Python, transforming rows into columns, an essential step demonstrated using a country population dataset (4,195 rows and 14 columns).
- Explored the creation of dynamic visualizations including pie charts and bar chart races using Python's specialized Bar Chart Race (BCR) library, highlighting the trends of the top 30 countries by population from 1955 to 2020.
- Reviewed techniques to inspect, clean, and prepare data by identifying unique values, managing missing data, and organizing datasets, emphasizing best practices in data management and preparation workflows.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
Hi, welcome back to this Python Programming and Data Science course. My name is Brian McLean. Thanks for joining me for lesson 10.
This is the final lesson and we got a really great one for you. So let's dive in. So here's what we're going to learn.
We're going to learn how to pivot data. If you come from an Excel background, you might know about that. That would involve moving columns to become rows and vice versa.
We'll get in. It's a little hard to explain without being in the middle of doing it. So let's just hold off on that.
But it involves a quite a transformation of your dataset. We're also going to revisit line charts and bar charts and we're going to have two new charts. We're going to learn how to make a pie chart and then the crowning achievement, kind of the capstone project here, is we're going to make what is known as a bar chart race.
A bar chart race is really cool. It's probably the only kind of chart that can get its own YouTube video and, you know, 50,000 views of people just looking at a chart. And why? Check this out.
You've seen these. Watch. The data is racing.
We've got the top 10 countries by population from 1955, top 30 rather, from 1955 to 2020. It's feeding off this dataset. We're going to load up from a CSV file and we have to use this pivot to rearrange the data.
So, you know, a bar chart race in Python, the columns are racing. We actually have to load up some data and go into the countries column and take all the countries and turn each country into its own column. Hence the pivot part.
And then we're using this special library called BCR bar chart race, which involves an install. It's not like NumPy or something where you can just import it. It has to be installed.
If out of the box Python, like print and type and int is like your junk drawer in your kitchen where you just get a roll of tape or scissors and importing is like having to go into the garage and get a toolbox or get something out of the toolbox like a drill. Well, importing would be like you have to go to Home Depot and get like the floor sander polisher thing that you rent for the day. Something that you don't use all the time.
It takes an extra little bit of work to get it. We're going to have to install this module that's making this possible and then import and then do our stuff. And there's this, the bar chart race module comes with a namesake method with a ton of properties.
If you thought we were passing in a lot of arguments and into the parentheses of various functions, this is going to be like dozens of properties. But it is cool and it works and you can impress your friends and you can impress your co-workers. You can have people gathering around your computer at work, you know, while you're showing them this, you know, doing this on the company data.
All right, so we're going to need the big three. Import NumPy as np, import pandas as pd, import Matplotlib.pyplot as plt. We don't have any graphics, so we don't need the image, but we do need to load the csv file.
We'll say from Google.colab import drive.mount content drive. Let's run these. All right, let's get our base path.
It's the same base path from the last lesson if you want to just go in there and grab it. I already closed the lesson though and I just want to remind you of how to find the base path. We're going to go grab it even though it has not changed.
Copy path. We're going in the csv. We've got this csv country populations.
We've already got it written out though. We don't need to copy that path. We've got our path.
Put a slash or you can put the slash on the next line. You do need a slash to separate. We'll say csv path and then we're going to let's load the csv.
Call it popdf, not pop like remove, but pop as in population. Pd.csv. That'll be your base path plus your csv path. Pandas has no attribute.
Right, of course not. Read csv and let's get the shape of this and that gives us a tuple. 4195 × 1400.
We have 4195 rows of data in 14 columns. Let's just get the head of that and it's China five times. So basically the way the data is organized is it's every country is represented.
Every country gets one row for each year sample. So let's look at the first 40 rows. So China, China all the way for 18 times and then India for 18 more than the United States.
So the data is already sorted by population and descending order. So the samples are 20 going backwards from 2020 but one year at a time until you get to 2015 and then you go in fives and the most recent sample is 2020 and it goes back to 1955. They're 18 samples then it repeats for the next most populous country and if you do the tail or you could do negative 40 to the end you get the last 40 and you're going to see these countries with obviously much smaller populations and we can also just get a sample.
Let's get a sample of 40 random ones. So there's a lot more data than the population in the country but that's all we care about. Nevertheless we will look at the data, see how much we're missing and talk about how we could fill the missing data even though we won't be needing it.
In other words we only need the first three columns but we will with other columns provide good opportunities for filling with missing data. If you recall in the students data frame the students df project right that csv file of thousand students had no missing data so we weren't able to practice cleaning real data from a csv so we will. We'll fix up and clean up some data from columns that we're not even going to use just as for practice.
All right so what we want to do here is typically when you load up a dataset with a csv file kind of you do two things. You explore the data, you take stock of the data, you just get a feel for it and then you clean it and organize it and filter it the way you want. So first you have to just see what it is you have and then you make your determinations and do what you want to do to the data.
So let's extract a series of population. Let's get let's get all the populations. Let's go into the population column and grab the entire column as a vector called a series.
We'll say pop series equals pop df series. You're just square bracketing the column and then what we're going to do is get the data type of that as well as the length of that. The length should be 4195 because that's how many items there are.
We don't want to print that many but we could print say the first 20. Let's try that out. Extract population column values into a series.
The data type is series. There's 4195 values. There's the first 20 populations with their companion keys.
The key value pairs. These are the column. These are the row names for these.
It's kind of giving it to us as key value pairs. In fact we can make a dictionary out of that. Pass that to the pop series because those references provide you know it's a key value pair thing with the key being the index of the row and the value being the value from the population column.
So there are no well in a dictionary we can't say give me the first 20 because dictionaries don't have index. We just get this long thing that goes and goes. But it's saying 14 and then there's the value.
It's a humongous dictionary of 4195 key value pairs. Let's try to make a list. Fake dictionary of key value pairs from the series.
And we'll make a list from the series just items. So that will be pop list equals the listification of the series. We don't want all of them right.
Let's just get the first 10. There they are. There's your list of 10.
And we can also make a NumPy array. Make a NumPy. Just a reminder we're not really needing to do that for this project.
But let's just remember from series or from list for that matter. We'll say pop r equals np dot array and feed in your pop series. And now we can get the shape which is a vector and get the first 10.
There it is. There's your shape. It's a vector 4195 common nothing.
And there's the first 10 items in the resulting array. Note the lack of columns. Just basic data types of these are all vectory 1D things right.
List, series, array. Now a couple new moves. Let's review a couple new moves.
What about the unique values? There is a method that you can call on a column called unique. And it will return all the unique values from a column. So the country column.
How many countries are there? Well let's find out. We'll say it will return as a list or we'll listify it. So unique countries equals our popdf country column with the unique method called on it.
And we need to unpack that properly. We'll listify it. We'll say the len of that and maybe print, I don't know, print the first 20.
There you go. And since the countries go in order of pop, the countries go in order of population. So you're looking at the top 20 most populous countries because China gets listed first.
So there are 235 countries and here are the top 20, etc. Now let's get the unique years. We'll make a challenge out of that.
Get all the unique values from the year column and print the length and the first 10. No, print all. Get all unique values from your column into a list.
Print the length of the list as well as the entire list just because it's not that many items and they're short. It's not like 230 of them, right? We don't want to print every country. I mean, we could print every country though.
It's not that many. There you go. 235.
All right, pause. Get all the unique years into a list. Let's just simplify that.
Get all the unique years into a list. Pause. See if you can do it.
It's very much like the thing you just did we just did with the countries. Okay, we're back. We're going to say unique years equals popdf the year column and unique.
We'll print the length of that and then the entire thing and there you go. So I guess we don't need to listify it after all. Let's see.
Yeah, it works without listifying it. Great. It's already listified.
Okay. And there are your 18 unique years. All right.
And now you wouldn't do that to a field in a column where there's just so many values like continuous values. You wouldn't do that in the population column. If you got all the unique populations, you'd probably get as many items as there are rows, right? Every single population is probably unique.
The numbers are so big, there's not likely to be any ties, maybe on the really small population countries. Okay, now let's time for some data cleaning or at least inspection of the data so we could know if we need to clean it. So let's say, we'll say we want to know the popdf.info. And that gives us a report on all the columns and the non-null values.
If you see 4,195 for non-null, that means all 4,195 have data. So those are all clean and good to go. And that's nice because year, country, and population, again, are really the only columns we're going to be using for this project.
That cool race bar thing, which is going to be the culmination of this, only requires these exact three columns. But if you come down a little, you'll see that we've got several columns with significant amount of missing data. We'll discuss more what we want to do with that missing data.
Even though we're not going to use it, we might want to practice cleaning it. And we can also get isna.sum to see a different view of the data. Okay, now it's showing us not what's not missing, but what is missing.
That's a little more clear, really. Okay, so we assume these are the three, you know, in all three columns, these are the same 595 rows. Likewise, same 113 rows.
We could also probably assume that there's some overlap here, so that the total number of rows with at least one missing data field would be in the 600,650-ish maybe.