Pivoting Data, Charts, and Bar Chart Races

Discover how to transform and visualize data effectively using Python, including pivot tables and dynamic bar chart races. Learn to manage complex datasets and enhance your insights with compelling visualizations.

Key Insights

Learned how to pivot datasets in Python, transforming rows into columns, an essential step demonstrated using a country population dataset (4,195 rows and 14 columns).
Explored the creation of dynamic visualizations including pie charts and bar chart races using Python's specialized Bar Chart Race (BCR) library, highlighting the trends of the top 30 countries by population from 1955 to 2020.
Reviewed techniques to inspect, clean, and prepare data by identifying unique values, managing missing data, and organizing datasets, emphasizing best practices in data management and preparation workflows.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

Hi, welcome back to this Python Programming and Data Science course. My name is Brian McLean. Thanks for joining me for lesson 10.

This is the final lesson, and we've got a really great one for you. So let's dive in. So here's what we're going to learn.

We're going to learn how to pivot data. If you come from an Excel background, you might know about that. That would involve moving columns to become rows and vice versa.

We'll get into it. It's a little hard to explain without being in the middle of doing it, so let's just hold off on that.

But it involves quite a transformation of your dataset. We're also going to revisit line charts and bar charts, and we're going to have two new charts. We're going to learn how to make a pie chart and then the crowning achievement, kind of the capstone project here, is we're going to make what is known as a bar chart race.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes, 1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

A bar chart race is really cool. It's probably the only kind of chart that can get its own YouTube video and, you know, 50,000 views of people just looking at a chart. And why? Check this out.

You've seen these. Watch. The data is racing.

We've got the top 10 countries by population from 1955—top 30 rather—from 1955 to 2020. It's feeding off this dataset. We're going to load up from a CSV file, and we have to use this pivot to rearrange the data.

So, you know, a bar chart race in Python, the columns are racing. We actually have to load up some data and go into the countries column and take all the countries and turn each country into its own column—hence the pivot part.

And then we're using this special library called BCR bar chart race, which involves an install. It's not like NumPy or something where you can just import it; it has to be installed.

If out-of-the-box Python—like print and type and int—is like your junk drawer in your kitchen where you just grab a roll of tape or scissors, importing is like having to go into the garage and get a toolbox or get something out of the toolbox like a drill. Well, importing would be like you have to go to Home Depot and rent the floor-sander polisher thing for the day—something you don't use all the time.

It takes an extra little bit of work to get it. We're going to have to install this module that's making this possible, and then import and do our stuff. And the bar chart race module comes with a namesake method with a ton of properties.

If you thought we were passing in a lot of arguments into the parentheses of various functions, this is going to be like dozens of properties. But it is cool and it works, and you can impress your friends and you can impress your co-workers. You can have people gathering around your computer at work while you're showing them this—doing it on the company data.

All right, so we're going to need the big three: import NumPy as np, import pandas as pd, and import Matplotlib.pyplot as plt. We don't have any graphics, so we don't need the image, but we do need to load the CSV file.

We'll say from Google.colab import drive.mount content drive. Let's run these. All right, let's get our base path.

It's the same base path from the last lesson if you want to just go in there and grab it. I already closed the lesson though, and I just want to remind you of how to find the base path. We're going to go grab it even though it has not changed.

Copy path. We're going into the CSV. We've got this CSV of country populations.

We've already got it written out though. We don't need to copy that path. We've got our path.

Put a slash, or you can put the slash on the next line—you do need a slash to separate. We'll say csv_path, and then let's load the CSV.

Call it popdf, not pop like remove, but pop as in population. pd.read_csv. That'll be your base path plus your CSV path. Pandas has no attribute.

Right, of course not. read_csv, and let's get the shape of this, and that gives us a tuple. 4195 × 14.

We have 4,195 rows of data in 14 columns. Let's just get the head of that, and it's China five times. So basically the way the data is organized is every country is represented.

Every country gets one row for each year sample. So let's look at the first 40 rows. So China, China all the way for 18 times, and then India for 18, then the United States.

So the data is already sorted by population in descending order. The samples are 20 going backwards from 2020, one year at a time until you get to 2015, and then you go in fives. The most recent sample is 2020, and it goes back to 1955. There are 18 samples, then it repeats for the next most populous country, and if you do the tail—or you could do negative 40 to the end—you get the last 40, and you're going to see these countries with obviously much smaller populations. We can also just get a sample.

Let's get a sample of 40 random ones. There's a lot more data than the population and the country, but that's all we care about. Nevertheless, we will look at the data, see how much we're missing, and talk about how we could fill the missing data even though we won't be needing it.

In other words, we only need the first three columns, but we will—with other columns—provide good opportunities for filling with missing data. If you recall in the students data frame, the students_df project, that CSV file of a thousand students had no missing data, so we weren't able to practice cleaning real data from a CSV; so we will. We'll fix up and clean some data from columns that we're not even going to use just for practice.

All right, so when you load up a dataset from a CSV file, you typically do two things: you explore the data, take stock of it—just get a feel for it—and then you clean, organize, and filter it the way you want. First you have to see what it is you have, then you make your determinations and do what you want to do to the data.

So let's extract a series of population. Let's get all the populations. Let's go into the population column and grab the entire column as a vector called a series.

We'll say pop_series = popdf['population']. You're just square-bracketing the column, and then we're going to get the data type of that as well as its length. The length should be 4,195 because that's how many items there are.

We don't want to print that many, but we could print, say, the first 20. Let's try that out—extract population column values into a series.

The data type is Series. There are 4,195 values. There's the first 20 populations with their companion keys.

The key-value pairs—these are the columns; these are the row names for these.

It's giving it to us as key-value pairs. In fact, we can make a dictionary out of that—pass that to pop_series—because those references provide a key-value-pair thing with the key being the index of the row and the value being the value from the population column.

So there are no—well, in a dictionary we can't say “give me the first 20” because dictionaries don't have index order. We just get this long thing that goes and goes. It's saying 14, and then there's the value.

It's a humongous dictionary of 4,195 key-value pairs. Let's try to make a list—make a dictionary of key-value pairs from the series.

And we'll make a list from the series items. So that will be pop_list = list(pop_series). We don't want all of them, right?

Let's just get the first 10. There they are—your list of 10.

We can also make a NumPy array. Make a NumPy—just a reminder, we're not really needing to do that for this project.

But let's remember—from series or from list, for that matter. We'll say pop_r = np.array(pop_series). Now we can get the shape, which is a vector, and get the first 10.

There it is. There's your shape. It's a vector 4,195 comma nothing.

And there's the first 10 items in the resulting array. Note the lack of columns—just basic data types. These are all vector-y 1-D things: list, series, array.

Now a couple new moves. Let's review a couple new moves.

What about the unique values? There's a method you can call on a column called unique, and it will return all the unique values from a column—so the country column.

How many countries are there? Well, let's find out. We'll say it will return as a list—or we'll listify it. So unique_countries = popdf['country'].unique().

We need to unpack that properly. We'll listify it, get the len of that, and maybe print, I don't know, the first 20.

There you go. Since the countries go in order of population, you're looking at the top 20 most populous countries because China gets listed first.

So there are 235 countries, and here are the top 20, etc. Now let's get the unique years—we'll make a challenge out of that.

Get all the unique values from the year column and print the length and the first 10. No, print all. Get all unique values from your column into a list.

Print the length of the list as well as the entire list—just because it's not that many items and they're short. It's not like 230 of them, right? We don't want to print every country—although we could; it's not that many. There you go—235.

All right—pause. Get all the unique years into a list. Let's simplify that.

Get all the unique years into a list—pause. See if you can do it.

It's very much like the thing we just did with the countries. Okay, we're back. We're going to say unique_years = popdf['year'].unique().

We'll print the length of that and then the entire thing, and there you go. I guess we don't need to listify it after all—it works without listifying.

Okay—and there are your 18 unique years. All right.

You wouldn't do that to a field in a column where there's just so many values, like continuous values. You wouldn't do that in the population column. If you got all the unique populations, you'd probably get as many items as there are rows, right? Every single population is probably unique.

The numbers are so big, there's not likely to be any ties—maybe on the really small-population countries. Okay, now it's time for some data cleaning, or at least inspection of the data so we know if we need to clean it. So let's run popdf.info(). That gives us a report on all the columns and the non-null values.

If you see 4,195 for non-null, that means all 4,195 have data—so those are all clean and good to go. That's nice, because year, country, and population are really the only columns we're going to be using for this project.

That cool bar-race thing, which is going to be the culmination of this, only requires these exact three columns. But if you come down a little, you'll see that we've got a significant amount of missing data in several other columns. We'll discuss more about what we want to do with that missing data.

Even though we're not going to use it, we might want to practice cleaning it. We can also run popdf.isna().sum() to see a different view of the data—showing us not what's not missing, but what is missing.

That's a little clearer, really. We assume in all three columns these are the same 595 rows—likewise, the same 113 rows.

We could also probably assume that there's some overlap here, so the total number of rows with at least one missing data field would be in the 600-to-650-ish range, maybe.

Key Insights

Brian McClain

How to Learn Python