Exploring Matplotlib for Data Visualization

Intro to Matplotlib, load CSV with Pandas, clean/filter/sort data, compute averages, show descriptive stats.

Dive into Matplotlib, Python's powerful visualization library, and discover how to efficiently load large CSV data sets for insightful data analysis. Learn to manipulate and visualize data effectively to reveal meaningful patterns and trends.

Key Insights

  • Introduced Matplotlib, a fundamental data visualization library in Python, crucial for generating various charts (bar, line, pie) after organizing data using NumPy and Pandas.
  • Demonstrated connecting Google Colab to Google Drive to efficiently load large CSV data sets, such as the 1,000-row student performance CSV file, for comprehensive data analysis.
  • Explained data manipulation techniques, including renaming columns, computing average scores through vector operations, and filtering data frames based on multiple conditions (e.g., filtering for female students scoring above 90 in math, reading, and writing).

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

Hi, welcome back to this course on Python programming and data science. My name is Brian McLean. Thanks for coming back for lesson eight.

In this lesson, we're going to get into another data science library. We looked at Pandas in the last lesson seven. And in lesson six, we learned about NumPy.

And now we're going to get into Matplotlib, which is a plotting library for making all your charts and visualizations. So think of it this way. Matplotlib is one of the big three of data science as far as the data science libraries.

You have NumPy, Pandas, and Matplotlib. They're the big three, and you kind of want to learn them in that order as well, in my opinion. NumPy being the structural underpinning of the matrix and the vector operations that we use to then extrapolate that out and build two-dimensional data frames.

We learn all the selection moves in NumPy to select two-dimensionally, and we use that in Pandas as well. And then Matplotlib is once you've got your data in and you've cleaned it and sorted it the way you like, and we learned a lot of operations in the last lesson on how to add columns, remove them, remove them, add new rows, clean up, fill missing data, all sorts of manipulations. So once you get your data the way you like it, you can then chart it, bar charts, line charts, pie charts, and so forth.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes,  1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

So that's what we're going to get into. And also, in the last lesson, we made all of our data from scratch, spending a lot of time on that chessboard and food data frame. But that was all our own practice data.

Now we're going to start loading CSV files. So big piles of data, large datasets from an external file, which we will then need to connect to Google Drive to do that. So let's begin by importing our… So CSV, comma-separated values, this is the format for the datasets that we're loading.

A CSV file represents… In a CSV file, the commas are separated by columns. So here's a CSV file. So this is a comma-separated values file where the first row is your column header separated by commas.

The CSV comma-separated is the… It's the commas indicating the break between columns, and then you have your multiple rows. And underneath each column header, you can see where the row data lines up. So there's your date, your weekday, your region, your employee, your item units, and so forth.

So this is the kind of data we load up. And you can save an Excel spreadsheet as CSV. We're going to import what are now we could call the big three of data science modules.

We'll import Pandas as PD. It doesn't really matter the order, but typically you import NumPy first. NumPy as NP.

And then we're going to import Matplotlib.pyplot, which is the Python version of Matplotlib as PLT. And if you recall, we imported that chessboard image. We need… And we need images.

We're going to say from ipython.display import image. Run that. You'll recognize all of these except Matplotlib is new.

Now here is something else that's new. We're going to connect to Google Drive. As we go over to Google Drive, there's our class files.

There's your CSV.csv files for the course, and we also have some images here. We need to load the data. We're going to connect to Google Drive so we can load data into our Google Colabs.

We're going to say from Google.colab import drive. And then we're going to say drive.mountcontent slash drive. Run that.

And it's going to prompt you to connect to Google Drive, which we'll go ahead and do when it's done. It'll give you a little confirmation message mounted at content drive, I should say. There we go.

And let's come over to the side now. All these little buttons on the side. Click on the folder and you'll see drive because your drive is mounted.

And we want to go into my drive. And then from my drive, we want to go into our class folder, right? Noble, Python, Boot Camp, et cetera, et cetera. Let's scroll for it.

Noble, Python, Boot Camp. I've got a lot of these. Okay, here we go.

Bingo. We're going to copy, mouse over the big folder here, and we're going to copy the path. This is sort of the base URL that'll get us into all the subfolders of our project, our course files.

So let's copy, paste the CSV. We'll first copy and paste the base and CSV file URLs. So the base URL is kind of the URL that just takes you into the main folder there.

We'll call it base URL equals and then quotes, just paste whatever it gave you. Okay, so we've got this URL now. And then the CSV URL, you can go in here and we just want this thing called student performance.

We can mouse over the dots, rename file, and just copy. We don't need the whole path because we've got the base URL. We just want the name of the file.

So CSV URL or path, not really. I mean, it is kind of a URL, but I'll call this the base path and the CSV path. And that is going to be, we're going to put a slash at the end of the base path.

And then in the CSV path, we'll say CSV slash student performance of CSV. So between the two paths concatenating together, you can get the path all the way to that student's performance CSV file. We're going to load up here.

We're going to load the student performances CSV file into a data frame. So let's say, come on down, we'll say students, DF, using our standard suffix, that's going to be pd dot, not data frame, we're not making a data frame, we're saying read CSV. And then that'll, whatever we read in will become a data frame.

We'll say base path plus CSV path. Between the two of them concatenated together, that should get us our data frame. And did it work? We can find out by getting the shape and the head, which would be the first five rows.

It worked. So 1000 times seven, remember our tuple for our two dimensional shape, we have 1000 rows and seven columns. Some student data, looks like high school kids, probably grade school.

We've got gender, parental level of education, lunch, test prep course, math, reading and writing scores. Now are we missing any data? We've got 1000 rows. And one thing we can do is you can say sample and say 10.

And that'll just give you 10 random rows, which is maybe useful. I'm not seeing here, let's run it again. So running the random sample 10 times, I don't see any missing data, there might not be any, let's find out.

Remember, there are two ways we can get to find out if we're missing data, we can say studentsdf.info. And it says 1000 now null, so that means we're good. And then there's, of course, the isna. Of course, I mean, if you remember that, it's pretty good.

Only looked at a couple times. Some, this will tell you how many are missing. It's more concise to print it there.

So nothing missing, no missing data. Isna, right? Is missing, sum of is missing is none on every single column. We're not missing any data.

Okay, so next we want to do is let's describe the data. Let's get a numeric stats breakdown. Studentsdf.describe, remember that gives you eight stats on the numeric columns.

The count, the mean, the standard deviation, the min, the 25,50,75th percentile, and the max value. So just to recap standard deviation, okay, so the average math score is 66. And the standard deviation is 15, if you remember the configuration of that.

The way the bell curve works is 66 is the average. And there you go. So two-thirds of the results would be in one standard deviation, plus or minus.

68 percent of the results are plus or minus one standard deviation. Well, if the average is 66, I've already have to go left and right to get a standard deviation. In this case, 15 points.

So the average, so add 15, subtract 15, you get 51 to 81. That means that 68 percent of the math scores are in the 51 to 81 range. That's how standard deviation works.

All right, we're going to load up the bell curve image, which we have, so we don't have to keep googling it. We've got our base URL. Now we're going to say the imgPath equals, and we've already probably don't need the image as big as it's going to come in, so we'll say 450.

Oh, there it is. So standard deviation, okay, right? You can just look. So the reading, average reading score is 69, average writing score 68, standard deviation 15.

That would mean subtract 15 to get 53, add 15 to get 83. 68 percent of all the writing scores are between 53 and 83. Go out two standard deviations, you get to 95.

Subtract another 15, add another 15,38 to 98. 95 percent of the writing scores are between 38 and 98. That sounds right.

That's how standard deviation works. Okay, so renaming. We learned about this in the last lesson.

We renamed, we had a file, we had a column called sales, and we renamed it revenue. So you call the rename method on the data frame and set the columns value equal to a dictionary with the key of old name, value of new name. We've got some kind of long column names that it would be better to just shorten them up a little bit.

We're going to change parental level of education and test preparation course to parental edu and test prep. We've got them written out right here. We'll say, and we can do it in place.

So rather than set the df equal to itself, we'll use that in place equals true that we learned about in the last lesson as well. We'll say student df dot rename columns. It's a dictionary of key value pairs.

The first key will be parental level of education, and we'll call that parental edu. Rename that one. And the next key value pair will be test preparation course.

We'll rename that test prep. And we should be able to say in place equals true. Remember, in place equals true means you do not have to save the operation equal to itself because it's not going to return a new data frame.

Does it work? Boom. It worked. All right.

Challenge. Rename the score columns. Change math score to math, reading score to reading, and writing score to writing.

We get it that they're scores. You can just take out the word score. Rename them all.

So look at how we did the parental level of education and the test prep. How did we rename those? Check it out. Bring that technology into the challenge.

Pause. Come back when you're ready. Okay.

We could copy paste. Maybe better practice for us to write it out. We'll say students df dot rename columns equals squiggles because it's a dictionary of key value pairs.

We're going to target the math score. Colon, we'll just call it math, comma, the reading score, which we're going to change to reading. So these we can copy paste, right? I'm sitting right here.

And I'll just copy this reading, writing, and like so. Run. And we've tightened the screws on the column names.

Oh, notice if we don't say just give the head or the tail, you can also say the tail, just like the last five rows. If you just say the whole df, it's not going to show you every row because there's a thousand rows. It'll show you the first five, dot, dot, dot, the last five.

But you can see if you specify the first 40, it'll show you up to 40 or 50. We could say I want the first 40 by twos. I just want the even ones.

First 40, every other one. And maybe start at two. I don't want to see the zero.

Okay. And maybe I want to go to 40. So just there you go.

Remember slicing. There's the first 10 by twos. First 11, really.

No, that's 10, two, four, six. There's five. There's 10.

Yeah. All right. Challenge.

Make a new column called average, which is the mean value of which is the mean of the three scores. That would be a vector operation. Hint, vector operation, math between columns, right? You can just add columns and it'll do the math on the corresponding items row by row.

So pause, add the new average column. Let's lowercase it. You don't have capitalized letters.

Math, reading, writing. You could call it the mean but or average. Let's call it average.

All right. Pause. Make the average column.

Come back for the solution. Okay. We're back.

We're going to say students df for bracket average. So remember to add a new column is you kind of do it like you're declaring a key on a dictionary and that is going to be the sum of math, reading, writing. Oh, that's not the average.

That's the sum. Better do it over. Okay.

We have to divide that by three, right? You add it up and divide by three and we'd probably like to round the number down. We'll say round to one decimal. There you go.

72.7, maybe two decimals. So here's what we're doing. We're taking the round, we're adding up the three scores, dividing by three, and then we're taking that piece of math, that number, passing it to the round method and saying we want to round it off by one or two decimals.

Two is probably good. And we have to wrap the three numbers we're adding together in parentheses, right? Because the order of operations, otherwise it'll do the division first, which we don't want. There we go.

We have our averages. And again, I'm forgetting to specify don't give us every single row. So there you go.

It's working great. All right. Challenge.

Filter challenge. You learned all about this in the previous lesson. We'll get hitting you with some challenges right away.

Save to a new DF, new data frame, only male students with parental EDU equal to bachelor's degree. So you have to do a two, a double filter here. Two conditions.

Pause. Try it. Turn the recording back on when you'd like the solution.

Okay. Here we go. We're going to say the new DF, what do I want to call it? We'll call it male parental batch degree.

I don't know. DF equals students DF. We have two conditions.

We're going to do the and operator and we'll bundle each condition inside parentheses. And it'll be easier to read if we put them on their own line and let the and go like, so we'll do it in place. No, no, no, no, no, not in place where we want to return a new data frame.

Okay. So the first one is going to be first condition will be student DF gender equals male. The second condition will be student DF or rental EDU equals bachelor's degree.

Just like so. We've got two conditions, both of which must be true. That's what the and is for.

This is not an or situation. So let's print out the shape and print the head of this. We'll do a sample.

How's that? There. It worked. We have 55 by eight.

That makes sense. There's 55 rows. That was about right.

55 results, 55 males whose parents have exactly a bachelor's. Okay. Review, right? We learned this in lesson seven.

Okay. Next up challenge, save to a new DF, only the top 10 females with reading and math and writing scores of at least 90 each. That's three conditions you have to consider.

All of which must be true. It's not that they can have an average of 90. Each individual score has to be 90 plus.

So you could have a average of two 70, which is exactly 90, but not be above 90 in every one. And they have the females. So, and we don't care about the average, but here you go.

And we only want the top 10. So what you can do square bracket, select off the top 10 when you're done with your filter, give it a try. Pause, come back for the answer when you're ready.

Okay. We're going to go with the answer. I would recommend retyping it just so you can get good at this stuff.

We'll say, what do we want to call this DF? We'll say female, min 90 all scores. So that'll be equal to the student's DF. We're going to have one.

We actually have four conditions here, female and three different tests. We can just kind of make inside each one of these parentheses, we'll go another condition. So the first condition will be the gender check, gender equals female.

The second condition can open this up a little bit. And the second condition will be a score check. We'll say math greater than or equal to 90.

And we'll just do that two more times. So all four of these must be true. So you're not going to get a ton of results, maybe 20.

Let's see. Oh, and we want the answer in descending order. We'd like to get the highest score at the top.

And we're doing a top 10, but let's just first see how many we did get altogether. Shape. Okay, 20.

20 girls met this standard. Fine, great. Now, we'd like that in order, though it's not in order.

So what we're going to do is we could do it all in one move, but it gets hard to read when it gets really long. So let's sort the results. All right.

We'll say sort results by average from high to low, right? I mean, every single score has to be at least 90. But it says sort the results by the highest average score. We're not going by the highest math score or anything, just the average.

We'll say dot sort values. If you recall, that's the method. And then you say by and you feed in the column you want to sort by.

And ascending equals false. We want to be descending. We have to switch from the natural.

If we don't say that, it'll give us low to high. We want high to low. And we actually want top 10.

And we want to do that in place so that we don't have to save it to anything. Let's run. And what do we get here? Caveats.

Slicing. Dicing. Okay.

Sort values. Ascending false. Okay, that's all right.

It doesn't like the in place move there. So fine. Notice it didn't work.

It didn't work because we have to set this to itself to save the return value. Now since we're not doing in place true, this should put it in order. There it is.

The average of 100 coming up at the top. But again, we only want the top 10. So let's just grab the top 10.

There you go. Top 10. The creme de la creme.

If you got the entire 20 of them, it goes down to 90. So top 10,3, 4,5, 6,7, 8,9, 10. All right.

Above 97 for those top 10. That's the answer to that one.

Brian McClain

Brian is an experienced instructor, curriculum developer, and professional web developer, who in recent years has served as Director for a coding bootcamp in New York. Brian joined Noble Desktop in 2022 and is a lead instructor for HTML & CSS, JavaScript, and Python for Data Science. He also developed Noble's cutting-edge Python for AI course. Prior to that, he taught Python Data Science and Machine Learning as an Adjunct Professor of Computer Science at Westchester County College.

More articles by Brian McClain

How to Learn Python

Master Python with hands-on training. Python is a popular object-oriented programming language used for data science, machine learning, and web development. 

Yelp Facebook LinkedIn YouTube Twitter Instagram