Discover how pandas data frames transform NumPy arrays into versatile, spreadsheet-like tools for Python-based data science. Explore the powerful functionalities of pandas to efficiently handle, manipulate, and analyze structured datasets.
Key Insights
- Understand how pandas data frames build upon NumPy arrays to create spreadsheet-like structures, offering additional properties such as column headers, indexes, and versatile methods for data manipulation, filtering, and sorting.
- Learn that single rows or columns extracted from a pandas data frame are known as series—one-dimensional data structures that retain column and row references, facilitating data identification and analysis.
- Gain practical knowledge of pandas essentials such as importing pandas, creating data frames from NumPy arrays, selecting data subsets using methods like
loc
andiloc
, and filtering data based on complex conditions (e.g., salaries over $100,000 or employees with more than five years tenure).
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
Hi, welcome back to this course on Python for Programming and Data Science. My name is Brian McLean. I'm so thrilled that you're back for lesson seven.
We're into the second lesson of the data science portion. The first lesson was the last one, that being NumPy, where we learned about how you can take lists and build arrays that can be manipulated, reshaped into two-dimensional structures. And that's important because in this lesson, we get on to what are called pandas data frames, pandas for panel data.
And under the hood of a data frame, which is basically a spreadsheet living in Python, is that two-dimensional matrix structure that we were working with in the NumPy lesson. So here we are, pandas data frames. Pandas stands for panel data, where panel basically refers to a table or spreadsheet that is data arranged in rows and columns.
So in the last lesson, we were dealing with NumPy in rows and columns, but that was just raw data. The data frame would have column headers, rows, column names, different kinds of data, more like a spreadsheet that you're used to, and we'll see that shortly. But under the hood again of a data frame is a two-dimensional matrix NumPy style.
In pandas, this spreadsheet-like data structure is called a data frame. Unlike a spreadsheet like that in an Excel worksheet, a pandas data frame exists as a variable. So sometimes students will ask, well, what's the difference between working with a dataset? You can even load, of course, Excel spreadsheets and CSV files into pandas, into Colab, into a Jupyter notebook, and we will do that.
So the question is, well, why would I even do that when I have Excel? Well, part of the answer is because in Excel, the unit that you're working with is a worksheet, and in Python, the unit that you're working with is a variable. So you have more programmatic possibilities. A data frame is a two-dimensional matrix of rows and columns, like a NumPy array.
In fact, it is a NumPy array under the hood, and it has, as such, it does have a shape property as a tuple, rows comma columns. So a 10 row by 4 column data frame has a shape of 10 comma 4. A single row or column of a data frame exists as a one-dimensional vector called a series. So as we saw when we began working in the last lesson with NumPy arrays, we were starting with lists and then turning them into, you know, making NumPy arrays from lists, and the result would be this one-dimensional vector with a shape of number comma nothing that we could then reshape into two-dimensional structures, right? Take the 9 Xs and Os in the vector and turn it into a 3 × 3 tic-tac-toe board, for example.
Generate the 64 random unique numbers from 1 to 99 and turn them into a 8 × 8 chess board. Now, in the case of having a data frame such as, say, a spreadsheet living in Python, if you extract a row or column, then you've got a vector again, right? But that vector, that one-dimensional structure, will have a reference or knowledge of its rows and columns, like where it came from, right? If you pull out a row of data from a spreadsheet, like first name, last name, age, salary, etc., from the employees table, well, you know, telephone, email, job title, start date, you have all this data of one employee in a vector because you pulled out one row, but that vector would know what columns each piece of data represents, so it's more like a key-value pair piece of data, and it is known as a series. So, one-dimensional vector extracted from a dataset called a data frame is a 1D vector series.
So, you'll see that as we go. You'll hear the term series and data frame, matrix, vector a lot now going forward. So, NumPy terms such as shape, dimensions, matrix, vector, and tuple all apply to data frames just like NumPy arrays because pandas is built on top of NumPy, and what we were working with NumPy in the last lesson was a lot of 2D NumPy arrays, tic-tac-toe board, chess board, in preparation for saying, okay, now let's look at a TD dataset that's more like an Excel spreadsheet with rows and columns.
Now, like NumPy, pandas must be imported. So, in this lesson, we're going to have, we're going to learn how to do all kinds of stuff, making a pandas data frame from lists, from dictionary, how to select ranges of rows and columns, which we were getting into in the previous lesson. Remember, we had the tic-tac-toe board.
We would go into the 8x8 and get the upper left. We have the, you know, 8x8 chess board, and inside the chess board, we'd go and get the 3x3 tic-tac-toe board in the opposite corners. That kind of two-dimensional slicing you do in pandas data frames as well.
There are some other keywords used to do this, namely LOCK and ILOCK, which we'll get into. We're going to learn how to filter data by condition. So, you've got all these employees.
You just want to see the ones who make more than $100,000 or have been there more than five years, that kind of thing, and you just filter by a condition, a Boolean condition, lots of stuff, adding columns, adding rows, removing columns, getting individual values from a data frame, changing or targeting individual values, making just all kinds of stuff. It's probably too much to rattle off and won't make a lot of sense. So, listen, I'm a big believer in just diving in and getting going, because you can't just talk at the student like, blah, blah, blah, right? We're not going to go there.
It's a lot of stuff. This is a very long, dense lesson. You remember lesson one being super long? This one's super, super long.
You can get over this one. You're over the hump, you know, the heartbreak hill of all this. It's a big lesson.
All right, not to psych you out, get you psyched. Let's get pumped. You're ready to run a marathon.
Get your Gatorade, get going, stretch out those hammies. We're going to import NumPy as NP, and we're going to import Handas, this is a new one, as PD. Go ahead and run that.
Give it a few seconds to bake that cake. There we go. And we're going to make that chessboard again, right? This time we'll just use consecutive integers from one to 64.
So, you know, we're not going to do that random thing. We'll say nums64 equals listification of range from 1 to 65. And then we'll print nums and we'll get the length of nums to make sure that it does have 64 values.
Yep, we got 64 items from 1 to 64. What do we want to do with that? We're going to take that list and make an 8 × 8 array out of it. We'll call it chessboard again.
Chessboard array equals np.array. Hopefully this looks a little familiar. We're going to pass in our list of 64. You could pass in the range directly if you like, but let's do it in pieces.
Let's print the shape and the end dim, the number of dimensions of that. It should be two dimensions and 8 × 8. Oh, we haven't reshaped it. We want to reshape it, right? So, we're going to say reshape 8 × 8. Remember, you can reshape right as you make your list, as you make your array.
We looked at that the last time. All right, there's your 8 × 8. It's got a shape, a tuple shape of 8 comma 8, and it's got two dimensions, and there's your 1 through 64 numbers in an 8 × 8 arrangement. Chessboard.
Now what we're going to do is make a data frame from the chessboard. So, this is new business. Chessboard.dataframe underscore df is a common suffix.
We're going to say pd. That's the Pandas's module alias. Pd.dataframe and we're going to pass it our chessboard array.
So, we're going to then get the shape of that and then output. You don't print your data frames. You just output.
There you go, and it's kind of like the 8 × 8 array, but it's all dressed up now, right? Look, it's got this gray white. It's got column headers there. We didn't tell it to have any column names, so it's defaulting to indexes for rows and columns.
Now, in the case of a spreadsheet, quite often you're perfectly happy to have the rows just numbered sequentially like we have them. Maybe start at one, but whatever. It's, you know, just numeric, usually the row names.
However, the columns, you typically want them to have column values like names, first name, last name, start date, salary, you know, that kind of stuff, like in a spreadsheet, but this is basically a Pythonic spreadsheet now, and as a NumPy array, not just a raw NumPy, as a Pandas's data frame built on top of a NumPy's array, the data frame has more methods and properties and things that you can do with it that are not found in a regular structural array such as filter values and all kinds of things, sort, add columns, remove columns, move them around, all kinds of manipulations that you would expect to be able to do with a spreadsheet, basically.