DataFrames and Series in Pandas

Explain creating pandas DataFrames, extracting rows and columns, converting Series to lists or dictionaries, and selecting data using integer-based indexing or column-name indexing.

Learn to clearly differentiate between pandas Series and DataFrame structures, and understand how subtle bracket distinctions impact dimensionality. Master the essentials of indexing and selecting data efficiently by column name or numeric position.

Key Insights

  • Construct pandas DataFrames from scratch by initializing columns as dictionary keys with lists as values, ensuring all column lists match in length to avoid errors.
  • Differentiate clearly between pandas Series (a one-dimensional structure similar to a dictionary or vector) and a DataFrame (a two-dimensional structure), noting specifically how single or double square brackets using iloc affect the dimensionality of the extracted data.
  • Select data precisely by numeric index using iloc, or by column names for more intuitive and readable code, avoiding the need to memorize numeric positions of important dataset columns.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

Hey, welcome back. We are still on file 07, the marathon file. We left right in the middle.

We just made a new DataFrame from scratch and then supplied columns, declared them basically as you would declare keys in a dictionary, and then the values of each of those columns were lists. The length of each list turned out to be how many rows we have. So let's get back into the file and check out where we left off.

We declared a new empty DataFrame. Remember, you called the DataFrame constructor on pd, which is the alias we assigned to pandas when we imported it—worth repeating because it's the first file in which we've used pandas. And now we've got these four lists.

Each list supplies the values for a different column, and there are four items per list. If you've got a list of four values for each column, and you've got four columns, then that would have to be four rows long.

These lists correspond to the columns. We have the name of the food item and the corresponding price, calories, and vegan Boolean. We've got a string, a float, an int, and a bool data type for our data.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes,  1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Declare a new empty DataFrame using pd. DataFrame. We're not passing anything into it like we did when we made our chessboard. Then we declare columns for the DataFrame.

To create a column for a DataFrame, you declare it as if the DataFrame were a dictionary and you were adding a key. The value is a list, and the length of the list will be the number of rows you are making.

And be careful because the lengths have to match. It wouldn’t work if you supplied, say, an uneven number. Here, let’s run everything.

If you supplied a ragged number of items—say, calories had five, prices had three, and food had four—it wouldn’t work. The inputs to the DataFrame expect that every row and column has a value. So let’s see what we get.

So there we have our DataFrame—four columns declared as key-like entities, right? On the DataFrame, we’re declaring the columns as you would declare keys in a dictionary. Then the values are these lists. The length of the list turns out to be how many rows you get.

But what happens if we were to say 999, and then maybe take out a value, so we have four, five, three, and four as the lengths of our lists? Let’s see what we get. Doesn’t like it.

"Length of values (5) does not match length of index (4)." So it already establishes from the first column you assign that it wants four and it expects all columns to supply the same number of items. That makes sense.

Let’s run it again and now it’s all good to go. Shape being four by four—four rows and four columns. Okay, so we’re now going to get a row as a Series.

So let’s say pizza series. Remember, a Series is a vector. It’s the extracted values of a single row or column.

We’re going to come into the pizza row and extract that. We’ll say food_df.iloc[1] for integer location, and the pizza is at index one, right? Zero being the steak. So let’s see what we get.

The shape of it should be (4, ) and the number of dimensions should be one, right? It’s a vector. Yep. And that’s the Series itself.

Kind of like key-value pairs. Now let’s make a list from the Series, which will purge the column names, leaving just the values. When you extract a Series, you don’t just get, for example, pizza, 19.99,500, and false.

You don’t just get the values. You get the references to the columns they came from. So really what you’re getting are key-value pairs.

The Series is closer to a dictionary than to any other data structure, and you can actually pass the Series to the dict() method and return a dictionary. But first let’s make a list. Now the list wouldn’t have key-value pairs.

So to do that, we would have to purge the column names and just keep the values. We’ll listify the Series and then print the list. Yep, and it’s giving us np.float. It’s giving us a data type without us asking for it.

I wonder if we could list the list and just get rid of that. Nope. I wonder if this is a new thing, because it should just say 19.99,500, and false.

It really shouldn’t be calling it on NumPy data types like that. But it does show the data types anyway. Okay, now we’ll make a dictionary—we’ll say dict(pizza_series).

We’ll pass the pizza series to the dictionary method dict. Run that, and there we go. And now we get a dictionary.

Again, it’s doing this NumPy move on the data types as opposed to just providing the numbers. All right, food_df.iloc[num]. Now why, when we extracted the pizza row, did it give us a one-dimensional vector? Why didn’t it give us the pizza row as a little 2D, four-column, one-row DataFrame? And the reason is, it’s as simple as counting how many square brackets you have around the index.

So you have one square bracket. I mentioned earlier—yesterday or earlier in the lesson—that if you count the number of square brackets before you see a value, that’s the number of dimensions of your structure. So there’s only the one square bracket, which means you’ve only got the one dimension.

Unless you come in and do something to provide that dimensionality, like come across with the column values. If you just double-wrap that number to get a DataFrame slice of the DataFrame—not just a 1D Series—use iloc with double square brackets. The double square brackets around the index indicate a 2D selection.

So let’s say pizza_df this time equals food_df.iloc[[1]]. It’s a subtle difference, but it makes a big difference. So it’s the same exact move as we did up here with food_df.iloc[1], except now we’re double-wrapping the square brackets, which will give us two dimensions.

And we’re going to say we want the shape and we want the number of dimensions, and then we’ll print the DataFrame.

And it’s a one by four—one row of four columns across two dimensions. Not the same as a Series, which has a dimensionality of one and a shape of (4, )—that "comma nothing" with only one number indicating that we’re looking at a vector here.

Like a number line, not a 2D thing like a spreadsheet. A vector is like a number line—just a line—one-dimensional, right?

Think of it: a point is zero dimensions, a line is one dimension, a square is two dimensions. There you go. Zero dimension is a point—just xyz values of a spec floating in space with no width, height, or depth.

One is a line—that’s a vector. That’s what we’re dealing with when we extract a Series or just have a list, like the fruits list or NumPy array of numbers.

2D—spreadsheet, right? Two dimensions.

And 3D—cube. There are definitely three-dimensional arrays in data science, but you can’t think of the three dimensions like a cube—that’s not really how it works. We’ll get into that later.

So anyway, let’s just get back to it.

One-dimensional vector = Series Two-dimensional vector = DataFrame

Pizza = DataFrame of one row.

Now, to get an individual piece of data like the word "pizza" or the 19.99 price, we can come in to get it by row and column—just like you would in a NumPy array.

We could say—let’s try this—we’ll say pizza or just food_df. We’ll go to the whole food_df.

food_df.iloc[]

So the pizza is at row one and the word "pizza" is at column zero. The word "pizza" is at row one.

No—the entire pizza row is row one (index one), and then index zero in columns.

Remember, the columns under the hood are indexed as well—these are just labels covering the one, the zero, the one, the two, and the three.

So column zero is the "item" column. So the intersection of row one, column zero gives you the word "pizza."

Yep. And if you were to say (1,1), you’d get the price.

Now what if you want to select by name and not number?

And that makes a lot of sense, because what we just did to select the word "pizza, " we had to know that "item" is column zero.

But a lot of times when you’re dealing with data, you don’t know the number—the index number—that is, the position in the columns of "first name, " "last name, " right?

You want to just go into the "last name" column and get people.

You want to go into the "salary" column and get everybody who makes more than fifty thousand.

You don’t need—or want—to know that "salary" is column seven in the index. It’s column eight if you look at it because it’s starting from zero.

So especially in the case of referencing columns, it’s more user-friendly and convenient if you can just refer to them by their name.

Now, in the case of the rows, it’s not usually as vital to be able to refer to them by name, because the index and the name are essentially the same, right? Assuming you haven’t changed them.

A lot of times they’re just the numbers.

Brian McClain

Brian is an experienced instructor, curriculum developer, and professional web developer, who in recent years has served as Director for a coding bootcamp in New York. Brian joined Noble Desktop in 2022 and is a lead instructor for HTML & CSS, JavaScript, and Python for Data Science. He also developed Noble's cutting-edge Python for AI course. Prior to that, he taught Python Data Science and Machine Learning as an Adjunct Professor of Computer Science at Westchester County College.

More articles by Brian McClain

How to Learn Python

Master Python with hands-on training. Python is a popular object-oriented programming language used for data science, machine learning, and web development. 

Yelp Facebook LinkedIn YouTube Twitter Instagram