Creating Data Frames and Cleaning Data in Pandas

Demonstrate creating, concatenating, cleaning, indexing, modifying, and dropping dataframes and columns using Python pandas.

Discover how to efficiently create and manage pandas DataFrames using dictionaries, and learn essential techniques for handling missing data and maintaining data integrity. Gain practical insights into data cleaning, indexing, and concatenating DataFrames for effective data manipulation.

Key Insights

  • Create pandas DataFrames directly from dictionaries, automatically converting dictionary keys into column headers without needing explicit column declarations.
  • Combine multiple DataFrames with the pandas concat method, resetting indices afterward using reset_index(drop=True) to maintain proper consecutive indexing and avoid unintended index columns.
  • Effectively manage missing (NaN) data by either deleting columns entirely, filling numeric data with a mean or zero, or filling missing string data with sensible default values, such as replacing missing bread types with "dinner roll."

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

Now let's look at another way to make a DataFrame. We can make a DataFrame from a dictionary. We made a DataFrame from that 8x8 chess board we fed into the DataFrame constructor method. We also made a new empty DataFrame and then assigned columns the way we would declare keys on a dictionary and assign the values to be lists.

But we could also make a dictionary and feed that to the DataFrame constructor: `pd. DataFrame(dictionary)` is the argument. The advantage of that would be that the dictionary keys will automatically become the column names, so you don't have to specify any column headers. So let's make many lists, and we're going to assign these lists to keys.

These lists are going to become the values of keys. So you can see what we're doing here—we're making half a dozen or seven, six different rows' worth of data, right? These lists have six items, so six rows' worth. But these are not going to just be a list of lists.

We're going to do key-value pairs here. These are going to be the values of the keys of a dictionary. We'll call it—these are all sandwiches—so let's call it `sandwiches_dictionary`.

Set that equal to squiggles, and this will be `item`. We're going to match this up with our current names of our columns. So these keys are corresponding to columns.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes,  1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

We're going to make the `sandwiches_df` out of this and then glue it onto the bottom of the `food_df`, CONCATENATE them together. We want these matching names of the columns. Now here's a little curveball.

We don't have a `bread` column in the `food_df`, but sandwiches—you might want to have that. So let's pretty print `pprint.pprint` that `sandwiches_df`—the `sandwiches_dictionary`, rather. There you go, a little JSONic thing—dictionary, key-value pairs.

Now a DataFrame from this. The keys will be used as column names, so we don't have to declare any column names the way we did with the chess board. Remember we had to tell it b a to h. This is just going to use the keys as the column headers automatically.

`sandwiches_df`, rather, is going to equal `pd. DataFrame`, feeding in the `sandwiches_dictionary`. Here we'll get the shape of that: 6,5—six rows, five columns.

Very much like our existing `food_df`, right? `item`, `price`, `cals`, `vegan`. Same, same, same, same. We actually could take this and CONCATENATE it or add it onto the `food_df` and just have six more rows.

Only problem is: what do we do with this `bread` column, which is not found in the `food_df`? We'll get into that. That was done on purpose—to see how do we deal with combining two DataFrames where you don't have exact matching columns, which is a very realistic use case scenario.

So concatenating two DataFrames together—we're going to declare the main DataFrame, which is `food_df` in our case, equal to `pd.CONCAT([food_df, sandwiches_df])`. That will return a new uber DataFrame combo. So let's say:

`food_df = pd.CONCAT([food_df, sandwiches_df])`

We're saving the result. It's going to make a new DataFrame, but we're saving it under the name of `food_df`, so we don't have yet another name.

There you go. We've now got 16 rows and five columns.

Sure, it's much longer—look at all these sandwiches we just added—but we also added a new column called `bread`, and there are no bread values for the existing food items from the original `food_df`. So that `NaN` does not mean “naan, ” the bread—this is not a number. We've got missing data now.

So that brings us to the next important concept, which is cleaning data.

There are a few different ways to approach cleaning data, and it's something for you to figure out—what's best depends on many things. Let me just walk you through it so you'll have a better sense of what you're going to do when you encounter bad data in a dataset—which happens a lot. You might have missing, null, or `NaN` values such as what you have here. These can be dealt with in one of three ways.

If the column itself is rife with bad values, perhaps you just delete the entire column. Maybe you say, “Well, the bread was interesting for when it was just sandwiches, but now we're in this greater context—fruit salad and such—and the bread's not as relevant anymore, ” so we're just going to get rid of the `bread` column altogether.

That's option one. The pros are: it gets rid of all the bad data in that column—because it gets rid of the whole column. The cons are: it might also get rid of some useful data that you might have wanted to keep.

So maybe there's another way. Step two would be the other way. You would keep the `bread` column, and you could fill with zero or a mean—but that would only be if there were missing numbers.

These are just missing strings, so you're not going to fill a missing string with a mean. There is no mean—or a zero. So in the case of a string, you could fill with some default name.

Something arbitrary but sensible. Like—you wouldn't fill it with “apple” or “bunny”—but you might fill it with “roll” or “biscuit” or something bread-related. You fill in, in the case of strings, with values that have no mean—just provide some arbitrary default. We could say:

`food_df['bread'] = food_df['bread'].fillna('biscuit')`—or “bun” or “roll” or “rye” or whatever we wanted to call the filler.

`df.info()` gives a report of the columns, so let's run `food_df.info()` now—and it's going to tell you that in the case of `bread`, you have only six non-null values out of 16.

That means you've got 10 null values—that being all the original food values—don't have a `bread` value.

And we could also do this: food_df.isna().sum()—that gives you the same results, but kind of backwards. Rather than non-null counts, you get null counts.

So here we're seeing that we're missing 10 items from the `bread` column—right—all the original 10.

You could also shorten that up.

So what we want to do is come to food_df['bread']. Let's actually take that out and type it.

And what do we want to fill it with? Bun? Biscuit? Fajita? No? I don't know…

We could make it a “dinner roll”—kind of a standard thing you get with a lot of meals.

We're going to say: food_df['bread'] = food_df['bread'].fillna('dinner roll').

There it is—they're all filled with “dinner roll.”

Here's another issue: when we combine the two DataFrames, let's look at the indexing.

So the original food_df indexed the rows from 0 to 9 for the 10 rows, and then the sandwiches_df had its own indexing of 0 to 5.

And when we concatenated the two DataFrames into one big one, it did not automatically reset or continue the indexing properly. In fact, it rolled over—it just started again with 0 for the sandwiches.

We're going to use what is called the reset_index() method, which you call on a DataFrame, and that will reset the index numbers—so instead of restarting at 0, it'll go to 10,11, and so on.

Concatenating two DataFrames into one messes up the consecutive row indexing, as we're seeing here.

We're going to reset the index, and then we're going to say drop=True to prevent the old, wonky indexing from being saved to a new column.

What that means is: if, when you reset the index, you do not provide a drop=True argument to the reset_index() method, what happens is it takes the old, wonky index and thinks it's important, and makes a brand-new column called "index" with the old unwanted values—which we don't want.

So let's say: food_df = food_df.reset_index(drop=True)

drop=True, rather—and True is capitalized in Python.

That worked. Index keeps right on incrementing consecutively by one.

Now let's do that again but… let's back up and redo these moves—they're worth looking at again.

So—now let's not do that—it's fine, let's just keep rolling.

Let's pretend that the indexes are not fine, that they're still messed up—because I want to show what happens if you reset the index without passing drop=True.

What do you get?

We can reset the index—even though it's fine—let's just do that.

We're going to reset_index() but without drop=True. This gives a new column of unwanted old index numbers.

Which won't look wonky because the index is fine, but it will make this new column.

We're going to say: food_df = food_df.reset_index() and we're going to take out the drop=True and see what that gives us.

Aha! Look—it made a new column called "index".

So it just kind of does that as a hedge—because sometimes these row names have some kind of numeric label value, kind of like column headers—they might have some meaningful data there.

And when you reset them to numbers, you would lose that data—so you have to provide the opportunity for that data to get preserved as a new column.

Which we didn't do the first time—we didn’t want that—but we do want to see this in action.

So there's that unwanted "index" column.

Which brings us to the next point—how to drop a column by name.

We want to drop the "index" column by name.

So let's do that. We'll say: drop the unwanted "index" column that got made during reset_index().

We're going to say: food_df = food_df.drop(columns='index').

You give the name of the column, and if you're dropping multiple columns, pass it as a list. If you're just dropping one, you don't need a list.

And let's see if that works.

Yep—it works.

And just be careful not to run this again because it'll throw an error, since there is no more "index" column to drop.

Now let's add another challenge: add a new column called foo, the value of which is "bar".

Now in programming, foo and bar—believe it or not—are common placeholder variable names.

Like foo as the name of a variable and bar as the value.

So like: foo = bar. That's used when we want values that—well, and we're also telling anyone reading the code, these are fake values.

We're going to make a new foo column. Not to be confused—it's not a shorthand for food. It's foo—as in foo/bar.

So make a new column called foo, the value of which is "bar".

If you just set—if you declare foo as a column and set it equal to "bar"—it'll just be "bar", "bar", "bar" in every single row.

And that's only so we can drop it.

Then you're going to drop the foo column—this is more of the challenge.

So pause recording, see if you can declare a new column. Think of how you did the columns for item, price, and cals to begin with.

You make a new column the way you would add a key to a dictionary.

So pause, come back when you're ready.

Okay here we are.

We're going to say: food_df['foo'] = 'bar'

And you've got this new foo column: bar, bar, bar.

And why did we make a nonsense column?

Now if you just want to see if it worked, we could say .head()—and that gives you the first five rows.

We could say .head(3)—this is the first three rows.

Right—we only want to see enough to know that it worked—we don't need to print the whole thing.

Now what we want to do is drop the foo column using what you just learned about how to drop.

We'll say—well that's part of the challenge, right?

Pause—drop the foo column. Turn on the recording when you're ready to see the answer.

Okay, so how do we drop the foo column?

We'll say: food_df = food_df.drop(columns='foo')

I think that will work.

Yep—got rid of the foo column.

All right—check for NaN values.

Do we have any NaN values anymore?

No, we don’t.

Remember how to do that? We’ll say: food_df.info()—that’s one way to do it.

And it’s all non-null.

So that means we’re not missing anything.

We can check for NaN another way—it’s a review, right?

food_df.isna().sum()—we can print that.

And now you’re finding out—you’re looking at every column and it’s telling you how many are missing.

How many rows have missing data in each column?

And the answer is zero—because you’re not missing any data.

And the info() is like the inverse of that—it’s telling you on a column-by-column basis how many are not missing.

So 16 of them are not missing any data.

It’s another way of saying they all have data. And they all have data. All 16. All 16 rows. All columns.

Let’s drop—how would we drop a row?

Let’s say we want to get rid of a row.

You can drop by index—but then once it’s dropped by index, you kind of just don’t worry about the gap.

Right? If you get rid of the 12—it’s okay.

So what we’re going to do is—we’ll drop the BLT, which is a row.

We’re going to say, rather than drop a column, we’re just going to drop a row.

We’re going to say: food_df = food_df.drop(12)

Then run—and look—it’s back. The BLT is back.

It looks like we dropped it—and it’s back.

And this is because we did not—remember—we learned about methods that return values?

Like when you uppercase or lowercase a string—if you don’t save it to itself, it doesn’t work. It doesn’t stick. It just kind of transiently makes the change.

You have to set it equal to itself to catch and preserve the operation—because you’re storing the return value.

These operations return a new DataFrame—they don’t mutate the original.

Kind of like a string method. When you .replace()—you catch the return value. You don’t replace the originally immutable string.

So the reason it appeared to work but then came back—the BLT—was because we just ran the command without saving the value.

Now—there. BLT is gone.

Let’s drop a row by reference to the item name—not the index—because what if you don’t know the index?

So let’s drop the hot dog by reference to the item name.

We’re going to do this food_df.drop() thing again.

food_df.drop(…)

Now—the hot dog is at 14—but instead of doing that, because we don’t know, we’re going to say:

food_df[food_df['item'] == 'hot dog'].index

So you’re just looking for whatever item returns true on a check for “hot dog.”

If the food item equals “hot dog”—we’re going to drop that one.

And it gives us an error—because we have to drop—it’s not just the…

We need the Boolean comparison inside on the column.

We have to be on the big DataFrame, then on the column—like that.

Yep. And then kaboom—and it still doesn’t like it—because we have to say .index.

Oh—it’s a property—it’s not a method.

All right—there we go. Boom.

Okay—so that’s the move.

That is how you drop—basically on a conditional search, right?

We’re not just typing in 12 to get rid of the BLT.

What if you didn’t know where the BLT was?

We’re not just typing in 14 to get rid of the hot dog.

We’re saying: drop where you do this condition, and then on the condition we’re looking at the index of whatever meets that condition.

That would be the 14. Little power move there.

And the BLT and the hot dog are removed—whoa.

Oh yeah—but we didn’t save it again.

We did not save this equal to itself. We’re going to do it again. And again.

And now—it’s missing the 12 and the 14. The BLT and the hot dog are gone.

Now let’s put the BLT back in.

So whatever these numbers are—we just want the length of the DataFrame, and that’s going to become the new index.

Remember how we did that dynamically?

We’re going to say:

food_df.loc[len(food_df)] = ['BLT', …]—or whatever the full list of column values is.

We’re going to go to the food_df, go to the end of it, and we’re going to set that equal to the new food item.

Which doesn’t even need a variable—we can just directly do the list—right—as if that were called new_item.

And there is your BLT at the end.

Brian McClain

Brian is an experienced instructor, curriculum developer, and professional web developer, who in recent years has served as Director for a coding bootcamp in New York. Brian joined Noble Desktop in 2022 and is a lead instructor for HTML & CSS, JavaScript, and Python for Data Science. He also developed Noble's cutting-edge Python for AI course. Prior to that, he taught Python Data Science and Machine Learning as an Adjunct Professor of Computer Science at Westchester County College.

More articles by Brian McClain

How to Learn Python

Master Python with hands-on training. Python is a popular object-oriented programming language used for data science, machine learning, and web development. 

Yelp Facebook LinkedIn YouTube Twitter Instagram