Filtering Dataframes in Python Pandas

Discover how to effectively filter data using multiple conditions in Python's pandas library. Learn practical techniques for updating data frame values and adding new columns with random data.

Key Insights

Combine multiple filtering conditions in pandas data frames using logical operators "&" (for "and") and "|" (for "or") bundled with parentheses, enabling precise data selection such as items priced under $13 or containing at least 700 calories.
Update specific data frame entries using conditions rather than hardcoded row numbers, such as changing the price of "Bison Burger" to $19.50 by referencing its item name directly.
Create new columns in pandas data frames filled with randomly generated data using numpy's "randint" method, allowing for quick population of values within a specified numerical range.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

This is a lesson preview only. For the full lesson, purchase the course here.

All right, now what about selecting and filtering by multiple conditions? Because you know we filtered by calories and price, but what if you want to filter by calories and price? Then we would use the and and or operators that we learned about in Lesson Two: If/Else logic. And we bundle up the conditions with the ampersand and the pipe, as opposed to using the words “and” or “or” like in vanilla Python. We're going to make a DataFrame of just max $15 price and min 600 calories.

It'll be the food_df, and in it we're going to say: food_df['price'] >= 15. We're going to bundle that in parentheses. You can hit ENTER when you're doing these multiple conditions—you probably want to do it on more than one line since they can get kind of long and run off the side of the screen. Then we're going to say: food_df['calories'] >= 600 and food_df['price'] <= 15. See what we get: less than 15, greater than or equal to 600.

It worked. So there's one condition and another condition. Make a DataFrame of just max 13 or min 700. Let's call that a challenge—based on what you just saw, what we just did with the double value, but using or instead of and. The or operator is the pipe. Go ahead and try to make a DataFrame of just max $13 price or min 700 calories. Not and—a change in operator. You get more results if it's or, right? You're not demanding two conditions be met.

So let's… okay, take a pause, try the challenge, come back.

Okay, we're back. We're going to say max 13, min 700, but these don't both have to be met—just one or the other.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes, 1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

We're going to come in and change the and to a pipe, and then say max 13 price. Price has to be less than or equal to 13, and the calories have to be greater than or equal to 700.

So anything less than 13 or having more than 700 calories. There you go—less than 13, it's either one or the other; it could be both, but at least one or the other.

There's your less. There's both. Both—oh no, excuse me—that does not satisfy the 17, but it does satisfy the more than 700.

So and or or, multiple conditions. And you can have more than two.

I've done stuff with baseball—you know, I like—there's this big baseball dataset I like to explore where I'm looking: okay, show me all the seasons where a guy hit 30 homers, had 30 doubles, 30 steals, hit.300, struck out less than 100 times, had 100 runs scored, and 100 RBI.

So the more conditions you add, the fewer and fewer results you're going to get. And it's kind of interesting what you get when you go in and filter five, six conditions on this dataset with many, many thousands of rows.

It's like one row per player per year—so like one player who played 20 years, he's got 20 rows himself.

So you're going in over 100,000+ rows and you're just pulling out six things with these really fine conditions—with lots and lots of variables, lots and lots of conditions.

Okay, we could do an and/or combo.

Let's get all the items that include the word "salad" in the name, and the price is at least 13 or the calories are at least 700.

Because we're going to use this 13 and 700 calories thing with or, but then we’re going to make—on top of that—it has to be a salad.

So you’ll get one, two… two results. There are only two.

So once you've filtered by less than 13 or greater than or equal to 700—which the green salad satisfies the price, and the chicken salad satisfies both, actually—but they're the only two salads.

And then there are many salads that don't, you know, meet this standard.

We want this or but also the requirement that it has to be a salad.

We'll say salads—so it’s the whole long name: salads of max $13 price, min 700 calories. It’s really or for this.

You know, this is or and and, or uh.

We've got our or, which we will double wrap now—like so—just bundle that as an or.

And then we're going to say and, and then the and condition will be food_df['item'].str.contains('salad', case=False).

Let’s back this up a smidge.

And it’s whining about… it doesn’t need all that wrapping.

Okay—just this big outer wrapper, and then you've got your inner wrapper, and then you've got your food wrapper.

Yep, just a set—oh yeah, tuna salad sandwich—I didn’t notice that one, right?

So you have three salads.

Pause, try it. It’s tricky, but you can get used to it if you practice it.

You'll have to do many of these. I finally got good at this because I was doing it with that baseball database—dozens and dozens of these, just sitting there for hours fiddling with the baseball database, writing queries.

Increase the price of the bison burger—this is kind of like a little review—to 20.25. Bison burger is at row 5.

Is it at row 5? Yeah.

All right, so we’re going to say—now that’s not ideal to go by row 5 because what if you don’t know? But that’s okay.

We’ll say: food_df.loc[5, 'price'] = 20.25.

Not a percent increase—a hard-coded increase. Bison burger is 20.25—good.

Now let’s do another thing. We did one where we set hot dog—we went and targeted the hot dog.

Oh, the bread. We set the bread—we looked up an item by a condition—not by row—so we went for the row that met the condition where the item equals hot dog.

Or in this case: bison burger. Set any individual item value such as the price of bison burger, without knowing its row number.

Find the item’s row number by looking it up by Boolean condition. In this case, item equals bison burger.

We're going to try to do the same thing again—except not type in 5, because what if we don’t know where it is?

We don’t know where the bison burger is.

So instead of saying 5, we’re going to say: food_df['item'] == 'bison burger', and then that price will be—we’ll go back to 19.50.

And it's back to 19.50.

We targeted it by a condition as opposed to a hard-coded 5, which is better.

Change numeric value relative to a current value—we did this—raising prices by 10%.

Do this as a challenge.

Pause—you're going to start with the hot dog—setting the hot dog price to $4, then you're going to increase the hot dog price by $1.50, then multiply it by 10%, then discount it by 20%, then increase all items by 8%, and round the prices to two decimal places.

Boom, boom, boom.

Big challenge. Try to do it, see how you do, and come back for the answer.

Okay, challenge: salads now on sale.

Reduce the price of all salads by 20%. To do this, apply it to the full food_df.

Okay, so we're going to say: food_df['price'] = food_df['price'] * 0.8.

Right—there it is. And because we didn’t round it off, it gave us these multi-decimals.

We need to round to two digits, because it's money, right? That's fine—we can fix it again.

Here, let’s multiply it by 1.0—so it doesn’t change—but now we're going to round it off.

We're going to round it to two decimals, like so.

Yeah—that’s showing three decimal places. Round, times equals… oh, you’re just multiplying by itself.

Okay—we’re not changing it—we’re setting it equal to itself. There you go. Round it off to two.

Set this price equal to itself, rounded to two decimals.

There we go.

Next up: challenge.

Start by setting the hot dog price to four.

Bonus: find the hot dog by condition—item equals hot dog. In other words, do not hard code the row number, which is 13, I believe… no, 15. Okay.

We're going to set the hot dog price to four.

We're going to say: food_df.loc[food_df['item'] == 'hot dog', 'price'] = 4.

And the hot dog, which is currently priced at $3.60, should become $4.00.

And it worked.

Next up: += the hot dog price by $1.50.

Once again, we’re going to find the hot dog by the row—whatever row has an item that equals hot dog—and then we're going to target the price column. Whatever it is, we're going to += it by a buck and a half, so that it will become 5.50.

Hot dog is 5.50.

Next up: we're going to multiply the hot dog price by 10%.

So the 5.50 will get 55 cents added to it—it’ll be 6.05.

So once again: food_df.loc[food_df['item'] == 'hot dog', 'price'] *= 1.1

And now the hot dog is 6.05.

Rolling…

We're going to knock 20% off the hot dog’s price.

We can either multiply by 0.8—it’s actually easier to multiply by 0.8.

We’ll say, once again, we’re looking for the row in the food_df where the item is hot dog, we’re targeting the price, and we’re saying: *= 0.8 to reduce it by 20%, which will knock the price down to 4.84.

Lastly, we're going to knock the price of everything down by 8%.

That's every single row—so we don't even need a condition.

We'll just say: food_df['price'] *= 1.08.

And that gives you more decimals.

So to fix that—price = round(price, 2)—at all costs, because it's currency.

We’re going to come in and say: the price of everything equals the rounded price of everything to two decimal places.

Whoa—oh—*=, gosh, yikes.

Okay yeah—I accidentally *='d it by itself. I meant to—I basically squared the price.

Let’s run it again, then. Oh yeah. All right—I messed that one up pretty good.

Okay—so I’m just going to set that equal to itself and do a Run All.

Doesn’t like the shrimp salad. It’s not finding all that.

Okay, here we go—put it on down here.

Gonna move on. This file is enormous. I should have broken it up into a couple of lessons, right?

All right.

Adding a new column within a range of values—a random range of values, no less.

We have random imported, but we’re also going to use NumPy.

We can say: np.random.randint() and we can do a min, max, and n would be the sample size.

It's kind of like random.sample, but it's randint, where you can feed in not just a min and max, but also the number of items that you want.

They won’t be unique like sample, but you’ll get multiple values.

We’re going to make a list of randint in the 10 to 50 range, with a length equal to the food_df.

So that n value is going to be the length of the food_df.

And these values are not all necessarily going to be unique the way they would be with random.sample, but that is good enough.

We're going to say:

qty_list = np.random.randint(10,50, len(food_df))

Why do we want to do this anyway?

Because we want to have many numbers that we can supply—a list, a vector of numbers—that we could supply as the values of a new column.

There you go. So those are the numbers for our new column.

Notice they are a NumPy array, which is fine.

random.sample(range(min, max), count) can also produce a list of random ints, but they’re all unique.

Just as a reference—we could say:

unique_qty = random.sample(range(10,50), len(food_df))

And that is going to give us not a NumPy array but just an ordinary list.

And the thing is—all the values will be unique.

Takes three positional arguments, but four were given—uh right—we want a range.

Okay:

range(10,50)

There we go. So these are all unique and it's a list.

Note the commas.

The NumPy array gave us the same number of items, but not necessarily unique. In fact, I see two 45s in here.

So they’re not unique, but that’s fine. The quantity list doesn’t have to be unique.

Now what we’re going to do is make a new quantity column.

Add a new column.

We’ll say:

food_df['qty'] = qty_list

Remember—we declare a column the way you would declare a new key on a dictionary.

And then the food_df will have a new quantity column.

If we run the file—there it is—a new qty column.

That’s how many units we have, or sold, or whatever.

Key Insights

Brian McClain

How to Learn Python