Using Boolean Conditions to Filter Data in Python

Unlock the power of conditional filtering and data manipulation in Pandas, from creating filtered DataFrames to modifying data dynamically. Discover efficient techniques to analyze and update datasets using Boolean conditions and DataFrame indexing.

Key Insights

Learn to filter DataFrames effectively by applying Boolean conditions, such as selecting items with calories less than or equal to 500, and creating subsets based on non-vegan food items.
Understand dynamic DataFrame updates using row indexing (df.loc) and length-based indexing (len(df)), allowing the addition of new items like 'Fruit Salad' and 'Bison Burger' without hard-coding positions.
Utilize Pandas' describe method to obtain critical statistical summaries—including mean, standard deviation, percentile rankings, and range—for numeric columns within the dataset.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

This is a lesson preview only. For the full lesson, purchase the course here.

Now, selecting by a condition, a Boolean condition, you remember those? Well, you can use Booleans to set a condition and say, I just want, say, everything that's got no more than 500 calories. We don't have a lot of data, obviously, but we can still try this. We'll add more data and then be able to do better filters later.

But let's just try a filter before we start adding on. So let's make a new DF. We'll just do it as a new DF this time.

We'll say max 500 cals df is going to equal food df. Now, we do what is called a, let's look at the syntax. We're making a new df.

That's going to equal the existing df, and inside that, a condition where the column name is subjected to a Boolean condition. So the column we're interested in is calories. We're going to say cal.

So inside the df itself, we're filtering, inside these square brackets, we're doing a Boolean condition off a column. What is the condition? Calories less than or equal to 500. Run that, and it won't print yet because we saved it to a variable.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes, 1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

There you go. So it's just the two items that are no more than 500 calories. So the pizza and the steak.

Let's go look at the whole thing. The hamburger and the steak are above 500, so they didn't make it into the filter. Filtered out.

Only rows. It's going to basically iterate or loop the whole dataset, the whole data frame, row by row by row, and it's going to take every row, and going to take the calorie value in each row and subject it to this Boolean. Only those that return true will get saved and accumulated in the resulting new data frame.

All right. We're going to make a new df of only the non-vegan food items. That is your challenge.

You haven't done many of these, but it might be a little much. But at least when we do the solution, it'll click a little better if you had a chance to struggle first with it. We'll make a df called non-vegan, and that's going to be set equal to the food df.

Inside here, we're going to put in these square brackets as our Boolean condition set on the column that you want to target, which is vegan. What do we care about the vegan column here? We'll say make of non-vegan. That would be equals false, which is three of the four.

Everything but the garden salad. Yep, we got everything but the garden salad. If vegan is false, we want it.

If vegan is false, we want to accumulate it in our new data frame. Now, how do we add new data? Adding a new row to a data frame. Df.loc[num] will give you the location of a row number.

To specify a new next row number, we set it equal to a list of values. Let's say we want a new item. We would need four values.

Let's name, price, cals, and vegan: a string, a float, an int, and a bool. We'll call this new item. We've got this new item and we'd like to add it to the food df.

Now, the food df goes, just as a reminder, four rows, 0,1, 2, and 3. If we could declare a new row at index 4 and set it equal to this new item, we could slot it right in. Let's do that. We're going to say food df.loc, and the number is going to be four, and the value is going to be new item, which is this list.

It's going to slide right in and each value will fill in one column. New item is not defined. Okay, fine, I didn't run that one.

There you go. Fruit salad is in. Slid right in.

Slotted right in at index 4. All right, now, challenge. Add bison burger to the df as the next available row position. So, you can just keep this new item variable going.

We can do it in one line as well. Pause. Go ahead and do the solution.

All right, so the solution would be, we're going to add the new item, but we're going to do it at position 5. And there you go. There's the bison burger. Next, let's increase the new item price by a dollar with the plus equal operator.

So, we're going to come into bison burger, go to the price, and set it equal to itself plus equals 1 to make it 1950. So, that's kind of tricky. What are we going to do? We're going to say food df dot loc 5. We want row 5, and we want column price.

That's why we have price here. Loc being the name. So, we're going to row 5, price column.

And we're going to set it equal to, we could set it equal to itself plus 1, or the shorthand plus equals 1. If we run the food df again, that bison burger should be 1950 now. And it went up by a dollar. It worked.

Now, double all the prices. Well, it tells you how to do it. So, food df dot loc, all rows.

Colon means all. Price is the column. So, it's every single row in the column multiplied by 2 with the shorthand times equals 2. So, the prices went through the roof.

Now, let's cut all the prices in half back to normal. Let's do that as a challenge. Turn off the recording, pause it, and come back after you've knocked all the prices down by half.

OK, here's the solution for that. We're going to go to the food df dot loc location, and we want every row comma price column. Set that equal to divide equal 2. Or you could say times equals 0.5, if you like.

That would also work. Multiplying by 0.5 is the same as dividing by 2. And the prices are back how they were. Len df.

Now, if we wanted to add yet another item, so here, we have new item again. Let's say new item. And this time, it's a Caesar salad.

It's got the item, the price, the item string, the price Boolean, the calorie int, and the vegan, no, excuse me, the price string, the name string, the price float, the calorie int, and the vegan Boolean, all in the correct order. We could just slot it in at row 6. We did the bison burger at row 5. But the problem with that is, what if we don't know what the number is? We don't know how long this is. We just want to go in at the end.

So for that, we would say length df. If you get the length of the df, it's kind of like the length of a list or of an array. It'll just tell you how many rows there are.

So that's your value. You don't write a 6. You just come in at the length. And since you're starting from 0, like right now, we're 6 rows, but that's 0 to 5. We want the length of 6, right? There's 6 items.

So the new index would be 6. The next new index is the current length. We're going to say food df.loc. And instead of saying 6, we're going to say length, the len of the food df, which is 6. And that's going to equal the new item. And then the food df, there it is.

There's your Caesar salad dynamically added. And if you added another new item, well, the length just got longer by 1, right? It's dynamic. And you could just keep going that way.

Pause and type all this stuff. Got to do it. Now, get rid of the last row.

Why would we want to do that? Here, let's run this command again. Look, you've got two Caesar salads now. Why? Because the length is going up dynamically.

Now let's get rid of the last row. We'll say, what we'll do is we'll take a slice of itself and just leave off the last row. Food df is going to equal itself, dot loc.

And we'll start at 0 and go to 6. And then we want all columns. We want rows 0 to 6, and we want all columns. There we go.

We shaved off the bottom column, right? We didn't go all the way to 7. That is one way. You can also pop or remove rows a different way. But we just slice from top to bottom, leaving off the last row.

Remember, loc being inclusive. We do get the 6. All right, now, what about a loop? We learned about loops. That was a while ago.

That was lesson 4. But we need to know them and remember them and use them. So what we want to do is we saw that we could just keep running this cell and keep adding Caesar salads because we've got the length increasing every time. So it'll go to 8 and 7 and 9 and so on.

So what we're going to do is use a loop, though, taking advantage of the fact that the length will keep getting longer every time you run, every time you add an item. We've got these three items that we're going to add all at once. We'll say there's one list, there's two, and there's three.

And these are going to be bundled as the three items of a parent list. We'll call it new items with an S. And what do we want to do here? Well, we can just keep it rolled up. We're going to loop it.

And we're going to loop the new items, adding the current item to the food df each time as a new row using len df to get the new length, the new longer length, the longer by one value each time. So the length is going up by one every time, which is perfect because you want to slot in the next item at the new next higher number every time. We're going to say for item in new items, food df dot loc len new item.

No, not len new items, len food df, right? The food df is getting longer and longer. Whoa. Oh, I didn't run the new items.

OK, there we go. Whoa, what are we doing here? Exception. We have a list here, seven.

Really? Don't say. Item in new items, loc. Oh, we didn't assign it to anything.

There we go. We have to assign this to something. All right.

We're taking the current item, which is this whole list, and we're slotting it in as the new row at the new next row number. And I just added it twice. Look, I just went and added it.

I ran the command again, so I did it twice. We've got chicken, chef. OK, chicken salad, chef salad, big kahuna.

All right, so we want to do, want to shave off one of them. We'll say, dot loc, we want from, let's say, iloc. We want to go from the beginning to negative three.

We want all the columns. Let's see if that fits it. Yep, now we just have the one, so we shave off.

Get rid of last three if ran loop twice. Right, if you ran the loop twice, you got the three new items twice, so we're shaving them off by selecting from the first row to the third from the last row exclusive, and colon, comma, colon for all columns. Let's change the name.

Let's target chef salad and change it to house salad, so how would we do that? Well, we'd have to come in and get. Now, we could either no, the location. Let's just start with that.

Let's see here. All right, we'll just say location 8 and chef salad. Food, df, dot loc, 8, and item, right? We're changing the item to house salad as a run, and it's a house salad now.

We've already divided the prices and stuff. We don't really need to do that again. Let's do another filter.

Challenge, get just the minimum 15 price items. If you look at all these items, some of them are less than 15, so we want to go in and get just the ones that are 15 or more, the minimum of 15. So turn off the recording, pause, and come back when you have had a chance to try it.

OK, so we'll make a new df. We'll call it min 15 price, df equals food, df. We're going to say, what do we want to do? We want to put in our Boolean condition here.

We'll say food df price is greater than or equal to 15. Run, and there it is. Only those items, at least 15, go in.

OK, challenge, change house salad again. Now we're going to change it to shrimp salad. Let's use iloc, even though it is not as user-friendly.

So for iloc, we'd have to say 8 for the row and 0 for the column, not item. We'll say food df.iloc, 8 for the row, 0 for the column, and that will be set equal to shrimp salad. See if that works.

Yep, it's a shrimp salad. OK, we've got two numeric columns. You can get stats on numeric columns, and to do that, we use what's called the describe method.

So let's say food df.describe. Run that, and it gives you a statistical breakdown on any numeric column. The count would be the number of rows, the mean average value of each column, the standard deviation, how far you have to go, plus or minus, to move one standard deviation along the bell curve, which is, in the one standard deviation, it's plus and minus is a total of 68%, so 34% in each direction. If the mean is, what is the mean? So the mean calorie is 250.

If your mean calorie is 250, that's your mean right there, and your standard deviation is, oh, no, excuse me. Your mean calorie is 637, and your standard deviation is 250.

So that means if your mean is right here in the top of the bell curve, so that would be your 637, you would have to go 250 calories to the right to reach the one standard deviation, and you would have to go 250 calories down the slope to the left to get another standard deviation, minus a negative standard deviation, minus 1, and the total difference is 68%. If the total range from negative 1 to positive 1 standard deviation is 68%, so if 637 is your mean calories, and 250 will take you up and down the slope, one standard deviation, then if you add 250 to 637, you get 887. If you subtract 250 from 637, you get 387.

So 387 to 887, that's 2 thirds of all the food items. That's how it works. And going another standard deviation, you're into 95%.

Same idea for the price. That's a bell curve. Very important concept in statistics.

Now, we also have your minimum value, your max value, and the percentiles. So percentiles means that that percent is lower than you. If the food item has 525,520 calories, then only 25% of the items have fewer calories than that, which makes sense if the mean is 637.

It also makes sense that the 50th percentile would be pretty close to the mean. We don't have a lot of data, so it's not that close. But if you had thousands of items, your 50th percentile would be pretty close to your mean.

75th percentile, that means if your dishes, your meal, or whatever, your food item is 700 calories, then 75% of the items have fewer calories. So those are the eight statistical values that you get when you run the describe method on a data frame. And notice, again, it only gives you a report on the numeric columns.

It couldn't do that on strings or booleans. Wouldn't make sense.

Key Insights

Brian McClain

How to Learn Python