Master the nuances of Pandas' LOC, ILOC, and slicing techniques to efficiently manipulate your DataFrames. Explore practical scenarios like randomized grouping and data aggregation to streamline your data analysis workflow.
Key Insights
- Distinguish clearly between LOC (inclusive with labels) and ILOC (exclusive with integer positions) indexing methods to select specific rows and columns, notably when column names are non-continuous or scattered.
- Implement randomization techniques using the
sample()
method to create randomized splits of data, such as dividing 1,000 students into two randomized groups of 500 each. - Utilize Pandas'
groupby()
function effectively to aggregate data, demonstrated by grouping 1,000 students based on 'parental EDU' and counting occurrences, followed by renaming and sorting the resulting grouped DataFrame.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
This is a lesson preview only. For the full lesson, purchase the course here.
Now, let's do a quick review of .loc
and .iloc
. Remember, you can slice. Here, we have a little slice of 10—the first 10.
Let's just—now, we don't have to make a DataFrame out of everything. Sometimes you can just output. Let's just take a slice.
Let's get just the six students at index 50 to 55, inclusive, and only three columns. We only want gender
, parental EDU
, and average
, which aren’t even continuous or connected.
So to do that, we could say students_df.iloc[50:56, [0,1, 7]]
, right? The 56 would be exclusive. So you get 50,51,52,53,54,55—that’s six—comma, column.
You want every single column? Fine. But we don't want every column. We're going to feed in a list of the three columns that we do want.
And it's whining—bitterly whining. Oh, because you can't use .iloc
for names. There we go.
average
—not an index? Oh yeah, it’s called average
. Fair. We’re using the column names, not the index position, so we have to use .loc
.
So to get a slice of rows, we did that with the top 10 here, and we could get away without using .loc
or .iloc
on something like that. But when you're saying, "I want a range of rows and a range of columns, " you kind of have to do—you’ve got to have a start and end row and a start and end column. If, in the case of the rows, they're continuous, we can use a colon to do a range from 50 to 56 exclusive—or inclusive, actually.
Remember, .iloc
is exclusive. .loc
is inclusive. If it says get index 50 to 55, then since we're using .loc
because we're using column names, not indexes, we just go 50 to 55 on the rows.
There we go. And again, the rows are continuous, so we do start:stop. The columns, however, are not continuous.
They are not in a range. They do have gaps, right? gender
and parental EDU
are together, but average
is all the way at the end. So you have to list out the three columns that you want.
Now, you can also avoid .loc
. You can get away without using .loc
or .iloc
. To do this, you use square brackets twice.
You use double square brackets. So here we have the move to do .loc
, where you say 50 to 55 and then the columns—that's exactly what we just did here. But you can also do it another way, an alternative version of the above .loc
without .loc
or .iloc
using double square brackets.
We're going to say students_df[50:56][['gender', 'parental EDU', 'average']]
. So the first set of square brackets will be your rows, and these are going to be exclusive. And then the next set is going to be the column names in one set of square brackets representing the columns.
And then the columns themselves, being not continuous, go in a nested set of square brackets. It works though. Boom. Okay. So that’s two different ways of doing the same thing here.
Let me shut off the output. So here you have two ways to get the same thing. One set of wrapper square brackets: start:stop row, comma, your columns—or no comma, set of square brackets, another set of square brackets.
.loc
version of the above—we did that already. Okay. .iloc
version of the above is a little bit harder, right? If you want to do .iloc
, that means you can’t use the names of the columns, and the closing row is exclusive. So you’d put a 56 and then you just say, okay, the gender is column 0, parental EDU is column 1, and average is column 7.
That would also work. Oh yeah, but it would be .iloc
. There you go.
That’s another way to do it. Not so user-friendly because now you’re on the hook for knowing the index values of the columns. And typically you don’t want to have to know that.
You don’t want to have to know that the parental EDU column is really indexed with 1. All right, but it works. We’ve got .loc
, we’ve got no .loc
(but double sets of square brackets), and we’ve got .iloc
.
Let’s get five random rows. We already did that. That would be students_df.sample(5)
. And now let’s try this.
What if we want to divide the thousand students into two randomized groups of 500? So what we can do is shuffle the thousand students by making a sample of a thousand and saving it. We could say students_randomized_df = students_df.sample(,000)
. Right? If sample(5)
gives you five random ones, then sample(,000)
would give you a thousand random ones—which is all of them randomized.
We would shuffle them. And then we would save that to the new DataFrame to prove it worked—top five. Yep. And then get the last five.
So that totally works. So what we could do with that is—once we’ve shuffled the thousand, randomized them—if we wanted to divide the thousand students into two randomized groups of 500 for some kind of, let’s say, study and compare how they do over time or something, well, we could just randomize them all and then take the first 500 to be Group A and the second 500 to be Group B. So that’s what we’re going to do.
Let’s assign the first 500 to group_A_df
and the last 500 to group_B_df
. And they’re already all randomized. So the first 500 will be random. And so will the last 500. We’ll say group_A_df = students_randomized_df[:500]
.
We can do it—it’ll let us have the first 500 that way. We don’t have to use .iloc
or .loc
.
And we’ll get group_B_df
as well. That would be—you would start at 500 and go to the end. We’ll check the head of each of these.
So the head of—so if you’ve got a thousand randomized ones and you take the first five, that’s going to match the first five of the first 500, right? So Group A is the first five of the thousand as well as the 500. And then Group B is from 500 to 504 or 505. That’s the second half.
We’ve divided the students into two groups of 500 by first just shuffling—randomizing—the thousand, then taking the first 500 to Group A, second 500 to Group B. All right, so here’s what we’re going to do now. What if we want to just grab five students at random—even though it’s all randomized, let’s still just get five students at random.
We’ll use sample
. We’ll say students_df.sample(5)[['column1', 'column2', 'column3', 'column4']]
. Let’s not even save it to a new DataFrame—just run it. And there we go. Five random students, four different scores.
It’s like slicing off the columns that we want. What if we want to get just the first 15 odd-index students—all columns? We have to use a step. We’ll say students_df[1:30:2]
.
That should be it. All right, there’s your first 15 odd all the way from 1 to 29. We start at 1—that sets the odd, right? We want odd, so we want every other one. So our step is 2, but you’ve got to start at 1 if you want odd, because it’ll be 1,3, 5.
If you start at 0, you’ll get even ones. All right, challenge: get every 50th student.
So student at index 0,50,100,150, etc. That’ll be 20 students altogether, right? Because there’s 1,000—or 21 maybe. Yeah, no, 20—we’ll get it.
Every 50th student. Let’s see. We’re going to pause, do it, come back. Okay, here we go.
students_df[::50]
. There it is—start at the beginning, go to the end, get every 50th one.
All right, so a lot of that is review. Okay, this is brand new though: grouped_df
. We can group by.
We can make a DataFrame that’s representing the data in clusters. So let’s say we want to group by a column name and get the count. We’d like to know how many students there are with each of these parental EDU categories.
Like how many students fall into the parental bachelor degree category? How many in some college? There are six or seven categories. We’d like the breakdown of the 1,000 students assigned to each category. We’re going to say edu_group = students_df.groupby('parental EDU').count()
.
And we have 222 and 118,118. Now, of course, at this stage of the game, all we really want is to know how many fall into each one of these categories. We don’t need all these columns anymore.
So what we will do is just pick a column—say gender
—and just take that. Same as the above, but just one column since all columns have the same value. So let’s take this and let’s say we want only gender
.
We’re going to slice that column right there. And we need to wrap it in its own list. Then—okay, more like it.
Although now we want to rename the column, because that shouldn’t say gender
—that should just say count
or something. We’re going to rename—remember how to do that? Challenge: rename the gender
column to count
. And you get this little evaluate pair hint.
So pause, do it, come back. Okay, we’re back. It’s actually the grouped students—edu_group.rename(columns={'gender': 'count'}, inplace=True)
.
Let’s see—it worked. That’s what we want.
Now let’s sort the DataFrame by the only sortable value, which is the count
, right? These aren’t in any kind of order. We’ll say edu_group.sort_values(by='count', ascending=False, inplace=True)
.
And maybe we won’t have trouble with inplace=True
here with the sort—but we may. Let’s also get the shape of this thing. Eh, it doesn’t like it.
All right, so we’re not going to do inplace=True
on the sort. Oh, of course, gender
has already been changed—doesn’t exist anymore—it’s count
now. There we go. All right.
Now if we run edu_group
again, it’s not sorted anymore because we didn’t save it to itself and we didn’t say inplace=True
. So the return value that you see only got printed. It didn’t get stored or stamped into—there we go.
There we go. So inplace=True
to make the change really take place—stamp the existing instance. All right, and our shape is (6,1) because we’ve only got one column.
You might think, wait, I see two columns, but that parental EDU
—that is not a column. That’s just named index values, right? Like if you reset the index, this would become 0,1, 2,3, 4,5, 6.