Learn how to extract pagination data using Beautiful Soup, and set the stage for powerful web scraping. This guide demonstrates parsing HTML elements step-by-step to identify and handle pagination effectively.
Key Insights
- Inspect the HTML to identify the target element, specifically the
LI
tag with classcurrent
, to extract the pagination text "page 1 of 50." - Apply Python's Beautiful Soup library to locate and isolate the desired HTML element, then utilize the
.text
method and.split()
function to obtain and parse the pagination content. - Convert extracted pagination values from strings into integers to avoid potential data processing errors, setting the foundation for looping through multiple pages to retrieve comprehensive data.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's take a look at how we could do this step-by-step. First step would absolutely be doing a little exploration to figure out how we can hook into this element. Let's inspect it.
It is an <li>
tag, that's the name of the tag, just like <p>
, <a>
, and <h3>
tags that we've been working with. <li>
, and it has an attribute to identify a class of "current." That one has the class of "next"; that's not the one we want.
But the class equals "current." That's got the text in it we want: Page 1 of 50. All right, let's take a look.
Now that we know it's an <li>
with a class of "current, " we could say pagination element—equals soup.find
. We just want to find one. Find the <li>
with the class of "current." All right, that should do that.
Now I want the text in it. I'm just going to break this up. It could potentially be done all in one line or two.
Let's do it in three. That's the text that's in it. And now for our bonus, which we definitely want to do, we ultimately want to get what the maximum number of pages is. It's that pagination content, but I want it split into words.
And that's what .split
will do. It will take a string and make it into a list of words. Now that it's a list, I could say I want the last—now it's a list of words.
And we can check that out. For example, from "Page 1 of 50, " we can extract the list of words. I want the last word in that list, so after I split it, give me index -1.
And there it is. Ooh, it is the string "50." We should probably make it the integer version of all of that.
And there we go. Not a string now. That could have caused problems later.
Okay, our next step is to take this and do a very complex and beautiful loop to hit up every single page on this site and get all that beautiful data.