Extracting Pagination Data: Navigating Web Elements

Learn how to extract pagination data using Beautiful Soup, and set the stage for powerful web scraping. This guide demonstrates parsing HTML elements step-by-step to identify and handle pagination effectively.

Key Insights

Inspect the HTML to identify the target element, specifically the LI tag with class current, to extract the pagination text "page 1 of 50."
Apply Python's Beautiful Soup library to locate and isolate the desired HTML element, then utilize the .text method and .split() function to obtain and parse the pagination content.
Convert extracted pagination values from strings into integers to avoid potential data processing errors, setting the foundation for looping through multiple pages to retrieve comprehensive data.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's take a look at how we could do this step-by-step. First step would absolutely be doing a little exploration to figure out how we can hook into this element. Let's inspect it.

It is an <li> tag, that's the name of the tag, just like <p>, <a>, and <h3> tags that we've been working with. <li>, and it has an attribute to identify a class of "current." That one has the class of "next"; that's not the one we want.

But the class equals "current." That's got the text in it we want: Page 1 of 50. All right, let's take a look.

Now that we know it's an <li> with a class of "current, " we could say pagination element—equals soup.find. We just want to find one. Find the <li> with the class of "current." All right, that should do that.

Now I want the text in it. I'm just going to break this up. It could potentially be done all in one line or two.

Data Science Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Let's do it in three. That's the text that's in it. And now for our bonus, which we definitely want to do, we ultimately want to get what the maximum number of pages is. It's that pagination content, but I want it split into words.

And that's what .split will do. It will take a string and make it into a list of words. Now that it's a list, I could say I want the last—now it's a list of words.

And we can check that out. For example, from "Page 1 of 50, " we can extract the list of words. I want the last word in that list, so after I split it, give me index -1.

And there it is. Ooh, it is the string "50." We should probably make it the integer version of all of that.

And there we go. Not a string now. That could have caused problems later.

Okay, our next step is to take this and do a very complex and beautiful loop to hit up every single page on this site and get all that beautiful data.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Data Science