Data Scraping: Navigating Unique Web Structures

Demonstrate web scraping with BeautifulSoup to extract text from specific HTML tags.

Learn how to scrape web data by identifying HTML elements and extracting text using BeautifulSoup. Understand common pitfalls and adapt your scraping strategy to handle irregular web page structures.

Key Insights

  • Use BeautifulSoup's find method to precisely locate HTML elements by their tags and attributes, such as retrieving text from an a tag with a specific attribute (name="1.1.13").
  • Apply list comprehensions in Python to efficiently extract text from multiple HTML elements, demonstrated by grabbing content from the first 10 a tags.
  • Stay attentive to the actual structure and quirks of web pages rather than relying on conventional standards, as pages may deviate from typical usage, requiring flexible scraping strategies.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's solve these challenges. First, our code shall be a little academy. Let's look here.

Here’s that text. If I want, I can inspect it and find out exactly where it is. It’s the A tag with the name attribute 1.1.13. Okay, so what we want is to say, um, line = soup.find('a', {'name': '1.1.13'}). And then we’ll run line.get_text().

And there it is. All right, the second one’s a little tougher. Not so tough, a little tougher.

All A tags. So, that would be, we can say, I don’t know, tags. That seems like a fine name, pretty generic, but we’re doing something pretty generic.

I want to find all A tags. That’s it. Just find them all.

Data Science Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Now, let’s print out the text for the first 10. Let’s make a tag_texts list. It’s a new list.

We’ll do a list comprehension. We want tag. I kind of prefer to write it like this: for tag in tags.

And now I can go back and say I want tag.get_text() for every tag in tags. Okay, and now I only want, again, the first 10 of these. And there we go.

We have Shakespeare's homepage, 'Love's Labor Lost, ' and then some content from it. Now, these were not maybe what you were expecting. These are the tags up here that are more what we think of as links.

That’s what A tags typically are. These are doing kind of a funky thing on this page. But it’s something to pay attention to as you’re navigating scraping data: hey, they don’t have to follow regular standards of how to make a page.

Our job as data scrapers is to figure out, hey, what is it that this page does? Not what should they be doing, right? But we may get more, fewer, or just different texts than we think we will if we’re not paying very careful attention to not what pages typically do, but what this page is doing in this case.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Data Science

Master data science with hands-on training. Data science is a field that focuses on creating and improving tools to clean and analyze large amounts of raw data.

Yelp Facebook LinkedIn YouTube Twitter Instagram