Data Scraping: Navigating Unique Web Structures

Demonstrate web scraping with BeautifulSoup to extract text from specific HTML tags.

Learn how to scrape web data by identifying HTML elements and extracting text using BeautifulSoup. Understand common pitfalls and adapt your scraping strategy to handle irregular web page structures.

Key Insights

  • Use BeautifulSoup's find method to precisely locate HTML elements by their tags and attributes, such as retrieving text from an a tag with a specific attribute (name="1.1.13").
  • Apply list comprehensions in Python to efficiently extract text from multiple HTML elements, demonstrated by grabbing content from the first 10 a tags.
  • Stay attentive to the actual structure and quirks of web pages rather than relying on conventional standards, as pages may deviate from typical usage, requiring flexible scraping strategies.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's solve these challenges. First, our court shall be a little academy. Let's look here.

Here's that text. If I want, I can inspect it and find out exactly where it is. It's the a tag with the name attribute 1.1.13. Okay, so what we want is to say um line equals soup dot find an a tag with the attribute name and it's 1.1.13. And then we'll run lines get text.

And there it is. All right, second one's a little tougher. Not so tough, a little tougher.

All a tags. So that would be, we can say, I don't know, tags. That seems like a fine name, pretty generic, but we're doing something pretty generic.

I want to find all a tags. That's it. Just find them all.

Data Science Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Now to print out the text for the first 10. Let's make a tag texts text list. It's a new list.

We'll do a list comprehension. We want tag. I kind of prefer to write them like this for tag in tags.

And now I can go back and say I want tag dot get text for every tag in tags. Okay, and now I only want, again, the first 10 of these. And there we go.

We have Shakespeare's homepage, Loves, Labor, Lost, and then some content from it. Now these were not maybe what you were expecting. These are the tags up here that are more what we think of as links.

That's what a tags typically are. These are doing kind of a funky thing on this page. But it's something to pay attention to as you're navigating scraping data is, hey, they don't have to follow regular standards of how to make a page.

Our job as data scrapers is to figure out, hey, what is it that this page does? Not what should they be doing, right? But we may get more or less or just different texts than we think we are if we're not paying very careful attention not to what pages typically do, but what this page is doing in this case.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Data Science

Master data science with hands-on training. Data science is a field that focuses on creating and improving tools to clean and analyze large amounts of raw data.

Yelp Facebook LinkedIn YouTube Twitter Instagram