Learn how to scrape web data by identifying HTML elements and extracting text using BeautifulSoup. Understand common pitfalls and adapt your scraping strategy to handle irregular web page structures.
Key Insights
- Use BeautifulSoup's
find
method to precisely locate HTML elements by their tags and attributes, such as retrieving text from ana
tag with a specific attribute (name="1.1.13"
). - Apply list comprehensions in Python to efficiently extract text from multiple HTML elements, demonstrated by grabbing content from the first 10
a
tags. - Stay attentive to the actual structure and quirks of web pages rather than relying on conventional standards, as pages may deviate from typical usage, requiring flexible scraping strategies.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's solve these challenges. First, our code shall be a little academy. Let's look here.
Here’s that text. If I want, I can inspect it and find out exactly where it is. It’s the A tag with the name attribute 1.1.13. Okay, so what we want is to say, um, line = soup.find('a', {'name': '1.1.13'}). And then we’ll run line.get_text().
And there it is. All right, the second one’s a little tougher. Not so tough, a little tougher.
All A tags. So, that would be, we can say, I don’t know, tags. That seems like a fine name, pretty generic, but we’re doing something pretty generic.
I want to find all A tags. That’s it. Just find them all.
Now, let’s print out the text for the first 10. Let’s make a tag_texts list. It’s a new list.
We’ll do a list comprehension. We want tag. I kind of prefer to write it like this: for tag in tags.
And now I can go back and say I want tag.get_text() for every tag in tags. Okay, and now I only want, again, the first 10 of these. And there we go.
We have Shakespeare's homepage, 'Love's Labor Lost, ' and then some content from it. Now, these were not maybe what you were expecting. These are the tags up here that are more what we think of as links.
That’s what A tags typically are. These are doing kind of a funky thing on this page. But it’s something to pay attention to as you’re navigating scraping data: hey, they don’t have to follow regular standards of how to make a page.
Our job as data scrapers is to figure out, hey, what is it that this page does? Not what should they be doing, right? But we may get more, fewer, or just different texts than we think we will if we’re not paying very careful attention to not what pages typically do, but what this page is doing in this case.