Learn how to scrape web data by identifying HTML elements and extracting text using BeautifulSoup. Understand common pitfalls and adapt your scraping strategy to handle irregular web page structures.
Key Insights
- Use BeautifulSoup's
find
method to precisely locate HTML elements by their tags and attributes, such as retrieving text from ana
tag with a specific attribute (name="1.1.13"
). - Apply list comprehensions in Python to efficiently extract text from multiple HTML elements, demonstrated by grabbing content from the first 10
a
tags. - Stay attentive to the actual structure and quirks of web pages rather than relying on conventional standards, as pages may deviate from typical usage, requiring flexible scraping strategies.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's solve these challenges. First, our court shall be a little academy. Let's look here.
Here's that text. If I want, I can inspect it and find out exactly where it is. It's the a tag with the name attribute 1.1.13. Okay, so what we want is to say um line equals soup dot find an a tag with the attribute name and it's 1.1.13. And then we'll run lines get text.
And there it is. All right, second one's a little tougher. Not so tough, a little tougher.
All a tags. So that would be, we can say, I don't know, tags. That seems like a fine name, pretty generic, but we're doing something pretty generic.
I want to find all a tags. That's it. Just find them all.
Now to print out the text for the first 10. Let's make a tag texts text list. It's a new list.
We'll do a list comprehension. We want tag. I kind of prefer to write them like this for tag in tags.
And now I can go back and say I want tag dot get text for every tag in tags. Okay, and now I only want, again, the first 10 of these. And there we go.
We have Shakespeare's homepage, Loves, Labor, Lost, and then some content from it. Now these were not maybe what you were expecting. These are the tags up here that are more what we think of as links.
That's what a tags typically are. These are doing kind of a funky thing on this page. But it's something to pay attention to as you're navigating scraping data is, hey, they don't have to follow regular standards of how to make a page.
Our job as data scrapers is to figure out, hey, what is it that this page does? Not what should they be doing, right? But we may get more or less or just different texts than we think we are if we're not paying very careful attention not to what pages typically do, but what this page is doing in this case.