Master precise HTML parsing techniques by learning to extract attribute values and select elements nested within others using Python's Beautiful Soup library. Enhance your web scraping skills by effectively navigating HTML structures and handling nested elements with clarity.
Key Insights
- Utilize Beautiful Soup's
.findall()
method to locate specific HTML elements within other elements, such as extracting alla
tags contained insideblockquote
elements. - Access attribute values directly from HTML elements, demonstrated by retrieving the 'name' attribute from targeted
a
tags, enabling precise selection of HTML attributes. - Efficiently manage nested list structures returned from parsing HTML by flattening them using Python methods such as
.extend()
or list concatenation, facilitating simpler data manipulation.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's show you a couple more complex queries you may need. One is getting attribute values from HTML elements, and the other is finding elements that are inside other elements. We'll demonstrate this by looking at these a
tags and their name
attributes.
1.1.1,1.1.2. What if we want the value "1.1.1"—not the text content and not "name="—but literally the value here: 1.1.1 or 1.1.2? Or really all of them, which is what we want. If we want that, we're going to need to access it. When we searched for all a
tags, we also found some others. Let's see if I can find them.
They're in here somewhere. They're not there. Here's the a
tag for the Shakespeare Homepage and this one for Love's Labour's Lost.
So these are the links up here. And what we want ideally is to not have those. We want to be more specific.
In this case, we want the values for the name
attribute for every a
tag that's in a blockquote
. Not these ones up here—they don't even have name
attributes.
We'll get an error if we try to access them. Instead, we'll say we want only the ones that are in blockquotes. Here's how we do that.
First, to get all the blockquotes, we could say blockquotes = soup.find_all("blockquote")
. Okay. Now that we've got that, we want to find all a
tags that are inside those blockquotes.
Maybe a_tags
equals—well, instead of soup.find_all()
, we're going to start with the blockquotes. Every single element that soup
gives you back has its own query methods. BeautifulSoup
elements have a method for finding all things inside them. And in fact, these a
tags that are in the blockquotes—if we need to find something inside these a
tags—we can also do a_tag.find_all()
. This will find all a
tags within a single blockquote—but not quite.
Because find_all()
is a method on a single blockquote, not all of them. The full list is a Python list, and a list doesn't have a find_all()
method. But every element in the list—every blockquote—does. So this won’t quite work.
Instead, we need to loop through the bqs
. I'm actually going to simplify this. I'm going to make a names
list.
Pretty sure there's a way to do this with some overly complex list comprehensions. But at that point, it's like—yeah, let's do a loop. We're going to loop through every blockquote in blockquotes
.
And for each one, we'll say okay, let's get the a
tags. Now that we're in one blockquote, it has a find_all()
method on it. So let's say a_tags = blockquote.find_all("a")
. And you see I’m getting autocomplete here.
I wasn't getting it before because lists don’t have find_all()
methods. Now that I’ve got that, I could do another loop—but I think at this point a list comprehension will work just fine.
So I’m going to use names.append
—that’s not quite right, but it’s close enough—and build a new list comprehension for every tag in a_tags
. For every single one, I want to do something with that tag.
What I want is to get its name
attribute, and it works just like accessing any key-value pair—as if tag
is a dictionary, which it actually is. Anytime I’m doing find_all()
or find()
, I get back a tag object that has its own find attributes, methods, and properties—like get_text()
. But instead of get_text()
, I want the name
value.
And again, that’s this value right here. I don’t want the visible text—I want the name
attribute.
This isn’t going to quite work, but we’re getting pretty close.
Let’s look at names
. We encountered an error here. Oh yeah, I said this wouldn’t quite work—but I forgot to comment it back out.
Let’s try that again. All right, we’re almost there. We’re pretty close.
This is an understanding of Python here. We have a list within our big list. Here’s a list of all the lines that go with speech one, then another list for speech two, then another for speech three, and so on.
And actually, what we want is what’s called flattening the list—getting rid of the nested lists. Maybe we could do that instead of saying “put a new list in that names
list.”
Instead, we could say, “I want to extend that list with the elements in this list.” So it’s going to CONCATENATE those lists together. All right, if I run that now, we get a much flatter list with all the numbers in it.
We also could have done that—went too far up—we also could have done that with names = names + [new list]
. Let’s try that. Yep, same result.
If that’s clearer to you than .extend()
, either one will work. Yep, that worked.
Either way. Okay, so again—we’re showing you both things. Here’s how we get an attribute, and here’s how we find all elements that are inside another element.
Find all a
tags that are in this particular blockquote
. Keep in mind as always that these find_all()
and find()
methods that are on one of these elements are never on the list itself. bqs
is the list we get back here.
It doesn’t have a find_all()
method on it—but every element in it does. That’s a bit of a gotcha—thinking about a list versus what’s on each element of the list. But we’re going to use all these elements in our next big project for you.