Learn to parse HTML content using Python's BeautifulSoup library. Gain practical skills to extract specific data from webpages programmatically.
Key Insights
- Use Python's BeautifulSoup library to parse HTML content returned by request.get(), enabling extraction and querying of webpage elements such as headings (H3 tags).
- After parsing HTML with BeautifulSoup, leverage built-in methods like findall() and get_text() to retrieve specific page content like act and scene names from Shakespearean texts.
- Accessing webpage elements programmatically reduces manual effort and increases accuracy compared to manually copying and pasting content, especially when gathering numerous data points.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Make sure to run this code block that reassigns `url`, otherwise your URL will still be our API URL from earlier. Let's run that URL, and let's hit up that URL. We'll use the same basic code.
We'll run `requests.get` to that URL, and that will give us back a response. And let's make sure that the response status code is equal to 200. If it’s not—if we don’t get back a response like “Yep, that’s the page you wanted, here it is, everything went great”—then we want to print something out, an error, to let people know that happened.
Request for data failed
. But if I run this, we should get no output. Great—because the status code was 200.
Alright, so we don't get back JSON. Last time we said, “Hey, give me `response.json()`.” And what that did was convert the JSON into a regular string. Instead, if we print out `response.content`, we'll see it’s an extremely long string—that’s all the HTML of the page.
Very, very long one-line string. You can see here there’s some good old “I have sworn for three years’ term to live with me”—some good Shakespearean language here. You know, “still in contemplative and living art.”
That certainly sounds like some Shakespeare. Alright, so that’s `response.content`. It’s not JSON. It’s not data we can convert to a dictionary.
Instead, it’s something we’re going to need to work with as HTML. And to do that, we’re going to use our library, BeautifulSoup. To BeautifulSoup, we pass `response.content`—that thing we just printed out—all the HTML.
And we’re going to say, “Parse that as HTML. Give me an HTML parser for that content.” What we actually get back is a parser—something we can look at and say, “Okay, I want this bit. I want that bit. I want all of those bits. Give me all the bits.”
So what this gives us back is a parsable, queryable object we can use to filter down to the data we actually want. This is typically called `soup`, in recognition of BeautifulSoup being the library name. Okay, now that we’ve got that, let’s run it—make sure we haven’t made a mistake.
No errors? Great. Let’s find all the `H3` tags. We can say `soup.find_all("h3")`. This method is built in because BeautifulSoup was created specifically to parse HTML.
Find all `H3` tags. And if you're familiar with JavaScript's way of accessing the DOM, it’s very similar. Alright, so we say “Find all `H3` tags.”
Now, if I save that somewhere—maybe I save it as `h3s`. Let’s print out `h3s`. Let’s see if that’s what we think it is.
Alright, it’s a list—a Python list—but it’s not actually the text. It looks like text. It’s the printed representation of the HTML tags.
BeautifulSoup is doing some work to give us something that prints nicely, but it’s not actually plain text—you can tell because there are no quote marks. These aren’t strings. If I print out `h3s[0]`, I get back that tag. But what about its type? It’s a `bs4.element. Tag`.
`bs4` stands for BeautifulSoup version 4. It’s an element tag—not a string. It’s got all kinds of methods and functionality on it.
One of those is `.get_text()`. Let’s say I’ve got `h3s[0]`, the very first one in the list. I can call its `.get_text()` method, and that will output the text.
It’s the words “Act One.” Now it’s a string. Now it’s actual text.
So these aren’t just plain strings, these `h3s`. They’re objects—HTML elements with methods like `.get_text()`.
If we want to get the text from all of them, we can make a new list that contains just the text values. That’s a list comprehension, if you know your Python. We could say `h3_texts = […]`
But let’s give it a better name. Let’s call it `act_and_scene_names` to be more descriptive. It’s a new list where we call `h3.get_text()` for every `h3` in the `h3s` list.
If we take a look at `act_and_scene_names` now—yep, there it is. It’s now just the actual strings for all the act and scene names, stored as a list. The type of each item in this list is now a string.
Alright, so that was our first data scraping. Congratulations. Let’s take a look at one more example.
Let’s say we want to find Act One, Scene One, Line 1119. Here’s how we’re going to do that. We’ll say, “Okay, let’s inspect this part of the page.”
Line 1119: “that war against your own affections.” And we can see this—we’re grabbing this text: “that war against your own affections.”
But if we’re accessing data programmatically, then it doesn’t really matter that we could copy and paste this. If we wanted to get all the act and scene names, we could spend a lot of time copying and pasting them manually. It would take forever to find them all.
We might even miss some. That’s the great thing about doing this programmatically—doing something like: “Hey, get me all the act and scene names.” We’ve got code that can do that.
Even when we’re doing something more trivial—less real-world—like getting all the act and scene names, or even a single quote like “that war against your own affections, ” we could just copy and paste it.
But we’re trying out new concepts, and the concept is: what if we want just one specific thing? The way we’re going to do that is to find the `` tag with that name. If you want something that specific, we can definitely do that—and in the next video, we’ll show you how.