Understanding HTML Structure for Effective Web Scraping

Inspect HTML structure to identify tags useful for web scraping data.

Gain a foundational understanding of HTML and learn how it structures web pages, enabling effective data scraping. Using Shakespeare's public domain plays as an example, master identifying HTML elements necessary for data analysis.

Key Insights

  • Understand that HTML is a markup language used to define the structure and presentation of web content, including headings (like H3), speeches, and text blocks.
  • Practice analyzing HTML structure through inspection tools in browsers by examining a website featuring Shakespeare's Love's Labor Lost, available online since 1993.
  • Learn how to identify specific HTML tags (such as H3) to effectively scrape structured data like act and scene titles from web pages.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

In order to understand the shape of a page—to understand how these pieces fit together, which piece we need, and how to access it—we need to understand HTML. I'm not going to spend an entire course on HTML. There are whole courses on HTML.

Instead, I'm going to give you the briefest introduction—just enough to understand HTML, enough to be dangerous, enough knowledge to get started. So here we have a URL that we'll be using to scrape some data from. If you click on it, it leads to the full script of Love’s Labour’s Lost by Shakespeare, which is very easy to scrape.

That is a public domain work. Shakespeare has been dead for several hundred years. I think that link actually doesn't work anymore.

This is a very old page. I wanted to show you what page this is from. It’s from the Complete Works of William Shakespeare*.

The web’s first edition of the Complete Works of William Shakespeare*. This site has offered Shakespeare’s plays and poetry to the internet community since 1993. Wow.

Python for Data Science Bootcamp: Live & Hands-on, In NYC or Online, Learn From Experts, Free Retake, Small Class Sizes,  1-on-1 Bonus Training. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

And if you look at it—you know, its full news archive goes all the way back to December 1993. The creator was apparently trying to avoid finishing a paper on Othello and created this server instead. A very typical MIT student’s way of putting off doing English work—create a server to host the full works of the author.

Very MIT-like. So this page has been around a long time. They won’t mind if we scrape public domain data from it.

And that’s why we’ve chosen it. When we look at this data, we can see a visible structure. We have “Act One, ” “Scene One, ” and text that appears to be similarly sized.

These act and scene labels are consistent. Then we have some text for the scene, and then a character name—slightly smaller than the heading text but larger than the scene dialogue text.

And then a full speech by Ferdinand, and then another name—slightly larger—followed by a speech by Longaville, another name, and so on. This structure continues through to the next scene, which is further down.

Shakespeare plays are not short. Okay, here we go—Scene Two. Again, slightly larger text indicates a new section, a new scene.

So there’s a structure here, and that structure is repeated. That structure is actually built using HTML. In your browser, in any browser, you can right-click—and by right-click, I mean usually clicking the lower-right corner of your trackpad.

If you have a mouse, it’s the right mouse button. On some computers, Control-click or ALT-click may do the same. Either way, you can open a contextual menu, and clicking on any of the text on the page will give you the option to “Inspect.”

When you inspect, you'll see many code-looking elements—that’s the HTML. And when you hover your cursor over those elements in the right-hand panel, it highlights the corresponding element on the actual webpage.

So here is what's called an H3. You can see the tag H3 associated with “Act One.”

That’s this text right here. Again, if I hover over “Act One, ” it highlights the HTML on the side. The same goes for “Scene One”—another H3 tag.

As we noted, this was the same size. H3 is essentially a heading level 3—it's a medium-sized heading.

Then we go down and see a speech tag, and inside it, a blockquote tag, and so on. This divides our page into areas with structured meaning. That’s what HTML is.

It’s a markup language—it marks up your text to say: “This part should look like this, ” or “This part serves this purpose.”

When we put text in H3 tags, we’re saying “Act One” should be a large heading. When we put “Scene One, the King of Navarre’s Park” in H3 tags, we’re saying that’s also a big heading. When we use something like a name="speech", we’re saying this is a slightly different kind of heading—not as big, but still structurally distinct.

It’s not just text. The word H3 doesn’t show up visually on the page—it’s information about the text. It provides structure.

Inside that structure, we have the actual visible content like “Scene One, the King of Navarre’s Park.” That’s what HTML is. And when you create an HTML page, you use HTML code like H3 to set aside and mark off sections of specific types.

We can utilize that structure when scraping—if we want all the act and scene names, for example, we can target H3 tags. That’s not a rule of the internet—that’s just how the creator of this page structured it.

We can look at it, examine the structure, and say, “Okay, if we want the list of acts and scenes on this page, we need to look for H3 tags.” And that’s the process we’ll be exploring as we work through these data scraping problems.

Let’s get started.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Python

Master Python with hands-on training. Python is a popular object-oriented programming language used for data science, machine learning, and web development. 

Yelp Facebook LinkedIn YouTube Twitter Instagram