Web Scraping: Extracting Non-Truncated Titles and Prices with Python

Scrape book titles and prices from a webpage using requests and BeautifulSoup.

Learn how to extract book titles and pricing details from webpages using Beautiful Soup and Python. Gain practical tips for refining your data scraping by handling truncated text and converting prices to numerical formats.

Key Insights

  • Implement web scraping using Beautiful Soup, utilizing methods like find_all() and list comprehensions to extract data from HTML tags such as H3 and paragraph tags.
  • Resolve truncated titles by retrieving the complete title attribute from anchor tags instead of their displayed text.
  • Convert price data from strings containing currency symbols into usable numerical float values, facilitating easier data analysis and integration into structures like data frames.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's take a look at some ways we might have solved this problem. We're going to make a response and set it equal to doing request.get of that URL. And we should check is the status code not 200.

Error, book's not available to be scraped. Seems like a fine thing to say. Okay, let's get the titles of the page.

Now if we look at the title here, and we do an inspect, let's see, it's text in the A tag. We need to find an A tag in an H3 and get its value. So first we need to look for all H3s.

Then we can look for A tags that are in there. So to do that, let's get all H tags, H3 tags, or H3s, right? Pretty sure, yep, yep, H3s. We'll say title tags, maybe.

Okay, soup, oop, we haven't created soup. Soup equals beautiful soup from that response.content as an HTML parser. All right, title tags equals soup.find all H3s.

Data Science Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

Then now that we've found all H3s, we can get for every H3 the name, the text of an A inside it. I'm going to do it with a loop. Could potentially do this with a list comprehension, but I'm going to go loop.

For every tag in title tags, I'm going to make a, let's make a titles list and append to it. Well, first I guess we'll find, I believe every H3 just has one in it. We'll say tag.find A, that thing's title attribute.

No, actually just that thing's text. Getting ahead of myself. Get text method.

Find an A tag, get its text, and I want to append that to titles. I'm feeling like this maybe should actually, I should go back to this as a list comprehension. Let's try it as a list comprehension, see if it's comprehensible.

Titles equals a list where for every tag in title tags, I want the tags find A, get text. I think that's pretty readable. Titles is a list where for every tag in the title tags, find the A inside it and get its text.

Yeah, I find that pretty readable. Feel free to do a loop version. If you want to do a loop version, that's how you do it.

But I'm going to prefer the list comprehension as long as it's reasonably readable. I think even a little longer than this, and I'd say that's not readable. All right, now prices, let's take a look at where prices are here.

Prices are right here. Let me inspect that. All right, it's a P, the class of price color, and I want the text inside it.

Okay, well, we can get that using a query, and we'll query by the attribute. So I'm going to do another list comprehension, and for every paragraph in, and paragraph is just the HTML P tag. These prices are wrapped in P tags.

That P is not for price, it is for paragraph. So for every paragraph in soup.findall paragraphs with the attribute class, and you put that in quotes, which I will in a moment, of price color. Is that right? Price underscore color.

Yep, then this should be in quotes. All right, so for every P in soup.findall, I just want the P's get text. There we go.

Again, now the, you know, it's a little more complicated on the right, and pretty simple on the left. Here we already made a title tags previously. We didn't try to do this all in one line, so that's pretty short, and we made it a little longer.

We did it here right in the list comprehension. Feel free to mix it up and do it however you'd like, whatever is most readable to you. As always, readable is in the eye of the beholder.

Now the bonuses. The non-truncated version of the title. Well, let's take a look at these.

Let's print titles, and let's print prices. So the titles are good, but they're pretty truncated. What does that mean? They go past a certain length.

They have a dot, dot, dot, and the prices are actually strings with the little pound symbol in. We probably don't want that. So that's what our bonuses are about, putting them in the right format.

Let's get the non-truncated version of the title, and the hint here is if we look at a title, there's actually a longer version, like a light in the dot, dot, dot is the dot text of it, but each one has a title attribute with the longer version of the title. So that's pretty easy to fix. Instead of this, a find the A and get its text, let's find the A and get its title value.

We're rerunning that again. Yep, a light in the attic, we get the full version. Again, all we had to do was change the dot, get text.

Don't get the text that's inside that HTML element. Instead, give me the title attribute value for it. Title equals a light in the attic, whereas the text in it is a light in the dot, dot, dot.

Okay, so that solves that one. The prices is slightly harder, but you know, not so much harder. What we want is to convert them to decimal point numbers, floats instead of strings, and then we want it also to get rid of the pound sign.

We need to remove the pound sign first, because we can't convert a string to a number if it's got weird symbols in it, non-numerical symbols. We'll say, okay, actually, prices equals prices.strip. Nope, that's not right. So .strip is a great method, but it's on each individual string, which means we're going to need to do a list comprehension.

For every price in prices, I'm going to want to say price.strip, that pound symbol, and I don't know where on my keyboard the pound symbol is, but that's okay, because guess what? It's right here in these prices, and I can copy and paste, or I can give it a try anyway. Yeah, I think it's going to work. There we go.

Let's try printing this out again. It should print out without the pound symbols. There we go.

Strip them out. Now we can convert those to floating point numbers, and we can do it right in here. We could say I want you to round.

Nope, we're going to say I want to convert it to a float, possibly rounding it. We'll make it into a float first. Strip out the pound symbol, and then run it through the float function, and there we go.

We might have some issues where we've got it 22.6 when we want it to be 22.60. We could certainly play around with that, but I think we've solved the problem we were asked to solve. Maybe if you want to stretch it and print it out as $50.10 or pounds and 10 cents, but then I think you're getting back to the string we were at before. If we want them to be numerical, this is the way to do it, and this might be how we want it if we want to put it into a data frame.

Full titles and numerical values that we can look at. Let's do that in the next step.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Data Science

Master data science with hands-on training. Data science is a field that focuses on creating and improving tools to clean and analyze large amounts of raw data.

Yelp Facebook LinkedIn YouTube Twitter Instagram