Dive into supervised machine learning as we explore how algorithms predict outcomes using labeled data. Learn key concepts such as data training, feature selection, and testing methods to effectively build predictive models.
Key Insights
- Supervised machine learning involves training algorithms on labeled datasets, such as predicting car prices based on features like horsepower, engine size, and fuel efficiency.
- Models use numerical inputs called features to generate predictions; for example, classifying images of animals into cats and dogs using pixel-level data represented numerically.
- Effective model development includes splitting data into training (80%) and testing sets (20%) to evaluate how well algorithms perform on unseen data.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Hey folks, welcome to Notebook 2.0, where we'll start really diving into machine learning. And I hope that Notebooks 1 and 1.5 were good preparation, a good warm-up, and a good refresher on your data science skills. Before we actually get started, I do want to make sure we're all set up with our notebook.
We don't want to be wondering whether what's going wrong is our notebook infrastructure or our code. So I'm going to run this. Oh, it's already mounted, but for you it's probably going to come with a whole set of dialog boxes for configuring Google Drive, connecting it to this particular notebook.
It also will take a moment to just start running. Let's also run our import blocks. Here's this one, and this one.
And we're importing a lot of our typical NumPy, Pandas, and PyPlot. Also Seaborn, we'll be using that, and our ability to display images. And for the first time, we're going to import some things from scikit-learn, which is a science library devoted to machine learning.
And we're going to be importing a linear regression model, which is different from our linear regression formulas. The ability to split data into testing and training, we'll be talking about that in a moment. And Standard Scaler is a tool scikit-learn has for standardizing our data.
All right, so make sure you've run all three of those so that you're all set up. And now we'll go and talk about what we're doing here. So supervised machine learning, that's what we're doing here.
It involves helping an algorithm learn. And we'll talk a lot about what that means exactly. What does it mean for a model to learn? And we're going to train it based on some data.
We'll train it to make a prediction. It'll provide an output, which is called a prediction. And this phrase gets people a little bit.
We're going to be talking about features, different features, and those are just inputs. So think columns in a data frame, like which of these characteristics of our data are important? Those are the ones we'll feed to the model as features. Now, when the model makes a prediction, when we train the model, we'll give it the features, we'll give it all of the data that we want it to train on, and we'll also give it the answers, the target.
So the first thing we'll be doing, for example, is trying to predict car sales numbers. How much will this car sell for? What will its price be? Given things like horsepower, engine size, fuel efficiency, and several other elements of a car. Now, what we'll give it is, hey, here's all the data we think could be helpful for predicting our car's price.
And also, here's the prices. Here's the answers. So it can learn from that, hey, given X and Y, Z was the answer.
Given this X and Y, Z was the answer. So the overall goal is for it to be able to predict it, and it needs to know, hey, what are the general values that I can predict it based on, that I can reverse engineer exactly what model, exactly what formula will get me from the inputs to the output. And we also call the correct answer the label, since we've labeled the data.
This is the answer, this is the answer, this is this answer, et cetera. The term 'label' will make even more sense when we get to classifying data, but we'll see that in a little while. But it'll make more sense as a term for that.
Okay, so we're going to give it training data: inputs (features) paired with the correct labels. Now, when we train it, we'll give it these characteristics, fuel efficiency, engine size, horsepower, and typically it's like one row per car—right? One set of data, and also the label, the correct answer. Hey, what was the price for that car? And then we hope that it will be able to learn, hey, what features, what inputs, what values of those features lead to what values of prices of our target? So that's for cars, if we wanted a classification, we would have dogs and cats, pictures of dogs and cats, and the features would be those images, and the labels—which again makes more sense now that we're classifying—we label this one a dog, this one a cat, and if the model sees enough pictures of dogs and cats, it'll learn, hey, what about this picture makes it a cat? What about this picture makes it a dog? Now, it's not 'seeing' quotes here, it's just looking at numbers, it's just looking at a Python list of numbers typically.
So a color photo of a cat would actually be a whole list of pixels, red, green, and blue values, how much red, how much green, how much blue? Because it's just numbers to the machine, it doesn't know cats, it doesn't know dogs, it just knows data. And all outputs are gonna be numbers too. So cat versus dog, you know, we might label cat the number one, dog the number two, and it says, hey, when I see this set of pixels that look a little bit like this, that looks like it goes with one.
When I see pixels that have these numbers in them, that looks like it goes with two. So then I can predict, oh, this one looks like a one, this one looks like a two. We're also going to split off our data, not just into training data, but into testing data as well.
It's data we'll withhold from the model—we won't give it all. Instead, we will hold data in reserve so that we can say, 'Okay, you learned on this data; now here's a quiz, here's the test data.' How good are you, given what you learned? Run your predictions on the features of this test data, on the inputs, on the values we give you, and what's your prediction? And we'll measure that against the actual results for that testing data.
So this is like, 'Okay, you saw what a dog looked like, you saw what a cat looked like, you think you got the idea? Here's a picture of a dog, here's a picture of a cat. Which one do you think—is this a cat or a dog? Is this a cat or a dog?' And it's never seen those exact cats or dogs before. It's never seen these cars before; it doesn't know what prices they have.
And if it's learned its lesson well, it'll do a good job predicting new data based on old data. So it wouldn't be very helpful if it was just like, 'Yeah, I've seen that picture; you told me it was a cat, I remember it, that's a cat. You showed me this picture of a dog—I remember it's a dog.'
'I remember you told me.' It's like, 'Okay, we'll show you a new picture.' It's like, 'I've never seen that before, but I think, based on what I've learned about cats and dogs, that one's a dog.'
And you're like, 'Ooh, no—that one's a cat. Ooh, sorry—heartbreaker.' But hopefully our model will be pretty good at it.
And generally, we're going to split it into a dataset—testing and training—and typically we use 80% of the data for training and 20% for testing. That's what people have found is very helpful. All right, that's our introduction.
Let's get started.