Prepare for machine learning by revisiting essential Python libraries, statistical concepts, and data visualization methods. This article offers a concise refresher on NumPy, SciPy, Pandas, and Matplotlib to strengthen your data science foundation.
Key Insights
- The article provides a refresher on essential Python libraries used in machine learning, including NumPy for numerical computations, SciPy for statistical functions, and Matplotlib for data visualization.
- It emphasizes the importance of correctly setting up Jupyter Notebooks and importing standard libraries using common aliases, such as NumPy as "np," Pandas as "pd," and Matplotlib PyPlot as "plt."
- The author highlights common troubleshooting steps, particularly checking file paths and URLs when importing datasets (e.g., car sales data) into Pandas data frames from Google Drive.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Let's clean up after the last video. We had a code cell. We can go over here and delete it.
We don't want to be running that. Again, it will cause an error if we've already deleted the name. And let's just talk briefly about what we're going to be covering in this lesson.
So 1.0 is about regression. It's about statistics. It's about a little bit of Python and a Data Science refresher.
And getting our heads around Jupyter Notebooks and getting a little used to them before we dive into machine learning proper. Okay, so here's what we're going to cover in this lesson. We're going to cover some statistical modules, some libraries from Python, NumPy, SciPy, and Matplotlib.
And we're going to cover some statistics that we need to cover. Distributions. And we're going to cover plots.
That's what we're going to cover. Let's get started. So since I restarted the kernel, the Python environment, I'm going to rerun this.
It says drive's already mounted. Great. Now, here with this, you should be familiar with most of these.
But this is Pandas. Hopefully you're reasonably familiar with that. But we'll get a little refresher as well as with NumPy, which is the Python number math library that undergirds a lot of other things.
Our stats from SciPy. This is our library for displaying images in Jupyter Notebook, which used to be known as IPython for interactive Python. The Random module is for generating random values.
And from Matplotlib, we'll grab PyPlot. And typically, we call that PLT. These are the standard things to import for the things we'll be doing.
And they'll also have the standard names for them: PLT, PD for Pandas, and NP for NumPy. Now, we also want to run our URL grabbing variables here that we've given to you.
These should be links that link exactly to the files that we have that we uploaded to Google Drive. And if you didn't get that setup right, if this is an error for you, if this becomes an error for you, we're not done with it yet, then you can take a look at one of the earlier videos where we do our Google Drive setup. All right.
So, we've made these URLs. We can just check out what is base URL plus car sales URL. And we can execute that.
And it should read like this. Check yours just to make sure there's no slashes. It would be really easy to have two slashes before the CSV or no slashes at all if you've got a typo in there.
And, again, this is the URL to your Google Drive in general and to this car sales CSV data in particular. Let's check that path. You may get an error here.
Let's check that path to make sure we can create a DataFrame from Pandas based on the cars CSV. So, we can say cars equals, that's a common name for a DataFrame like this, right? It's the cars data. And we can say cars equals PD for Pandas dot read CSV.
And what we pass to read CSV is a path to the file. If it was a local file, it would start with a dot or a dot slash or a dot. But it's our base URL plus our cars URL.
The thing we printed out above. And if this is an error, you'll get a little error here. And, again, you'll go back if you do and look at your Google Drive and make sure that you've got this folder and it's the one that we gave you.
Let's run this. Yeah, I do have an error. Fantastic.
Name PD is not defined. That's because, and this is a very common, I'm glad to demonstrate this error. And I probably will again.
Sometimes I'm talking about code, and I forget to actually execute it. You can see that this import block does not have a check mark next to it. Probably not the error you got.
Again, if you got the error, a different error that the URL is wrong. Well, let's just delete something so you can see that error. Well, let's run this, see what it looks like if we have no error.
It just says check. Great. We're not outputting anything.
So there's nothing to display there. If you have an error, let's say you didn't put it in My Drive. Let's say you have something that it's actually in a different folder.
Let's just mess this up. I just deleted a letter. Execute that.
You can see that machine learning is misspelled here. Now if I execute this, I'll get this error. It's probably no such file or directory with the full path we've got up there.
If you've got an error here, it's not going to be because you misspelled this, because we gave you this. We gave you with the starting Python, with the starting Python notebook, you have the entire bit of code there. But if you got an error here, it's again because your Python Machine Learning Bootcamp folder is not uploaded to exactly in My Drive, or you simply don't have any files in that folder.
You don't have the CSVs in that folder. It's possible. All right.
So let's, I fixed the misspelling for you. If you got this error, make sure that you have now fixed putting the Python Machine Learning Bootcamp folder containing the CSV files directly into My Drive. See the earlier video.
And now you run it, and it runs just fine. It does not update because I need to re-execute this. Again, if I've changed this code, that's great.
But I haven't re-executed it. So base URL is still the incorrect value. Let's try executing this.
Great. It's been updated. Now executing this.
Great. No error. But let's actually look at our cars.
So the last evaluated value will be outputted. Cars. What is cars? Here it is.
It's a Pandas DataFrame based on a CSV. And we'll dive more into that in the next video.