Splitting Data into Training and Testing Sets for Modeling

Prepare your dataset effectively by splitting features and targets for accurate machine learning analysis. Learn to implement training and testing data partitions to ensure robust model performance.

Key Insights

The article outlines the process of defining feature inputs (X) by selecting columns such as sepal length and width, petal length and width, from the iris dataset.
It demonstrates assigning the "target" column of the dataset as output labels (Y), resulting in 150 labeled rows categorized numerically (0, 1, and 2).
The content explains using scikit-learn's train-test split method to randomly partition the dataset into training and testing subsets, specifying a 20% test size to yield 30 randomly selected samples for validation.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Let's take a look at preparing our data into X and Y training and testing sets. This looks ready for use. We have our four features: sepal length, sepal width, petal length, and petal width, and we have our target and species to help us better interpret the data.

All right, so X represents the features for training, and that means that X should be the iris dataframe with the columns: sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm). Let's see if we got that right. It's very easy to get a typo there.

It looks like we nailed it. Okay, those are our inputs. Let's get our target, our Y, and that's very simple.

Y is the iris dataframe target, and there it is. 150 rows, zeros, ones, and twos (representing species). Now let's use our train test split as we did previously.

We want X_train, X_test, Y_train, and Y_test to be the train_test_split of our X data, our Y data, and a test size of 0.2 (20%). And we want to take a look at, you know, maybe the test data. Here are the targets for our test data, and you can see that they're now randomly assigned—30 samples, as this is 20% of 150.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

And the corresponding test data—30 samples. These are the inputs that go with those answers. All right, next up we'll create our model, train it, and get it working.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning