Splitting Data into Training and Testing Sets for Modeling

Split the iris dataset into training and testing sets for features and targets.

Prepare your dataset effectively by splitting features and targets for accurate machine learning analysis. Learn to implement training and testing data partitions to ensure robust model performance.

Key Insights

  • The article outlines the process of defining feature inputs (X) by selecting columns such as sepal length and width, petal length and width, from the iris dataset.
  • It demonstrates assigning the "target" column of the dataset as output labels (Y), resulting in 150 labeled rows categorized numerically (0, 1, and 2).
  • The content explains using scikit-learn's train-test split method to randomly partition the dataset into training and testing subsets, specifying a 20% test size to yield 30 randomly selected samples for validation.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's take a look at preparing our data into X and Y training and testing sets. This looks ready for use. We have our four features: sepal length, sepal width, petal length, and petal width, and we have our target and species to help us better interpret the data.

All right, so X represents the features for training, and that means that X should be the iris dataframe with the columns: sepal length (cm), sepal width (cm), petal length (cm), and petal width (cm). Let's see if we got that right. It's very easy to get a typo there.

It looks like we nailed it. Okay, those are our inputs. Let's get our target, our Y, and that's very simple.

Y is the iris dataframe target, and there it is. 150 rows, zeros, ones, and twos (representing species). Now let's use our train test split as we did previously.

We want X_train, X_test, Y_train, and Y_test to be the train_test_split of our X data, our Y data, and a test size of 0.2 (20%). And we want to take a look at, you know, maybe the test data. Here are the targets for our test data, and you can see that they're now randomly assigned—30 samples, as this is 20% of 150.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

And the corresponding test data—30 samples. These are the inputs that go with those answers. All right, next up we'll create our model, train it, and get it working.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master Machine Learning with Hands-on Training. Use Python to Make, Modify, and Test Your Own Machine Learning Models.

Yelp Facebook LinkedIn YouTube Twitter Instagram