Data Frames: Concatenating Columns for Effective Splitting

Learn how to effectively concatenate new columns into your data frame to streamline your data preparation process. Follow straightforward techniques for ensuring data aligns correctly before splitting into training and testing datasets.

Key Insights

Concatenating new columns such as "high," "medium," and "low" into a data frame requires using the pandas concat function, specifying column-wise concatenation to properly align them to the right side.
The concatenation operation involves passing a list of data frames to the concat function, combining the original data frame with additional columns (such as the "salary 100 encoding") and assigning the result back to the original data frame.
After the concatenation, the updated data frame includes original columns along with newly added "high," "medium," and "low" columns, facilitating further data processing such as splitting into training and testing subsets.

This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.

Now, we have these columns for high, low, and medium. We want to CONCATENATE them to the end of our data frame so that we can later split the data frame into testing and training datasets. At that point, we’ll have all the necessary columns.

Here’s how we’ll do that. We’ll CONCATENATE them onto the right side, adding these new columns to each row. These three columns will be placed on the right side of our data frame.

We need to do a couple of things to achieve this. One step is to assign this concatenation result back to our data frame. We’ll say that our data frame is now the result of concatenating the old data frame with the new one.

The `CONCAT` function takes in a list of data frames. So, we’ll pass the old data frame and the new one (the salary one-hot encoding). Finally, we need to specify that the concatenation should be done by columns.

Otherwise, it will assume rows and place the high, low, and medium columns at the bottom of the data instead of the right side.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

If we’ve done that, we can now check the HR data to see the result. We’ll still have all our previous columns, including salary, low, medium, and high, but we’ll exclude the original salary column.

Instead, we’ll include the high, low, and medium columns from the right side.

Now, our next step is splitting the data. Let’s proceed.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning