Learn how to effectively concatenate new columns into your data frame to streamline your data preparation process. Follow straightforward techniques for ensuring data aligns correctly before splitting into training and testing datasets.
Key Insights
- Concatenating new columns such as "high," "medium," and "low" into a data frame requires using the pandas
concatfunction, specifying column-wise concatenation to properly align them to the right side. - The concatenation operation involves passing a list of data frames to the
concatfunction, combining the original data frame with additional columns (such as the "salary 100 encoding") and assigning the result back to the original data frame. - After the concatenation, the updated data frame includes original columns along with newly added "high," "medium," and "low" columns, facilitating further data processing such as splitting into training and testing subsets.
This lesson is a preview from our Data Science & AI Certificate Online (includes software) and Python Certification Online (includes software & exam). Enroll in a course for detailed lessons, live instructor support, and project-based training.
Now, we have these columns for high, low, and medium. We want to CONCATENATE them to the end of our data frame so that we can later split the data frame into testing and training datasets. At that point, we’ll have all the necessary columns.
Here’s how we’ll do that. We’ll CONCATENATE them onto the right side, adding these new columns to each row. These three columns will be placed on the right side of our data frame.
We need to do a couple of things to achieve this. One step is to assign this concatenation result back to our data frame. We’ll say that our data frame is now the result of concatenating the old data frame with the new one.
The `CONCAT` function takes in a list of data frames. So, we’ll pass the old data frame and the new one (the salary one-hot encoding). Finally, we need to specify that the concatenation should be done by columns.
Otherwise, it will assume rows and place the high, low, and medium columns at the bottom of the data instead of the right side.
If we’ve done that, we can now check the HR data to see the result. We’ll still have all our previous columns, including salary, low, medium, and high, but we’ll exclude the original salary column.
Instead, we’ll include the high, low, and medium columns from the right side.
Now, our next step is splitting the data. Let’s proceed.