Learn how to effectively concatenate new columns into your data frame to streamline your data preparation process. Follow straightforward techniques for ensuring data aligns correctly before splitting into training and testing datasets.
Key Insights
- Concatenating new columns such as "high," "medium," and "low" into a data frame requires using the pandas
concat
function, specifying column-wise concatenation to properly align them to the right side. - The concatenation operation involves passing a list of data frames to the
concat
function, combining the original data frame with additional columns (such as the "salary 100 encoding") and assigning the result back to the original data frame. - After the concatenation, the updated data frame includes original columns along with newly added "high," "medium," and "low" columns, facilitating further data processing such as splitting into training and testing subsets.
Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.
Now, we have these columns for high, low, and medium. We want to CONCATENATE them to the end of our data frame so that we can later split the data frame into testing and training datasets. At that point, we’ll have all the necessary columns.
Here’s how we’ll do that. We’ll CONCATENATE them onto the right side, adding these new columns to each row. These three columns will be placed on the right side of our data frame.
We need to do a couple of things to achieve this. One step is to assign this concatenation result back to our data frame. We’ll say that our data frame is now the result of concatenating the old data frame with the new one.
The `CONCAT` function takes in a list of data frames. So, we’ll pass the old data frame and the new one (the salary one-hot encoding). Finally, we need to specify that the concatenation should be done by columns.
Otherwise, it will assume rows and place the high, low, and medium columns at the bottom of the data instead of the right side.
If we’ve done that, we can now check the HR data to see the result. We’ll still have all our previous columns, including salary, low, medium, and high, but we’ll exclude the original salary column.
Instead, we’ll include the high, low, and medium columns from the right side.
Now, our next step is splitting the data. Let’s proceed.