Understanding Random Forest Classifiers: How They Work

Explain how random forest classifiers average multiple diverse decision trees for robust prediction.

Uncover the workings of random forest classifiers, a powerful machine learning technique leveraging diverse decision trees for highly accurate predictions. Learn how randomness and feature diversity help handle complex datasets effectively.

Key Insights

  • Random forest classifiers enhance prediction accuracy by creating numerous diverse decision trees, each analyzing random subsets of data and input features to prevent dominance by a single characteristic like passenger class.
  • This method efficiently manages both small and large datasets and effectively handles data outliers, making it particularly suitable for datasets with irregularities such as the Titanic dataset.
  • Users can optimize random forest models by adjusting hyperparameters such as the criterion (commonly "entropy"), the number of decision trees (estimators), and random state, which ensures reproducibility of random processes.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

Let's talk about random forest classifiers. Let's load this image first. These are decision trees.

This is tree 1, tree 2, many other trees, trees 600. Each of these trees takes data and splits it up bit by bit. It says for this piece of data, was it split this way, was it male or female, was it first class or second class, and then it makes a prediction.

For each of these. And it has a different method each time of doing so. A random forest classifier takes this decision tree idea, and as the name would imply, a forest is a collection of trees, and it has many, many trees.

A random forest classifier classifies something such as survival (e.g., survived or not). And it does it by looking at lots of different possibilities, lots of different methods, and averaging them all together. So how is this helpful? How is this a helpful method? Well, it's looking at random subsets of the data.

This means that each tree is diverse. They're looking at lots and lots of different pieces of the data. So there's a lot of diversity of ideas here.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

If you can consider this computer model as an idea. And it also has random features, random inputs. For example, one tree might consider age and fare.

Another might consider class and port of embarkation. And so, ultimately, this prevents any single dominant feature, like class (which is probably the most important feature), from being the only factor in the data. Random forest classifiers examine a diverse group of features and combine all of them, rather than relying solely on one feature like class.

So you get a high amount of accuracy because of this robust randomness. And it works with large datasets, small datasets. It also handles outliers really well.

There are certainly some strange outliers in this data. So a random forest classifier is ideal for Titanic data. Now, here are the hyperparameters we will use.

Hyperparameters are not the parameters within the data. They're similar to metadata. They're the parameters of training the model.

Criterion, number of estimators, and random state. We'll use 10 decision trees. There are a couple of different criteria for splitting our data.

Entropy is a good one. And it's the most common one these days, more than the other one, gini impurity. Setting a random state ensures the randomness is reproducible.

These hyperparameters can be tuned later if we decide the model needs improvement. What if we increase the number of trees or change the criteria? Changing the random state should not affect the outcome.

However, these two definitely could. We'll stick with these hyperparameters for now, but tuning them is an important part of working with random forest classifiers and other models.

Alright, we'll start creating a random forest classifier, which of course is not going to be that much code.

And we're going to see how it does.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

More articles by Colin Jaffe

How to Learn Machine Learning

Master Machine Learning with Hands-on Training. Use Python to Make, Modify, and Test Your Own Machine Learning Models.

Yelp Facebook LinkedIn YouTube Twitter Instagram