Domain Knowledge and Data Analysis in Model Training

Understand how combining domain knowledge with data analysis enhances predictive model accuracy. Gain insights into effectively selecting relevant data features for informed modeling decisions.

Key Insights

Utilize domain knowledge to identify and remove irrelevant data columns, such as recognizing that car sales dates (odd or even days) likely do not impact pricing, thus preventing misleading predictions.
Combine subjective human expertise with objective data analysis to validate assumptions—such as the belief that higher fuel efficiency, horsepower, and engine size may correlate with increased car prices.
Prepare targeted data sets by selecting specific, meaningful columns (e.g., sales in thousands, fuel efficiency, horsepower, engine size, price in thousands) for focused and efficient predictive modeling.

Note: These materials offer prospective students a preview of how our classes are structured. Students enrolled in this course will receive access to the full set of materials, including video lectures, project-based assignments, and instructor feedback.

So as we're thinking through what to train our model on, what data is important, there are two ways we typically decide. One: data analysis, and two: domain knowledge. We're about to do some data analysis.

We're about to do both of these, really. We'll come to data analysis a little later, but let's talk about domain knowledge. Domain knowledge means knowledge about that particular area of the world.

In this case, what do we know about cars? And I'll admit, I don't know a lot about cars. That's fine, but I do know more than the computer knows about cars. This computer—again, this model—is just going to know numbers.

It doesn't know what a car is or that this column is actually meaningless, and it might see value. It might see some patterns that aren't really there, or if they are there, they're not predictive.

Maybe cars that sell on an odd day of the week or an odd day of the month—like the first, third, fifth, or seventh—show patterns that it sees in the data. Those cars sell for more, but we know that that's not going to be predictive at all, that that's not meaningful, that that doesn't make any sense, and that if we ran that on more data, and more data, and more data, we'd see that that would lead to incorrect predictions. Our domain knowledge refers to what we know about cars. What do we know about this, and what we, as humans with a bit of understanding, can bring to this? What can we bring to the table to inform our model? When you're thinking about what data needs to be considered, domain knowledge is a great beginning, but it is subjective, and maybe there is something significant about odd- and even-numbered days of the month.

Data Analytics Certificate: Live & Hands-on, In NYC or Online, 0% Financing, 1-on-1 Mentoring, Free Retake, Job Prep. Named a Top Bootcamp by Forbes, Fortune, & Time Out. Noble Desktop. Learn More.

That makes no sense, but lots of things in life don't make any sense, and when you analyze them objectively, as the computer does without any subjective bias that's saying, no, that can't possibly be a pattern, it finds patterns that might actually be meaningful. So it's important to keep that in mind as we're trying to decide what could be valuable, what could not be. All right, let's do that a little bit.

Let's filter our data down to five columns to start working with. So what we'll do is we'll take our car sales, and we will make a new data frame out of certain columns, and we'll pick sales in thousands, fuel efficiency, horsepower, and engine size, and we'll also include our target value, which is price in thousands. Let's take a look at that at car sales now.

So here they are now—the same cars, same 157 rows; we haven't lost any cars, but we've decreased how much data is here. So this is because we've thought to ourselves, okay, the main thing we're going to test, the inputs that we think are important are these four, and this one is our target. Again, we use some domain knowledge to think about the problem and think about like, okay, these seem like, you know, maybe the better the fuel efficiency, the more expensive the car, maybe the higher the horsepower or the bigger the engine size, the more expensive the car.

We're going to take a look and do some data analysis to see if our domain knowledge answer of what seems important is right or not.

Colin Jaffe

Colin Jaffe is a programmer, writer, and teacher with a passion for creative code, customizable computing environments, and simple puns. He loves teaching code, from the fundamentals of algorithmic thinking to the business logic and user flow of application building—he particularly enjoys teaching JavaScript, Python, API design, and front-end frameworks.

Colin has taught code to a diverse group of students since learning to code himself, including young men of color at All-Star Code, elementary school kids at The Coding Space, and marginalized groups at Pursuit. He also works as an instructor for Noble Desktop, where he teaches classes in the Full-Stack Web Development Certificate and the Data Science & AI Certificate.

Colin lives in Brooklyn with his wife, two kids, and many intricate board games.

Key Insights

Colin Jaffe

How to Learn Machine Learning