Understanding the Math of Data Science

Learn the core mathematical subjects pivotal in data science: Probability, Statistics, and Linear Algebra. These subjects form the basis for understanding how data is analyzed, interpreted, and applied in various professional scenarios.

Key Insights

Probability is a fundamental concept in data science, underpinning the process of data detection and the application of the Bayes Theorem, which deals with the likelihood of events based on prior knowledge.
Probability distributions map every possible outcome of an event to its chance of occurrence, allowing for the prediction of a range of values like weather forecasts or sports betting spreads.
Statistics is integral to working with sample data, a smaller dataset that speaks for all possible data. It's crucial in hypothesis testing, a scientific method widely used in data science.
A/B testing is a practical application of statistical hypothesis testing used by companies to introduce new features and evaluate consumer response.
Linear and logistic regression models, which predict numerical values and binary outcomes respectively, are the foundational building blocks of machine learning.
Linear Algebra enables the simultaneous consideration of multiple data factors, allowing for the creation of more comprehensive data models. These models, when trained with algorithms that minimize errors, are the crux of machine learning applications.

Inevitably, everyone who has any interest in Data Science faces a wall, and that wall’s name is Mathematics. As much as we would love to avoid the intricacies of math in data science, when a stakeholder asks, “Can you explain the answer to me?” the answer is probably a complicated mouthful consisting of a whole pipeline of ideas involving data and mathematical calculations. As important as it is to know the correct algorithm to do the work, it is equally important to conceptually understand the math behind the algorithm. We’ll look at some of the key mathematical concepts so that we can slowly recreate the engine and learn the mechanics of data science.

Probability

Probability is the first of the three core mathematical subjects to learn. Probability is the study of events and possibilities and is seen as a theoretical science because it takes every outcome into account. We know that if a football game happens, the team can either win, lose, or tie.

Normally people see Probability as coin flips and dice rolls; however, the most straightforward data science example for why Probability is important is in detection. In data science, we often see a computer or device detect a fingerprint or a face, and there is a probability associated with the detection that determines if it is true or false.

Another topic in Probability is Bayes Theorem, which states that if we have prior knowledge about an event, things are more likely to happen. For example, if there was a picture of a dog, and if the computer knows that it is a picture of a dog, the computer would have a higher chance of knowing whether or not the picture is a poodle.

Lastly, we will look at probability distributions, which maps every single possible outcome of an event to the chance they are going to happen. We can then add a specific value to an event, like the sum of two dice or the final score of a sporting event. From this, we can obtain certain important values such as the expected value (average value/mean) and the variance. However, we can also look at ranges of values, such as a greater than 30 percent chance of rain, or in sports betting, a line, spread, or over-under. This end probability is exact, but to get to this result, we need to employ statistics.

Statistics

Usually, we see Probability and Statistics lumped into one topic, and that’s because they are very related in terms of terminology. With Probability, we are working with hard rules with strict numerical results. But most of the time, we don’t have perfect information, and we need to make assumptions about a larger topic with a smaller dataset, known as sample data. Since we are working with sample data, we are now working in the field of statistics, because this sample set is now speaking for all of the possible data as a whole. Jumping deeper into statistics allows us to evaluate how “good” the sample data may be.

When working with sample data, we do this in a process called hypothesis testing. This type of testing, which is based on the scientific method, is what makes data science a science. Learning statistics can demystify the “A/B” test, a metric that companies use all the time for potential rollouts of new features. We often see this when only a few people see different versions of Facebook or Amazon coming out. This is the B test to the original A, and depending on the crowd reaction, if the B test is overwhelmingly good, it will be adopted. This phenomenon is known as confidence, the idea that the B test is statistically different from A. If we believe that our confidence is high, then we can use our findings as a probability.

Eventually, with statistics, we will model the first two fundamental building blocks of machine learning, which are linear and logistic regression. Linear regression predicts a numerical value, such as the price of a car. Logistic regression predicts a binary outcome, such as true/false, yes/no, or hot dog/not hot dog. Linear regression does this through minimizing error and creating the best fit line, whereas, in logistic regression, we analyze a function curve known as the sigmoid function, which basically softens a hard yes/no into a maybe with a given percentage of yes.

Linear Algebra

Linear Algebra is the third of the major topics. Linear Algebra exists so that we can take multiple factors into consideration. Imagine if we wanted to determine if someone was good at basketball. We can look at the features of that individual, such as their height, weight, or free throw percentage. Usually, we would analyze one feature, and set everything else equal, such as only looking at height. However, if we wanted to look at everything as a whole, we can see that each of these features creates its own dimension. With linear algebra, we can take all of the necessary features into account simultaneously and thus build a better model.

With linear algebra, we can display our data into a matrix of rows of data points and columns of features. Our model for prediction gives us another matrix of weights that when multiplied through matrix multiplication, provides us with a prediction in a process called linear transformation. We then look at the difference between our prediction and real-world results, which is known as the error. We find the weights by using different algorithms that minimize errors, such as ordinary least squares (OLS) for linear regression. The model often improves as we add new data or run the algorithm through multiple times as we train our data set. This is where the Machine Learning happens. This is why people must pass multiple thumb presses for their phones or turn their faces in a circle for FaceID. Once we have the model in place, the computer can now test new data, and thus the detection cycle is complete. Learn how to build these models in our Data Science Certificate.

Putting It All Together

People are often intimidated by the math in the field of data science. However, knowing the core concepts and principles behind math can explain the overall processes to potential stakeholders. In our Python programming courses in NYC, students learn to bridge mathematical concepts and programming principles to leverage the power of Data Science. Though the concepts may seem difficult now, we will break down this engine piece by piece in our classes so that we can demystify the math and focus on the data and its outcomes.

Probability

Statistics

Linear Algebra

Putting It All Together

Learn more in these courses

Data Science Certificate