Inevitably, everyone that has any interest in Data Science faces a wall, and that wall’s name is Mathematics. As much as we would love to avoid the complexities of the math in data science, when a stakeholder asks, “Can you explain the answer to me?” the answer is probably a process that comes from a whole pipeline of ideas that took data and performed mathematical calculations to arrive at the answer. As important as it is to know the correct algorithm to do the work, it is equally important to conceptually understand the math behind the algorithm. We’ll look at some of the key mathematical concepts so that we can slowly recreate the engine and learn the mechanics of data science.
Probability is the first of the three core mathematical subjects to learn. Probability is the study of events and possibilities and is seen as a theoretical science because it takes every event into account. We know that if a football game happens, the team can either win, lose or tie.
Normally people see probability as coin flips and dice rolls, however, the simplest data science example for why probability is important is in detection. In data science, we often see a computer or device detect a fingerprint or a face, and there is a probability associated with the detection that determines if it is true or false.
Another topic in probability is Bayes Theorem, which states that if we have prior knowledge about an event, things are more likely to happen. For example, if there was a picture of a dog, and if the computer knows that it is a picture of a dog, the computer would have a greater chance of knowing whether or not the picture is a poodle.
Lastly, we will look at probability distributions, which maps every single possible outcome of an event to the chance they are going to happen. We can then add a specific value to an event, like the sum of two dice or the final score of a sporting event. From this, we can obtain certain important values such as the expected value (average value/mean) and the variance. However, we can also look at ranges of values, such as a greater than 30 percent chance of rain, or in sports betting, a line, spread, or over-under. These end probability is exact, but in order to get to this result, we need to employ statistics.
Usually, we see Probability and Statistics lumped into one topic, and that’s because they are very related in terms of terminology. With probability, we are working with hard rules with strict numerical results. But most of the time, we don’t have perfect information, and we need to make assumptions about a larger topic with a smaller dataset, known as sample data. Since we are working with sample data, we are now working in the field of statistics, because this sample set is now speaking for all of the possible data as a whole. Jumping deeper into statistics allows us to evaluate how “good” the sample data may be.
When working with sample data, we do this in a process called hypothesis testing. This type of testing, which is based on the scientific method, is what makes data science a science. Learning statistics can demystify the “A/B” test, a metric that companies use all the time for potential rollouts of new features. We often see this when only a few people see different versions of Facebook or Amazon coming out. This is the B test to the original A, and depending on the crowd reaction, if the B test is overwhelmingly good, it will be adopted. This phenomenon is known as confidence, the idea that the B test is statistically different from A. If we believe that our confidence is high, then we can use our findings as a probability.
Eventually, with statistics, we will model the first two fundamental building blocks of machine learning, which are linear and logistic regression. Linear regression predicts a numerical value, such as the price of a car. Logistic regression predicts a binary outcome, such as true/false, yes/no, or hot dog/not hot dog. Linear regression does this through minimizing error and creating the best fit line, whereas, in logistic regression, we analyze a function curve known as the sigmoid function, which basically softens a hard yes/no into a maybe with a given percentage of yes.
Linear Algebra is the third of the major topics. Linear Algebra exists so that we can take multiple factors into consideration. Imagine if we wanted to determine if someone was good at basketball. We can look at the features of that individual, such as their height, weight, or free throw percentage. Usually, we would analyze one feature, and set everything else equal, such as only looking at height. However, if we wanted to look at everything as a whole, we can see that each of these features creates its own dimension, and with linear algebra, we can take all of the necessary features into account simultaneously and thus build a better model.
With linear algebra, we can display our data into a matrix of rows of data points and columns of features. Our model for prediction gives us another matrix of weights that when multiplied through matrix multiplication gives us a prediction in a process called linear transformation. We then look at the difference between our prediction and real-world results, which is known as the error. We find the weights by using different algorithms that minimize errors, such as ordinary least squares (OLS) for linear regression. The model often improves as we add new data or run the algorithm through multiple times as we train our data set. This is where the Machine Learning happens. This is why one must pass multiple thumb presses for their phone or turn their face in a circle for FaceID. Once we have the model in place, the computer can now test new data, and thus the detection cycle is complete. Learn how to build these models in our Data Science Certificate.
Putting it all together
People are often intimidated by the math in the field of data science. However, knowing the core concepts and principles behind math can explain the overall processes to potential stakeholders. In our Python classes, students will learn to bridge mathematical concepts and programming principles to leverage the power of Data Science. Though the concepts may seem difficult now, we will break down this engine piece by piece in our classes so that we can demystify the math and focus on the data and its outcomes.