Demystifying Data Science
The term “data science” is used in a variety of fields and disciplines. But what exactly is it?
In simple terms, data science is exactly what it sounds like: the science of using data to obtain information, and, ideally, to make decisions based on the data that was obtained. In many industries, the goal of using data science is to make strategic decisions, or to create innovative products that solve frustrating problems. After all, why collect data if you aren’t going to do anything with it?
Let’s look at a simple example.
Predicting market trends is one of the most difficult things a financial analyst can do. After all, share prices can be volatile and the market can change rapidly based on a variety of economic and psychological factors.
What if we could use data science and machine learning to make more accurate predictions based on existing data and current market trends?
With Python, we can use linear regression to predict share prices for the next thirty days, using libraries and packages like Matplotlib to create visual representations of this data. We can scrape existing data from the S&P 500 using the Google Finance API, import and manipulate this data, and make strategic decisions regarding share prices and other market-related trends.
Data science has the potential to revolutionize the way we do business. Whether we are trying to predict stock prices, analyze political data ahead of the next election, or look at years worth of health data to determine whether to adjust health insurance premiums, there is no question that data plays a crucial role in how we create, market, and deliver products.
In addition, analyzing massive amounts of data can help companies create innovative and purposeful marketing strategies. Marketing costs a lot of money, so it’s in a company’s best interest to make informed decisions about which products to market to whom, and where to invest their marketing budget. Data science takes the guesswork out of this.
What do data scientists do?
A skilled data scientist could develop algorithms that predict market trends based on a company’s prior performance in the stock market, helping financial analysts make strategic short-term and long-term decisions.
Likewise, a skilled data scientist could analyze millions or billions of pieces of data to help an insurance company make informed choices about spending, to identify potential instances of fraud, to help developers optimize the company’s user interface, to guide marketing decisions, and to identify company-wide opportunities for improvement.
Data scientists are skilled at statistical analysis, interpreting huge amounts of quantitative and qualitative data, creating machine learning tools for companies, and programming—usually in Python.
But why exactly is programming a necessary component of data science?
Data scientists need to obtain data from local files or from remote databases or they will have nothing to work with! Unlike the kinds of data analysis that we might do on a small scale (such as reviewing how much money a small business spends on X each year), which we can do easily with Excel, data scientists work with thousands or millions of pieces of data at a time.
This data comes from a variety of sources and databases, and must be aggregated, cleaned, and analyzed so that it can be used to make strategic decisions and accurate predictions. In these cases, it’s necessary to use some form of scientific computing to analyze and create visualizations of this data.
What’s the difference between data science and data analytics? What is machine learning?
Data science and data analytics are similar fields, but data science is actually an umbrella term that encompasses data analytics, machine learning, and several other data-related disciplines.
A data analyst is someone who can perform basic data visualization, statistical analysis, and draw conclusions from data sets. A data scientist goes a few steps further and handles complex data visualization and modeling, data cleaning, and extensive analysis.
Machine learning is an important component of data science. With machine learning, algorithms are used to analyze data and to predict market trends. A data scientist is generally skilled with both machine learning and data analytics. A machine learning expert is generally skilled at data science, but might have a special aptitude for probability, statistics, and programming in multiple languages.
Overall, data science, machine learning, and data analytics require many of the same skills, but the practical application of each specialty differs a bit.
Languages Used For Data Science
Python is the most commonly used programming language in data science—with almost 70% of data scientists reporting that they use it—for a few reasons:
- Python is a general-purpose programming language.
- Python is easier to read and write than most other general-purpose languages, especially for analytical computing and quantitative data analysis.
- Python is open sourced, and has many extensive libraries that were specifically designed for use in data science.
- Python has numerous libraries and packages available for data science.
It surpassed R for the number one spot and has maintained this position due to its ease of use, powerful libraries and packages, clear and user-friendly documentation, and abundant community support.
Data scientists are already handling complex analysis of data, so they don’t need their programming language to be complicated, too. Python is known for its simple syntax and ease of use—even for beginners.
While some other languages (like Ruby) have clean and simple syntaxes, they don’t offer the same variety of scientific computing and machine learning libraries as Python.
Our Python classes and bootcamps will help get started with programming fundamentals and data science projects.
Approximately 40% of data scientists report using SQL, too.
SQL is a non-procedural language—that is, it cannot be used to write entire applications. SQL facilitates communication with databases and allows data scientists to access, edit, and re-organize pieces of data. Many Python packages and libraries work with SQL. Since SQL is not a general-purpose programming language and won’t be used on its own, Python and SQL are often used together.
For statistical analysis, R has a set of tools that is unmatched in depth and sophistication. Statisticians and other academics have been contributing to R for over 25 years, guaranteeing that for any statistical technique you can think of there will be a high-quality tool ready and waiting. If you have a background in statistics, you may find R easier to use because the terminology will be consistent with your training. However, since R is a specialized tool for statistical analysis, you may want to consider a more general-purpose language like Python if your interests are broader.
What Are Libraries and Packages?
One thing that sets Python apart from other general-purpose programming languages is the sheer quantity of libraries and packages available for data science. There are thousands of libraries in the Python Package Index, which makes Python a desirable choice for data scientists.
Some of the most useful libraries and packages are Pandas, NumPy, Matplotlib, and Sci-Kit Learn.
NumPy is a powerful linear algebra package for Python. It is primarily used for scientific computing. Many other libraries (Pandas, Matplotlib, and Sci-Kit Learn, for example) are dependent on NumPy. NumPy has extensive documentation and can be installed quickly and easily.
NumPy works with multi-dimensional arrays in Python. Lists can be converted into arrays, random arrays can be created, and numerous operations can be performed on these arrays. This is a crucial feature because operations (addition, subtraction, multiplication, and division, for example) cannot be performed on standalone Python lists, but they can easily be performed on NumPy arrays. Since data scientists often need to perform operations on data sets, NumPy is an invaluable tool.
NumPy allows you to find the min, max, standard deviation, and variance on an array. It allows you to combine different arrays to form a single array.
Overall, NumPy arrays are faster, easier to use, and use less memory than Python lists. When working with massive data sets, convenience and ease of use are two big selling points.
Because arbitrary data types can be defined with NumPy, the package is able to connect with a variety of different databases. This adds to its versatility and makes it an important component of any data scientist’s technical repertoire.
Pandas is an open-source library that provides high-performing, user-friendly data analysis tools for Python. It is one of the most popular libraries and, as such, has excellent documentation.
Pandas essentially takes data (from a CSV file or a SQL database, for example) and creates a Python object called a data frame. A data frame organizes data in a format that resembles a table, so it is easy to read, easy to analyze, and easy to work with.
Pandas is dependent on NumPy, and can optionally be used with Matplotlib for data plotting and visualization. Because of this, it can be installed on its own, or it can be installed through a package like Anacondas, which will install all required dependencies.
Pandas is usually used in one of three ways:
- To convert a list, dictionary, or array into a data frame
- To open a local CSV file or a related data file
- To open a remote file (CSV, JSON, SQL database, etc.)
After opening the file that you’d like to work with, you can perform a number of different commands to analyze the data. You can perform statistical analysis (mean, median, standard deviation, and so on), you can retrieve specific data points, and you can file, sort, or group data as you see fit.
Another important feature is the ability to clean data by checking for null values within the data set. It is difficult to work with data that has not been cleaned; unintentional null values within data sets can skew your results or make the results difficult to analyze. Pandas addresses this concern by identifying pieces of data that might be missing, incomplete, or otherwise incorrect so that you can get the most accurate results from your analysis.
Matplotlib is another popular library that allows data scientists to visualize data. Data visualization is a crucial step in making data accessible. It allows you to identify outliers and patterns quickly, while making data interpretation easier overall. Research shows that people in general are very receptive to visual representations of data, making Matplotlib an invaluable resource in data science.
Matplotlib is free, easy to install, and has robust features. Data can be rendered as a histogram, a pie chart, a line graph, a box plot, and so on. There are enough features to satisfy advanced users, but even entry-level users can create powerful visualizations of data.
Consider an enormous data set that encompasses countless data points over a long period of time. While this data can be displayed in an array or in another numerical format, it would take awhile to read and analyze. There is a potential for human error when manually reading and interpreting massive lists of data. Naturally, human error is something that data scientists try to avoid.
Matplotlib allows you to choose the specific data that you’d like to work with and arrange it in any visual format that you can imagine. Data can be rendered and displayed in almost any format with a few quick commands. Because Matplotlib is so easy to use and works seamlessly with other Python libraries and packages, it is a top choice for data scientists who use Python.
Many data scientists begin their analysis and evaluation of data with Pandas before moving over to Sci-Kit Learn for machine learning. Sci-Kit Learn is a machine learning library for Python that works with NumPy arrays and focuses on modeling data, not operating on it (NumPy and Pandas handle this).
Some modeling options include clustering, data sets, parameter tuning, and cross-validation. Sci-Kit Learn comes with standard data sets (for classification and regression of data, for example). Sci-Kit Learn is used in conjunction with stats and linear regression to make predictions based on inputted data sets.
Other Libraries and Packages
These libraries and packages, among others, are one of the main reasons that Python is so popular in data science. The options to import, manipulate, operate on, clean up, visualize, and model data are unmatched by any other programming language’s libraries.
In our data science courses, we cover Python in depth, and we hone in on NumPy, Pandas, Matplotlib, and Sci-Kit Learn to help you make the most of your data.
What is big data? What is small data?
In data science, big data is often defined by the 3 Vs: volume, variety, and velocity. Big data is usually tremendous in volume, full of a variety of different data types, and slower to process than small data. Big data is rich with insights, but it is not always accessible due to the sheer amount of analytics that must be performed in order to obtain the most usable information from the data set.
Small data is exactly what it sounds like -- a smaller subset of useful data, often derived from big data. Small data is organized and visualized in a way that is accessible and understandable so that we can make the most of our analysis. Anything that can be processed in Excel is considered small data.
There is a place for both big and small data in data science. It takes time and skill to process big data and to derive something meaningful from it. Fortunately, there are enough powerful Python libraries to make this possible.
Tools like Hadoop make it even easier to handle big data. Hadoop is an open-source framework that allows data scientists to process huge amounts of data across numerous computers. It also allows data scientists to store massive amounts of unstructured data, such as videos, images, and text files, even if the data is not being used right away.
Companies like eBay, Facebook, and Twitter use Hadoop to optimize their search engines and to store copies of internal log files. LinkedIn uses it for their “People You May Know” feature. All of these companies process and analyze massive amounts of big data on a daily basis.
Nonetheless, it’s crucial to begin with the end in mind; companies must decide what they need from the data before they decide how much of it to analyze.
Sometimes, visualizing and modeling small data provides a company with all of the insights that it needs to make data-driven decisions, but the types of decisions and predictions that can be made with small data are limited and short-term.
The Growing Job Market for Data Scientists
A 2017 study by CareerCast highlighted the fact that data science was a relatively new career path at that time, but the job growth was predicted to be the highest of any other job in the United States. The study concluded that because the field was so new at the time, data science jobs were some of the hardest to fill. Nonetheless, they predicted massive job growth.
As we fast forward to 2019, it appears that this prediction has come true.
According to Indeed, data science is still a fast-growing field. On average, data scientists are making $120,000-$140,000 per year, with some earning over $200,000 annually. As of March 2019, there were over 90,000 vacancies for entry-level and mid-level data scientists on Indeed alone. There are no shortages of opportunities for aspiring data scientists.
Data scientists are working for health insurance companies, pharmaceutical companies, manufacturing companies across a wide variety of industries, credit card companies, start-ups, retail stores, tech giants, and so on. Every industry has a growing need for data scientists, which is what makes this such a versatile career option.
In Noble Desktop’s data science bootcamp, we jump into hands-on data science practices right away. Students hit the ground running with Python fundamentals, including the analysis of real-world data sets using loops, functions, and objects.
As the course progresses, students will learn to work with tabular data (such as that found in CSV files or databases). They’ll learn to combine and aggregate data, using packages and libraries to create visualizations and to do advanced computing.
By the end of our Python for Data Science bootcamp, students will be able to extrapolate information from complex data sets, using it to make predictions and to make data-driven decisions using NumPy, Pandas, Matplotlib, and Sci-Kit Learn. From here, students will be able to explore a variety of other libraries and packages for Python. The possibilities are endless.