This Is The Data Science Process All Data Scientists Use
Nowadays, data science is not restricted only to the tech giants, but is already used by organizations across many industries, from FMCG to the financial services. Although the exact data science process varies depending on the specific use case, there is a common framework that is used to deliver data science solutions: the cross-industry standard for data mining (CRISP-DM). The CRISP-DM framework is widely regarded as the industry standard for developing predictive analytics solutions.
CRISP-DM consists of six major phases:
- Developing business understanding
- Gaining data understanding
- Conducting data preparation
- Doing predictive modeling
- Evaluating the results against your business problem/question of interest
- Deploying your solution/changes
First phase (business understanding): The first step in CRISP-DM involves understanding the problem that is valuable to the business and asking the right question. For example, a startup that is in a hypergrowth mode would be interested in new user acquisition; in this context, the data science question could be: what is the most effective way to get new customers? In healthcare, a pharmaceutical company would probably be interested in studying if a new drug can cure a certain disease; in this context the question could be: is the new treatment significantly better than a control treatment?
Second phase (data understanding): Businesses tend to collect as much data as they can, insofar as their resources allow. This often results in huge amount of data that needs to be processed and analyzed. In this case, an exploratory data analysis would help with data understanding. However, if the data are not yet already available, one needs to plan for a data acquisition effort that is designed to gather a dataset that can be used for answering the relevant business questions.
Third phase (data preparation): Even assuming that the data are already gathered for you, the data preparation (a.k.a. data wrangling or munging) step is often still the most tedious part in the whole data science process. This involves lots of programming and scripting for data cleaning purposes, such as identifying and getting the right variables/columns, generating more plots and charts, handling missing values, etc. You may also need to join multiple data tables, such as matching columns with their schema/description.
Datasets often contain missing values, encoded in different formats (e.g. NA, blank, zeros, etc.). This is probably the most frequently encountered artifact when working with a dataset, and while there are no hard and fast rules, the following pointers should help:
- Missing values in the data should be handled differently depending on the business requirement.
- Missing values should be handled before fitting a predictive model.
- Three common ways to handle missing values are: removing them, imputing them, or working around them.
- Removing data points with missing values would result in a biased predictive model in the next phase, so I would rather engineer some new features around these missing values when it makes sense to do so (e.g. count of missing values for every observation).
- In some cases, it would be OK to remove the missing values: when the values are due to technical/measurement errors (& imputation might potentially lead to misleading values) or when the missing values are simply not needed (& they are only a small fraction of your data).
Fourth phase (predictive modeling): Predictive models are often constructed by using one of the machine learning tools/libraries (e.g. scikit-learn in Python). In a supervised learning problem, we are interested in predicting a specific response given some known factors. For example, a common problem in the recruitment industry is to predict a candidate’s expected salary given her background and qualifications — here the response is the desired salary.
The process of creating a predictive model from data is called model training or model fitting. The portion of your dataset that is used for model fitting is called the training data, and the other portion that is used for evaluating how well your model performs is called the test data.
Fifth phase (evaluation): If your problem performs well on the training data but poorly on the test data, it’s called overfitting. Overfitting is a situation in which the model doesn’t generalize well to new instances that it has never seen before. Ideally, you would like your model to perform well on both the training and test data.
Sixth phase (deployment): Once the predictive model is ready to be used, it can be deployed (a.k.a. productionalized). A very common deployment is having the model running in real-time (or at some fixed period) to automate certain tasks (e.g. recommending products to buy or tagging uploaded photos). Another deployment scenario that is often encountered in advanced analytics settings is to carve out actionable insights from your findings.