Data Science: Looking the value on the data?
Data science has been enriched by Machine Learning and technological advances. It is revolutionizing many industries, from banking to manufacturing and many of the Internet services, such as social media. The kind of problems you have to resolve could be: identify credit card fraud, tag people in photos, how to increase e-commerce sales, recommend products, among many more and these are just business problems, since the applications in research are huge.
The first question we might ask ourselves is what does a data scientist do? To find the value hidden in the data, the answer can be found in the data science process (Figure 1).
Figura 1
We clearly see how from raw data, data science is responsible for analyse and generate solutions that add value to decision-making and consequently find the hidden treasure in the data. In this process, those of us who are dedicated to data are faced with several questions: What is the problem or the correct question to answer with the analysis? How do I approach the problem? Will the result delivered by the model influence the decision to be made? Is the investment in the analysis justified?
As Mike Gualtieri from Forrester Research expressed:
"If analysis doesn't lead to more informed decisions and more effective actions, why do it?"
Why a methodology to tackle problems in data science?
To tackle data science problems, it is best to have a recipe, which allows us to increase the probability of success of our foray. We can then define a methodology as: an iterative system of methods that guide data scientists on the ideal approach to solving problems with data science, through a prescribed sequence of steps. In other words, it organizes and ensures us a way to correctly define the problem and give a valuable answer, using data science.
The scientific method has been a guide to addressing scientific questions since the 1700s, it is an iterative method for standardizing the process of conducting experiments, so that all experiments can produce reproducible, more valuable and reliable results. The methodology that we use in data science should guide us to obtain valuable knowledge that contributes to decision-making and the prediction of scenarios that executives expect, and therefore justify the investment.
There are several methodologies for data mining, however CRISP-DM is the most frequently used methodology, as shown in graph 1.
Graph 1
What is CRISP-DM?
It is a methodology created in 1996, to guide the development of data mining projects, the name is due to the acronym for Cross Industry Standard Process for Data Mining. It consists of 6 stages which can have cycle iterations according to the needs of data scientists and developers. The stages are: Business understanding, Data understanding, Data preparation, Modeling, Evaluation, and Implemetation. Figure 2.
Figure 2 gives us a summary of the stages and interactions that exist in the methodology
Why CRISP-DM?
CRISP-DM was not developed in an academic or theoretical way from technical principles, nor from elite committees with gurus who developed it behind closed doors. CRISP-DM is successful because it is completely based on practical, real-world experience of how people conduct data mining projects (The CRISP-DM consortium).
The CRISP-DM methodology is intuitive, flexible and simple, there have even been experiments with students who have been given a data science challenge without a methodology to address it and in a natural way they tended to CRISP-DM, identifying the stages and interactions.
Teams using CRISP-DM generally perform better compared to teams that have used other methodologies to address problems.
The Business Understanding stage is useful for aligning technical work with user needs and for data scientists to have a proper understanding of the objectives of the problem to be addressed.
The implementation stage also addresses important considerations for closing the project and transitioning to solutions maintenance and operations.
The flexible and cyclical nature of CRISP-DM can provide many of the benefits of Agile. By accepting that a project begins with important unknowns, the user can walk through the steps, each time gaining a deeper understanding of the data and the problem. The empirical knowledge learned from previous cycles can feed into subsequent cycles.
The following table, present the stages of the methodology with the generic tasks (bold letters) and the results (italic letters) that we should obtain from each one. This box provides a complete and summary view of the CRISP-DM methodology.
We already have a guide for the development of the data science project, however, it usually happens that to advance some teams execute some stages partially, the result is the fastest advance, but with a very high risk and cost for the project .
The objective for the next articule will be the bad practices using CRISP-DM.
If you want to tackle a data science project, at Datarunner we support you.
Sources:
CRISP-DM 1.0, Step-by-step data mining guide.
Mentor y Consultor Fortune 500, Speaker TEDx y Autor. Apoyo a empresas en Gobierno Corporativo, Transformación Digital, Liderazgo y Gestión del Cambio, integrando IA para potenciar crecimiento y adaptación ágil.
3moBuen punto Osvaldo. Gracias por compartir