DATA MINING PROCESS
The recent literature offers several useful definitions of the concept of data mining, some of which are below.
“Data Mining is the efficient discovery of valuable, non-obvious information from a large collection of data. (...) The idea is that the raw material is the business data, and the data mining algorithm is the excavator, sifting through the vast quantity of raw data looking for the valuable nuggets of business information.”- Joseph P. Bigus
"Data Mining is an inductive learning strategy that builds patterns to identify hidden patterns in data. A model created by a Data Mining algorithm is a conceptual generalization of data. This generality can take the form of a tree, a network, an equation, or a set of rules." - R. J. Roiger, M. W. Geatz
"The process of exploring and analyzing, in a totally or partially automated way, a large amount of data in order to identify significant patterns and rules not known prior." - M.J.A. Berry, G.S. Linoff
"Data Mining is a process enacting new and significant correlations, relationships and trends, sifting through large amounts of data stored in repositories, using reporting techniques and statistical and mathematical techniques. " - Gurtner Group
“The promise of Data Mining is to find the interesting patterns lurking in all these billions and trillions of bytes. Merely finding patterns is not enough. You must respond to the patterns and act on them, ultimately turning data into information, information into action, and action into value.” - M.J.A. Berry, G.S. Linoff
The term data mining is based on the analogy of the operations that are carried out by miners, who dig inside the mines large quantities of material of little value in order to find gold. In our field, gold is the information previously unknown; while the material of little value is the data and the excavation operations are nothing more than the techniques of data exploration. It is important to note that data mining is not linked to specific techniques, indeed very often, the best results are obtained by combining a number of distinct techniques.
The most important thing to take into account when talking about this argument is that data mining does not make decisions, but provides decision-makers (those who are called upon to make decisions on the subject studied through data mining) the information necessary to deal with the difficulties of competitive markets, or more generally the uncertainty of real life. Therefore, the real critical factors of success of a mining project are the knowledge of the topic put in analysis and the experience gained over the time of the subject who is preparing to take the recordings. This knowledge, along with useful information from mining analytics, creates a strong synergistic process that leads to brilliant and fast decisions.
Therefore we can say that the hallmarks of data mining are:
- Analyze large amounts of data within which the Data Miner will find something interesting; it is crucial for the success of the model to check that the results obtained are not wrong;
- The new information extrapolated from the data must bring an advantage to the business;
- The goal for the Data Miner is to find something that is not intuitive, in fact, the more the information deviates from the obvious, the greater its potential value.
To achieve this value, the Data Miner must follow a number of steps ranging from setting goals to assessing results.
There are many variations in the literature for the schematization of the data mining process, but the most important thing to note is that although the number of stages can vary greatly, the fundamental concepts differ slightly.
The model known as the Cross-Industry Standard Process for Data Mining (CRISP-DM) will now be unveiled. This model is, in fact, a project funded by the Europea Commission, aimed at defining a standard approach to data mining projects regardless of business type. CRISP-DM divides the life cycle of a data mining project into six main phases, but the sequence of these is not rigorous; in fact, it is often necessary to go back and forth between the different stages.
Cross-Industry Standard Process for Data Mining CRISP-DM.
The arrows in the diagram indicate the most important and frequent interdependencies that can occur between phases, while the outer circle of the diagram symbolizes the iterative nature of the mining process; In fact, almost always, the process continues even after a solution has been deployed, so that future processes can benefit from previous ones and find better and better solutions.
Step 1 – Understanding the business.
The object of the first phase is the identification of the objectives and the clear definition of what needs to be completed. To do this it is important to understand, what data will be needed to be able to complete the analysis and where to find it.
At the beginning of this phase, the cost of the project and the revenue expected to be obtained after this analysis should be estimated. In other words, we should consider how much this information is worth in monetary terms, given that in order for a project to be efficient, revenues must always exceed costs. However, this aspect will not be covered in this process as this analysis is carried out for educational purposes only.
Step 2 – Understanding the data.
The second step is to collect the data for use in the analysis process, to understand the variables available and to create new variables that might be useful in achieving the desired results.
There are two types of data:
- Quality: They take on discoverable values that cannot be ordered; they serve to identify one category and distinguish it from others. To this type of variable, belongs a particular case, the dichotomic variables or dummy, which assume only two modes, usually 0 and 1, which indicate respectively the absence or presence of the element indicated with the variable.
- Quantitative– When enumerable discrete values or continuous values are assigned. These values take on full numerical significance and are therefore portable.
Step 3 – Prepare your data.
Before the statistical analysis is carried out and then the extraction of useful information, it is necessary to carry out a thorough check of the data available to us, in order to check whether they require the necessary characteristics to make them suitable for subsequent processing. These operations, called pre-processing, are upstream of statistical analysis, which means that you need to implement them before you create the real and proper model and import the data into it. By synthesizing we can say that data cleaning is a process that can guarantee, with a certain threshold of reliability, the correctness of a data set.
Through this phase, therefore, we want to ensure the quality of the chosen data, and to do so it is necessary to carry out the treatment of abnormal and missing data.
The anomalous data are those that differ significantly from the rest. By definition, in fact, they are observations that being typical or erroneous, differ decisively from the behaviour of the other data, with reference to the type of analysis considered. This definition is very important because it emphasizes the type of analysis that is carried out, in fact very often it occurs that the values are not abnormal if the variables are examined individually, while they become so when they are considered together.
In this context, it is important to locate double or better called redundant data that can be easily discovered through correlation analysis. These corrections, in fact, could jeopardize the results of the model in an unpredictable way. To do this, since the databases are generally very large, statistical calculations such as media and variance are used, in order to obtain and obtain the information useful for the evaluation of the variables themselves.
Finally, the missing or missing values data represents, lost information and may depend on:
- malfunctions in data collection systems,
- inconsistency with values of other attributes of the dataset (for example, when the same field has different values in different tables, very often this is caused by updates made incorrectly),
- the data have not been entered due to misunderstandings,
- some data may not be considered important at the time of insertion,
- failure to record changes in data.
You can do this in several ways, such as deleting records that contain missing data, or replacing the missing data with the class average or with detected values for similar observations. It is important to consider two things before doing anything about this data:
- the use of this guesswork can have a significant impact on the results, and
- you need to carefully consider the techniques you want to use, as some of them are able to recognize and develop the missing data, while others require that all values be present.
Once you have done all this you need to move on to the data transformation, in fact, depending on the technique used, the variables may need to be reviewed to respond to certain characteristics of the model. By the term "transformation of a variable" we mean the derivation of new variables through the application of functions to the original ones.
A very common method of transformation is normalization, which is able to modify values so that they all fall within a given range, or rather, this type of transformation causes the sizes to be "scaled" according to defined ranges. This allows you to compare different distributions.
Another very important aspect to evaluate at this stage is the size of the data available to us, very often in fact databases can contain terabytes of data, and therefore complex data mining techniques may take a long time to process. To work around this problem, you reduce the data, which aims to reduce the representation of the initial dataset in a smaller dataset, but it is intended to reduce the data set to produce the same results (or almost). However, if you have a large database, but not excessive, I always recommend trying to launch the model with all the data at our disposal (of course cleaned from the problems treated in the previous lines), as in recent years the software has developed considerably and allowed you to carry out many operations in an extremely short time. If you then realize at run time that the times are unsustainable, you can stop the execution process and resize the data. It would also be a good rule of order not to have a number of variables much greater than the number of observations.
Step 4 – Modeling.
At this stage, the main objective is the application of one or more data mining techniques, through which models can be built that can provide effective information with respect to the objectives of the research. Some techniques to be applied require that the data meet certain characteristics, so, very often, it is necessary to return to the phase of preparation of the data to adapt the initial date to the new needs. For this phase to be successful, it is necessary to know in depth the different statistical techniques and the different areas in which to apply them; It is also important for each algorithm to know the type of variability be inputted and the type of variables you expect to get as output.
Thanks to the rapid development of new technologies, today, it is possible to develop models containing various types of variables.
Step 5 – Evaluation.
The evaluation phase is necessary to compare the results of some models compared. Usually, these comparisons are made on supervised issues for which, therefore, the results are known. If no techniques perform well, you may need to go back to the previous steps and review some steps.
To achieve good results it is essential to have quality data, in fact, it is unthinkable to be able to gain knowledge or at least to be able to achieve satisfactory results if the database available to us does not contain the information we are looking for; in fact, however correct and refined a model is, it can never make up for the lack of fairness (distortion) of the information provided as input to the model. The upstream data environment should be as robust and reliable as possible.
Step 6 – Implementation.
Once a model has been selected and its correctness has been verified, the results obtained can be integrated into the decision-making processes. It is important to define the application areas in which the knowledge produced, through the model, can bring real benefits.
Helping address the mobile connectivity needs of businesses
4yA clear description of data mining and the process.
Customer Success Manager at ConnexAI | Omnichannel | Customer Engagement | SaaS | Speech Analytics | Gamification | Increasing Productivity
4yFor someone who’s not really familiar with this concept, this article has done a very good job in explaining and depicting what data mining is about and the process of implementing it and gathering deductions from a business perspective. Great piece Ruwangi, thank you for sharing. 👏🏻