Preliminary Data Analysis with Automated EDA: A CRISP ML(Q) Approach
In the ever-evolving world of data science and analytics, the foundation of any successful project lies in the effective exploration and understanding of data. This critical phase, known as Exploratory Data Analysis (EDA), sets the stage for informed decision-making, hypothesis generation, and ultimately, model building. With the advent of sophisticated tools and techniques, Automation of EDA process has emerged as a powerful ally, enhancing the efficiency of the iterative process. Let’s delve into these concepts and understand how we can integrate Auto EDA with the CRISP ML(Q) methodology to accomplish a ML pipeline for a production level implementation.
The Essence of EDA
Exploratory Data Analysis is the initial phase of data analysis lifecycle. It is assumed to about 60% - 80% of the overall effort in a Analytics project is spent in EDA phase, where we examine datasets to summarize their main characteristics, often using visual methods and statistical computation. EDA is not just about statistics; it's about understanding the data's structure, patterns, anomalies, and relationships.
Aa few of the key activities in EDA include:
Automated EDA
While the traditional EDA begins with univariate analysis, relies heavily on manual coding and expert intuition for the data, Automated EDA leverages machine learning and advanced algorithms to streamline and enhance the process. Automated EDA tools, such as AutoViz, D-Tale, Pandas Profiling, Sweetviz, etc., can perform comprehensive data analysis with minimal human intervention. The benefits of Automated EDA are:
Let’s discuss a few Python Libraries for Automated EDA:
import pandas_profiling as pp
profile = pp.ProfileReport(df)
profile.to_file("output.html")
import sweetviz as sv
report = sv.analyze(df)
report.show_html('sweetviz_report.html')
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
AV.AutoViz('data.csv')
Integrating EDA and Automated EDA with CRISP ML(Q)
The CRISP ML(Q) methodology, an extension of the CRISP-DM framework, provides a structured approach to machine learning projects with a strong focus on quality assurance. The phases of CRISP ML(Q) include:
Within this framework, EDA play a pivotal role during the Data Understanding and Data Preparation phases.
How many Automated EDA libraries did you explore, let us know your experience with AutoEDA libraries in comments?
Junior Data Analyst
6moThnak you sir
Data science | Machine learning | Deep Learning | SQL | Data analysis | Python |
6moGreat read! Thanks for sharing