Preliminary Data Analysis with Automated EDA: A CRISP ML(Q) Approach

Sharat Manikonda

Director - Data Scientist, Data Engineering & MLOps

Published Jun 17, 2024

In the ever-evolving world of data science and analytics, the foundation of any successful project lies in the effective exploration and understanding of data. This critical phase, known as Exploratory Data Analysis (EDA), sets the stage for informed decision-making, hypothesis generation, and ultimately, model building. With the advent of sophisticated tools and techniques, Automation of EDA process has emerged as a powerful ally, enhancing the efficiency of the iterative process. Let’s delve into these concepts and understand how we can integrate Auto EDA with the CRISP ML(Q) methodology to accomplish a ML pipeline for a production level implementation.

The Essence of EDA

Exploratory Data Analysis is the initial phase of data analysis lifecycle. It is assumed to about 60% - 80% of the overall effort in a Analytics project is spent in EDA phase, where we examine datasets to summarize their main characteristics, often using visual methods and statistical computation. EDA is not just about statistics; it's about understanding the data's structure, patterns, anomalies, and relationships.

Aa few of the key activities in EDA include:

Descriptive Statistics: Calculating measures such as mean, median, mode, variance, and standard deviation to summarize data.
Data Visualization: Creating plots like histograms, scatter plots, and box plots to visualize data distributions and relationships.
Missing Value Analysis: Identifying and handling missing data points.
Outlier Detection: Detecting anomalies that may skew the analysis or indicate special phenomena.
Feature Relationships: Examining correlations and interactions between variables.

Automated EDA

While the traditional EDA begins with univariate analysis, relies heavily on manual coding and expert intuition for the data, Automated EDA leverages machine learning and advanced algorithms to streamline and enhance the process. Automated EDA tools, such as AutoViz, D-Tale, Pandas Profiling, Sweetviz, etc., can perform comprehensive data analysis with minimal human intervention. The benefits of Automated EDA are:

Rapid Insights: Quickly generate visualizations and summary statistics, saving valuable time.
Scalability: Handle large and complex datasets efficiently.
Consistency: Ensure that no critical aspect of the data is overlooked by following a systematic approach.
Exploration Depth: Utilize advanced algorithms to uncover hidden patterns and relationships that might be missed in manual EDA.

Let’s discuss a few Python Libraries for Automated EDA:

Pandas Profiling: Provides a detailed report of the dataset, including descriptive statistics, correlations, missing values, and data types.

import pandas_profiling as pp

profile = pp.ProfileReport(df)

profile.to_file("output.html")

Sweetviz: Generates beautiful, high-density visualizations with a few lines of code.

import sweetviz as sv

Recommended by LinkedIn

7 Challenges Faced by Data Scientists in Your…

Naveen Joshi 2 years ago

7 Challenges Faced by Data Scientists in Your…

Naveen Joshi 2 years ago

PANDAS PROFILING

360DigiTMG 1 year ago

report = sv.analyze(df)

report.show_html('sweetviz_report.html')

AutoViz: Automatically visualizes any dataset with one line of code.

from autoviz.AutoViz_Class import AutoViz_Class

AV = AutoViz_Class()

AV.AutoViz('data.csv')

Integrating EDA and Automated EDA with CRISP ML(Q)

The CRISP ML(Q) methodology, an extension of the CRISP-DM framework, provides a structured approach to machine learning projects with a strong focus on quality assurance. The phases of CRISP ML(Q) include:

Business and Data Understanding: Define business objectives and requirements and understand the data in business context.
Data Preparation: Clean, transform, and prepare data for analysis.
Modeling: Build and evaluate predictive models.
Evaluation: Assess the model’s performance and alignment with business goals.
Deployment: Implement the model in a production environment.
Monitoring and Maintenance: Continuously monitor and refine the model.

Within this framework, EDA play a pivotal role during the Data Understanding and Data Preparation phases.

Business and Data Understanding: EDA helps stakeholders gain a clear understanding of the data landscape, aligning business objectives with data realities. Automated EDA tools can accelerate this process by providing quick insights.
Data Preparation: EDA techniques are crucial for cleaning and transforming data. Automated tools can identify and address missing values, outliers, and anomalies more efficiently, ensuring high-quality data for modeling.
Modeling and Evaluation: Insights gained from EDA inform the choice of features and modeling techniques. Automated EDA can suggest feature engineering strategies and highlight potential data issues that could affect model performance.
Monitoring and Maintenance: Continuous EDA is essential for monitoring data quality and model performance over time as the data may contain drift. Automated tools can provide real-time insights and alert stakeholders to any deviations.

How many Automated EDA libraries did you explore, let us know your experience with AutoEDA libraries in comments?

Aruna Jyothi

Junior Data Analyst

5mo

Thnak you sir

Anirudha Sutar

6mo

Great read! Thanks for sharing

1 Reaction

See more comments

To view or add a comment, sign in

See all

Preliminary Data Analysis with Automated EDA: A CRISP ML(Q) Approach

Sharat Manikonda

Director - Data Scientist, Data Engineering & MLOps

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

Data Science for Business Impact: Unleashing the Power of Data

Datatile: A Library for AutoEDA

Exploring Data with Pandas: Essential EDA Techniques for Data Science

Know About Data Science & Data Science History

Unlocking the Power of Data: Exploring the World of Data Science

Data Science Notes _ Part 1

DATA SCIENCE VS. DATA ANALYTICS VS. MACHINE LEARNING

Automate Data Science To Make Your Life Easier; 3 Easy Ways

Data Science Workflow: From Data Collection to Insights

The Art and Science of Data Analysis

Explore topics

Recommended by LinkedIn

DSA Types

Aug 5, 2024

Data Structures and Algorithms

Jul 30, 2024

The Math Behind Perceptron: A Step-by-Step Guide to Neural Network Learning and Decision Boundaries

Jun 24, 2024

AutoEDA with glook

Jun 20, 2024

Happy Father's Day

Jun 17, 2024

Insights from the community

Others also viewed

Data Science for Business Impact: Unleashing the Power of Data

Datatile: A Library for AutoEDA

Exploring Data with Pandas: Essential EDA Techniques for Data Science

Know About Data Science & Data Science History

Unlocking the Power of Data: Exploring the World of Data Science

Data Science Notes _ Part 1

DATA SCIENCE VS. DATA ANALYTICS VS. MACHINE LEARNING

Automate Data Science To Make Your Life Easier; 3 Easy Ways

Data Science Workflow: From Data Collection to Insights

The Art and Science of Data Analysis

Explore topics