The Limitations of the Data in Predictive Analytics

Vartul Mittal

Top 40 Under 40 Digital Transformation Leader | Top 100 CIO/CDO/CTO | GCC & Digital Strategy | Technology Solutions | Managed Services | Intelligent Automation (BPM, RPA, IDP, GenAI, Data Analytics) | SaaS Alliances

Published Mar 7, 2021

As with many aspects of any business system, data is a human creation — so it’s apt to have some limits on its usability when you first obtain it. Here’s an overview of some limitations you’re likely to encounter:

The data could be incomplete. Missing values, even the lack of a section or a substantial part of the data, could limit its usability.
For example, your data might cover only one or two conditions of a larger set that you’re trying to model — as when a model built to analyze stock market performance only has data available from the past 5 years, which skews both the data and the model toward the assumption of a bull market.
The moment the market undergoes any correction that leads to a bear market, the model fails to adapt — simply because it wasn’t trained and tested with data representing a bear market.

Make sure you’re looking at a timeframe that gives you a complete picture of the natural fluctuations of your data; your data shouldn’t be limited by seasonality.

If you’re using data from surveys, keep in mind that people don’t always provide accurate information. Not everyone will answer truthfully about (say) how many times they exercise — or how many alcoholic beverages they consume — per week. People may not be dishonest so much as self-conscious, but the data is still skewed.
Data collected from different sources can vary in quality and format. Data collected from such diverse sources as surveys, e-mails, data-entry forms, and the company website will have different attributes and structures. Data from various sources may not have much compatibility among data fields. Such data requires major preprocessing before it’s analysis-ready. The accompanying sidebar provides an example.

Data collected from multiple sources may have differences in formatting, duplicate records, and inconsistencies across merged data fields. Expect to spend a long time cleaning such data — and even longer validating its reliability.

To determine the limitations of your data, be sure to:

Verify all the variables you’ll use in your model.
Assess the scope of the data, especially over time, so your model can avoid the seasonality trap.
Check for missing values, identify them, and assess their impact on the overall analysis.
Watch out for extreme values (outliers) and decide on whether to include them in the analysis.
Confirm that the pool of training and test data is large enough.
Make sure data type (integers, decimal values, or characters, and so forth) is correct and set the upper and lower bounds of possible values.
Pay extra attention to data integration when your data comes from multiple sources.

Be sure you understand your data sources and their impact on the overall quality of your data.

Choose a relevant dataset that is representative of the whole population.
Choose the right parameters for your analysis.

Even after all this care and attention, don’t be surprised if your data still needs preprocessing before you can analyze it accurately. Preprocessing often takes a long time and significant effort because it has to address several issues related to the original data — these issues include:

Any values missing from the data.
Any inconsistencies and/or errors existing in the data.
Any duplicates or outliers in the data.
Any normalization or other transformation of the data.
Any derived data needed for the analysis.

ABOUT THE AUTHOR

Vartul Mittal is Technology & Innovation Specialist. Vartul Mittal focuses on helping clients accelerate their digital transformation journey. He has 14+ years of Global Business Transformation experience in Management Consulting and Global In-house Centres in managing technology & business teams in Intelligent Automation, Advanced Analytics and Cloud Adoption. He is passionate about extending customer relationships beyond the current project with a longer term goal of becoming a trusted adviser and bringing greater value to businesses via digital disruption.

He has lived and worked across multiple countries and cultures involving senior client stakeholders from various industries like Financial Services Sector, FMCG and Retail. He has delivered engagements for Fortune 100 organizations such as Coca Cola India, Kotak Mahindra Bank, IBM, Royal Bank of Scotland, Standard Life Insurance, Citibank and Barclays. Vartul Mittal is also renowned speaker on Analytics, Automation, AI and Innovation among Top Global Universities and International Conferences.

To view or add a comment, sign in

The Limitations of the Data in Predictive Analytics

Vartul Mittal

Top 40 Under 40 Digital Transformation Leader | Top 100 CIO/CDO/CTO | GCC & Digital Strategy | Technology Solutions | Managed Services | Intelligent Automation (BPM, RPA, IDP, GenAI, Data Analytics) | SaaS Alliances

More articles by Vartul Mittal

Insights from the community

Others also viewed

5 Easy Steps to Collect the Data Your Business Needs to Grow

The Importance of Data Analytics in Business Strategy

Data Analytics for Customer-Centric Decision Intelligence

Data Mindset 101 - From Outliers to Trends

Demystifying Data Analytics

Exploring the Different Types of Data Analytics

Business Analytics - The Importance of Data in Data Analytics

Analytics Needs Better Data Control and Protection

The Importance of Data Analytics in Driving Business Strategy and Profitability

TOP 5 INDUSTRIES USING DATA ANALYTICS

Explore topics

More articles by Vartul Mittal

Data Mining For Rookies

Amazon Web Services for Rookies

Hybrid Cloud For Rookies

API for Rookies

Machine Learning For Rookies

Brief Guide to Understanding Bayes’ Theorem

AI in the Professional Services Industry

10 Applications that require Deep Learning

Linear Regression vs. Logistic Regression

Predictive Analytics For Rookies