Challenges within the Modern Data Stack

The Appearance of Data in its Early Stages

Ten years ago, the data ambitions of numerous companies were largely centered around business intelligence (BI). Their goals revolved around generating reports and dashboards to handle operational risk, address compliance requirements, and ultimately make informed business decisions, albeit at a slower pace.

Apart from BI, traditional statistical learning has found application in business operations within sectors such as insurance, healthcare, manufacturing, and finance. These initial use cases, executed by expert teams, have significantly shaped many previous approaches to data management.

In summary, this is how data appeared in its early stages:

Data served specific case analyses rather than strategic decisions.
Its primary purpose was automating reporting.
The initial steps towards employing statistical modeling for business emerged.

The Traditional Data Stack

The conventional data stack (TDS) refers to on-premises data systems.

Companies were responsible for overseeing their infrastructure and hardware, which posed challenges such as fragility (limited adaptability to change), high maintenance costs (involving extensive manual work), scalability limitations (difficulty in provisioning new infrastructure as needed), inflexibility (stemming from bottom-up maintenance), and intricate root cause analysis.

The data environment began to undergo transformation.

Somewhere around 2010, the surge of major technology advancements brought forth fresh challenges to the data stack. Companies found themselves grappling with the following:

- Escalating Data Volumes - This necessitated a shift from inflexible governance and extensive modeling within the data warehouse to a more adaptable data lake environment for storage. Additionally, managing the expenses associated with storing such extensive data posed a significant challenge.

- Novel Data Categories - Unprecedented data forms like text, images, and audio emerged. Many organizations of that era were uncertain about how to harness the potential of such unstructured data.

- Expanded Business Applications - With the availability of vast volumes and new data categories, organizations could construct more precise models to enhance their decision-making systems. Natural Language Processing (NLP), Computer Vision, and Recommender Systems became more accessible to all types of businesses.

The Modern Data Stack

Faced with these fresh obstacles, the data stack needed to undergo a transformative process, leading to the emergence of the modern data stack (MDS).

The most significant breakthrough brought about by MDS was the transition to cloud computing. This shift has rendered data more reachable, retrievable, and technically manageable. The modern data stack streamlined data gathering and ingestion, extended support for rapidly flowing data streams, and provided exceptional scalability at an economical expense.

MDS encompasses a suite of interconnected tools designed to facilitate a seamless transition from raw data to valuable business insights.

These tools are distinguished by their straightforward cloud deployment, scalability, and modular structure. Each individual tool addresses specific data-related challenges, encompassing tasks such as segregating computing and storage (e.g., Snowflake, DataBricks), ensuring data lineage (e.g., Stemma, Alation), executing transformations (e.g., dbt), orchestrating jobs (e.g., Airflow, Prefect), managing schemas (e.g., Protobuf), handling streaming data (e.g., Kafka), overseeing monitoring (e.g., Monte Carlo, Bigeye, DataObserve), and more.

What issues have arisen?

MDS emerged as a disjointed assortment of tools that generated intricate pipelines and data deposits into a central repository, leading to unwieldy data accumulations across various sectors. These tools were not initially designed to collaborate seamlessly throughout the entirety of the data value chain.

Furthermore, data's role has surpassed its initial purpose of supplying executive dashboards and has expanded to encompass numerous models and dashboards within a brief span of years. This progression has given rise to a range of challenges:

1. With data flowing in from diverse sources, comprehending data context became more intricate, as data warehouses could no longer faithfully replicate the real world with interconnected entities and tables.

2. Numerous data initiatives recycle the same data under different names or reference neglected tables.

3. Effective testing is a rarity, making debugging a formidable challenge.

4. Teams grapple with pinpointing the definitive source of vital data, leading them to construct personalized tables for impromptu queries, thereby incurring 'data debt' (more on this topic in upcoming posts).

5. Data teams invest months in crafting feature sets for machine learning models, formulating metrics, conducting experiments, and refining data structures.

6. Essential datasets encounter frequent breakdowns without clear accountability and ownership.

We are observing a surge in data debt, a heightened volume of daily bug management, and a substantial loss of control over the data warehouse. Interestingly, the significance of the data warehouse, once the paramount data asset within organizations, has diminished over the past decade.

Most organizations are currently either encountering these issues due to the evolution of their data stack, or they are on the brink of experiencing them as they persist in their data-driven endeavors.

What actions can be taken at this point?

The Modern Data Stack predominantly addressed engineering hurdles concerning cost and performance, but introduced further complexities in terms of effectively utilizing data to address business issues.

The core aim of harnessing data has always been and remains enhancing business outcomes and efficacy, and this should be our central concern.

Outlined below are several concepts for mitigating the obstacles between data generation and its effective utilization:

Establishing the Data Warehouse as the bedrock of all analytical efforts

Creating a semantic mapping that interconnects the various data feeds from diverse sources will pave the way for a genuinely efficient Data Warehouse and significantly enhance the data consumer experience. It requires dedicating substantial time to comprehend the interrelationships among distinct data sources and accurately represent the real-world connections.

Integrating Software Engineers (SWE) into the data workflow

Software Engineers are responsible for generating a significant portion of the data utilized in business reports, experiments, and models. Paradoxically, they often lack insight into how their data is being utilized.

This disconnect results in Data Engineers frequently serving as intermediaries, dedicating more time to rectifying pipeline issues stemming from alterations upstream (in back-end or front-end services) than to creating fresh pipelines that drive business potentials. Consequently, certain adjustments are necessary within this workflow:

Prioritize Data Contracts - Data contracts encapsulate data expectations, including business context, data quality, and security measures. This empowers data engineers to comprehend upstream data sources, minimizing the risk of pipeline disruptions caused by changes upstream.
Extend Engineering Insight Downstream - Engineering teams should delve deeper into comprehending how the data they generate will be employed in subsequent stages. This proactive understanding will inform their decision-making when implementing alterations.
Engineering Accountability for Data Quality - As the creators of the data, engineering teams should shoulder responsibility for data quality, at least until the data is integrated into the central repository.

Bridging Data Engineering and Business Context

Data engineers play a pivotal role in establishing and overseeing data platforms and their operational workflows. Positioned between software engineers and data scientists/analysts, they serve as intermediaries. However, frequently, they construct pipelines without a grasp of the business context or a clear vision of the ultimate purpose of the tables they create.

Absence of business context hinders data engineers from comprehending the appropriate interconnections among distinct data elements, impeding the development of a data warehouse that accurately reflects the real world.

Adopting a Data Product Mindset

The data team must guarantee that the data products they develop effectively address tangible user needs. Approaching data products solely from a technical angle is no longer sufficient. It's imperative to integrate considerations of market alignment between the data product and the user's problem into the data workflow.

Revolutionizing Data Modeling Framework

Conventional data modeling grapples with challenges like stringent governance, inflexible processes, limited adaptability for iteration, and extended timeframes for insights. While data modeling design was effective in an era of controlled data where teams could ensure incoming data fit designated schemas, the surge in data volume and sources has made applying traditional data modeling increasingly difficult. Consequently, data warehouses began to deviate from their primary purpose.

To navigate this landscape, the data ecosystem must consider ushering in Data Modeling 2.0 tailored for the Modern Data Stack. This entails embracing a decentralized data architecture, where data is dispersed across domains, potentially offering a remedy to this predicament.

Establishing Data Governance Guidelines

In the past few decades, the majority of investments in data initiatives were primarily directed towards enhancing and expanding technology. Unfortunately, there was less emphasis on refining processes and implementing effective data management practices. Here are several approaches to guaranteeing the proficient, secure, and ethical management of data:

Defining Data Governance Measures

Data Ownership: Identifying individuals or entities responsible for specific data, promoting clear ownership and accountability.
Data Quality Standards: Establishing a comprehensive set of benchmarks to ensure data accuracy, completeness, consistency, and timeliness. These standards bolster the reliability and credibility of the data.
Data Catalogs: Creating a centralized repository encompassing all organizational data products, housing metadata, documentation, and data lineage details. This facilitates the exploration, comprehension, and utilization of data.
Data Policies and Procedures: Formulating a structured set of guidelines to govern data collection, storage, processing, validation, and utilization within the organization.
Data Lineage: Offering a transparent depiction of data origins and movements, enhancing clarity regarding the trajectory of data.

Is Data Mesh the Ultimate Answer?

In my perspective, the pivotal aspect isn't solely the framework itself, but rather our competence in leveraging data to genuinely enhance business outcomes.

Data mesh certainly addresses several of the challenges highlighted in this article, yet it isn't a universal panacea, nor can it be implemented without considering an organization's unique context. Moreover, it's still an early-stage framework, and its true essence will evolve as businesses gradually adopt it.

In conclusion, despite my somewhat cautious stance on the Modern Data Stack, I hold a strong optimism for the future of the data industry and our potential to adapt and enhance.

Challenges within the Modern Data Stack

Dr. RVS Praveen Ph.D

Director - Product Engineering at LTIMindtree

The Appearance of Data in its Early Stages

The Traditional Data Stack

The Modern Data Stack

What issues have arisen?

Recommended by LinkedIn

What actions can be taken at this point?

More articles by Dr. RVS Praveen Ph.D

Insights from the community

Others also viewed

Mastering Feature Transformation in Data Science: Key Techniques and Application

Next-Gen Data Science: The Future of Data Analytics, Solutions & Services - @DataThick

Revolutionising Data Management: Jaiinfoway’s Journey with Retrieval Augmented Generation (RAG) to Enhance Internal Efficiency

De-Mystifying Big Data

Beyond Bars and Rows: Can Generative AI Revolutionize Data Analytics?

Understanding Vector Databases: A Strategic Guide for Business Applications Across Key Industries

Your Monthly Hyperight Read and Updates

Analytics and Data Science News for the Week of November 4; Updates from ConverSight, GoodData, IBM & More

Importance of People in the Data Industry

Using Databases and Data Warehouses as Vector Databases for AI Agents

Explore topics

The Appearance of Data in its Early Stages

The Traditional Data Stack

The Modern Data Stack

What issues have arisen?

Recommended by LinkedIn

What actions can be taken at this point?

More articles by Dr. RVS Praveen Ph.D

Mastering Data Storytelling: 7 Essential Steps to Engage and Inform

The Future of Data Integration: Moving Beyond Traditional ETL

Fostering Analytical Maturity in Organizations (AMO)

Revealing Contemporary Data Frameworks: From Warehouses to Meshes

Part 5: Navigating Generative AI in Retail & Commercial Banking

Part 4: Navigating the Generative AI Landscape in Banking: An In-Depth Exploration

Part 3: Exploring Generative AI Applications in Banking

Part 2: Exploring Generative AI in Banking: An Overview

Part 1: Introduction to Generative AI Playbook for Banking

Constructing a Data Platform in 2024

Insights from the community

Others also viewed

Mastering Feature Transformation in Data Science: Key Techniques and Application

Next-Gen Data Science: The Future of Data Analytics, Solutions & Services - @DataThick

Revolutionising Data Management: Jaiinfoway’s Journey with Retrieval Augmented Generation (RAG) to Enhance Internal Efficiency

De-Mystifying Big Data

Beyond Bars and Rows: Can Generative AI Revolutionize Data Analytics?

Understanding Vector Databases: A Strategic Guide for Business Applications Across Key Industries

Your Monthly Hyperight Read and Updates

Analytics and Data Science News for the Week of November 4; Updates from ConverSight, GoodData, IBM & More

Importance of People in the Data Industry

Using Databases and Data Warehouses as Vector Databases for AI Agents

Explore topics