Challenges within the Modern Data Stack
The Appearance of Data in its Early Stages
Ten years ago, the data ambitions of numerous companies were largely centered around business intelligence (BI). Their goals revolved around generating reports and dashboards to handle operational risk, address compliance requirements, and ultimately make informed business decisions, albeit at a slower pace.
Apart from BI, traditional statistical learning has found application in business operations within sectors such as insurance, healthcare, manufacturing, and finance. These initial use cases, executed by expert teams, have significantly shaped many previous approaches to data management.
In summary, this is how data appeared in its early stages:
The Traditional Data Stack
The conventional data stack (TDS) refers to on-premises data systems.
Companies were responsible for overseeing their infrastructure and hardware, which posed challenges such as fragility (limited adaptability to change), high maintenance costs (involving extensive manual work), scalability limitations (difficulty in provisioning new infrastructure as needed), inflexibility (stemming from bottom-up maintenance), and intricate root cause analysis.
The data environment began to undergo transformation.
Somewhere around 2010, the surge of major technology advancements brought forth fresh challenges to the data stack. Companies found themselves grappling with the following:
- Escalating Data Volumes - This necessitated a shift from inflexible governance and extensive modeling within the data warehouse to a more adaptable data lake environment for storage. Additionally, managing the expenses associated with storing such extensive data posed a significant challenge.
- Novel Data Categories - Unprecedented data forms like text, images, and audio emerged. Many organizations of that era were uncertain about how to harness the potential of such unstructured data.
- Expanded Business Applications - With the availability of vast volumes and new data categories, organizations could construct more precise models to enhance their decision-making systems. Natural Language Processing (NLP), Computer Vision, and Recommender Systems became more accessible to all types of businesses.
The Modern Data Stack
Faced with these fresh obstacles, the data stack needed to undergo a transformative process, leading to the emergence of the modern data stack (MDS).
The most significant breakthrough brought about by MDS was the transition to cloud computing. This shift has rendered data more reachable, retrievable, and technically manageable. The modern data stack streamlined data gathering and ingestion, extended support for rapidly flowing data streams, and provided exceptional scalability at an economical expense.
MDS encompasses a suite of interconnected tools designed to facilitate a seamless transition from raw data to valuable business insights.
These tools are distinguished by their straightforward cloud deployment, scalability, and modular structure. Each individual tool addresses specific data-related challenges, encompassing tasks such as segregating computing and storage (e.g., Snowflake, DataBricks), ensuring data lineage (e.g., Stemma, Alation), executing transformations (e.g., dbt), orchestrating jobs (e.g., Airflow, Prefect), managing schemas (e.g., Protobuf), handling streaming data (e.g., Kafka), overseeing monitoring (e.g., Monte Carlo, Bigeye, DataObserve), and more.
What issues have arisen?
MDS emerged as a disjointed assortment of tools that generated intricate pipelines and data deposits into a central repository, leading to unwieldy data accumulations across various sectors. These tools were not initially designed to collaborate seamlessly throughout the entirety of the data value chain.
Furthermore, data's role has surpassed its initial purpose of supplying executive dashboards and has expanded to encompass numerous models and dashboards within a brief span of years. This progression has given rise to a range of challenges:
1. With data flowing in from diverse sources, comprehending data context became more intricate, as data warehouses could no longer faithfully replicate the real world with interconnected entities and tables.
2. Numerous data initiatives recycle the same data under different names or reference neglected tables.
3. Effective testing is a rarity, making debugging a formidable challenge.
4. Teams grapple with pinpointing the definitive source of vital data, leading them to construct personalized tables for impromptu queries, thereby incurring 'data debt' (more on this topic in upcoming posts).
5. Data teams invest months in crafting feature sets for machine learning models, formulating metrics, conducting experiments, and refining data structures.
6. Essential datasets encounter frequent breakdowns without clear accountability and ownership.
Recommended by LinkedIn
We are observing a surge in data debt, a heightened volume of daily bug management, and a substantial loss of control over the data warehouse. Interestingly, the significance of the data warehouse, once the paramount data asset within organizations, has diminished over the past decade.
Most organizations are currently either encountering these issues due to the evolution of their data stack, or they are on the brink of experiencing them as they persist in their data-driven endeavors.
What actions can be taken at this point?
The Modern Data Stack predominantly addressed engineering hurdles concerning cost and performance, but introduced further complexities in terms of effectively utilizing data to address business issues.
The core aim of harnessing data has always been and remains enhancing business outcomes and efficacy, and this should be our central concern.
Outlined below are several concepts for mitigating the obstacles between data generation and its effective utilization:
Establishing the Data Warehouse as the bedrock of all analytical efforts
Creating a semantic mapping that interconnects the various data feeds from diverse sources will pave the way for a genuinely efficient Data Warehouse and significantly enhance the data consumer experience. It requires dedicating substantial time to comprehend the interrelationships among distinct data sources and accurately represent the real-world connections.
Integrating Software Engineers (SWE) into the data workflow
Software Engineers are responsible for generating a significant portion of the data utilized in business reports, experiments, and models. Paradoxically, they often lack insight into how their data is being utilized.
This disconnect results in Data Engineers frequently serving as intermediaries, dedicating more time to rectifying pipeline issues stemming from alterations upstream (in back-end or front-end services) than to creating fresh pipelines that drive business potentials. Consequently, certain adjustments are necessary within this workflow:
Bridging Data Engineering and Business Context
Data engineers play a pivotal role in establishing and overseeing data platforms and their operational workflows. Positioned between software engineers and data scientists/analysts, they serve as intermediaries. However, frequently, they construct pipelines without a grasp of the business context or a clear vision of the ultimate purpose of the tables they create.
Absence of business context hinders data engineers from comprehending the appropriate interconnections among distinct data elements, impeding the development of a data warehouse that accurately reflects the real world.
Adopting a Data Product Mindset
The data team must guarantee that the data products they develop effectively address tangible user needs. Approaching data products solely from a technical angle is no longer sufficient. It's imperative to integrate considerations of market alignment between the data product and the user's problem into the data workflow.
Revolutionizing Data Modeling Framework
Conventional data modeling grapples with challenges like stringent governance, inflexible processes, limited adaptability for iteration, and extended timeframes for insights. While data modeling design was effective in an era of controlled data where teams could ensure incoming data fit designated schemas, the surge in data volume and sources has made applying traditional data modeling increasingly difficult. Consequently, data warehouses began to deviate from their primary purpose.
To navigate this landscape, the data ecosystem must consider ushering in Data Modeling 2.0 tailored for the Modern Data Stack. This entails embracing a decentralized data architecture, where data is dispersed across domains, potentially offering a remedy to this predicament.
Establishing Data Governance Guidelines
In the past few decades, the majority of investments in data initiatives were primarily directed towards enhancing and expanding technology. Unfortunately, there was less emphasis on refining processes and implementing effective data management practices. Here are several approaches to guaranteeing the proficient, secure, and ethical management of data:
Defining Data Governance Measures
Is Data Mesh the Ultimate Answer?
In my perspective, the pivotal aspect isn't solely the framework itself, but rather our competence in leveraging data to genuinely enhance business outcomes.
Data mesh certainly addresses several of the challenges highlighted in this article, yet it isn't a universal panacea, nor can it be implemented without considering an organization's unique context. Moreover, it's still an early-stage framework, and its true essence will evolve as businesses gradually adopt it.
In conclusion, despite my somewhat cautious stance on the Modern Data Stack, I hold a strong optimism for the future of the data industry and our potential to adapt and enhance.