Self-Service Data Piloting: keys to unlock Single Source of Truth
Credit: Image by Freepik

Self-Service Data Piloting: keys to unlock Single Source of Truth

As companies increasingly leverage data to power digital products, drive decision making, and fuel innovation; understanding the quality and reliability of these most critical assets is fundamental. For decades, organisations have relied on data catalogs to power data governance. But is that enough?


Let’s understand why data governance isn’t meeting the needs of the modern data stack, and how a new approach: Self-Service Data Piloting is needed to better facilitate data and metadata management.


It’s no secret: knowing where your data lives and who has access to it, is fundamental to understanding its impact on your business. In fact, when it comes to building a successful data platform, it’s critical that your data is both organised and centralised, whilst automatically maintain a single source of truth in real-time.


No alt text provided for this image
Source: cjts3rs.files.wordpress.com


Analogous to a physical library catalog (as depicted above), data catalogs serve as an inventory of metadata and give users the information necessary to evaluate data accessibility, quality, location, and have also emerged as a powerful tool for data management and data governance. 


Not surprisingly, for most data leaders – one of their first imperatives is to build a data catalog. 


At the bare minimum, a data catalog should answer: 

  • Where should I look for my data? 
  • Does this data matter? 
  • What does this data represent?
  • Is this data relevant and important? 
  • How can I use this data? 


Still, as data operations mature and data pipelines become increasingly complex, traditional data catalogs often fall short of meeting these requirements.


Here’s why some of the best data engineering teams are innovating their approach to metadata management – and what they’re doing instead:


Data catalogs falling short

While data catalogs have the capability to document data, the fundamental challenge of allowing users to maintain a “single source of truth” and glean meaningful, real-time insights about the quality and reliability of your data has largely remained unsolved.


Data catalogs as we know them are unable to keep pace with this new reality for three primary reasons:

  1. Lacking automation,
  2. Inability to scale, and
  3. Their undistributed format. 


Increased need for automation

Traditional data catalogs and governance methodologies typically rely on data teams to do the heavy lifting of manual data entry, holding them responsible for updating the catalog as data assets evolve. This approach is not only time-intensive, but also requires significant manual toil that could otherwise be automated, freeing up time for data engineers and analysts to focus on projects that move the needle.


As a data professional, understanding the state of your data is a constant battle and speaks to the need for greater, more customised automation. Perhaps this scenario springs to mind: 


Before stakeholder meetings, do you often find yourself frantically pinging #data Slack channels, Teams chat to figure out what data sets feed a specific report or model you are using? And why on earth the data stopped arriving last week? To cope with this, do you and your team huddle together in a room and start whiteboarding all the various connections upstream and downstream for a specific key report?


No alt text provided for this image
Source: xkcd.com


If this sounds familiar, you’re not alone. Many organisations that need to solve this dependency jigsaw puzzle embark on a multi-year process to manually map out all their data assets. Some organisations dedicate resources to build short-term hacks or even in-house tools that allow them to search and explore their data. Even if it gets you to the end goal, this poses a heavy burden on the data organisation, costing your data engineering team time and money that could have been spent on other things, like product development or using the data. 


Ability to scale as data changes

Data catalogs work well when data is structured, but that’s not always the case. As machine-generated data increases and companies invest in machine learning (ML) initiatives, unstructured data is becoming more and more common, accounting for over 90% of all new data produced.


Typically stored in data lakes, unstructured data does not have a predefined model and must go through multiple transformations to be usable and useful. Unstructured data is very dynamic, with its shape, source, and meaning changing all the time as it goes through various phases of processing, including transformation, modeling, and aggregation. What we do with this unstructured data (i.e., transform, model, aggregate, and visualise it), makes it much more difficult to catalog in its “desired state.”


On top of this, rather than simply describing the data that consumers access and use, there’s a growing need to also understand the data based on its intention and purpose. How a producer of data might describe an asset would be very different from how a consumer of this data understands its function, and even between one consumer of data to another there might be a vast difference in terms of understanding the meaning ascribed to the data.


For instance, a data set pulled from CRM has a completely different meaning to a data engineer than it would to someone within the sales team. While the engineer would understand what “SF_v7_23” means, the sales team would be scratching their heads, trying to determine if said data set correlated to their “Sales Forecast FY23” dashboard in CRM. And the list goes on...


No alt text provided for this image
Source: xkcd.com


Static data descriptions are limited by nature. In today’s era, we must accept and adapt to these new and evolving dynamics to truly understand the data.


Data is distributed, catalogs are not

Despite the distribution of the modern data architecture and the move towards embracing semi-structured and unstructured data as the norm, most data catalogs still treat data like a one-dimensional entity. As data is aggregated and transformed, it flows through different elements of the data stack, making it nearly impossible to document.


Nowadays, data tends to be self-describing, containing both the data and the metadata that describes the format and meaning of that data in a single package.


Since traditional data catalogs are not distributed, it’s near to impossible to use as a central source of truth about your data. This problem will only grow as data becomes more accessible to a wider variety of users, from BI analysts to operations teams, and the pipelines powering ML, operations, and analytics become increasingly complex.


No alt text provided for this image
Source: xkcd.com


A modern data catalog needs to federate the meaning of data across these domains. Data teams need to be able to understand how these data domains relate to each other and what aspects of the aggregate view are important. They need a centralised way to answer these distributed questions as a whole – in other words, a distributed, federated data catalog.


Investing in the right approach to building a data catalog from the outset will allow you to build a better data platform that helps your team democratise and easily explore data, allowing you to keep tabs on important data assets and harness their full potential.


Understanding Self-Service Data Piloting

Data catalogs work well when you have rigid models, but as data pipelines grow increasingly complex and unstructured data becomes the golden standard, our understanding of this data (what it does, who uses it, how it’s used, etc.) does not reflect reality.


Believe that next generation catalogs will have the capabilities to learn, understand, and infer the data, enabling users to leverage its insights in a self-service manner. In addition to cataloging data, metadata and data management strategies must also incorporate data piloting – which refers to the process of conducting small-scale experiments / trials to assess the feasibility and effectiveness of any data-related projects or initiatives. While data piloting is not a new approach, the ability to self-serve provides a game-changing alternative that empowers decision-makers at all levels with direct access to data and enables them to make informed choices in real-time. But how do we get there?


Self-Service Data Piloting replaces the need for a data catalog by providing a domain-specific, dynamic understanding of your data based on how it’s being ingested, stored, aggregated, and used by a set of specific consumers. As with a data catalog, governance standards and tooling are federated across these domains (allowing for greater accessibility and interoperability), but unlike a data catalog, Self-Service Data Piloting surfaces a real-time understanding of the data’s current state as opposed to it’s ideal or “cataloged” state.


Self-Service Data Piloting can answer these questions not just for the data’s ideal state but for the current state of the data across each domain:


  • What data set is most recent? Which data sets can be deprecated? 
  • When was the last time this table was updated? 
  • What is the meaning of a given field in my domain? 
  • Who has access to this data? When was the last time this data was used? And by whom? 
  • What are the upstream and downstream dependencies of this data? 
  • Is this production-quality data? 
  • What data matters for my domain’s business requirements? 
  • What are my assumptions about this data, and are they being met?


Believe that the next generation data catalog, in other words, Self-Service Data Piloting will have the following features:


Automation and Orchestration

Data teams should be able to easily leverage their data catalog without a dedicated support team. Self-service, automation, and workflow orchestration for your data tooling removes silos between stages of the data pipeline, and in the process, making it easier to understand and access data. Greater accessibility naturally leads to increased data adoption, reducing the load for your data engineering team.


Scalability as data evolves

As companies ingest more and more data and unstructured data becomes the norm, the ability to scale to meet these demands will be critical for the success of your data initiatives. Self-Service Data Piloting leverages machine learning to gain a bird’s eye view of your data assets as they scale, ensuring that your understanding adapts as your data evolves. This way, data consumers are set up to make more intelligent and informed decisions instead of relying on outdated documentation, or worse – gut-based decision making. 


Data Lineage for distributed data

Self-Service Data Piloting relies heavily on automated table and lineage to map upstream & downstream dependencies between data assets. Data lineage helps surface the right information at the right time and draw connections between data assets so you can better troubleshoot when data pipelines do break, which is becoming an increasingly common problem as the modern data stack evolves to accommodate more complex use cases.


Data quality & reliability to ensure the gold standard of data

The truth is — in one way or another — your team is probably already investing in data piloting. Whether it’s through manual work your team is doing to verify data, custom validation rules your engineers are writing, or simply the cost of decisions made based on broken data or silent errors that went unnoticed. Modern data teams have started leveraging automated approaches to ensuring highly trustworthy data at every stage of the pipeline, from data quality monitoring to more robust, end-to-end data reliability platforms that monitor and alert for issues in your data pipelines. Such solutions notify you when data breaks so you can identify the root cause quickly for fast resolution and prevent future downtime.


Self-Service Data Piloting empowers data teams to trust that their assumptions about data match reality, enabling dynamic data piloting and a high degree of quality and reliability across your data infrastructure, regardless of domain. 



Final thought: If, in some complex situations (or context), bad data is worse than no data, a data catalog without data piloting is worse than not having a data catalog at all. To achieve truly Self-Service Data Piloting, it’s important that your data is not just “cataloged,” but also accurate, clean, and correct for ingestion to consumption – in other words: right quality and reliable.


A self-service approach to data piloting relies on automated and scalable data management, which works with the newly distributed nature of data systems. Therefore, to truly enable Self-Service Data Piloting in an organisation, we need to rethink how we are approaching the data catalog.  Only by understanding your data, the state of your data, and how it’s being used – at all stages of its lifecycle, across domains – can we even begin to trust it.


Vokse fuels revenue initiatives, supercharges customer experience & streamlines collaboration by enabling organisations to pilot their data quality in real-time within a central, intuitive, and collaborative Self-Service Data Piloting platform.


Learn more by visiting www.vokse.eu | www.linkedin.com/company/vokse-dpa OR give me a follow!

To view or add a comment, sign in

More articles by Ankit Mehta

Insights from the community

Others also viewed

Explore topics