Modern data stack in 2022, metadata that brings it all together, some weekly reads, and more
Welcome to this week's edition of the ✨ Metadata Weekly ✨ newsletter.
Every week I bring you my recommended reads and share my (meta?) thoughts on everything around metadata.
This edition includes my predictions about the modern data stack in 2022, some interesting reads on modern data stack trends, what’s missing in the modern data stack (hint: metadata that brings it all together), and more.
P.S. In case you prefer email over social, the newsletter is on Substack. You can subscribe here.
✨ Future of the Modern Data Stack in 2022
It’s that time of the year – to read the yearly predictions and reports on trends, and build your own understanding of what’s to come. Last year was filled with buzz words, some hype, and a lot of events that set the data world on fire.
I spent some of my downtime studying through all that has happened in the data world in 2021 and sharing my predictions for 2022. Read the article here. Below are the six ideas about the modern data stack that exploded in the data world last year and don’t seem to be going away:
1.Data Mesh: The idea of the data mesh came from Zhamak Dehghani in 2019, and it was everywhere in 2021! This year, we’ll see a ton of platforms rebrand and offer their services as the “ultimate data mesh platform”. But my advice for data leaders: stick to the first principles at a conceptual level rather than buy into the hype that you’ll see inevitably in the market soon.
2. Metrics Layer: Metrics have been often split across different data tools, with different definitions for the same metric across different teams or dashboards. In 2021, people finally started talking about the metrics layer, metrics store, headless BI. Base Case proposed “Headless Business Intelligence” while Benn Stancil talked about the “missing metrics layer” in today’s data stack.
Metrics layers are deeply intertwined with the transformation process, so I’m really excited about dbt announcing their foray into the metrics layer. I think we’ll finally see metrics become first-class citizens in more transformation tools in 2022!
3. Reverse ETL: I’m pretty excited about everything that’s solving the “last mile” problem in the modern data stack. What I’m not so sure about is whether reverse ETL should be its own space or just be combined with a data ingestion tool, given how similar the fundamental capabilities of piping data in and out are, as some tools have already started offering. I believe we’ll see more consolidation and deeper partnerships in this space soon!
4. Active Metadata and Third-Gen Data Catalogs: There’s never been so much buzz about data catalogs and metadata! At the beginning of 2021, I wrote an article to introduce the idea that we’re entering the third generation of data catalogs. This idea got amplified by a huge move from Gartner — introducing “active metadata” as a new category in the data space.
I’m probably biased, given that I’ve dedicated my life to building a company in the metadata space. But I truly believe that the key to bringing order to the chaos that is the modern data stack lies in how we can use and leverage metadata to create the modern data experience.
Where data catalogs in the 2.0 generation were passive and siloed, the 3.0 generation is built on the principle that context needs to be available wherever and whenever users need it. They leverage metadata to improve existing tools like Looker, dbt, and Slack.
5. Data Teams as Product Teams: Emilie Schario, Taylor Murphy, and Eric Weber have been talking about a way to break data teams out of the “service trap” — rethinking data teams as product teams. Of all the hyped trends in 2021, this is the one I’m most bullish on. Data teams today are stuck in a service trap, and only 27% of their data projects are successful. I believe the key to fixing this lies in the concept of the “data product” mindset, where data teams focus on building reusable, reproducible assets for the rest of the team.
6. Data Observability: This idea – of “monitoring, tracking, and triaging of incidents to prevent downtime – came out of “data downtime”, which Barr Moses from Monte Carlo first spoke about in 2019. In the past two years, data teams have realized that tooling to improve productivity is not a good-to-have but a must-have.
While data observability will be a key part of the modern data stack in the future, what I’m not so sure about is whether it will continue to exist as its own category or will be merged into a broader category (like active metadata or data reliability).
Recommended by LinkedIn
Some more additional reads on data trends for you to build your own understanding of how the modern data stack is going to take shape in 2022:
❤️ Fave Links from This Week
The Metadata Money Cooperation by Benn Stancil
“Today’s data stack is quickly fracturing into smaller and more specialized pieces, and we need something that binds it all together. The stack isn’t missing another hub or aspirational standard; it’s missing plumbing.... To get information from every tool in the rest of the stack, vendors would only need to integrate with a single API, no new standards necessary.”
Benn’s words are music to my ears. The only reality in data teams is diversity — a diversity of people, tools, and technology. Diversity leads to chaos and sub-optimal experiences for everyone involved.
I believe that the key to helping us achieve the modern data experience lies in activating metadata. This to me is the obvious next generation of what data catalogs should look like — instead of just collecting metadata and bringing them into a siloed “passive” tool, an active metadata platform should be bidirectional. They should not only bring metadata together into a single store like a metadata lake, but also create intelligent connections between the metadata and the places it came from, improving the experience for everyone involved. **
Here’s an example of this in action. When a data user looks at a dashboard, dozens of questions come up: Is this dashboard updated? When was it last updated? Can I trust it? Who owns this dashboard? What does this metric actually mean? Did this data come from our Sales or Finance team?
Ideally, the answers to these questions should be available right there — in the dashboarding tool, giving users “context” when they need it where they need it. But... the answers to the questions are hard because the “truth” in the data ecosystem is dispersed. The truth about whether the data was updated is in Airflow or dbt, whether I can trust the data is in Soda or Montecarlo, who owns it is in people’s heads, and so on. 🙂
Even if a BI vendor attempts to answer these questions, they’d have to integrate into over 100+ tools in the stack, which is a non-trivial problem. I wrote about an active metadata platform earlier last year – a platform that brings all the metadata back to one place, uses intelligence to auto-construct connections like lineage, and finally offers this information back into every tool in the ecosystem via a simple API. Metadata goes from a single silo-ed use case like a catalog to basically being the engine that enriches every tool in the ecosystem. The article shares a lot of how we have built Atlan as an active metadata platform.
I think Benn’s post also cautions us about something that I believe is dangerous to the ecosystem — it's no secret that metadata is HOT in 2022. But, in this market hype, we are falling prey to solving trivial problems while forgetting the real problem in the ecosystem. IMO the real burning need in the metadata space isn’t the need for yet another aspirational “open” standard. It is that we haven’t figured out how to help our end users get value from metadata beyond some initial “scratch the surface” use cases like data cataloging and discovery. And this needs all of us to shift the conversation from yet another argument about push vs. pull architectures to what it’s going to take to help end-users (tools and people) finally drive value from metadata.
Some other reads that I enjoyed reading:
I'll see you next week with more interesting stuff around the modern data stack. Meanwhile, I'd love for you to subscribe to the newsletter on Substack and connect with me on LinkedIn here.
P.S. Liked this newsletter? Share it with a friend by sending them this link.
CEO & Co-founder at Sifflet
2yThanks for sharing Prukalpa ⚡!
VP Client Insights Analytics (Digital Data and Marketing) at Bank Of America, Data Driven Strategist, Innovation Advisory Council. Member at Vation Ventures. Opinions/Comments/Views stated in LinkedIn are solely mine.
2yExcellent compilation, could you share some resources related to what reverse ETL is? Thank you