The marriage of Azure Purview and CluedIn

The marriage of Azure Purview and CluedIn

With Azure Purview now in GA, what better time than to talk about CluedIn's native integration with the new kid in town. We have had countless companies ask us how we play with Azure Purview and we can say with all confidence, the answer is, "natively". We thought that the best way to talk through the integration is to paint the picture of what this marriage of CluedIn and Azure Purview solves in the enterprise data space. 

Imagine a situation where you have 10 systems. 2 CRM systems, 3 ERP, 1 Data Lake, 2 HR systems, Support Desk and Office 365. You have been asked to bring this data together and start generating insights in Azure Synapse and Power BI. This very much is an ambition of many companies, but yet still quite hard to achieve today. We would like to step you through how this is possible. For this ambition to be met, there are many components in Azure that will need to be stitched together to make this a reality. The end goal of this scenario is to generate insights from the data, but to setup a flow that guarantees that insights will continue to be generated as the project evolves. 

Let's first start by introducing Azure Purview. Touted as a unified data governance solution that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data, it is without a doubt that Data Governance is the number one ambition of companies today. It is great to see that Microsoft are bringing a solution to the Azure space.

In fact, the era's of Machine Learning and Business Intelligence (although still very popular) are somewhat responsible for shining a big light on the need for a proper Data Governance foundation. On top of this, Azure Purview is tasked with creating a holistic, up-to-date map of your data landscape with automated data discovery, sensitive data classification, and end-to-end data lineage - enabling data consumers to find valuable, trustworthy data.

Because our team at CluedIn has been working and integrating with the preview of Purview for months now, with regards to our scenario above, Purview can help answer the following questions: 

What data assets do you have?

 i.e. we should know that there is a Company tables in CRM #1, and Account tables in CRM #2. We should know that we have 10 system, what they are and how to connect to them.

Where is it?

Purview helps us get a location of the data and we can also register custom metadata in Purview to add more details if necessary.

It must be said that Purview does give us a framework and shell for interacting with it from its REST API, so from this it also gives a mechanism to eventually answer other questions like:

  • What happened to the data from source to target?
  • Where is our data being used?

However we need to do a little extra work to enable this. 

Enter CluedIn. CluedIn is a native Azure Master Data Management system like you have never seen before. Throw your pre-conceptions of MDM out, CluedIn is a cloud-native, modern solution for unifying data from across your business and getting that data ready for generating insights. There are many good reasons to throw your pre-conception's out, but for me, the main one is Gartner's statistic that 85% of traditional MDM initiatives fail. There is clearly something wrong here and clearly a line has to be drawn in the sand that separates traditional MDM approaches versus modern. I can stand here, hand on heart and tell you that our failure rate is 12%. You won't find many Vendors talking about failure rates, only success, but this is just one more reason why CluedIn is different. Transparency. 

With CluedIn in the story, alongside Purview, this allows us to start answering different questions, but also moves the needle on the data closer to a point where the data is ready for insight. Adding CluedIn into the mix now helps us answer:

  • Who is responsible for data as it moves throughout the business?
  • What happened to the data along the way?

Now that we have learnt a little bit about both the players, let's discuss, at a high level, the main value that comes from the synergy of CluedIn and Purview. 

What is the big value of a Purview / CluedIn kinship?

  • Purview is about getting the birds-eye view, CluedIn is about zooming in on the details. 
  • Purview brings cross-platform visibility into data movement . To recognize this, systems like CluedIn have to remember to register its information back into Purview or we can't achieve that true end-to-end lineage. 
  • Purview provides scanning at source, CluedIn provides scanning at record level (but does ask you to move the data in question). 
  • Purview is your data catalog for assets across your business, CluedIn takes these assets and turns them into ready-for-insight data. 

The first step to data insight, is actually knowing what data you have that IS actionable. You don't want to move and pay processing costs before you know what you want to commit to. Unfortunately, this is a double edged sword i.e. you often don't know what you don't know. It is also worth mentioning, like the Data Lake received its bad reputation as a Data Swamp, the Data Catalog does have the same challenge, in that if it is not managed, it will soon cause a mess.

There are ways that Purview helps with this e.g. Purview can support Synonyms, so if your tables, files and assets are not all meticulously called 'Customers', you can tell Purview to also look for files and tables called Accounts, Companies and others.

Now that you have registered what data assets you have, there is still a long journey to be had before this data is ready for insight.  

CluedIn essentially sits on top of Purview and then takes the data to its next step of maturity i.e. "Let's turn this scattered data into consolidated data".

CluedIn evolves the maturity of the data in that it transitions thinking about data at a dataset level, into thinking about data at a record level. We think this is necessary because often you will find that your "customers" are scattered across multiple files, tables and systems instead of being in one nice and clean file. Our job at CluedIn now is to take the assets and turn them into data that is ready for insights. 

At CluedIn, we believe there are two audiences (and hence two types of Data Catalog) for a Data Catalog. The first, is someone who wants to know what raw datasets they have across their business. Purview is great for this. The next audience is someone closer to the business that is not so interested in knowing that customers sit across 4 SQL tables and what are the Primary and Foreign Keys, but are much more interested just in Customers, where the columns names have been standardized AND the data has also been normalized.

So now with CluedIn and Purview, what we have brought to the table at this point, is:

  • Instead of 4 customer files/tables in different formats and structure, we now have 35,342 customers. Out the end, our job at CluedIn is to then have these customers all with aligned column names, normalized values and more. This is 35,342 customers that have been de-duplicated, cleaned, enriched and standardized - instead of 4 tables with customers in it that need that attention every time they are used in the future. 

What is important to establish is that Purview and CluedIn will NOT be the last target for data. This means that ALL tools need to register their movement in Purview as when you think about it, there is no way for Purview, CluedIn, Azure Databricks and other systems to track data past one hop. Once a system like CluedIn or Azure Databricks has pushed data to another system, there is no easy way to track what happens after the data has been "let out of the bag". CluedIn not only tracks data coming directly to it (and not through Purview first), but it tracks what data goes out to other Targets. We write this lineage directly back into Purview as we believe Purview is the central place that this should be registered. THIS is how we will achieve the end to end lineage. It does require that other vendors like Azure Databricks or other tools you are using in Azure that are handling, moving or storing data, to support writing data processing and movement back to Purview as well OR for Purview to reach out and GET the data from those tools itself.

Now that we have Purview serving the data to CluedIn, CluedIn now is consolidating and making the data ready for the downstream systems e.g. Azure Synapse. Let's now talk about how we can slice and dice the data to match the use cases that we are wanting to drive. For this, let's talk about the Purview and CluedIn Glossaries. 

The Glossary in Purview is about describing assets, the Glossary in CluedIn is about describing data after it has been integrated, standardized and to a record level. 

For example, in Purview we can answer "Where is my customer data?"

In CluedIn, we can now answer "Who are my customers?"

The bottom line is, you need both. You can't have one without the other. 

CluedIn has worked meticulously to bridge the best of Purview with the best of CluedIn to offer a seamless and fluid experience. Here are some examples of the integration in the flesh.

The Azure Purview Glossary is available directly in CluedIn and vice-versa.

This is such a nice marriage as it allows you to easily transition from an asset level Glossary to a record level Glossary. 

No alt text provided for this image

CluedIn can ingest assets that have been registered in Purview.

When registering assets in Purview, one can assign Key Vaults to the resources. Given the right access, CluedIn can read directly from the Purview Registered Resource and the Key Vault to self-authenticate with the source, requiring one less hop in the chain of getting data into your MDM solution, cleaned, and out in the ether and generating insights. 

No alt text provided for this image

CluedIn scans the personal information from Azure Purview, and it can pinpoint to a record level where the personal data is. It also adds supports for personal information scanning in unstructured and semi-structed data, not just structured. So CluedIn can scan files, mail, PDF's and more.

No alt text provided for this image
No alt text provided for this image

CluedIn will use the schema set in Purview to automatically map data sets into CluedIn.

At CluedIn, we have developed a zero-modelling approach to MDM that is just revolutionizing the space. In saying this, we can still leverage the metadata in schemas to hint to CluedIn what data types and constraints should be on data, but NOT enforce them on entry into CluedIn. Rather, we want to flag that there are issues and assign the fixes to data stewards to rectify it. All other Vendors will simply reject these records, not even giving Data Stewards the chance to fix things!

No alt text provided for this image


No alt text provided for this image

CluedIn extends the Purview Lineage with detailed processing logs.

This is my favorite. If you can easily describe to me within your business - what happened to the data from raw to insights in a detailed view, then I would be impressed. Most companies don't have anything, anywhere close to this. CluedIn's job is to explain itself as transformation is done to the data. This means everything from:

  • Why did a record merge with another record?
  • Why did we chose to use the City from one record over another?
  • Where did we get the Industry of the company from?
  • What business rules were triggered on the data?
  • What Id's did it use to merge with other records?

No alt text provided for this image

CluedIn can initiate Purview Scans before a new data ingestion is scheduled.

If there is the need that sensitive data cannot even be MOVED between different systems, then this adds an extra layer of risk mitigation. 

Summary

With this stack in place, it is not about having 100% data quality across the board before we generate insights. It is about putting a system and flow in place that we can improve data quality over time in a tracked, transparent manner. With this stack we can now answer:

  • What data we have?
  • Where is the data?
  • What is the quality?
  • Who owns the data?
  • Who is responsible for every step of the data journey from start to finish?
  • What happened to the data from raw to insight?

Now that CluedIn and Purview have both done their jobs, CluedIn is now responsible for making this data available to Synapse (in this case) so that the Data Warehousing team can refine the data even more to fit into the insights that need to be generated. 

Although slightly off topic, I wanted to touch on some of the other pieces of Azure and how they fit into the picture. For example, let's ask some obvious questions:

1: Why can't I just plug Azure Databricks or Azure Synapse over Purview and give it to my Data Engineering team to solve this challenge?

Firstly, I am not saying this can't be done. But there is a good reason that the concept of Master Data even exists. It is because after a while you do realize that certain parts of the process of data processing are necessary to involve Data Engineering and IT and in others, involving IT would result in unmaintainable work. I can definitely say this, I have been in Software Engineering for 15+ years and it is very clear that some parts of the data processing journey just do not scale and do not work well if an IT approach is taken. 

Here are some very practical examples of things that on the surface can be solved with data engineering, but will not scale. 

 - We have data on companies coming from multiple places, however the cities are all deformalized. Sometimes people write SYD, some times Sydney, sometimes Cydney! On the surface, this seems like it is a simple if statement, but now multiply it by all the permutations and all the cities. Let's just say that IT can solve this. The problem is that whenever there is a new permutation, then IT have to get involved and scheduled. 

 - When bringing data together from across multiple sources, it is not often obvious how to merge records together. There are great fuzzy merging libraries in Python, but once again, this will need constant IT attention as anomalies arise. 

 - When bringing data together from across multiple sources, you will find that datasets 1 and 3 can talk to each other, but 1 and 2 can't. For datasets 1 and 2 to talk, they have to jump through an Id in dataset 3 and 4. Now add one more dataset and you have 5! possible links to investigate to see if records can be triangulated. This is NOT a scalable approach to integration. We can say this with full confidence, as I can call out one CluedIn specific customer example where they have 642 data sources. This would be unachievable in ANY other tool.

 - You will most likely be using the same code and logic as new datasets come through, but unlikely that you will be re-using it properly. CluedIn can centralize automatic transformations so you don't have to maintain dictionaries of values that should be auto-transformed. 

Is there any place for Azure Synapse directly over Purview?

100%. I can think of a couple of obvious ones but will highlight the main one. Processing data through CluedIn takes time and attention. If you require QUICK, if not real-time answers on your data but are happy to accept that the data has holes and issues then it makes complete sense. However, when it is time to make REAL decisions off this data, it needs to have been matured through a platform like CluedIn i.e. duplicates need to be removed, data needs to be normalized, ownership of data needs to be applied (and more).

We have to remember that the Spark/Python approach to preparing data is a very different workflow to that of the MDM style. Typically, the Spark/Python approach is much more about taking raw data, and scripting notebooks where we can see the code evaluate directly in front of our eyes. It really is a lovely approach to the problem. Once we have solved the challenge, we essentially save these as pipelines that can automate things. In principal there are many cases where this would work beautifully. But we also assume through this that things won't change and new problems won't bubble up. The bottom line is that they will. You have a choice. You either maintain these pipelines and involve IT again and again every time a new problem comes up OR you systemize the changes in a system like CluedIn. This is in no way saying that CluedIn replaces Spark/Python. This is saying that there are certain aspects of the data journey that should be solved with Spark/Python and others that should not. 

Should CluedIn write back data to Purview?

No, as Purview is not a data store, but rather an application that scans data stores. Hence it could make complete sense for CluedIn to write data back into a folder in the Data Lake called "Cleaned". In my world, this does complicate the architecture as you don't have this lovely left to right flow of data where as it flows from the left to the right, the data becomes more mature, less flexible and more ready for insight. However it does play nicer in terms of some of the more modern approaches of Lakes, Lake Houses and Data Warehouses. 

Don't MDM systems only support storing Master-style Domains like Customers, Products?

Yes and no. Most MDM platforms will tell you "Yes", you only store certain types of data in an MDM system. We strongly disagree with this mentality. Why? Because, we are asking the wrong question of our data. Instead of looking at Master Data as a short list of Domains, we should be asking "what data needs the attention of Master Data Management?" If we look at it from this angle, we then start to ask: 

"What structured, semi-structured, unstructured, transactional data needs to be Integrated, Governed, Cleansed, Tracked, Enriched, Deduplicated". In many situations, I could easily answer that all types of data need this attention. Then we need to ask, is this the BEST place to treat this data. The answer then could be quite different. But this doesn't negate the fact that MDM data just falls into a couple of common buckets. I can categorically say that this is just plain wrong. It honestly sounds like "Vendor-Speak" for "ours can't do that, so you shouldn't do it."

My Purview Account is scanning Hive Tables, Data Lakes that literally have TB's or PB's of data. How will CluedIn tackle this scale?

With pure transparency, CluedIn shouldn't. It doesn't make economic sense. Although I can say that our aim is to innovate in a way that does handle this, it just won't make economic sense to host PB's of data in CluedIn today. It is kind of like storing data in a hot-seat. In saying this, CluedIn can scale immensely, with most of our customers have many, many millions of records. In fact, we have one particular customer with over 1 billion records in it. But just because it can do this, it doesn't mean it should. I can add color to this, in that what CluedIn often surfaces for customers is that they don't even use a large percentage of their data to generate insights anyway. 

Olufemi George

OpenAI for Modern Master Data Management

7mo
Like
Reply
Olufemi George

OpenAI for Modern Master Data Management

7mo
Like
Reply
Rizwan Mian, PhD

Generative / AI | Azure / Solutions Architect | Data & Advanced Analytics | Azure SME | Python Coder | Teacher | Happiness Advocate

1y

Tim Ward Thanks for the article. Is there a diagram showing the relationship and/or interworking between CluedIn and Purview?

Like
Reply
Anna Abramova

Follow for insights in Enterprise Data & AI: Data Modeling, Data Architecture, Data Engineering

3y

Beautiful

Like
Reply

Matt Minor, Purview + Cluedin better together. Exciting times for data governance.

To view or add a comment, sign in

More articles by Tim Ward

Insights from the community

Others also viewed

Explore topics