One of the most important things to better know a data estate is to investigate into the data lineage topic.
In a non scientific definition, data lineage defines …
- Where does my data come from?
- Where does my data go to?
- and What happens on the way?
In this blog post, you’ll see how data lineage from Azure Data Factory and Synapse pipelines are pushed into the Microsoft Purview. There is also a new (at least I found out about it some days ago) functionality that brings metadata-driven, dynamic pipeline runs into the lineage information in your Purview data map.
TL;TR – There is a video for all of that
Data Lineage in Microsoft Purview
In Microsoft Purview, the lineage view allows you to get to know more about the lineage of a certain data asset. Within the chain of your data supply chain, there are different shapes that Purview puts together.
First, there are data stores like Azure Data Lake Storage accounts or Azure SQL Databases that store information and within the Purview context, contain data assets. The Microsoft Purview data map pulls this information from the sources during scan processes.
Second, there are transformation steps that connect data assets. In the example above there are two transformations shown (both are Azure Data Factory (ADF) pipelines – to be more specific ADF Copy activities). In the context of Purview, lineage information is pushed from ADF into the Purview data map.
How to get ADF/Synapse lineage into Microsoft Purview?
If you have not connected your Azure Data Factory (ADF) / Synapse workspace with your Purview account, you might start with the registration of a new source in Purview. BUT – there is no ADF source available in the Purview data governance portal.
In order to connect ADF with Purview, you need to start in ADF. Within the Manage menu, there is a Microsoft Purview section. In there, just connect your ADF instance with an existing Azure Purview account.
This connection comes with two (main) options – search your Purview data catalog from ADF and the lineage push from ADF into Purview.
Are there any additional requirements that lineage is pushed from ADF into Purview?
No. Simply speaking no. You only need an ADF Pipeline containing a copy activity. And – you need to run that pipeline at least once.
Every execution of a pipeline (in an ADF instance connected to Purview) pushes lineage information into the Purview data map. You can check that by a drill-down into the activity run history. Every copy activity execution gets a new icon (lineage status) that indicates the lineage push into Purview. In some cases, the lineage push does not work because of some limitations / requirements for the copy activity (see more in the documentation – https://docs.microsoft.com/en-us/azure/data-factory/tutorial-push-lineage-to-purview)
But what about Dynamic / Meta-data driven pipelines in ADF and Synapse?
In many of our projects, we do not develop ADF pipelines that are “hard-coded” – i.e. that are configured to copy data from a fixed-defined source into a fixed-defined destination. What we do instead is to use metadata driven pipelines.
Within these metadata driven pipelines, usually a lookup activity is used to get the list of objects to load and a ForEach activity loops over the items to load. I will not go into details of metadata driven pipelines (there are tons of blog posts out in the thing called internet).
There was one problem with this kind of pipelines and their lineage push into Purview: it simply did not work. Dynamic Copy activities were not supported and the lineage of these pipelines did not appear in the Purview data map.
But fortunately this changed a few days ago. I have not seen any public announcement of that feature but for me (and my colleagues) this is huge gamechanger in the integration of ADF/synapse and Microsoft Purview.
Let’s run the example pipeline from the screen shot above and view the monitoring information of that run.
What we can see is, that there are three executions of the Copy activity and all of them push the lineage information (indicated by the icon in the column Lineage status) into the Purview data map.
Lineage view in Microsoft Purview
Let’s head over to Microsoft Purview, search for the pipeline and … hmm… there is no lineage information available.. That’s because the pipeline itself does not expose lineage information – it’s the copy activity that does the work!
If you open the data asset of the copy activity, we’ll be successful! Lineage is there. And with that – it’s the dynamic (metadata) lineage shown here.
That’s it for the first look at the dynamic data lineage support in Microsoft Purview. I have not tested it in more details, but as a first conclusion I am really happy that dynamic copy activities are finally supported!