Maybe you’ve heard of Azure Purview, the new Azure data governance service, that was released into public preview in the beginning of December 2020.
Very often, when I want to test some new services I miss some infrastructure and environments I can start and play with. I am not talking about creating a new Azure Purview account (see my previous blog post – Creating an Azure Purview account) – I am talking about the data infrastructure to analyze, catalog and gain knowledge out of it.
We could start to create such an infrastructure, BUT: the Purview team create a Starter Kit to quickly create a data estate and configure everything that you can start with Purview within a view minutes.
What you need:
- An Azure subscription
- An existing Azure Purview account (see my blog post for a description how to create one)
- Download the starter kit (from this page)
The starter kit creates and populates a demo data estate consisting of an Azure Blob storage, an Azure Data Lake Storage Gen2, an Azure Data Factory (ADF) containing a Copy activity pipeline PLUS it configures the connection between the ADF and Azure Purview.
What to consider – Do not repeat my faults 😉
I do not want to copy the instructions from the Starter Kit page, but wanted to emphasize some steps that made me think and I had to correct because of my naming conventions!
- Just a general hint: if you download zip files from the internet and you trust the source: Rightclick the file in Windows explorer and select Unblock.
- Check your IDs you need for the generation script
- Azure Tenant ID (Get it from the Azure Active Directory properties)
- Subscription ID (Get it from the overview page – either from your Purview account or the resource group)
- The Starter Kit creation scripts are PowerShell based – follow the instruction on the Starter Kit page
- PLUS: Start the PowerShell terminal with the option Run as administrator.
- I had some problems in the first runs while executing the .\RunStarterKit.ps1 script..
- …because I named my resource group for the data estate “starterdata-rg” => this leads to an error because of the contained hyphen (this name is used and the script concatenates this to generate the storage account name – hyphens are not a friend of your storage account name :-))
When done, hit Enter and wait for some minutes.
My command prompt looked like this one – you need to enter your values (marked in bold):
PS D:\PurviewStarterKitV4> .\RunStarterKit.ps1 -CatalogName purviewstarter -TenantId xxxxxx -ResourceGroup pvdata -SubscriptionId xxxxx -CatalogResourceGroup purviewstarter-rg
What is really nice is, that not only the Azure services are created, it’s also that the blob storage is filled with demo data and the ADF pipeline is run (because this triggers the metadata to be pushed to Purview)
Let’s try the Starter Kit demo data
The scanning takes some time (a few minutes) and when done, we can have a look at the assets imported into the Purview metadata store.
On Purview Home tab, select the button Browse assets to explore the assets in your Purview instance.
In the Starter kit environment, you should see three sources registered – the Blob Storage, the Azure Storage Account and the Azure Data Factory.
Select the Azure Data Factory button and then the ADF present there (it’s the one that was generated by the PowerShell script). Select the TestCopyPipeline until you get to to the Details page of the TestCopyPipeline (the Azure Data Factory Copy Activity).
On the overview page, you basically get the technical details of this data asset, like the qualified name and other tech properties. What is interesting and you should have a look at is the data asset hierarchy shown on the right side of the screen. In this example, the hierarchy is defined by the object types Azure Data Factory – Azure Data Factory Pipeline – Azure Data Factory Copy Activity.
If you want more details, you can open this activity directly in Azure Data Factory (blue button on the top right)
Last, but not least you can have a look at the lineage of the pipeline or at related objects. I will start with the related objects as I want to save the best for the end 🙂
In the Related tab, you can have a look at and browse to related objects of this asset. In the demo environment, you can see links to the ADF and the ADF Pipeline.
And now – I think I already mentioned it one or multiple times .. I am really looking forward to the data lineage possibilities of Purview. The current state of the visualization of the Starter kit environment looks like this:
- Data sources on the left (Blob storage)
- the ADF Copy Activity in the middle
- and the targets on the right (ADLS Gen2 storage)
In my screenshot the copy activity is selected which allows you to a) see more details and b) open the pipeline directly in ADF
As this Copy activity copies multiple sources, not all paths are displayed. In my demo, Purview shows the first six paths and the toggle button More lineage allows you to show the “…” bubble on the source or either the target side.
If you select one of the assets, the related assets are highlighted. In my opinion, this visualization is a start – I would not assume that data from all the sources are flowing into the highlighted CreditLimit asset.
Next, navigate (“Switch to asset”) to the details of the Contoso_CreditLimit_[N].ssv asset (hightlighted in the last screenshot).
This lineage view makes more sense, it shows the data lineage for credit limit data coming from the credit limit asset in Azure Blob storage, moved by the ADF copy activity and stored in the ADLS storage.
And now, it’s your turn to further browse through the Starter Kit demo environment!
THANK YOU to White and Brown Wooden Box · Free Stock Photo (pexels.com) for the features image!