In todays blog post I would like to build an end-to-end solution to combine data coming from different sources and stored in different form factors into a single Power BI data model using Azure Synapse Analytics.
The Basics to start with
For those of you, that do not know Azure Synapse Analytics, just a short explanation – taken from the official product documentation page: Azure Synapse Analytics is a limitless analytics service that brings together enterprise data warehousing and Big Data analytics. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources—at scale.
So what does this mean? Well, in theory it should not really matter where your data comes from, in which format it arrives and how you like to transform and query it – Azure Synapse Analytics should solve this for you.. 🙂 In theory… It’s a little bit of work from your side to solve it.. but let’s have a look how we can deal with it.
Over time, two major streams for data in enterprises were created: 1) traditional data warehousing for dealing with well-structured data and reporting requirements and 2) big data analysis to deal with (huge) amounts of less structured data.
Use case 1) involves business users from the enterprises to work with the data warehousing data based on proven security, dependable performance and data stored in a relational structure. Data in use case 2) is often used by data scientists for exploration, big data analysis and machine learning based on semi structured data in a data lake.
In the past, different (Azure) services were needed to solve and deal with these different kind of analytic requirements, storage technologies and performance requirements.
With the announcement of Azure Synapse Analytics, the separate services are now combined under a single bracket – a so-called Azure Synapse Workspace.
Within an Azure Synapse Analytics Workspace,…
- ..you can integrate data using Synapse Pipelines (if you already are familiar with Azure Data Factory, it’s an easy go for you)
- .. the central data storage is a based on an Azure Data Lake Storage Gen2 (needed for metadata storage)
- .. you can use different kind of analytic runtimes to work with your data
- SQL-based runtimes using either a serverless approach or a provisioned flavor (named dedicated SQL Pools and based on the Azure SQL Datawarehouse technology).
- Apache Spark runtimes to use the distributed architecture, notebooks and language of your choice
- .. security is integrated from ground up using Azure Active Directory
- .. Synapse Studio is the development single-stop user interface to create artifacts and work with all objects in the workspace
- .. you can configure multiple linked services and linked data storages to extend the data landscape
- .. you can configure connected services (like Power BI or Azure Machine Learning) which – in the example case of Power BI – allow a direct creation of Power BI reports within Synapse Studio
And now: Action for an end-to-end “Data through Azure Synapse to Power BI” demo
In my example scenario I would like to analyze sensor-data originating from different machines in different location of our company. The sensor values are gathered directly at the machines and sent through Azure Stream Analytics into a directory in our data lake (this process is not covered in this example).
I am not going into detail how to create a new Synapse workspace and how to create new dedicated SQL Pools or managing the connections to linked services etc.
In order to get a first feeling of our data, we can use Azure Synapse Studio and there the Data hub to start with our data lake exploration. The output of our data collection process stores the data into CSV files – one for every machine. An example input directory would look like this
Synapse Studio allows us to preview the content of CSV files – use the context menu and the Preview action. For a first look, Synapse Studio now opens a dialog with a preview of the file content
But we want to work with the data and query the file content. To achieve this, we can use the context menu again and select New SQL Script -> SELECT TOP 100 ROWS. This action opens a new SQL script window containing a script to query the CSV file from our data lake storage.
The magic in this script is based around the keyword OPENROWSET – this function reads the CSV content and returns the data in a relational structure. In the first screenshot the relational structure is not that kind of visible – only one column C1 is returned by the statement. What needs to be done are some additional parameters and tweeks to get a more reliable structure.
Also, this script is executed against / using the SQL Serverless pool (named Built-in). The source for this query is one single file – SensorValues_M_01.csv in the data lake.
An enhanced version of the SQL script deals with some of the points mentioned above:
SELECT * FROM OPENROWSET( BULK 'https://<ourdatalake>.dfs.core.windows.net/data/staging/sensors/SensorValues_M*.csv', FORMAT = 'CSV', PARSER_VERSION='2.0' , FIELDTERMINATOR = ';' ) WITH ( MeasureDT DATETIME2(0), MachineName VARCHAR(20), SensorKey VARCHAR(20), SensorValue DECIMAL(18,2) ) AS [result]
In this version of the SQL script I’ve changed the name of the source file to use a wildcard – i.e. to read all the files in the /staging/sensors/ directory following the specified file name pattern.
Second, the option FIELDTERMINATOR =’;’ was added to OPENROWSET -> this returns the four columns from the CSV file.
Third, by using the WITH clause I declared a relational structure for the statement output – defining column names and datatypes.
Data Import – Sensor Measurements
SQL Serverless pools in Azure Synapse do not store data by themselves, they are metadata containers for external table and view definitions.
What I would like to do for the next step of our example – store the sensor values in a relational store we need to switch to Dedicated SQL pools. For the import we have at least two choices – the first: use Synapse Pipelines to load the data from the data lake.
Second, use the bulk copy feature that Synapse Studio brings out-of-the box. Back in the Data Hub, I select a data file present and in the context menu the action New SQL script -> Bulk load is the door opener to the data integration step. After the wizard-like steps are finished, a COPY SQL Script is generated.
First, select the input files / folder – i.e. the content you want to import. I was not able to select an input folder – Synapse did not like this selection and gave me errors. So I decided to go for a single file (the _M_02.csv file)
Next, configure the input settings – in my case the field terminator ‘;’
The sink of our COPY statement – a dedicated SQL pool and a new destination table – staging.SensorValues.
Just to be sure, have a look at the column mapping.
I’ve changed the generated SQL script to provide some more meaningful column names and to read in not only one file but the content of the folder (using a wildcard).
Our current dataset looks as follows:
Data Import – Masterdata
In addition to our sensor measurements, I want to include some masterdata information into our data warehouse database. The masterdata list for our machines and detailed information for our sensor-measurement keys are stored in an Azure SQL database.
For the data import, a Synapse Pipeline using two copy data activities was created. For each of the sources, an integration dataset was created pointing to the according database table. The sinks for the copy activities are both pointing to the SQL dedicated pool.
To test the data import, use the Add trigger -> Trigger Now option in the menu.
And the Monitor hub to monitor the pipeline execution run.
Data Modelling (in the SQL Pool)
Up next – combine the data islands (sensor measurements and masterdata) into a logical layer to serve as input for the Power BI data model.
Using Synapse Studio and the Develop hub, I’ve created Views to create the logical layer.
CREATE VIEW [reporting].[SensorValues] AS SELECT [MeasureDT] ,[MachineID] ,[SensorID] ,[SensorValue] , MachineID + '_' + SensorID MachineID_SensorID FROM [staging].[SensorValues]; GO CREATE VIEW [reporting].[Sensors] AS SELECT [MachineID] ,[SensorID] ,[SensorName] , MachineID + '_' + SensorID MachineID_SensorID FROM [masterdata].[Sensors]; GO CREATE VIEW [reporting].[Machines] AS SELECT [MachineID] ,[MachineName] ,[Location] FROM [masterdata].[Machine]; GO
And the result looks like:
Create a Power BI Data Model
As the next step, a Power BI data model is needed for our visualizations. As I mentioned in the beginning, with the use of connected services the boundaries of a Synapse Workspace can be extended and external, Azure / Office 365 services can be integrated in Synapse Studio.
In order to do so, switch to the Manage hub and add a new linked service (type Power BI). Select an existing Power BI workspace, name it and that’s it!
As of today, only one Power BI workspace can be linked to a Synapse workspace. I hope this limitation will be lifted in the future..
At the Develop hub, the Power BI workspace appears in the Power BI section: datasets and reports already existing are listed here. It does not matter if these items were created in Power BI directly or in Synapse Studio.
Open the datasets list and add a new one.
Open the downloaded PBIDS (Power BI Datasource file) in Power BI Desktop. The connection to the dedicated SQL Pool is automatically established. Select the three views we created in one of the previous steps and load them into the Power BI model.
For this demo, we will use Import mode. If you want to see a demo using DirectQuery a Data Lake directly, please have a look at my blog post here.
Update the relationships between the tables and the data model looks like this:
I’ve created one sample report page:
Save the PBIX file and publish it to the linked Power BI workspace.
After the publish is finished, switch back to Synapse Studio and use the Close and refresh button.
There are two options now – use the published dataset to create a new report or open the report directly from Synapse Studio:
The Power BI editing experience is fully integrated into Synapse Studio – change the report as you like. Be aware that all the changes you perform on one of the sides (Power BI service or Power BI integration in Synapse Studio) will appear on the other side too 😉
With that, we are at the end of our end-to-end Synapse example. To recap,…
- we’ve explored data files in the data lake directly in Synapse Studio using the preview option and the SQL serverless OPENROWSET command
- we’ve used the bulk load wizard to generate a COPY statement to load sensor data into a dedicated SQL Pool
- we’ve used Synapse pipelines to load masterdata from an Azure SQL databae into a Synapse dedicated SQL pool
- we’ve created views in the dedicated SQL pool as a logical layer
- we’ve created a Power BI data model based on the views
- we’ve used the Synapse Power BI integration to open and edit the report