The usage of Data Lakes in a Modern Data Warehouse / Data Platform is getting more and more popular. Fed by different kind of data sources, a data lake is very often used as a cheap, easy to handle and powerful data storage service. With the introduction of Azure Data Lake Storage (ADLS) Gen2, the analysis of Big Data (and not only big data ) got a very powerful technical foundation.
With hierarchical namespaces, the content and files in an ADLS Gen2 storage account are structured into a folder hierarchy, like the way that the file system on your computer is organized. The process of data lake design (= how you structure your directories, subdirectories, and files in a data lake) should be a major and important part of your data lake strategy. The reason behind this lies in the fact, that if you structure your data lake appropriate, the following analysis queries can benefit from a good, suited directory structure (i.e. only parts of your data lake are read when you want for example only data for a specific year or time period).
What to do with your data lake content?
In many cases, the files stored in your data lake are used as input for further analysis like performing Machine Learning processes or other data processing on it. Sometimes, files in the data lake are directly used for analysis in a business intelligence tool like Power BI for example.
Use Power BI and connect directly to the Data Lake
There are different ways how you – as a Power BI user – can access data files in the data lake using Power BI (Desktop).
If you open the Get Data dialog, and search for data lake, you’ll find an entry to connect to an ADLS Gen2 account.
Select that one and enter the URL to your data lake storage account (including the container name). There are two options to use – either File System View or CDM Folder View. I will use the first option (File System View) in my sample. After you have entered your authentication details, you are ready to connect to your data lake.
If you need to work and analyze multiple files in your data lake, you can do so by selecting the action Combine & Transform Data in this dialog to read every file included in your selected path in the data lake. Be aware, that multiple (many) files can lead to performance impacts while reading.
One thing that I would like to mention here is, that this approach to access files in the data lake is IMPORT mode only, there is no option to tell Power BI (Desktop) to access files in direct Query mode. Which leads to the fact, that for up to date information out of your data lake, Power BI needs to perform a (scheduled) refresh action.
Would not it be nice to have Direct Query for a data lake?
Sometimes however, would not it be nice to access the data lake in Direct Query mode – to get the most up to date information for every report view? I would say: yes … but how can you achieve this? The options natively provided by ADLS Gen2 and Power BI are not sufficient to solve this requirement. But: there are options to achieve this and, in this post, I would like to show you the possibilities using Azure Synapse Analytics to build a query layer on top of a ADLS Gen2 storage account.
What is Azure Synapse Analytics?
Announced in late 2019, Azure Synapse is a (partly) new analytics service that brings together enterprise data warehousing and Big Data analytics. Parts of the new Azure Synapse Analytics have their roots in the Azure SQL Datawarehouse (the MPP data warehousing solution from Microsoft in the Azure Cloud).
I will not go into details (today) about the different parts of an Azure Synapse workspace, but some basic terms are needed to follow my sample. Within an Azure Synapse workspace, you will find different kinds of capabilities to analyze data coming from and residing in different data stores. Every Synapse workspace has one required data stored – a primary Data Lake Storage account that is needed to store system and metadata information required for Synapse itself. With the concept of linked services and many options to connect to other data storages, Synapse can bridge the gap between different data services and data structures.
Data needs to be analyzed and Synapse brings different flavors of analysis capabilities – Apache Spark based capabilities (not the topic of today) and SQL capabilities. The SQL analysis approach is provided by the main resources called SQL pools which comes in two different consumption models – serverless and dedicated.
The following graphic (taken from here) shows the different aspects of SQL Pools in Azure Synapse Analytics.
The two SQL pools serve different aspects of data processing – serverless is mainly used for data exploration, whereas the dedicated SQL Pools are the foundation of data warehousing. The pricing is also different for these two approaches – dedicated SQL pools are billed based on the consumption level provisioned, whereas serverless pools are billed based on the data processed (currently per TB data processed).
And now – let us explore the data.
In my example, I will use a SQL serverless pool within Azure Synapse to create a query (meta) layer on-top of files stored in an ADLS Gen2 storage account.
This blog post will not contain an how-to create a Synapse workspace – I’ve already done that in before and the sample environment looks like this:
- One Synapse Workspace (synapse4wolfgang) is created
A primary ADLS Gen2 storage account is attached
- This storage account also contains the sample files
- Only a serverless SQL Pool exists in this Synapse Workspace (it is created by default during setup)
There are two files stored in the data lake, one containing sales from the years 2015 to 2019 (sales_2015_2019.csv), the other file contains sales from the year 2020 (sales_2020.csv).
The two files contain data following the same structure, which can be previewed using a right click on one of the files and selecting the action Preview. Sales are structured based on the country, year, article and the number of articles sold in that time period.
If you want to query the content of a data file using a serverless SQL Pool, select the context menu action New SQL script -> Select TOP 100 rows.
In a new window, a SQL script is generated which uses OPENROWSET to read the content of the selected file and return the content using the serverless SQL Pool (named Built-in). Have a look at the output of the query – it’s only one column containing all the fields from the file.
To get a more structured output, I will modify the SQL Script a little bit and tune it with additional options for the OPENROWSET function. The parameter FIELDTERMINATOR defines the character used to divide columns and HEADER_ROW takes the first row in the file as header row (and thereby names the columns accordingly).
Create a VIEW based on data lake data
As a next step, the statement above will be wrapped within a VIEW definition to be accessible from other tools using the serverless SQL Pool. As a first step, let’s create a new database within the SQL Pool:
Now, change the database context for your SQL script to work in the new created database. If the database is not shown, refresh the database list and select the new one afterwards.
Modify the content of your SQL script to include a CREATE VIEW statement that reads all the lines from the file. Execute the script (run it in the right database!!)
If you included no typos, the script should execute fine and you can check the structure of the new view in the Data hub and there in the Workspace and Databases section. If the new database is not listed, please use the refresh action from the menu. As marked in the screenshot, the view contains four columns named correctly, but the data types are not the best choices I would say.
Going back to the SQL script, the OPENROWSET function can be enriched with a specified schema definition (WITH clause). If you change the script to an ALTER VIEW command plus the structure definition, the view columns datatypes should look better afterwards.
Use the VIEW as data source in Power BI
As a next step, the created VIEW is used as a data source in Power BI (Desktop). There are multiple ways to create a Power BI dataset based on a Synapse SQL Pool – I will use the integrated approach using a linked service definition to a Power BI workspace. In my demo environment, the Power BI workspace Synapse Integration is linked to my Synapse workspace. At start, the workspace is empty – no datasets and reports exist in there. Use the + New Power BI dataset action to start the steps..
Create a Power BI dataset
As Power BI datasets can only be created in Power BI Desktop, this tool is the place where we are going to create the new dataset. Hit the Start button and select the SQL Pool (in our case the database we created in our script).
As a next step, download the Power BI datasource file (.pbids -> see documentation here) and open the file in Power BI Desktop.
Use your credentials to sign in and select the SalesView we created in the SQL pool before.
- Remember, this view directly grabs the content of the sales data from the CSV file stored in our data lake!!
As we do not want to transform the structure of the source, select the Load button and use the generated data table to create a first report page. Before that, there is one very important setting to define – use either Import or DirectQuery to connect to the source!?
- VERY Important: select DirectQuery to get a live connection to your data lake data!
As you can see in the following screenshot, data from the years 2015 until 2019 is displayed. This data is read every time the Power BI report is opened or refreshed. With the option DirectQuery the data is always up to date. So, in every case, if the source file is changed the new values are displayed in Power BI.
What if we want to include the sales data from 2020 into our Sales View? There are two options:
- Option 1: add a second query into the view definition and UNION the result rows..
- Option 2: use wildcards in the view definition and not select a specific file but a list of files..
Use wildcards in the OPENROWSET Function
Let’s dive deeper into option 2 – Wildcards in the OPENROWSET function and therefore read a list of data files and provide their content into the view.
If you change the View definition like in the following screenshot, every time the view is accessed data from all the files following the specified file pattern are read. In addition to wildcards in file names, you can also use wildcards in parts of the path definition like …/primary/samples/*/sales_*.csv.
Update the View definition and refresh your Power BI report – et voila: Data from 2015 until 2020 is available!
Easy, wasn’t it?
To sum it up:
- Power BI is not natively able to provide a DirectQuery dataset based on data stored in Azure Data Lake Storage (ADLS) Gen2
- With Azure Synapse Analytics data coming from different sources can be analyzed and brought to a SQL structure
- By using Azure Synapse SQL serverless pools, we were able to create a query layer based on files in the data lake (created a new database and VIEW in that database)
- By using wildcards in the OPENROWSET source, multiple files can be read
- That’s the way to roll Power BI DirectQuery on your CSV files in the data lake!
Happy data exploration and have fun swimming in the data lake,