Look back into the past of Microsoft Analytics
Do you remember those good old times where it was “just” SQL Server DB-Engine, Integration Services, Analysis Services and Reporting Services and your Data Warehouse solution was done?
And then – we moved into the cloud with a huge variety of services to build analytics solutions – Azure Data Factory, Azure Data Lake Storage, Azure SQL Database and Power BI. Different ways of combining these puzzle pieces were possible and when we look back at those cost estimation discussions with customers.. That wasn’t an easy task.
And then in 2019, the preview of Azure Synapse Analytics was released. I really liked the idea of an integrated analytics solution combining different ways of storing data (data lake and data warehouse) plus options for (almost) everybody to work and transform data (SQL based engines, Spark, ADX). But again the management and especially overall pricing was not easy to estimate. Every piece of Azure Synapse Analytics was billed individually and every compute engine used a different format of storing data. Spark could use data lake storage and/or dedicated SQL Pools, Data Explorer uses Data Explorer database format and SQL engines use their own storage model.
Just some thoughts about an ideal analytics solution…
- What if you could start an analytics solution without worrying about costs of the different pieces in that context (compute and storage aside)?
- What if you could log into a single portal and work on all your data tasks without switching tools?
- What if you could re-use your knowledge about current Azure Data services like Azure Data Factory, SQL Serverless, SQL Dedicated pools, Spark and Azure Data Lake without the need to learn “yet another analytics tool”?
- What if this analytical solution is provided to you as a Software-as-a-Service product where you do not have to worry about instantiating analytics runtimes, remember to providing storage services, … !?
- In the past, even within an integrated analytics solution, data was copied from one place to another. Think about a Data Lake as archiving/staging layer, a SQL Dedicated Pool as data warehouse and Power BI as semantic and reporting layer. Between every of those stages, data was duplicated.
- One of the reasons why data was duplicated between the different compute and analytic engines was based on their different (internal) storage formats. What if an analytic solution has one format for all their different analytic workloads? And image a world where results produced by engine #1 can directly be accessed and used by engine #2 or #3?
- Last but not least, what if an analytics solution provides you end-to-end security, data lineage and compliance under a single umbrella?
The future of Microsoft Analytics is here – Introducing Microsoft Fabric!
Announced at MS Build 2023 (announcement post), Fabric brings the Microsoft analytics stack and products to the next level. At Build 2023, the public preview of this services was announced.
“Microsoft Fabric is the most important milestone in the history of Microsoft Data since the introduction of SQL Server”Satya Nadella at MS Build Keynote 2023
Starting from the past with a huge set of different tools to build your solutions, Fabric brings them all under a single umbrella. The overall idea and purpose is to provide “end-to-end analytics from the data lake to the business user”.
If you are more into videos – I’ve recorded a short Fabric summary video -> https://youtu.be/FEzQnJFUvx4
Let’s dive a little bit deeper into Microsoft’s Fabric main pillars.
- Data Storage – OneLake to store all your information in one place and using – by default – one open standard storage format.
- Compute workloads for different personas to work on every aspect of analytics – from data integration, data engineering, machine learning up to reporting.
- Structure your data in workspaces and share these data pieces across them in your organization.
- Re-use already well-known and well-established Microsoft data services without the need for learning new tools and integrate them into a single Software-as-a-Service analytics platform.
Data Storage in Fabric – OneLake
OneLake is THE storage location in the Fabric scope. With the SaaS approach, OneLake is a single unified logical data lake for your whole organization. You do not have to worry about instantiating and configuring storage accounts – it’s done in background for you. Even if your organization is distributed around the globe, OneLake helps in this case and provides storages accounts in different regions around the world. Above that technical details, OneLake combines it into one single logical data lake.
Data is stored using open standard Delta format (parquet files plus transaction logs) which allows every tool that is able to read and write Delta format to consume the data within Fabric. It’s not only the data lake based workloads that store their data in Delta format, it’s also data warehousing and even Power BI that directly interacts with the Delta based data in OneLake.
When talking about open access to data, OneLake is based on Azure Data Lake Storage Gen2 technology and therefore all your already in place API calls & tools will continue to work with OneLake too.
Structure your OneLake with Workspaces
Workspaces are the concepts to structure your OneLake – every workspace is assigned to a Fabric capacity and – as of now – the level of defining security and access.
Workspaces themselves are not closed data silos. With the use of Shortcuts you can provide – like symbolic links in your file systems – connections to other data pieces in OneLake itself. This clears the way for the one copy of data and no duplication of data in your analytics solution. Shortcuts can even be extended to other services outside of Microsofts data universe like Amazon storages.
Compute Workloads in Microsoft Fabric
Based on the one place to (not only logically) store your data in your organization – the OneLake – there are different tasks, your data needs to be worked on.
First, you need
- to integrate and collect your data,
- transform and engineer your data (“massage your data” as some of those Guys from a cube named it 😉)
- and maybe then build a datawarehouse on your data
- sometimes – you learn more about your data with using Data Science.
In some projects, your data flows in real-time and needs to be analyzed in real-time too.
Besides that, every data pool without a proper reporting and business intelligence is not complete.
In Fabric all those different workloads are based on OneLake data. No different ways of organizing data, no need to transfer data from one to the other system. Every workload can directly read the output of another workload.
So what workloads are there in Microsoft Fabric?
- Data Integration is based on the well-know Azure Data Factory technology and Data Flows you maybe know from Power Query online. Both ways are there to read, transform and bring data into the Fabric OneLake (workspace).
- Data Engineering is based on the same Spark engine and technology you are already familiar with. One of the main concepts in Fabric is to build a lakehouse for all your organizational data. And that data in your lakehouse is not hidden and limited to Spark only. Because of OneLake and the Delta format used by every workload the transformed data tables can be used by all the other workloads. Even by default the generated Delta tables in OneLake are discovered & registered to provide an integrated file to table experience.
- Data Warehousing. Think about Azure Synapse SQL Dedicated Pools on fire. Using Delta tables as storage format and a serverless engine with auto-scale to do the compute work. In addition, use T-SQL to work with your data in the way as you are already familiar with.
- Data Science. Spark, MLFlow, Cognitive Services and Notebooks are already your friends. Good – because also in Fabric these technologies will continue to exist and be the foundation of your Data Science work. Expect a deep integration (because of OneLake and the overall Fabric umbrella) but also some exciting extensions for a better, enhanced and more productive way to do Data Science.
- Real Time Analytics. Real-time analytics has become important in many scenarios in the enterprise world, such as Log Analytics, cybersecurity, predictive maintenance, supply chain optimization, customer experience, energy management, inventory management, quality control, environmental monitoring, fleet management, health, and safety. The part of streaming, real-time data is handled by the former known Azure Data Explorer (ADX/Kusto) technology. But this time, directly integrated and based on the same storage format. Data in your real-time context is automagically mirrored to your central lakehouse – no data copy is needed here.
- Business Intelligence. In Fabric, a semantic layer (aka. Power BI dataset) on your lakehouse is generated by default. Yes – by default and automagically. Besides this, this dataset is not using Import or DirectQuery mode, it uses a new connection mode – the Power BI Direct Lake mode.
- Direct Lake mode – By using this connection mode the best parts of Import and DirectQuery mode are combined – fast dataset performance and up-to-date information. In Direct Lake mode, Power BI directly reads data from OneLake whenever this part of the dataset is requested. In order to return data from OneLake fast, these Delta tables are optimized (VOrder) but still 100% compatible with the open standard Delta format (and thereby can be consumed by all the other applications).
No workload without compute power (capacities)…
As Microsoft Fabric is a SaaS product, you do need to instantiate and administer the different workload components as you had – for example in Azure Synapse or in a “plain” data solution based on Azure services.
But you need one thing to start with a Fabric capacity that powers the Universal compute capacities.
A workspace in Fabric needs a capacity assignment because storage and compute are separated in this environment. Therefore no data transformation, analysis, warehousing is possible without a capacity assigned.
As to my knowledge, there will be capacity in different performance levels and they will be available on a subscription basis.
To sum it up…
Puh.. I think we all need some more time to digest these announcements and the options that the new Microsoft Fabric stack brings with it.
With the public preview announced at Build we all get the chance to try Microsoft Fabric on our own workloads (head over to https://aka.ms/try-fabric), learn how to interact with and especially design solutions based on best practices. Keep in mind that it is a first public preview version and therefore expect some hickups, learning curves, maybe some (small) steps back but with our combined feedback we can shape the Analytics Platform of the future.
I am more than excited about the new possibilities and look forward to start some projects based on Fabric in the near future.
- Find out more about Microsoft Fabric: https://aka.ms/try-fabric
- Webinar series: https://aka.ms/fabric-webinar-series
- Microsoft Fabric blog site: https://aka.ms/fabric-tech-blog
- Getting started e-book: https://aka.ms/Getting-Started-eBook
MS Build Content
Introducing Microsoft Fabric:
- Introducing Data Factory in Microsoft Fabric: https://blog.fabric.microsoft.com/en-us/blog/introducing-data-factory-in-microsoft-fabric/?WT.mc_id=DP-MVP-5001676
- Introducing Synapse Data Engineering in Microsoft Fabric:
- Introducing Synapse Data Warehouse in Microsoft Fabric: https://blog.fabric.microsoft.com/en-us/blog/introducing-synapse-data-warehouse-in-microsoft-fabric/
- Microsoft OneLake in Fabric: https://blog.fabric.microsoft.com/en-us/blog/microsoft-onelake-in-fabric-the-onedrive-for-data/
- Data Activator in Fabric: https://blog.fabric.microsoft.com/en-us/blog/driving-actions-from-your-data-with-data-activator/
- Real-Time Analytics in Microsoft Fabric: https://blog.fabric.microsoft.com/en-us/blog/sense-analyze-and-generate-insights-with-synapse-real-time-analytics-in-microsoft-fabric/
- Data Science in Microsoft Fabric: https://blog.fabric.microsoft.com/en-us/blog/introducing-synapse-data-science-in-microsoft-fabric/