Like SQL Data Warehouse, Microsoft once again named a product by what it does: Power BI Datamart. It is an Azure SQL database, with a visual query designer so users don’t need to write SQL to get the data out to BI tools, Excel or ML algorithms. It also generates a Power BI dataset automatically, which we can use to create Power BI reports.
Loading data into Datamart
Power BI Datamart uses a built-in Power BI Dataflow for loading data into it. After we create a Datamart, we click on Get Data and use PowerQuery to connect to various data sources such as Azure SQL database, Azure Data Lake Storage, Databricks (Spark), Azure Synapse, HDInsight, Sharepoint, Excel files, text files, JSON files, Parquet files, API, Salesforce, Snowflake, Redshift, BigQuery, Oracle DB, IBM DB2, Teradata, PostgreSQL, MySQL, Access, SAP, Analysis Services and ODBC (link):

The above list of data sources is so comprehensive, I could not believe it at first. This is a game changer. Gone are the days when we struggling in SSIS, Informatica or ADF to load data from those varied sources. PowerQuery has all the connectors and interfaces to every single one of those databases, files, data lakes and big data.
We then select the tables, specify transformations and PowerQuery would build the ETL rules in that built-in Dataflow, create the relationships and load the data from those tables in the data source into the Datamart. We can then see all those tables and relationships in the Table Tools tab:

Afterwards, we can create relationships (again) and measures. We can also hide the tables and columns that we don’t need.
In the workspace we automatically get a dataset:

We can schedule when we want to reload data from those data sources into the Datamart (called Scheduled Refresh). And we can also setup incremental load, i.e. only changed rows are loaded into the Datamart, based on the column that we want (known as “watermark column”, link):

Security and access control
Power BI Datamart uses row level security and roles to restrict user access to the Datamart. In this example below (link), the users are restricted to only be able to access rows with category ID of 1 and 2. Once that role is defined, we can assign users to that role, as we can see below:


Querying data
As I said at the start of this article, we can use a visual query designer to get the data out without writing SQL (link). As we can see below, the Visual Query Designer is like Markit EDM, SSIS and Informatica.

We can select rows, filter rows, select columns, split columns, remove duplicates, join tables, change data types, transpose table and pivot columns. We can also do error handling, such as removing errors or replacing errors (link). And we can also create custom functions (link).


Of course we can also write SQL:

Accessing Power BI Datamart from SSMS
And we can also query the data in the Power BI Datamart using SSMS (SQL Server Management Studio). And using Azure Data Studio too (link):

Not only we can use SSMS for querying data in the Datamart, but also adding users and configuring security. And we can also use SSMS for looking at the query statistics. This is how it looks from SSMS (link):

We can see above that the Power BI Datamart tables are listed as views, under “model” schema. The relationships in the Power BI Datamart is stored in two views called metadata.relationships and relationshipColumns, as shown above (link).
Conclusion
I think Power BI Datamart is a game changer.
- It enables us to gather data from databases, data lakes, files and APIs into one data mart.
- We can reload data every day/week and we can do incremental loading.
- We can query the data using SSMS (can also do admin stuff).
- We can build Power BI reports easily (the dataset is automatically created).
- We query the data using visual query designer or by writing SQL.
- We can easily get the data out to BI or ML tools, and to Excel.
That is a very powerful tool. Anyone in data warehousing and BI should consider this tool. At least take a look.
And I think the name suits it well. The Power BI Datamart is indeed a good data mart. With all its bells and whistles.
Hi – how different is this from ADF? I know ADF doesn’t have BI functionality. isn’t they are developing multiple tools with overlapping functionality?
Comment by Radnor — 1 November 2022 @ 11:11 am |
Hi Radnor, Power BI Datamart is a database, whereas Azure Data Factory is an ETL tool. So they are different.
You are right that there is some overlapping functionality: both can do ETL. But the ETL in Datamart is only limited to load data into itself, whereas the ETL in ADF can load into anything.
Comment by Vincent Rainardi — 2 November 2022 @ 3:32 am |
Thanks for your response. couple of questions ….
a.) In cloud , folks trying to convert existing ETL tool code to python/pyspark. can you please share your thoughts on code re-writing and what would have helped if OLD/LEGACY WAREHOUSES have loaded data to LANDING area similar to cloud storage and use sql or pl/sql to load to data and any tool can technically be eliminated? wouldn’t the rewrite of those applications can be easier and one would have used same sql in pysparksql?
b.) Are we missing something on these lines while trying to rewrite in python/pyspark in databricks. what if DATABRICKS gets replaced with SNOWFLAKE?, do they end of rewriting it as well?
c.) DATA, DATAFRAME, DATABASE basically underlying way of interaction should be SQL? is my understanding correct?.
d.) Also Why DATABRICKS trying to define the architecture in DELTA LAKE like GOLD > SILVER > BRONZE? I mean NO ETL tool defined (used to be ODS > STAGING > FINAL ) as part of their documentation? and they kind of came under regular warehousing concepts, is my understanding correct here?
Comment by Radnor — 2 November 2022 @ 9:26 pm
Reblogged this on ERP and BI.
Comment by selfservicebi — 2 November 2022 @ 1:52 pm |
Hi Radnor, those are very deep questions and it would take a long time to find the answers. Thanks for putting them forward, much appreciated. My initial thoughts are as follows, and please correct me if I’m wrong.
SQL is a one of the best and tested way of processing data, for example: joining, filtering and grouping. But there are many things which Pandas dataframes are better. For example, string processing, date processing, column operations, matrix operations and mathematical operations. In PySpark we have both.
About code rewriting, I agree with you. It would have helped if the legacy data warehouse process the data using SQL from staging tables to the dimensional model, because those SQL can be reused in PySpark.
In SnowFlake we use PUT, GET and COPY INTO to load files into staging tables, but you are right we use SQL to populate the dimensional model, like INSERT, UPDATE, DELETE and MERGE.
There is so much sales spin in Gold, Silver and Bronze data in the data lake business. You don’t have use them. You can just put the data into different folder according to the data source. Each data source goes to their own folder (as parquet or delta files) and their own Databricks database. These databases function as a staging area (one per data source), from which we load into the dimensional model.
Hope this helps and I look forward to your comments. And everybody else’s too.
Comment by Vincent Rainardi — 3 November 2022 @ 8:44 am |