Data Warehousing and Data Science

30 December 2022

Data Lake in Azure: Databricks or Synapse?

Filed under: Data Warehousing — Vincent Rainardi @ 4:17 pm

Which one is better for data lake in Azure: Databricks or Synapse Analytics? Databricks is easy to query using SQL from Power BI, Qlik or Tableau, as it has SQL endpoint and Partner Connect. On the other hand, Synapse Analytics can run on a SQL database, on a Spark cluster or on a storage account, all of which use SQL so they are easy to query too.

In terms of loading data in, both of them use Azure Data Factory (ADF). ADF has Databricks Notebook activity which enables us to transform data and ingest files from Azure storage account (either containers or file shares) into Databricks tables on ADLS2. This ETL process can be automated using Delta Live tables and Auto Loader (link).

Both ADF pipeline and Synapse pipeline can be used to ingest data into Synapse Analytics from many data sources using Copy activity. Synapse Analytics has Data Explorer which can pull data in batches or streaming from various sources, validate the data and doing data conversion too.

Synapse Analytics has changed a lot in the last 2 years. Previously it is synonymous to SQL Data Warehouse (previously known as Parallel Data Warehouse, PDW). It runs PolyBase SQL query, which is a distributed query engine. The main drawback was that this SQL DW or PDW was very expensive (this was 2010: link). Today the cost perception has completely changed, with $50/month we can start development on Synapse. I created a Synapse workspace, with SQL Pool and Spark Pool with 3 nodes, disable after 15 mins of inactivity, no Data Explorer, end up with about $10/month. So it is possible for small & medium companies to use it, unlike SQL DW 10 years ago.

The key to the answer I think, is the “data lake”. How does Databricks compare with Synapse, when they are used as a data lake? To be used as a data lake Databricks is better, because we can easily create a metadata driven data ingestion. Meaning that when we have a new data (from a file or a table), we can automatically create the ingestion Databricks notebooks to ingest the new data into data lake.

On the other hand, if it is for data warehousing platform, Synapse is better, because it has 3 modes including serverless SQL pool endpoints that we can use to query data files in Azure Data Lake, using SQL (link). We can then create a logical data warehouse on top of the raw data without moving the data (see my article here). We can transform the data in the data lake using Transact SQL and load it into a dimensional model in a Synapse SQL database.

I mean the difference between them (for a data lake) is not that big. In Synapse, data scientists can also query the data lake using OpenRowSet. And with a bit of an effort we can programmatically create metadata-based ingestion in Synapse pipeline (not using Copy Into but using Python Notebooks and SQL Scripts).

Conclusions

  1. Synapse today is very different to two years ago, so take a look and try it. It is still changing rapidly so two year from now it would have more new features and may look different.
  2. Which one is better for data lake in Azure: Databricks or Synapse Analytics? In my opinion Databricks is slightly better for data lake, but as a complete package for Data Lake + Data Warehouse, Synapse Analytics is better (compared to Databricks Lakehouse).
  3. Both platforms are changing rapidly, for example a year ago we didn’t have Lakehouse, Photon, SQL Warehouse or Partner Connect on Azure Databricks. And  a year ago there was no Data Explorer in Synapse Analytics, and Synapse has completely changed. Two years ago Databricks was way better than Synapse for data lake, but today the gap has narrowed a lot. A year from now the status can change or reverse.

PS. Note that what I’m talking about here is Azure. Databricks in Azure. Not in AWS.

Would welcome your opinion and comments.

28 December 2022

Share it, don’t move it

Filed under: Data Warehousing — Vincent Rainardi @ 9:32 am

I came across JP Morgan Chase’s implementation of Data Mesh in AWS (YouTube, AWS, Dave Vellante), and was amazed by the concept of “don’t move the data, but share it”. Many years ago when I was doing a data platform project in a large bank in London, the bank insisted that we should not move the data. They already had a data store, and didn’t want to create another data store. So the concept was not new, but today with data mesh it has become much more popular. Fannie Mae did it (link), Go Daddy did it (link), and many others are implementing this concept of “don’t move the data”.

In my book (2007) I called it Enterprise Information Integration (A below), where the data is shared. Compare this with a Federated Data Warehouse (B below) where the data is moved:

OpenFin

OpenFin technology (link) enables companies to consume data from financial companies as such Factset and Refinitiv, as well as their own internal data. We can use OpenFin desktop to integrate the data from various data sources, without moving the data. That is a new concept. For 20 years I have been integrating data into a database and data lake. But with OpenFin, we integrate data directly on the user desktop, without integrating it first in a database or a data lake (link).

So the core idea is, rather than integrating data in a data warehouse (or a data lake), companies should be integrating data directly on the user desktop, as shown below. Data from Refinitiv, Salesforce, Factset, S&P Global, etc. can integrate directly on the user desktop (B), without being integrated first in a data warehouse (A). We can also create an application which publishes the data from the data warehouse into user desktop (C).

OpenFin is a desktop/workspace. It is like a browser where you can open many tabs or boxes, each tab or box containing an application. The key feature here is that the tabs and boxes are connected and synchronised. If we change the data in one box, the data in all other boxes change automatically. For example, if in Box 1 you change from product A to product B, in Box 2 would change from displaying say the sales and costs for product A to product B. That sales and costs data can come from within your company, or it can come from other company, shared via a protocol called FDC3 (link).

So there are 2 new concepts:

  1. Don’t move the data but share it.
  2. Integrate data directly on user desktop, not in a database or data lake.

Data Mesh

That concept of “don’t move the data but share it” is probably best achieved using Data Mesh. What is Data Mesh and why doesn’t it play a critical role in implementing “data sharing” concept?

The Data Mesh concept was introduced by Zhamak Dehghani in 2019. First, please read her article on Data Mesh: link. There is no point for me to rewrite it here as she explains it very clearly there.

Data Mesh is based on 4 principles. One of them is self-service data platform. This data platform should be cloud based (SAAS if possible), where data is treated as products that live in separate business domains.

It should be self-service, meaning data users should be able to get the data themselves. For that it is critical that the platform provide two things:

  1. A searchable data catalogue to enable the users to find the data they need.
  2. An API which enables the users to download the data, or their applications to consume the data.

Data Processing

That concept of “don’t move the data but share it” is very difficult to do in practice. This is because of the processing required, before the data can be consumed. Take for example the performance of the funds (3 months, 1 year, etc.) which need to be attributed to various industry sectors, asset class, currencies, down to the instrument level (or even trade level). Once this processing is done, we will need to store the results. So very quickly, we realise that we need to store the data.

But the output of the attribution calculation is not the raw data. It is a new data. If your application produces new data, then you should be publishing it back to the data platform (e.g. data lake or data warehouse), so that other data consumers can use that new data.

In fact, the discipline of publishing the results of data processing back into the data platform is the corner stone principle of the whole “share it don’t move it” concept. Otherwise the entire thing won’t work.

Data Lake

One last thing, because of this “publishing calculation results back to the data platform”, a Data Lake is more suitable compared to a Data Warehouse as the data platform. A Data Lake is more versatile for that job, and more cost effective.

27 December 2022

Difference between Extract and Ingest

Filed under: Data Warehousing — Vincent Rainardi @ 9:57 am

What is the difference between Extract (as in ETL = Extract-Transform-Load) and Ingest (as in data ingestion)?

Difference #1: The direction of the flow

  • Extract means pulling data out of the operational system.
  • Ingest means putting data into an application.

So the difference is clear. Extract is pulling data out, whilst Ingest is pushing data in.

Difference #2: The reason for doing it

The reason we extract data out of the operational system is to populate a data warehouse. Or a data mart. So we extract in order to store it in another place.

The reason we ingest data into an application is to use it for analysis. Or for prediction (ML). Or for calculating something.

So it is clear the the reason we pull data out from an operational system is to store it (not to use it directly), whereas the reason we push data into an application is to use it directly.

Difference #3. The transformation

The third difference is transformation:

  • We extract data in order to transform it. To change the format. And then to store it, in different format. For example, in dimensional model format.
  • We ingest data not in order to change format. But to use it. To consume it.

Data Lake

What about data lake? Is it ingest or extract + load?

It is to ingest. Because we don’t transform the data. We don’t change the format.

Difference #4. End to end picture

It would help our understanding if we look at the end-to-end picture. There are 3 use cases below: A, B and C.

In A, the data is extracted into a data warehouse/mart to be used for ML.

In B, the data is ingested into a data lake to be used for ML. Note the use of “ingest”.

In C, the data in ingested into a data lake to be extracted and loaded into a data warehouse, before it is used by a Business Intelligence application.

18 December 2022

Building Prod and Dev Environments

Filed under: Data Warehousing — Vincent Rainardi @ 5:54 pm

About 20 years ago in a leafy little town in Surrey I had a IT training for a week. Every year, the company paid for each staff to have a development training of their choosing. That year I chose “Iterative Development” after choosing .NET the previous year. The methodology was called Rational Unified Process, which is an iterative development process. Rational is now part of IBM.

In that training class we were taught that it is better to deliver things incrementally. So if the software we build had 10 different parts, for the first release we should only do 1 part. On the second release we do another part, and so on.

What was remarkable however, was that the first release was the most difficult, because we would need to setup the development and production environment. Setup the architecture if you like. So if we call the architecture build as Release 0, and then we have 10 subsequent releases delivering the various functionalities, the effort for Release 0 would take a very large chunk of the who project (say 25%).

That was why the training instructor said to me that on the first release you should only deliver minimal functionality. Because you have all this architecture and environment to deal with. Procuring database servers, web/application servers, setting them up, installing storage and configuring them, dealing with network configuration, and so on. Setting up development environment, code repository, test harnesses, release pipelines, and so on.

I understood that infrastructure and architecture took a lot of effort because prior to that training in Surrey, I had several experiences configuring IT infrastructure. I had to configure a HP Unix server + storage in one project, and a cabinet of SQL servers + storage in another project. I worked as environment manager in Mainframe system in one project, and setting up LAN for Ethernet and token ring in another project.

That principle was ingrained in my mind. Since then I’ve been working for no less than 10 companies, delivering various data platforms (mostly data warehouses, big data and data lake platforms). And I always try to apply that principle. Try to deliver the architecture first, both the production environment and the development environment. That is an enormous undertaking.

I call it “delivering 2 numbers”. In the first release (what I called Release 0), we would just deliver 2 numbers. Not even to the screen. Just to a database or data lake. It is not even regularly, but only a one off. But it is in production. The development environment may not be setup properly. The test environment either. The release pipeline may not be working properly. The user access might not be setup. We may not be using a service account or a key vault. But the data load works in production. And it is only delivering 2 numbers (well not literally 2 numbers, but a minimum amount of data, like just 1 table).

If we have to deliver 100 tables to the data warehouse/lake and build 10 different reports out of it, the effort for delivering the project is probably something like this: (all items include design, build and test)

  1. Build development environment (including release pipeline): 15%
  2. Build production environment (including scheduling): 20%
  3. Build test environment (including automatic testing): 15%
  4. Setup security: 10%
  5. Build data load for 100 tables: 20%
  6. Build 10 reports:10%
  7. Setup operational process: 10%

So the environment (including security) is 60% of the effort, whereas the business functionality (data load + reports) is 30%. So environment is about twice of functionality.

And that is what the business users often don’t realise. “How come you guys been working for 3 months and not a single report been built yet? I mean there are 5 of you! What have you been doing?”

Building the environment is like digging the foundation of a building. It takes a lot of time, and no body appreciate it. When the foundation is built, the structure of the building can be built on it. Then the kitchen, flooring, etc. can be done very quickly. But it is the foundation that takes up a lot of time. For a long time we don’t see the results.

I hope this was helpful. As usual if you have any comments or questions just click the link below. Or you can email me at vrainardi@gmail.com.

Blog at WordPress.com.