Which one is better for data lake in Azure: Databricks or Synapse Analytics? Databricks is easy to query using SQL from Power BI, Qlik or Tableau, as it has SQL endpoint and Partner Connect. On the other hand, Synapse Analytics can run on a SQL database, on a Spark cluster or on a storage account, all of which use SQL so they are easy to query too.
In terms of loading data in, both of them use Azure Data Factory (ADF). ADF has Databricks Notebook activity which enables us to transform data and ingest files from Azure storage account (either containers or file shares) into Databricks tables on ADLS2. This ETL process can be automated using Delta Live tables and Auto Loader (link).
Both ADF pipeline and Synapse pipeline can be used to ingest data into Synapse Analytics from many data sources using Copy activity. Synapse Analytics has Data Explorer which can pull data in batches or streaming from various sources, validate the data and doing data conversion too.
Synapse Analytics has changed a lot in the last 2 years. Previously it is synonymous to SQL Data Warehouse (previously known as Parallel Data Warehouse, PDW). It runs PolyBase SQL query, which is a distributed query engine. The main drawback was that this SQL DW or PDW was very expensive (this was 2010: link). Today the cost perception has completely changed, with $50/month we can start development on Synapse. I created a Synapse workspace, with SQL Pool and Spark Pool with 3 nodes, disable after 15 mins of inactivity, no Data Explorer, end up with about $10/month. So it is possible for small & medium companies to use it, unlike SQL DW 10 years ago.
The key to the answer I think, is the “data lake”. How does Databricks compare with Synapse, when they are used as a data lake? To be used as a data lake Databricks is better, because we can easily create a metadata driven data ingestion. Meaning that when we have a new data (from a file or a table), we can automatically create the ingestion Databricks notebooks to ingest the new data into data lake.
On the other hand, if it is for data warehousing platform, Synapse is better, because it has 3 modes including serverless SQL pool endpoints that we can use to query data files in Azure Data Lake, using SQL (link). We can then create a logical data warehouse on top of the raw data without moving the data (see my article here). We can transform the data in the data lake using Transact SQL and load it into a dimensional model in a Synapse SQL database.
I mean the difference between them (for a data lake) is not that big. In Synapse, data scientists can also query the data lake using OpenRowSet. And with a bit of an effort we can programmatically create metadata-based ingestion in Synapse pipeline (not using Copy Into but using Python Notebooks and SQL Scripts).
Conclusions
- Synapse today is very different to two years ago, so take a look and try it. It is still changing rapidly so two year from now it would have more new features and may look different.
- Which one is better for data lake in Azure: Databricks or Synapse Analytics? In my opinion Databricks is slightly better for data lake, but as a complete package for Data Lake + Data Warehouse, Synapse Analytics is better (compared to Databricks Lakehouse).
- Both platforms are changing rapidly, for example a year ago we didn’t have Lakehouse, Photon, SQL Warehouse or Partner Connect on Azure Databricks. And a year ago there was no Data Explorer in Synapse Analytics, and Synapse has completely changed. Two years ago Databricks was way better than Synapse for data lake, but today the gap has narrowed a lot. A year from now the status can change or reverse.
PS. Note that what I’m talking about here is Azure. Databricks in Azure. Not in AWS.
Would welcome your opinion and comments.