Data Warehousing and Data Science

24 November 2021

Azure Big Data Analytics

Filed under: Data Warehousing — Vincent Rainardi @ 8:57 am

I’m not going to dwell on “what is big data”. You can read the definition here: link. It is basically data that doesn’t suit well in a database. So we put them as files in Hadoop or Data Lake.

In this article I would like to write specifically about Big Data Analytics, i.e. how they are processed / analysed. Specifically, what tools in Azure that we can use to analyse big data.

Databricks and HDInsight

The most popular one today is to store big data in Azure Data Lake (ADLS2), create a Spark cluster on top of it, and do the analysis using Azure Databricks Notebooks. You can use SQL Python, R or Scala to query and analyse the data. Here is the architecture (link):

The older method is to use Azure HDInsight. So we create a HDInsight Spark cluster on ADLS2 storage, and put the data in it. We then use Jupyter Notebooks to query and analyse the data, using either PySpark, SQL or Scala. We can also use HBase to process NoSQL data (schemaless) and use LLAP to query Hive tables interactively. Here is the architecture (link):

Stream Analytics

One of the things in big data is data stream such as stock market data, social media feeds, web logs, traffic data, weather data and IoT data (sensors, RFIDs).

The most popular method is to use Azure Stream Analytics (ASA) to analyse real time data stream. ASA can use data from Azure Event Hubs, Azure IoT Hub or from Azure Blob Storage. ASA query is based on T-SQL language (link), which we can use to filter or aggregate the data stream over time. Here is the architecture (link):

If we use HDInsight, the older method is to use Kafka to build real time streaming data pipelines and application. We can also use Storm to do real time event processing. Here is the architecture (link):

Azure Synapse

The alternative to Databricks and HDInsight is to use Azure Synapse Analytics (link). We put the data in ADLS2 and use Azure Synapse Analytics to analyse the data using SQL or Spark. We can also use Data Explorer for time series data. In Synapse Studio we can create pipelines (link) to process data (similar to ADF but different: link). Here is the architecture: (link)

Machine Learning

These days, the main analytics are not business intelligence or reporting. Not even stream analytics. It is machine learning. Machine learning may not be the most widely used analytics, but it is certainly the most powerful analytics, i.e. in terms of prediction capability, understanding the most influential factors, etc.

Whether your data is in Databricks, HDInsight or Synapse you can use Azure Machine Learning (AML, link). We can use AML to create ML models for prediction or regression whether using a Spark MLib notebook (link), using a Python notebook (link), or without coding (link). Here is the architecture: (link)

As somebody who has developed many ML models, I can tell you that there is a big gap in terms of DevOps. Azure Machine Learning has a complete DevOps, from development to deployment and operation support. This takes off a big headache if you are a development manager trying to do machine learning in your company.

Data Visualisation

Of course, no discussion about analytics is complete without mentioning data visualisation, i.e. BI and reporting. There is only one tool in the Microsoft toolbox: Power BI. Whether your data is in a data lake, a data warehouse, data marts, CSV files, Excel or API (or any other form), the Microsoft way is Power BI. No they don’t promote Python visualisation such as Seaborn. In fact, you can do Python visuals within Power BI: link. Here is the architecture: (link)

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at

%d bloggers like this: