I’m not going to dwell on “what is big data”. You can read the definition here: link. It is basically data that doesn’t suit well in a database. So we put them as files in Hadoop or Data Lake.
In this article I would like to write specifically about Big Data Analytics, i.e. how they are processed / analysed. Specifically, what tools in Azure that we can use to analyse big data.
Databricks and HDInsight
The most popular one today is to store big data in Azure Data Lake (ADLS2), create a Spark cluster on top of it, and do the analysis using Azure Databricks Notebooks. You can use SQL Python, R or Scala to query and analyse the data. Here is the architecture (link):

The older method is to use Azure HDInsight. So we create a HDInsight Spark cluster on ADLS2 storage, and put the data in it. We then use Jupyter Notebooks to query and analyse the data, using either PySpark, SQL or Scala. We can also use HBase to process NoSQL data (schemaless) and use LLAP to query Hive tables interactively. Here is the architecture (link):

Stream Analytics
One of the things in big data is data stream such as stock market data, social media feeds, web logs, traffic data, weather data and IoT data (sensors, RFIDs).
The most popular method is to use Azure Stream Analytics (ASA) to analyse real time data stream. ASA can use data from Azure Event Hubs, Azure IoT Hub or from Azure Blob Storage. ASA query is based on T-SQL language (link), which we can use to filter or aggregate the data stream over time. Here is the architecture (link):

If we use HDInsight, the older method is to use Kafka to build real time streaming data pipelines and application. We can also use Storm to do real time event processing. Here is the architecture (link):

Azure Synapse
The alternative to Databricks and HDInsight is to use Azure Synapse Analytics (link). We put the data in ADLS2 and use Azure Synapse Analytics to analyse the data using SQL or Spark. We can also use Data Explorer for time series data. In Synapse Studio we can create pipelines (link) to process data (similar to ADF but different: link). Here is the architecture: (link)

Machine Learning
These days, the main analytics are not business intelligence or reporting. Not even stream analytics. It is machine learning. Machine learning may not be the most widely used analytics, but it is certainly the most powerful analytics, i.e. in terms of prediction capability, understanding the most influential factors, etc.
Whether your data is in Databricks, HDInsight or Synapse you can use Azure Machine Learning (AML, link). We can use AML to create ML models for prediction or regression whether using a Spark MLib notebook (link), using a Python notebook (link), or without coding (link). Here is the architecture: (link)

As somebody who has developed many ML models, I can tell you that there is a big gap in terms of DevOps. Azure Machine Learning has a complete DevOps, from development to deployment and operation support. This takes off a big headache if you are a development manager trying to do machine learning in your company.
Data Visualisation
Of course, no discussion about analytics is complete without mentioning data visualisation, i.e. BI and reporting. There is only one tool in the Microsoft toolbox: Power BI. Whether your data is in a data lake, a data warehouse, data marts, CSV files, Excel or API (or any other form), the Microsoft way is Power BI. No they don’t promote Python visualisation such as Seaborn. In fact, you can do Python visuals within Power BI: link. Here is the architecture: (link)

Leave a Reply