Data Platform and Data Science

28 June 2023

Microsoft Fabric for Machine Learning

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 6:56 pm

I use Google Colab to do machine learning. The source data (e.g. from data lake or warehouse) is prepared as Pandas dataframe So I would query the data lake/warehouse using Python SQL, and store in the memory of Colab’s session as Pandas dataframe. There is a lot of data preparation and processing, before the raw data is ready to be fed into the ML models. EDA, correlation, data cleansing, dealing with missing data, outliers, anomalies, incorrect format, inconsistent values, just to name a few. And then we need to do PCA, feature engineering, splitting, cross validation, etc. After all that is done, then I persist the output as pickle files, ready to be fed into ML models.

Of course we have many ML models, each having different hyperparameters. We need to run those models using various different hyperparameters. Say in this project I’m doing stock price forecasting using various LSTM models. I would have to pick a performance measure to evaluate the results. That measure should be suitable for forecasting, say MAPE or RMSE. Using that measure I would finally find the best model (e.g. a model with the highest RMSE), say an LSTM model with 2 dense layer with 1000 nodes and 1 dropout layer with rate = 0.2. I then use that best model to forecast the production data.

Deploying that best model into production and using it in earnest against production data is a different cattle of fish. The data processing: the Python SQL would need to be able to query the production data lake, and the model would need to be able write back into that lake (say in Google Cloud or AWS S3). So we need to permission them. The model needs to use the latest data, so the data in the lake would need to be refreshed every day. The forecast results (such as next 3 months forecast for every stock in the portfolio + benchmark) needs to be stored in a place where the analytic / BI tool can read them (so again, permissioning is required). And finally we will need to employ some sort of orchestrator to manage this process, such as Control-M, Active Batch or Redwood. But then again the Control-M agent would need to be permissioned in order to run those processes.

Imagine for a second that you’ve battled for 6 months and all the above are done. And now Dev team is saying that they have developed version 2, using CNN algorithm. This new model is better, faster and has more features. But it requires a little bit of new data. So new ingestion pipeline will need to be built, and new permissioning would need to be done in order for that new model to work in production. In the end you will need to create some sort of release pipeline. Everything that would be released (code, parameter, etc.) is packaged together and tested together and approved as one bundle. All these is called “productionalisation”. And ML is no difference to any other software or warehouse, it needs productionalisation.

Microsoft Fabric

As Nellie Gustafsson said in Microsoft Fabric blog (link), we can use Fabric to do ML end-to-end (she called it data science), from data sharing, data preparation, data exploration, developing models, running models/experiments, deploying models to production, and creating data visualisation.

2 days ago, in an event hosted by Leon Gordon and Pragati Jain, Nellie explained the above diagram. You can see the recording in YouTube: link. Nellie was a Senior Program Manager in the SQL Server Product team. She worked on the SQL Server 2016 with R Services, ML Services in SQL Server 2017 and Big Data Clusters in SQL Server 2019. But now she is the Product management lead for Data Science & AI, Synapse, Microsoft Fabric. It is quite reassuring, I can tell you, to hear her saying something like “We don’t have that feature yet, but we will be working on it in the coming months”. She is the product management lead afterall, for Microsoft Fabric, so she knows what’s her team is working on of course.

I don’t want to repeat what she already said (please watch the above YouTube) but there are a few things in her diagram above that I’d like to point out. The first one is Data Wrangler, which is a new tool to do data prepration and processing. It’s a Notebook based tool where we can drop missing values, drop duplicate rows, fill missing values, find and replace, view the summary statistics, split text, strip whitespace, group by column, one hot encoding, scale min/max values, sorting and do calculation, all in a visual way. For more info see here: link.

p.s. Data Wrangler is also available in Visual Studio Code, as an extension: link.

The second one is Batch Predict. Rather than predicting one observation at a time, we can now predict many observations in one go. This feature has been in AWS (link) and GCP (link) for a while, and now it’s in Fabric.

Third, the Direct Lake mode. The ML model can write the forecasted/predicted values into the Lakehouse tables in Fabric’s OneLake, so the Power BI can pick them up.

Fourth, the MLFlow enables us to compare the models/experiments and managing the models (version control, code store, etc.) We can also package the ML model and share it with other data scientists. For more info see the full documentation here: link.

That’s it. See her YouTube presentation (here’s the link again: link, thanks to Leon Gordon and Pragati Jain for hosting it) and starting using Fabric.

26 June 2023

Cloud Data Warehouse

Filed under: Data Platform — Vincent Rainardi @ 7:59 am

A Cloud Data Warehouse (Cloud DW for short) is a data warehouse in a public cloud. For example:

  • Snowflake (29)
  • Amazon Redshift (19)
  • Azure Synapse (16)
  • Google BigQuery (13)

The numbers in brackets are the number of jobs in UK market for 1 week (both contract and permanent). When searching you need to drop Amazon, Azure, Google and just search for Redshift, Synapse, BigQuery. These numbers shows that Snowflake is the most popular Cloud DW in the UK (Tejas FYI!) with the other 3 trailing behind.

Note: Azure Synapse was previously called SQL DW. It’s Microsoft MPP SQL database engine.

If we build a DW on Amazon RDS or Aurora, does it count as a Cloud DW? Yes it does.
If we build a DW on Azure SQL, does it count as a Cloud DW? Yes it does.
If we build a DW on Databricks (either on AWS, GCP or Azure), does it count as a Cloud DW? Yes.
As long as it is a DW, and it is in a public cloud, it is a Cloud DW.

Back to the “Snowflake is the most popular Cloud DW in UK”, we need to notice that Azure SQL is not included in the above figure. So if you ask which one is more popular: Snowflake or Azure, it would be Azure because the number of DWs built on Azure SQL is far greater than on Azure Synapse. The same with Amazon RDS, the number of DWs built on RDS is far greater than on Redshift. This is because of the costs of Azure SQL and RDS are much lower than Synapse and Redshift.

So Snowflake is probably not the most popular Cloud DW in UK. Instead, it could be Azure SQL or RDS. Of course we can’t know how many jobs are for Azure DW, because those applications built on Azure SQL can be a DW or not a DW. Ditto with RDS, those applications built on RDS can be a DW or not a DW.

But in terms of “big engine”, yes Snowflake is the most popular platform (with Databricks possibly following closely). Again, note that we don’t know whether the system built on Databricks are data lake or data mart/warehouse or both.

Note that it has to be a DW, not a data lake. So if what you built on Databricks is a data lake, then it’s not a Cloud DW.

WHAT IS A DATA WAREHOUSE?

So what is a data warehouse, what is a data lake and what is a data lakehouse?
You can read about it here:

  • Definition of Data Lake and Data Warehouse: link
  • What is a Data Lake: link
  • What is a Data Lakehouse: link
  • Data Warehouse on Data Lake: link
  • Modern Data Warehouse: link

Summarising them:
A Data Warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalised data store. It is queried for reporting, BI or analytics. A data warehouse has a schema and has data integrity.

A Data Lake is a storage and analytic system which stores structured and unstructured data from all source systems in the company in its raw form. The data is used for operational reporting or to feed AI. Or to feed a data warehouse/mart, which is often the case. A data Lake does not have a schema and does not have data integrity.

A Data Lakehouse is a data lake which has data warehouse features such as star schema, transaction, data integrity. It also has analytic tools, data governance, scalability, data ingestion and can do “time travel”.

So I repeat, a Cloud DW must be a data warehouse, not a data lake or data lakehouse.

ADVANTAGES OF CLOUD DW

Why do we need to create a Cloud DW. What are the advantages?

  1. The main advantage is the cost.
    A Cloud DW is taking a huge cost benefit from being in the cloud. It’s “pay by the hour”, with no big upfront cost like On Premise DW. That alone is already a big enough reason create a Cloud DW.
  2. To keep up with technology.
    If today your company still on premise, then you have a big risk that your technology staff will leave you to work in a company with cloud infrastructure. Especially the younger ones, the ones with many years left in their careers.
  3. Infrastructure as Code
    Another advantage of being in the cloud is IaC, infrastructure as code. It means that deployments of updates to the servers are scripted, thus much more automated and requiring less hours.
  1. More connected
    With your DW being in the cloud it become more connected with data providers. For example, API from data providers (which is in their cloud) can be accessed directly from your cloud.
    More connnected with upstream systems too, which are in the cloud. For example, reading from data lake or OMS (order management system).
    More connected to downstream consumers too, which are in the cloud. For example: cloud data marts, or Microsoft 365 (e.g. API on Excel).
    Everybody is in the cloud now, so if you are still on premise, you are making your life more difficult and more isolated.

DISADVANTAGES OF CLOUD DW

There are always 2 sides of the same coin. With all those advantages above, what are any disadvantages? The main disadvantage is that a data warehouse is a monolith construct. Meaning that it is a big structure which takes a long time to built. And because it’s big, once it’s built it is difficult to change.

It takes a long time to build because you need to integrate the data. Data lake on the other hand is a lot quicker to build. Because you don’t need to integrate the data.

Monolithic is a disadvantage of a data warehouse. Not because it’s a cloud DW, but all DWs have this disadvantage, cloud or not cloud.

DATA LAKE + DATA MARTS

The solution to this monolith issue is to not build a data warehouse, but a data lake + data mart. Or, alternatively, to build a data lakehouse. We need to remember that not everyone will need to do BI/analytics. Some users just want to have the raw data, for example: quants, data scientists, data analysts, AI users, etc. There are more users wanting the raw data compared to those doing analytics.

Based on what I see in projects and from talking to people in the DW industry, I’d say that the ratio is 90% needs raw data, 30% needs analytics (the overlap is 20%). In other words, almost all users need the raw data. And 2/3 of those analytics users also need the raw data.

We still use all the cloud platforms above, i.e. the like of Azure SQL, RDS, Databricks and Snowflake, but we don’t build DW on them. We built marts instead. With the data lake residing on ADLS or S3, with data APIs built on top. These data APIs enables the users to access the lake data directly, either using a web browser or from Excel. User specify the ID (like product code, client code, ISIN, etc.) and the date range, and the API serves data onto the browser or spreadsheet.

DATA LAKEHOUSE

An alternative to building Data Lake + Data Marts is to build a data lakehouse.

A lakehouse is built on Databricks. A Data Lakehouse is a data lake which has data warehouse features such as star schema, ACID transaction, data integrity, time travel and metadata handling. One example of a lakehouse is Delta Lake on Databricks (link). Another example is Lakehouse on Snowflake: link.

Another alternative to building Data Lake + Data Marts is to build Data Mesh. But that is a topic for other time.

I hope this was useful. The big take away is: don’t build a monolith structure like a data warehouse, but a data lake + data marts. Each mart serves one specific purpose/project. This is far more cost effective than spending 2 years building a DW. Because most users requires raw data, and only a few requires analytics.

p.s. note that 5 years ago (mid 2018) the data need for ML is not as big as today (mid 2023). So in some cases it was still worth building a data warehouse, because the analytics needs are still high. That is not the case any more today.

Blog at WordPress.com.