I use Google Colab to do machine learning. The source data (e.g. from data lake or warehouse) is prepared as Pandas dataframe So I would query the data lake/warehouse using Python SQL, and store in the memory of Colab’s session as Pandas dataframe. There is a lot of data preparation and processing, before the raw data is ready to be fed into the ML models. EDA, correlation, data cleansing, dealing with missing data, outliers, anomalies, incorrect format, inconsistent values, just to name a few. And then we need to do PCA, feature engineering, splitting, cross validation, etc. After all that is done, then I persist the output as pickle files, ready to be fed into ML models.
Of course we have many ML models, each having different hyperparameters. We need to run those models using various different hyperparameters. Say in this project I’m doing stock price forecasting using various LSTM models. I would have to pick a performance measure to evaluate the results. That measure should be suitable for forecasting, say MAPE or RMSE. Using that measure I would finally find the best model (e.g. a model with the highest RMSE), say an LSTM model with 2 dense layer with 1000 nodes and 1 dropout layer with rate = 0.2. I then use that best model to forecast the production data.
Deploying that best model into production and using it in earnest against production data is a different cattle of fish. The data processing: the Python SQL would need to be able to query the production data lake, and the model would need to be able write back into that lake (say in Google Cloud or AWS S3). So we need to permission them. The model needs to use the latest data, so the data in the lake would need to be refreshed every day. The forecast results (such as next 3 months forecast for every stock in the portfolio + benchmark) needs to be stored in a place where the analytic / BI tool can read them (so again, permissioning is required). And finally we will need to employ some sort of orchestrator to manage this process, such as Control-M, Active Batch or Redwood. But then again the Control-M agent would need to be permissioned in order to run those processes.
Imagine for a second that you’ve battled for 6 months and all the above are done. And now Dev team is saying that they have developed version 2, using CNN algorithm. This new model is better, faster and has more features. But it requires a little bit of new data. So new ingestion pipeline will need to be built, and new permissioning would need to be done in order for that new model to work in production. In the end you will need to create some sort of release pipeline. Everything that would be released (code, parameter, etc.) is packaged together and tested together and approved as one bundle. All these is called “productionalisation”. And ML is no difference to any other software or warehouse, it needs productionalisation.
Microsoft Fabric
As Nellie Gustafsson said in Microsoft Fabric blog (link), we can use Fabric to do ML end-to-end (she called it data science), from data sharing, data preparation, data exploration, developing models, running models/experiments, deploying models to production, and creating data visualisation.
2 days ago, in an event hosted by Leon Gordon and Pragati Jain, Nellie explained the above diagram. You can see the recording in YouTube: link. Nellie was a Senior Program Manager in the SQL Server Product team. She worked on the SQL Server 2016 with R Services, ML Services in SQL Server 2017 and Big Data Clusters in SQL Server 2019. But now she is the Product management lead for Data Science & AI, Synapse, Microsoft Fabric. It is quite reassuring, I can tell you, to hear her saying something like “We don’t have that feature yet, but we will be working on it in the coming months”. She is the product management lead afterall, for Microsoft Fabric, so she knows what’s her team is working on of course.
I don’t want to repeat what she already said (please watch the above YouTube) but there are a few things in her diagram above that I’d like to point out. The first one is Data Wrangler, which is a new tool to do data prepration and processing. It’s a Notebook based tool where we can drop missing values, drop duplicate rows, fill missing values, find and replace, view the summary statistics, split text, strip whitespace, group by column, one hot encoding, scale min/max values, sorting and do calculation, all in a visual way. For more info see here: link.
p.s. Data Wrangler is also available in Visual Studio Code, as an extension: link.
The second one is Batch Predict. Rather than predicting one observation at a time, we can now predict many observations in one go. This feature has been in AWS (link) and GCP (link) for a while, and now it’s in Fabric.
Third, the Direct Lake mode. The ML model can write the forecasted/predicted values into the Lakehouse tables in Fabric’s OneLake, so the Power BI can pick them up.
Fourth, the MLFlow enables us to compare the models/experiments and managing the models (version control, code store, etc.) We can also package the ML model and share it with other data scientists. For more info see the full documentation here: link.
That’s it. See her YouTube presentation (here’s the link again: link, thanks to Leon Gordon and Pragati Jain for hosting it) and starting using Fabric.