Data Warehousing and Machine Learning

27 November 2021

Using Temporal Tables in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 2:38 pm

Can we use temporal tables in data warehousing? Yes for staging area (persistent). Yes for slowly changing dimension, but not recommended. No for standard types fact tables, but yes for particular cases.

What are temporal tables?

Temporal tables keep all the historical changes, i.e. deleted rows and previous values before updates. Here is the documentation: link.

We can query a temporal table to get the latest version or the old versions:

Using Temporal Tables for Staging Area

We can use a temporal table for a staging area. There are 2 types of staging area: transient, which is the one you truncate before every load, and persistent, which you don’t truncate / delete (you keep all the historical data here).

Please read Hans Michials experience here: link (on using temporal table for staging area). The conclusion was: a temporal tables are a lot faster than conventional persistent staging tables. And temporal tables are able to adopt changes in the data model too.

And, out of the box, temporal tables enable us to do historical load into dimensions. And it is simple and reliable.

Using Temporal Tables for Slowly Changing Dimensions

Please read what Microsoft wrote in August 2021 first: link (on using temporal tables for slowly changing dimensions). Then please read what Tim Michell wrote in April 2019: link (scroll down to this part: using temporal tables for slowly changing dimensions). The conclusion: temporal tables is good for slowly changing dimensions. It simplifies the load process. It simplifies changing a dimension from type 1 to type 2.

But it is a bit tricky to implement temporal tables for SCD type 2. The join to the fact table can’t be on surrogate key, but on natural key plus the effective dates. That is not a good, both for performance and conceptually, thanks to Gary who pointed out at the end of Tim’s article.

And Gary also pointed out one important fact: in temporal tables the temporal times can overlap. In SCD type 2, the effective dates must not overlap (because otherwise we won’t know which row was effective at certain datetime.

And the killer point is (again, also from Gary) in SCD type 2 dimension, I’m usually very selective about which attribute we want to set as type 2. If there is no good reason for that attribute to be type 2, it stays as type 1 (including corrections). But in the temporal tables, changes on every column result in new rows. We can’t be selective and choose only 5 columns out of 50, for example.

Because of those reasons, I would not recommend using temporal table for SCD type 2 dimension.

Using temporal tables for fact tables

First please read Paul Te Braak article on using temporal tables for fact tables: link. A SCD type 2 dimension captures which attributes were active for each fact row, whereas a temporal table doesn’t do that.

Let’s have a look at the 2 transaction above from James Hood buying oil filter for his TD5 Discovery II Land Rover car. One was last year (Sep 2020) and was this week (Nov 2021). The attributes of the oil filter are different between last year and this year. The customer attributes are also different. That is what SCD type 2 dimension captures. They captures which attributes were active for each fact row. Whereas temporal tables don’t do that.

Second, please read Martin Schoombee article on how he implemented temporal tables for his retail pricing tool case: link. In this specific case, to me it makes sense to use temporal fact table. Martin’s case is similar to Davide Mauri’s insurance document case that he presented at SQLBits X: link. In this case it also makes sense to use temporal tables for the fact table.

But, temporal tables are not suitable for replacing the 3 Kimball fact table types i.e. periodic snapshot fact table, accumulative snapshot fact table or transaction fact table. In the periodic snapshot fact table, the timing for snapshotting is at particular dates, for example end of month or beginning of month. We control the timing, but in temporal tables we don’t have any control on the snapshot date. Because of all the reasons above we can’t use temporal tables for all 3 Kimball types of the fact tables. But your can use it if your case is like Martin’s or Davide’s cases above.

26 November 2021

Automating Machine Learning using Azure ML

Filed under: Data Warehousing — Vincent Rainardi @ 8:06 am

I have been using Google Colab (see my article here) and Jupyter to build and train various machine learning models including LR, KNN, PCA, SVM, XGBoost, RF, NLP, ANN, CNN, RNN, RL. I have been using Azure (Databricks, Data Lake, Data Factory, SQL, etc.) so I’m intrique to try Azure Machine Learning to see if it is as good as Colab.

The first thing I notice in Azure ML is Automated ML, which enables us to train an ML model without doing any coding. We specify the data, and Azure ML will try various algorithms, build various models, and evaluate them according to the criteria that we give.

This sounds too good to be true, but entirely possible. One of my ML projects is about credit card transactions. In that project I used 6 algorithms (LR, KNN, SVM, DT, RF, XGB) and each model has many hyperparameter. Each of these hyperparameters have many values to try. So to find the best model I had to do are a lot of hyperparameter tuning using GridSearch cross validation on training data. Once I found the best parameter for a model, I had to evaluate the performance on the test data using Area Under ROC Curve, or AUC. Then I had to select the best model based on that evaluation. And on top of that I need to find out the top features. Can all this be automated in Azure ML? Sounds too good to be true, but entirely possible.

First, I loaded the data using a procedure similar to this demo: link. Set the evaluation metric to AUC, set the train-test split using K-fold cross validation with K=3, set the ML algorithm to auto, set explain the best model = True, and set maximum concurrent session to 5. For the compute node use DS12 V2 with 4 CPUs, 28 GB memory and 200 GB SSD space (16×500 IOPS)

The top 10 models came out like this:

I expected XG Boost classifier (XGB) to be the top model and it is (I didn’t enable Neural Network in the AutoML). The top XGB model is using SparseNormalizer, which is expected because the data is skewed on many features. 2m 24s training time on 30k observations/examples on 4 CPUs/28 GB is not quick.

The eta (learning rate) is the step size shrinkage used in update to prevents overfitting. In this case it is 0.4 (the default is 0.3, range is from 0 to 1, see link, link). Gamma is the minimum loss reduction required to make a further partition on a leaf node of the tree, ranging from 0 to infinity (default is 0). It is a regularisation measure and in this case it is conversative (the larger the gamma the more conserative the model is). The maximum depth is 10. For comparison when I tuned by XGB model for credit card fraud data, the eta was 0.2, the gamma was 0 and the max depth was 6.

We can see it in more details by clicking the Algorithm Name, then click View Hyperparameters:

We can see the top influencing features like this:

F is the feature number, such as account age, location, or customer behaviour.

We also get a chart of the top feature against the probability of the predicted variable, like this: (I would prefer charting the top and second top features on the x and y axis but as this is out of the box it looks good and useful)

And we get the Precision-Recall chart out of the box too (you can choose which version of AUC to use, i.e. weighted, macro or micro:

The ROC is True Positive Rate (TPR) on the Y axis against False Positive Rate (FPR) on the X axis, so the above is not an ROC curve. But it gives us a good sense on how we can maximise recall or precision.

We want to recall to be as large as possible, and precision to be as large as possible but the AUC line limit them so it will always be a trade off between them. For example if you take the Weighted Average AUC line, the maximum of (recall – precision) might be point A. But in the case of credit card fraud you would want high recall, so we would choose point B instead of point C which is for high precision.

And AutoML in Azure ML also gives us the data transformation, such below:

We can see above that during the preprocessing of the data, for numerical features AutoML uses MeanInputer to mitigate missing values, whereas for categorical features CharGram count vectoriser and ModeCatInputter label encoder. Then it uses maximum absolute scaler before feeding the preprocessed data to the XGBoost model.

Overall I found that AutoML is useful. It tried various algorithms including Random Forest, Logistic Regression and XG Boost. Over 60 models it tried, in under 2 hours! The test AUC is 94.8% which is a good result for this data. And it gives us features importance as well. It tried various values of hyperparameters for each model, and chose the best values for us, automatically. Very, very easy to use. Welldone Microsoft! Of course, once we get the top models AutoML, then we can tune it further ourselves to get higher AUC. It is finding the top models which is very time consuming (it took me a week, but with AutoML it only took 2 hours).

25 November 2021

How to download files from Azure storage account using Control-M

Filed under: Data Warehousing — Vincent Rainardi @ 8:46 am

Control-M has a dedicated Azure Blob Storage job: (link)

On the Action dropdown we can select Upload, Download or List. Upload means uploading a file from an on-prem folder to an Azure container, whilst Download means the reverse. List means getting the name of the files in an Azure container.

We can also copy a file from a container to another container, delete a file, create and delete a container. Just need to specify the appropriate action on the Action dropdown list.

Note that we must install Azure CLI 2.0 and include the Azure authentication parameters in the connection profile. Or authenticate the CLI.

Upload files to FTP server

Once the files are downloaded to a local folder, we can push it to an FTP server using Manage File Transfer job (MFT, link, link). Note that FiIle Transfer Server must be configured first.

24 November 2021

Azure Big Data Analytics

Filed under: Data Warehousing — Vincent Rainardi @ 8:57 am

I’m not going to dwell on “what is big data”. You can read the definition here: link. It is basically data that doesn’t suit well in a database. So we put them as files in Hadoop or Data Lake.

In this article I would like to write specifically about Big Data Analytics, i.e. how they are processed / analysed. Specifically, what tools in Azure that we can use to analyse big data.

Databricks and HDInsight

The most popular one today is to store big data in Azure Data Lake (ADLS2), create a Spark cluster on top of it, and do the analysis using Azure Databricks Notebooks. You can use SQL Python, R or Scala to query and analyse the data. Here is the architecture (link):

The older method is to use Azure HDInsight. So we create a HDInsight Spark cluster on ADLS2 storage, and put the data in it. We then use Jupyter Notebooks to query and analyse the data, using either PySpark, SQL or Scala. We can also use HBase to process NoSQL data (schemaless) and use LLAP to query Hive tables interactively. Here is the architecture (link):

Stream Analytics

One of the things in big data is data stream such as stock market data, social media feeds, web logs, traffic data, weather data and IoT data (sensors, RFIDs).

The most popular method is to use Azure Stream Analytics (ASA) to analyse real time data stream. ASA can use data from Azure Event Hubs, Azure IoT Hub or from Azure Blob Storage. ASA query is based on T-SQL language (link), which we can use to filter or aggregate the data stream over time. Here is the architecture (link):

If we use HDInsight, the older method is to use Kafka to build real time streaming data pipelines and application. We can also use Storm to do real time event processing. Here is the architecture (link):

Azure Synapse

The alternative to Databricks and HDInsight is to use Azure Synapse Analytics (link). We put the data in ADLS2 and use Azure Synapse Analytics to analyse the data using SQL or Spark. We can also use Data Explorer for time series data. In Synapse Studio we can create pipelines (link) to process data (similar to ADF but different: link). Here is the architecture: (link)

Machine Learning

These days, the main analytics are not business intelligence or reporting. Not even stream analytics. It is machine learning. Machine learning may not be the most widely used analytics, but it is certainly the most powerful analytics, i.e. in terms of prediction capability, understanding the most influential factors, etc.

Whether your data is in Databricks, HDInsight or Synapse you can use Azure Machine Learning (AML, link). We can use AML to create ML models for prediction or regression whether using a Spark MLib notebook (link), using a Python notebook (link), or without coding (link). Here is the architecture: (link)

As somebody who has developed many ML models, I can tell you that there is a big gap in terms of DevOps. Azure Machine Learning has a complete DevOps, from development to deployment and operation support. This takes off a big headache if you are a development manager trying to do machine learning in your company.

Data Visualisation

Of course, no discussion about analytics is complete without mentioning data visualisation, i.e. BI and reporting. There is only one tool in the Microsoft toolbox: Power BI. Whether your data is in a data lake, a data warehouse, data marts, CSV files, Excel or API (or any other form), the Microsoft way is Power BI. No they don’t promote Python visualisation such as Seaborn. In fact, you can do Python visuals within Power BI: link. Here is the architecture: (link)

22 November 2021

Tuning XGBoost Models

Filed under: Machine Learning — Vincent Rainardi @ 7:15 am

I was tuning fraudulent credit card transaction data from Kaggle (link) and found that for classifier, XGBoost provides the highest AUC compared to other algorithms (99.18%). It is a little tricky to tune though, so in this article I’d like to share my experience in tuning it.

What is XGBoost?

XGBoost stands for Extreme Gradient Boosting. So before you read about XGBoost, you need to understand first what is Gradient Boosting, and what is Boosting. Here are good introductions to this topic: link, link. The basis algorithm for XGBoost is Decision Tree. Then many trees are used together in a technique called Ensemble (for example Random Forest). So a complete journey to understanding XGboost from the ground up is:

  1. Decision Tree
  2. Ensemble
  3. Stacking, Bagging, Boosting (link)
  4. Random Forest
  5. Gradient Boosting
  6. Extreme Gradient Boosting

Higgs Boson

The original paper by Tianqi Chen and Carlos Guestrin who created XGBoost is here: link.
XGBoost was used to solve Higgs Boson classification problem, again by Tianqi Chen, and Tong He: link. Higgs Boson is the last elementary particle discovered. It was discovered in 2012 at the Large Hadron Collider at CERN. The particle was predicted by Peter Higgs in 1964.

Reference

A good reference for tuning XGBoost model is a guide from Prashant Banerjee: link (search for “typical value”). Another good one is from Aarshay Jain: link (again, search for “typical value”). The guide from the developers: link and the list of hyperparameters are here: link.

Python Code

Here’s the code in its entirety:

# Import required libraries
import numpy as np
import pandas as pd
from sklearn import preprocessing

# Load the data from Google drive
from google.colab import drive
drive.mount('/content/gdrive')
df = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/creditcard.csv')

# Drop time column as fraudulent transactions can happen at any time
df = df.drop("Time", axis = 1)

# Get the class variable and put into y and the rest into X
y = df["Class"]
X = df.drop("Class", axis = 1)

# Stratified split into train & test data
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split( X, y, test_size = 0.2, stratify = y, random_state = 42 )

# Fix data skewness
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(copy=False)
train_return = pt.fit_transform(X_train)
test_return  = pt.fit_transform(X_test)

# Balance the train and test data using SMOTE
from imblearn.over_sampling import SMOTE
SMOTE = SMOTE(random_state=42)
X_smote_train, y_smote_train = SMOTE.fit_resample(X_train, y_train)
X_smote_test, y_smote_test = SMOTE.fit_resample(X_test, y_test)

# Sample training data for tuning models (use full training data for final run)
tuning_sample = 20000
idx = np.random.choice(len(X_smote_train), size=tuning_sample)
X_smote_tuning = X_smote_train.iloc[idx]
y_smote_tuning = y_smote_train.iloc[idx]

# Import libraries from Scikit Learn
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier

# Create a function to calculate AUC using predict proba
def Get_AUC(Model, X, y):
    prob = Model.predict_proba(X)[:, 1] 
    return roc_auc_score(y, prob) * 100

# Perform grid search cross validation with different parameters
parameters = {'n_estimators':[90], 'max_depth':[6], 'learning_rate':[0.2], 
              'subsample':[0.5], 'colsample_bytree':[0.3], 'min_child_width': [1],
              'gamma':[0], 'alpha':[0.001], 'reg_lambda':[0.001]}
XGB = XGBClassifier()
CV = GridSearchCV(XGB, parameters, cv=3, scoring='roc_auc', n_jobs=-1)

# Hyperparameter tuning to find the best parameters
CV.fit(X_smote_tuning, y_smote_tuning)
print("The best parameters are:", CV.best_params_)

Output: The best parameters are: {'alpha': 0.001, 'colsample_bytree': 0.3, 'gamma': 0, 'learning_rate': 0.2, 'max_depth': 6, 'min_child_width': 1, 'n_estimators': 90, 'reg_lambda': 0.001, 'subsample': 0.5}

# Fit the model with the best parameters and get the AUC
XGB = XGBClassifier(n_estimators = CV.best_params_["n_estimators"], max_depth = CV.best_params_["max_depth"], 
                    learning_rate = CV.best_params_["learning_rate"], colsample_bytree = CV.best_params_["colsample_bytree"], 
                    subsample = CV.best_params_["subsample"], min_child_width = CV.best_params_["min_child_width"],
                    gamma = CV.best_params_["gamma"], alpha = CV.best_params_["alpha"], 
                    reg_lambda = CV.best_params_["reg_lambda"])
Model = XGB.fit(X_smote_train, y_smote_train)
AUC = Get_AUC(Model, X_smote_test, y_smote_test)
print("AUC =", '{:.2f}%'.format(AUC))

Output: 99.18

Tuning Process

So here is the tuning process that I did for XG Boost model, for the above data, using the above code.

Step 1. Broad ranges on the top 3 parameters

First, I read the expected values for the parameters from the guides (see the Reference section above).
Note: In this article when I say parameters I mean hyperparameters.

Then, using the Grid Search cross validation I set the parameters in very broad ranges as follows:

  • n_estimators: 10, 100, 500
  • max_depth: 3, 10, 30
  • learning_rate: 0.01, 0.1, 1

I used only 20k data out of 284,807 transactions so the cross validation process didn’t take hours but only minutes. I tried with 10k, 20k, 50k samples and found that 10k results didn’t represent the whole training data (284k), 50k and above were very slow, but 20k is fast enough and yet it is representative.

I would recommend trying only 3 values for each parameter and only the 3 parameters above to begin with. This way it would take 10 minutes. These 3 parameters are the most influencing factors, we need to nail them down first. They are mentioned in the Reference section above.

Step 2. Narrow down the top 3 parameters

I then narrow down the range of these 3 parameters. For example, for n_estimators out of 10, 100, 500, the Grid Search shows that the best value was 100. So I changed the grid search with 80, 100, 120. Still getting 100 as the best parameter so I did a grid search with 90, 100, 110 and got 90. Finally I did the grid search with 85, 90, 95 and it still gave out 90 as the best n_estimators so that was my final value for this parameter.

But I understood there was interaction between the parameter so when tuning n_estimators I included the max_depth of 3, 10, 30 and learning_rate of 0.01, 0.1, 1. And when the n_estimator was settled at 90, I started narrowing down the max_depth (which was giving out 10) to 7, 10, 14. The result was 7 so I narrowed it down to 6, 7, 8. The result was 6 and that was the final value for this max_depth.

For the learning_rate I started with 0.01, 0.1, 1 and the best was 0.1. Then 0.05, 0.1, 0.2 and the best was 0.2. Tried 0.15, 0.2, 0.25 and the best was 0.2 so that was the final value for the learning_rate.

So the top 3 parameters are: n_estimators = 90, max_depth = 6, learning_rate = 0.2. The max_depth = 6 was the same as the default value, so I could have not used this parameter if I wanted to.

Note:

Note that I didn’t put all the possible ranges/values for all 3 parameters into a grid search CV and let it run for the whole night. It’s all manual and I nailed down the parameters one by one, which only took about an hour. Manual is a lot quicker because from the prevous run I knew the optimum range of parameters, so I could narrow it down further. It’s a very controlled and targetted process, that’s why it’s quick.

Also note that I used only 20k data for tuning, but for getting AUC I fit the full training data and predicted using the full test data.

Step 3. The next 3 parameters

With the top 3 parameters fixed, I tried the next 3 parameters as follows:

  • colsample_bytree: 0.1, 0.5, 0.9
  • subsample: 0.1, 0.5, 0.9
  • min_child_width: 1, 5, 10

I picked these 3 parameters were based on the guidelines given by the XGBoost developers and the blog posts which are in the Reference section above.

The results are as follows: the optimum parameters = colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1. This gives an AUC of 98.69%.

For the explanation about what these parameters are, please refer to the XGBoost documentation here: link.

It is possible that the AUC is lower than the AUC from the previous step. In this case I tried the values for that parameters manually using the full training data. For example, with tuning data (20k) the best min_child_width was 0 but this gives AUC of 98.09% which was lower than the previous AUC value before using min_child_width (98.69%). So I tried 0, 1 and 2 values of min_child_width using the full training data. In other words, the tuning data (20k) is good for narrowing down from the broad range to narrow range, but when it’s narrow range we might need to use the full training data. To do this I replaced the “XGB = … “ in last cell with this:

XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2, 
                    colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1)

Step 4. Three regularisation parameters

Reading from the guides from the Reference section above, it seems that the next 3 most important parameters are gamma, alpha and lambda. They are the regularisation parameters and their value ranges are in the XGBoost documentation (link).

  • gamma: 0, 1, 10. Optimum value: 0.
  • alpha: 0, 10, 1000. Optimum value: 0.001
  • reg_lambda: 0.1, 0.5, 0.9. Optimum value: 0.001

After tuning with these 3 paramters, the AUC increased to 99.18%.

I confirmed the result by replacing the XGB = … in the last cell with this:

XGB = XGBClassifier(n_estimators = 90, max_depth = 6, learning_rate = 0.2, 
                    colsample_bytree = 0.3, subsample = 0.5, min_child_width = 1,
                    gamma = 0, alpha = 0.001, reg_lambda = 0.001)

Note on the imbalanced data

XGBoost has 2 parameters to deal with imbalanced data: scale_pos_weight and max_delta_step. You can read how to implement them in the XG Boost documentation: link.

I did use them, trying the scale_pos_weight values of 1, 10,100 and the optimum value was 10, but it only gave AUC of 96.83%.

So I tried different approaches for handling imbalance data, i.e. random oversampling, SMOTE and ADASYN. SMOTE gave the best result, i.e. the AUC of 99.18% above.

Note on the data skewness

The credit card fraud data is skewed, meaning it is not distributed normally. This is particularly so with the amount feature, which is distributed differently between the left of the mean and the right of the mean. A few other features such as V3 are also like that.

I used PowerTransformer to fix the data skewness, as you can see in the above code. I fixed the skewness separately between the training data and test data. So I split the data first, and then fix the skewness. This is better than fixing the skewness first because when afterwards the data is split, then it would become skewed.

Note on the stratified sampling

Because the data is very imbalanced, I use stratified sampling so that the ratio between the 2 classes are kept the same between the training data and the test data. I use 80-20% split rather than 70-30% split to give the model more data to learn, and because 20% is one fifth which is large enough unseen data to test the trained model against.

I don’t believe 10% test data is fair enough testing, in my opinion 20% is the minimum we should not go lower than that, not even 15%. I verified this in Kaggle i.e. that most practices in Kaggle are using test data of 20%, 25% or 30%. I didn’t see any one uses test data lower than 20% or higher than 30%.

13 November 2021

Google Colab

Filed under: Machine Learning — Vincent Rainardi @ 9:05 am

I started using Google Colab about 6 months ago, as my laptop is 1 core 4 GB. I’m now on Colab Pro (£9.72 a month) which is fantastic. So in this article I’d like to share my experience doing machine learning on Colab.

The Look and Feel

I love the table of contents on the left panel! Jupyter Notebook on local install don’t have it. Well they do (link) but it’s so complicated and so manual. On Google Colab it is available out of the box, without us doing anything (it detects the headings in the markdown, i.e. #, ##, ###, etc.) We use it to jump to different sections in our notebook:

Google Colab also supports dark theme as you can see above. On the left most column (marked A) we have search, code snippet, variables and files. They are very useful when doing development. Psst, if you click the Colab icon on the top left (B), you’ll see all your notebooks in Google Drive!).

On the bottom left (C) there are 2 very useful icons. The first one is Command Pallette where every single Jupyter command and shortcut is listed! The second one is the Terminal, where you can run shell commands (only on Colab Pro).

Google Drive

When starting on Colab almost everyone asked “How do I upload files” and spent time searching.

There you go! Saves you from searching 🙂

Tip: don’t use GPU when you connect to Google Drive for the first time. You could get “403 error daily limit exceeded” (link). Turn off GPU on the Notebook Settings (see below) and you’ll be ok.

Performance

The main reason we use Colab is because it’s free and it’s fast. Yes it’s free. And yes you get GPU and TPU! In my case it’s 10-20x faster than my laptop. So why pay £9.72/month for Pro? (including VAT) Because it disconnects after 1.5 hours, that’s why. To prevent that we need to keep typing! With Pro we also get more power and RAM.

When doing neural network (RNN, CNN) or RL, Colab is an absolute godsend. I can’t do those NN or RL models on my laptop (well it’s 6 years old, no wonder 🙂 Colab is very good with anything to do with Keras. Because of its TPU. Well, with scikit learn too.

Tip: unless you are doing network layers, using TPU doesn’t mean it’s faster than GPU or None. I was using classical models (Logistic Rregression, Random Forest, XG Boost, etc) and found that TPU or GPU doesn’t make it faster. And there were moments when I could not connect using TPU, but with  None I could connect.

Also, most often I don’t need to use the High RAM setting. Only once I ran out of memory, that was when processing a lot of images (augmentation).

Executing Code

After executing code, you can tell how long each cell runs by hovering over the play button (or the green check box). Below you can see “29 seconds” under the green checkbox so why bother hovering? Because once you edit the cell the 29s is gone! But if you hover over you can still see it 🙂

A lot of times when I’m running code it fails. After fixing that particular cell, and testing it, I used to run it from the top again (using “Run before”). Now once I fixed that cell, I use “Run after” to execute the rest of the code, because if you are like me, there will be other cells failing down there 🙂

Sometimes a model runs for a long time and when you click Stop (the rectangle on the left of the cell) it doesn’t stop. In that case I use “Restart runtime” (the last blue arrow above). It’s an absolute godsend! Whatever the code is doing, it will stop. Psst, you will also see this “Restart runtime” if you keep clicking the Stop button on the left of the cell 3 times.

Settings

On the top right of the screen, next to your initial, you can see a wheel icon. This is for Settings.

The dark theme is set here, along with the font, etc. Ever wonder why the tab is 2 spaces not 4 spaces? Well you can set it here. Do you want to see line numbers? Set it here. Mind you it’s the line number within a cell, not the cell number.

Editing

When editing, the first botton (A) makes it a header, i.e. it begins with a “#”. Keep clicking that icon and it changes to level 2 and level 3. Botton B formats whatever we are highlighting as a code and botton C inserts a link. You can even insert an image (D), a line (E) or an Emoji (G). Botton F is LaTeX (it’s a type setting thing, link) and H is to put the preview below rather than on the right.

You can get the link to a particular cell (button I), and you add a comment on individual cell (J). I use button K often, it is for closing the editor (well you can also press Escape). The last button (L) is open up the cell.

File Menu

On the file menu you can open or upload a notebook, rename or save a copy, and download the notebook. I press control-S (save) often, but it’s actually not necessary because the notebook is automatically saving itself very often.

When working on a project for a few days or weeks, I download the notebook to my PC, having different file name every time (file1, file2, file3, etc.) So I can see any previous code that I’ve written and copy it back to my current code if required.

If the automatic save fails and it says something like “auto sync fails because of another copy”, click Runtime menu, Manage sessions. You’ll see two sessions open for the same notebook file. Terminate one of them and the automatic saving is fixed.

There you go I hope it was useful. Any comments, corrections or advise you can contact me on vrainardi@gmail.com.
Happy coding in machine learning!

8 November 2021

What is a Data Lake?

Filed under: Data Warehousing — Vincent Rainardi @ 9:05 am

When asked what a data lake is, most people describe the function, rather than describing what it is physically. That is like answering “What is a car” question with “A vehicle that we can use to move from place A to place B”. This is because not many people know what a data lake is, physically. So in this article I would like to answer that seemly simple question.

Collection of Files

A data lake, physically, is a collection of files. Files can be structured (tabular, hierarchical, graph, etc.) or unstructured (images, audio, videos, documents, etc.) Typically, structured data files are stored as columnar text files such as CSV or pipe delimited. Other commonly found format for data files are JSON, XML and Excel files.

Unstructured data files such as multimedia files are stored in their native formats, such as:

  • Images: JPEG, PNG, GIF, BMP (link)
  • Audio: WAV, MP3, M4A (link)
  • Video: MPEG, AVI, WMV (link)
  • Document: PDF, HTML, DOCX (link)

The other type of files stored in the data lake are:

  • Database backup files (typically .BAK but can be other extensions)
  • Log files (typically .LOG but can be other extensions)
  • Email files (typically MSG, PST, EDB, OST)
  • Social media data such as Facebook and Twitter

BLOB (Binary Large Object)

Let me clarify a terminology which is used often but not usually clear what it is physically: BLOB. A blob, or binary large object, is a file containing binary data, such as multimedia files or executable files. So originally blob does not include human readable files such as text files. But in the data lake world, blob generally means all files, including human readable files.

But if we want to be precise and get physical, blobs are not files. Blobs are a collection of files, which can be stored in different ways.

In Azure, there are 3 types of blobs (or blob storage): block blob, append blob, page blob. When people say “file”, it generally means “page blob”, which is random access file storage. Block blobs are optimised for uploading large amount of data/files, whereas append blob is optimised for appending data at the end of the blob (not efficient for updating or deleting existing blocks)

In AWS, files are called objects. What we call “blob storage” in Azure is called object store in AWS. An object store in AWS uses unique key-values to store objects. An object in AWS consists of a file and a metadata describing the object.

Containers

The files in the data lake are organised in folders. These folders are called “containers” or “buckets”. In Azure they call it a container and in AWS and Google Cloud they call it a bucket. The container can be multi levels (folders within folders).

Each container belong to an account. These accounts are called “storage account”. Users are permissioned to access these storage accounts. But users can also be permissioned to access a container or a file.

HTTP Access

Files in the data lake is accessable via HTTP. Data lake is a RESTful architecture so each file has a URI (means URL).

For example:

Database-like Access

Structured data files in the data lake can be accessed as if they are tables in a database. We can do this in both Azure, AWS and Google Cloud. For that we use technology like Hive and Databricks. In Databricks we create a Hive table which is linked to a data file. In Databricks, a collection of tables is called a “database”.  

Afterwards, we can query that Hive table using an SQL SELECT statement (Spark SQL). To run those Spark SQL queries we need to create a Spark cluster.

Database-like Access is a very important feature of a data lake. It means that the data lake is queryable as if the data is stored in SQL tables. Many BI tools (such as Power BI) can access Databricks tables, be it in Azure, AWS or Google Cloud. So we don’t need to put the data into a SQL Server or Oracle database. The BI tools can directly query the data lake as if they are database tables.

Blog at WordPress.com.