Data Platform and Data Science

28 April 2024

Backup in Snowflake

Filed under: Data Warehousing — Vincent Rainardi @ 10:43 am

Many of us who work with databases like SQL Server and Oracle are used to the concept of database backup. We backed up our databases every day, and in the event of database failure, we restore that backup. Then, to bring our restored database up to date, we restore the transaction log up to a specific point in time.

We are now in the era of Snowflake and Databricks, where there is no concept of backup. What we have Snowflake is time travel. As you know, we query a database using SELECT statement. Any database. In Snowflake too. For instance: SELECT contract_date, contract_value FROM contract. But in Snowflake at the end of your SELECT statement you can add “at” like this: SELECT contract_date, contract_value FROM contract AT ‘2024-04-15 07:30:00’. You need to use timestamp data type actually, but you get the idea. And you can put any date and time up to 90 days.

You also have time travel in Databricks, which works the same way. It’s just that instead of AT, at the end of the SELECT statement you put “AS OF” like this: SELECT contract_date, contract_value FROM contract AS OF “2024-04-15 07:30:00”. Time travel is an amazing feature! It eliminates the need for doing backup and restore on a database.

Not only that, when you accidentally dropped a table, and shouted “Oh no!” now you have no reason to panic. Because just like we have a DROP statement, in Snowflake we also have UNDROP statement. It is amazing! For example, if we did “DROP contract” accidentally, we can just type: “UNDROP contract” to bring our contract table back from the dead. And you can do it up to 90 days after you dropped the table. And before you ask, yes in Databricks you can also undrop tables. Same syntax, i.e. UNDROP table_name.

In addition to that, in Snowflake you also have CLONE. It is like copying a database without actually doing any copying. For example, to copy DB1 into DB2, we say: CREATE DATABASE DB2 CLONE DB1. Now we have a new database called DB2 which is exactly the same as DB1. But unlike SQL Server copy database, Snowflake CLONE statement does not actually make a copy. It only put a marker in DB1 at a particular point in time, labelled as DB2. And as we update DB1, those updates are not applied to DB2. And vice versa, when we update DB2, you will not find that update in DB1.

Oh yes, you CLONE tables too. So you can say CREATE TABLE Table2 CLONE Table1. And voila, you now have a new table called Table2, which is exactly like Table1. And you don’t occupy any disk space when doing that. Isn’t it amazing?

But the beauty of time travel in Snowflake is that at the end of that CLONE statement we can add “AT”. So we can say: CREATE DATABASE DB2 CLONE DB1 AT ‘2024-04-15 07:30:00’. And now we have a new database called DB2 which is exactly the same as DB1 on 15th of April. And yes, before you ask, you can do that with tables too. When I found this, my jaw dropped to the floor. That is an amazing feature.

Now the boring bit. The caveats. By default those time travel feature is only set for 1 day. In your Snowflake account there is something called DATA_RETENTION_TIME_IN_DAYS. If you set this to 90 you’ll have time travel feature set for 90 days. And yes you need Enterprise edition for this, which is 1.5x more expensive than the Standard edition.

Related to backup there is one more think I’d like to mention: unload into files. Using COPY INTO you can bulk unload your tables into text files in S3 or ADLS. You can unload all the data into a single file, or into multiple files. You can then load those files back into your tables/database. But who wants to do that if you have CLONE AT and SELECT AT features?

PS. For those of you who are interested in trying out the SELECT AT, to use timestamp data type the syntax is like this: SELECT * FROM TABLE1 AT (TIMESTAMP => ‘2024-04-28 10:58:00’::TIMESTAMP);

14 April 2024

Data Vault

Filed under: Data Warehousing — Vincent Rainardi @ 1:40 pm

Data Vault is a data warehousing data modelling technique. It is usually put in the middle layer of a data warehouse (integration layer), whereas the presentation layer uses star schema, as shown below. Its main strength is flexibility, i.e. we can add a data source without changing the data structure in Data Vault. The other strength is parallel loading. The main disadvantage is it’s not user friendly because we have to join many tables.

Data Vault modelling was created by Dan Linstedt in 1990-2000, and version 2.0 was written in 2013-2016. Beside Dan, other people that I consider as an authority in data vault are Kent Graziano (currently with Snowflake) who has been doing & teaching data vault for over 20 years, and of course Michael Olschimke, who wrote the DV 2.0 book with Dan, and has been teaching data vault for a long time, and Patrick Cuba, author of “Data Vault Guru” (also with Snowflake). It is impossible to Google any particular question on Data Vault without coming across these 4 people. We should all be grateful to them for educating and encouraging so many data professionals through their books, training and seminars for 20 years, and in turn transforming so many companies in terms of BI and DW.

Business Vault is a layer sitting on top of Raw Data Vault as shown below, not a full separate data store after it. Business Vault contains additional tables which improves the query performance such as Point-in-Time tables and Bridge tables, as well as calendar table.

Inside the Raw Data Vault there are 4 types of tables: hub, link, satellite and reference. It’s best to learn from an example, so below is an example for retail business such as a shop selling car spare parts over the internet.

To begin with, hub means business key. Because order has a business key (order ID) it becomes a hub rather than a link. The above diagram is not complete, for example in the Link_Order_Detail we usually have a line number column. Also the colour, size, UOM and packaging columns on Sat_Product may require reference tables.

Satellite tables is one per source system. Also one per speed of change (rapidly changing attributes are put in one satellite, whereas slowly changing attributes are put in another satellite.

System columns

Hub, link and ref tables have 3 system columns: hash_key, load_datetime and record_source. Hash_key is MD5 or SHA1 on the business key (MD5 and SHA1 are encryption algorithms). Load_datetime is the time when that row was loaded (same for all rows in the batch, not different by a few milliseconds). Record_source is the module in the source system (as granular as possible), such as SAP.SO.Product for the product table in the SAP Sales Order Management module.

Satellite tables have one more system column: hash_diff, which is used to identify changes in the descriptive attributes. It is the MD5 or SHA1 of what would be the SCD2 attributes (the columns that we are tracking changes of). The other system column which can be useful is delete_flag, which indicates whether that row was deleted or not.

Hash keys and hash diff are introduced in DV 2.0 to enable parallel loading, one of Data Vault main strengths.

Presentation layer

As Dan and Michael said in their DV 2.0 book, in the presentation layer we need star schema. They give a whole chapter explaining dimensional modelling (chapter 7). They call the presentation layer as “information mart” or “information delivery” layer.

The reason for creating a star schema is because it is business friendly, and it’s easy to use. Rather than joining many tables in the data vault, reports and users only need to join fact to dim tables. Other advantages of using star schema are listed here: 12 Reasons for using Dimensional Model. There are 2 types of dimensional model: star schema and snowflake schema. I don’t recommend Snowflake schema (not Snowflake the data platform) for presentation layer because users and reports will need to do more joins, making it less user-friendly.

Although it is possible to have 3rd normal form as the presentation layer, I don’t recommend it because there are at least 12 disadvantages compared to star schema (e.g. historical data, user friendliness and performance), as I wrote in the above link/article.

Conclusion

I would recommend using Data Vault 2.0 in data warehousing. Data Vault is a good and proven methodology in data warehousing. 20 years is long enough to prove it. Snowflake supports it (link), Databricks supports it (link), Microsoft Fabric supports it (link). Wherescape supports it too (link). Astera supports it too (link). It is not easy. Like dimensional modelling, it needs skills to implement Data Vault. Not just to design the data model, but also to design the data ingestions. Experience is very valuable and important here. Is it easy or difficult to learn Data Vault? Read my article here. The answer is similar to that.

Happy learning! Thanks for reading. Would welcome your comments, feedback and opinion. And corrections!

4 April 2024

Snowflake for AI: prediction for a bank

Filed under: Data Warehousing — Vincent Rainardi @ 5:38 am

A wise man once said to me, that I need to start with the conclusion. So here we go.

In AI, it seems that generally speaking Snowflake is still playing catch up with Databricks. Databricks started 10 years ago, whereas Snowflake just started 3 years ago, as far as I’m aware. But last November Snowflake released Cortex, which enables companies to make prediction based on their data, or detect anomalies. And like everything else in Snowflake, it is supposedly very easy to use.

What better way to spend Easter bank holiday than doing AI in Snowflake? So today I’m going to do just that: to find out for myself how difficult or easy it is to use Snowflake for AI. There was a bank in Portugal which did a marketing campaign by calling their customers offering term deposit product. The data contained 41,188 customers, which I loaded into a Snowflake database. I split the data into two: 28,939 for training data (70.3%) and 12,249 for testing data (29.7%). I then used Cortex to create a classifier model to predict if a client will take the term deposit offer or not. For the 12,249 customers in the test data, the model is correct 11,468 times so the accuracy is 93.6%.

Surprisingly, it is super easy to use Snowflake for AI. Far easier than Scikit-learn on Colab. It took only one SQL statement to train the model. Yes you read that correctly, a SQL statement, not Python! And I only needed one SQL statement to make prediction for the whole test data. That is amazingly simple! I like Snowflake because it is simple. To connect, to load file, to schedule a task, all of them are simple in Snowflake. But this is beyond my expectations. It is super simple to use Snowflake for AI.

So in the next few paragraphs I’ll explain the business case, the data, then the process (ML modelling), then the result.

Business Case

It’s a retail bank in Portugal, in 2010. They have 41,118 customers that the customer service centre called and offered a term deposit product in a marketing campaign. The bank has the client data such as age, occupation, marital status, education and communication type; as well as campaign data such as number of calls with the customer during this campaign, the duration of the calls, number of days since the last contact. The bank also has credit data of each customer, such as whether the customer has a mortgage, or a personal loan, and whether the customer ever defaulted with their credit (meaning failed to pay the instalment).

The data looks like this:

The last column (column U) is whether the customer took up the term deposit offer or not. Column P to T contains economic data when the customer was contacted, such as the consumer price index (in Portugal) and the inter-bank interest rate in Euro zone countries.

Here are the columns:

  1. Age: the customer age.
  2. Occupation: admin, blue-collar, entrepreneur, housemaid, management, retired, self-employed, services, student, technician, unemployed, unknown.
  3. Marital status: married, single, divorced (meaning divorced or widowed), unknown.
  4. Education level: basic 4 years, basic 6 years, basic 9 years, high school, illiterate, professional course, university degree.
  5. Credit default flag: whether the customer ever defaulted in any credit facility, such as loan, mortgage, overdraft or credit card.
  6. Mortgage flag: whether the customer has a mortgage or not.
  7. Personal loan flag: whether the customer has a personal loan or not.
  8. Communication type: whether the campaign call was to the customer mobile phone or a land-line telephone.
  9. Last contact month: in which month the campaign call happened.
  10. Last contact day of week: on which day the campaign call happened.
  11. Last contact duration seconds:  how long did the campaign call last.
  12. Number of contacts during campaign: how many times the bank contacted the customer during the campaign (or the customer contacted the bank)
  13. Days since previous campaign: how many days since the customer was contacted in the previous campaign.
  14. Number of contacts before this campaign: how many times the customer has been in contact with the bank before this campaign.
  15. Outcome of previous campaign: what was the outcome of the previous campaign, i.e. success or failed (or there was no previous campaign).
  16. Employee variation rate: indicates whether the bank is hiring more people or less people.
  17. Consumer price index: an economic indicator about how much prices are raising or falling for many different types of goods and services.
  18. Consumer confidence index: indicating how optimistic or pessimistic consumers feel about their financial well-being and the economy.
  19. EUROIBOR 3 month rate: how much interest rate the banks in the Euro zone countries charge each other for a credit line with a maturity of 3 months.
  20. Number employees: number of employees are working for the bank at the time the customer was contacted.
  21. Subscribed to term deposit: the dependent variable that we are trying the predict, i.e. whether the customer in the end subscribed to the term deposit product or not.

So there are 20 predictors and 1 factor to predict. And it becomes a machine learning classification problem. We try to classify each customer to a Yes category (they opened a term deposit with the bank) or No category (they didn’t open a term deposit with the bank). It would be interesting to understand which predictors are the most influential.

Loading the data

The process is outlined by Snowflake on Ref #2 below, based on the research conducted by Sergio Moro, Paulo Cortez and Paulo Rita from the University Institute of Lisbon in 2014 on Ref #3 below.

First I created a database in Snowflake called Nexus Bank, and a schema in that database called ML Prediction (see screenshot below). Then a “stage” called S3 Nexus Bank. A “stage” is window from Snowflake into the files in S3 bucket. Through this “stage” we can see a file called cortex_ml_classification.csv:

The URL for the stage is: s3://sfquickstarts/frostbyte_tastybytes/mlpf_quickstart/. The above CSV file in S3 contains the bank customer data that I mentioned above. To load this CSV file I created a file format called CSV_FILE like this:

I then loaded the CSV file into a table using the above file format, like this:

CREATE OR REPLACE TABLE BANK_CUSTOMER AS 
SELECT 
    T.$1  AS AGE, 
    T.$2  AS OCCUPATION, 
    T.$3  AS MARITAL_STATUS, 
    T.$4  AS EDUCATION_LEVEL, 
    T.$5  AS CREDIT_DEFAULT_FLAG, 
    T.$6  AS MORTGAGE_FLAG, 
    T.$7  AS PERSONAL_LOAN_FLAG, 
    T.$8  AS COMMUNICATION_TYPE, 
    T.$9  AS LAST_CONTACT_MONTH, 
    T.$10 AS LAST_CONTACT_DAY_OF_WEEK,
    T.$11 AS LAST_CONTACT_DURATION_SECONDS, 
    T.$12 AS NUMBER_OF_CONTACTS_DURING_CAMPAIGN, 
    T.$13 AS DAYS_SINCE_PREVIOUS_CAMPAIGN, 
    T.$14 AS NUMBER_OF_CONTACTS_BEFORE_THIS_CAMPAIGN, 
    T.$15 AS OUTCOME_OF_PREVIOUS_CAMPAIGN, 
    T.$16 AS EMPLOYEE_VARIATION_RATE, 
    T.$17 AS CONSUMER_PRICE_INDEX, 
    T.$18 AS CONSUMER_CONFIDENCE_INDEX, 
    T.$19 AS EURIBOR_3_MONTH_RATE, 
    T.$20 AS NUMBER_EMPLOYEES, 
    T.$21 AS SUBSCRIBED_TO_TERM_DEPOSIT
FROM @S3_NEXUS_BANK/cortex_ml_classification.csv (file_format => CSV_FILE) T;

Then I created a view like this to add data type for each column:

CREATE OR REPLACE VIEW BANK_CUSTOMER_VIEW AS
SELECT
    CAST(AGE AS INT) AS AGE,
    REPLACE(INITCAP(CAST(OCCUPATION AS VARCHAR(50))),'.','') AS OCCUPATION,
    INITCAP(CAST(MARITAL_STATUS AS VARCHAR(30))) AS MARITAL_STATUS,
    REPLACE(INITCAP(CAST(EDUCATION_LEVEL AS VARCHAR(30))),'.',' ') AS EDUCATION_LEVEL,
    INITCAP(CAST(CREDIT_DEFAULT_FLAG AS VARCHAR(10))) AS CREDIT_DEFAULT_FLAG,
    INITCAP(CAST(MORTGAGE_FLAG AS VARCHAR(10))) AS MORTGAGE_FLAG,
    INITCAP(CAST(PERSONAL_LOAN_FLAG AS VARCHAR(10))) AS PERSONAL_LOAN_FLAG,
    INITCAP(CAST(COMMUNICATION_TYPE AS VARCHAR(30))) AS COMMUNICATION_TYPE,
    INITCAP(CAST(LAST_CONTACT_MONTH AS VARCHAR(10))) AS LAST_CONTACT_MONTH,
    INITCAP(CAST(LAST_CONTACT_DAY_OF_WEEK AS VARCHAR(10))) AS LAST_CONTACT_DAY_OF_WEEK,
    CAST(LAST_CONTACT_DURATION_SECONDS AS INT) AS LAST_CONTACT_DURATION_SECONDS,
    CAST(NUMBER_OF_CONTACTS_DURING_CAMPAIGN AS INT) AS NUMBER_OF_CONTACTS_DURING_CAMPAIGN,
    CAST(DAYS_SINCE_PREVIOUS_CAMPAIGN AS INT) AS DAYS_SINCE_PREVIOUS_CAMPAIGN,
    CAST(NUMBER_OF_CONTACTS_BEFORE_THIS_CAMPAIGN AS INT) AS NUMBER_OF_CONTACTS_BEFORE_THIS_CAMPAIGN,
    INITCAP(CAST(OUTCOME_OF_PREVIOUS_CAMPAIGN AS VARCHAR(20))) AS OUTCOME_OF_PREVIOUS_CAMPAIGN,
    CAST(EMPLOYEE_VARIATION_RATE AS NUMBER(2, 1)) AS EMPLOYEE_VARIATION_RATE,
    CAST(CONSUMER_PRICE_INDEX AS NUMBER(5, 3)) AS CONSUMER_PRICE_INDEX,
    CAST(CONSUMER_CONFIDENCE_INDEX AS NUMBER(3,1)) AS CONSUMER_CONFIDENCE_INDEX,
    CAST(EURIBOR_3_MONTH_RATE AS NUMBER(4, 3)) AS EURIBOR_3_MONTH_RATE,
    CAST(NUMBER_EMPLOYEES AS NUMBER(5, 1)) AS NUMBER_EMPLOYEES,
    INITCAP(CAST(SUBSCRIBED_TO_TERM_DEPOSIT AS VARCHAR(10))) AS SUBSCRIBED_TO_TERM_DEPOSIT,
    CASE WHEN UNIFORM(0::FLOAT, 1::FLOAT, RANDOM()) < .7
        THEN 'Training' ELSE 'Testing' 
        END AS TRAINING_TESTING_SPLIT
FROM BANK_CUSTOMER;

And when I did SELECT * the output is like the Excel screenshot in the business case section above. Except the last column, which I added to split the data into Training and Testing (70%-30%).

Then I created 2 views based on that last column, one for Training data and one for Testing data, like this:

CREATE OR REPLACE VIEW TRAINING_VIEW AS
SELECT * EXCLUDE TRAINING_TESTING_SPLIT
FROM BANK_CUSTOMER_VIEW
WHERE TRAINING_TESTING_SPLIT = 'Training';

CREATE OR REPLACE VIEW TESTING_VIEW AS
SELECT * EXCLUDE TRAINING_TESTING_SPLIT
FROM BANK_CUSTOMER_VIEW
WHERE TRAINING_TESTING_SPLIT = 'Testing';

The Training view contains 28,939 customers (70.3%), whereas the Testing view contains 12.249 customers (29.7%).

Machine learning stuff

The machine learning modelling is very simple. As I said before, it only takes one SQL statement to create a classifier model and train it:

CREATE OR REPLACE SNOWFLAKE.ML.CLASSIFICATION TERM_DEPOSIT_CLASSIFIER
(   INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'TRAINING_VIEW'),
    TARGET_COLNAME => 'SUBSCRIBED_TO_TERM_DEPOSIT'
);

The above SQL created a classifier model from the training view, with the Term Deposit as the target column.

To predict whether the customer opened a Term Deposit account or not, we just need one SQL statement:

CREATE TABLE PREDICTION_RESULT AS
WITH PREDICTION_OUTPUT AS
(   SELECT TERM_DEPOSIT_CLASSIFIER!PREDICT
        ( INPUT_DATA => object_construct(*) ) AS TIME_DEPOSIT_PREDICTION,
        SUBSCRIBED_TO_TERM_DEPOSIT
    FROM TESTING_VIEW
)
SELECT 
    SUBSCRIBED_TO_TERM_DEPOSIT AS ACTUAL,
    TIME_DEPOSIT_PREDICTION:class::VARCHAR(10) as PREDICTED,
    TIME_DEPOSIT_PREDICTION:probability:No as YES_PROBABILITY,
    TIME_DEPOSIT_PREDICTION:probability:Yes as NO_PROBABILITY
FROM PREDICTION_OUTPUT;

The CTE in the above SQL predicted the Term Deposit for each customer in the Testing view. That prediction output was then put into a table called Prediction Result.

If we query that Prediction Result table it looks like this:

SELECT ACTUAL, PREDICTED, COUNT(*) FROM PREDICTION_RESULT
GROUP BY ACTUAL, PREDICTED ORDER BY 1,2;

So out of 12,249 customers in the Testing data, 10504 + 964 are predicted correctly. Which means that the accuracy of the model is 93.6%.

Conclusion

I found that it is extremely easy to do machine learning predictions using Snowflake, as I have shown above. To find out more details, such as which predictors have the most influence on the dependent variable, please see the documentation on Ref #4 below. I agree with @Ellery Berk that with Snowflake Cortex, machine learning becomes more democratized for everyone (see Ref #5 below). I think Snowflake has done a good job on this. They managed to make ML prediction become so simple.

References:

  1. Data: link
  2. Steps: link
  3. Research: link
  4. Create classification in Snowflake: link
  5. Ellery’s blog: link

Disclaimer: I did not receive any reward or compensation from Snowflake or anyone else for writing this article, monetary or otherwise. All of the above is my own opinion, there was no influence or suggestion from anyone.

Matillion: SaaS ETL Tool

Filed under: Data Warehousing — Vincent Rainardi @ 5:25 am

Looks like quite a lot of companies use Matillion for their data pipelines. I came across some of them last year when looking for a new job. In SaaS ETL space, Matillion seems to be as popular as Fivetran. Not just for Snowflake, but also for Redshift and Databricks. Seems that it’s only on AWS and GCP though, as the Azure space is dominated by ADF. So today I decided to find out for myself, how difficult or easy it is to create data pipelines using Matillion.

Surprisingly, it was very easy to use. It is a completely web-based UI, and yet you can do anything for transforming and loading your data. Within 15 minutes I managed to connect to Snowflake database, created a pipeline that aggregates data from a source table, wrote the result into a target table in Snowflake, and scheduled the pipeline to run daily. It only took me 3 goes before the pipeline that I created ran successfully all the way through. It is very easy to use. There is no typing or scripting. Everything is UI based, point and click, drag and drop. I must say that the UI is very intuitive, hats off to the UI designer at Matillion. And it’s SaaS through and through. No installation whatsoever!

So below I’ll explain the data pipeline I setup. The data, the process of creating the pipeline, running it, scheduling it and checking the output.

The Data

The data is about machine learning predictions of a bank in Portugal who ran a campaign by calling their customers offering a term deposit product. It’s real data, but it’s very old (2010). I wrote about this case here. There are 4 columns in this table: Actual, Predicted, Yes Probability and No Probability, like on the screenshot below.

If the Predicted column matches the Actual column, it means that the prediction is right. Otherwise it is wrong. So I’m trying to aggregate the above data into 4 rows, showing how many predictions are right and how many are wrong, like this:

Creating the data pipeline

In Matillion I created a pipeline which read the source table, filter the rows, aggregate it up into 4 rows as mentioned above, and write those 4 rows into the target table. The pipeline is shown below.

As we can see above, on the left hand side there are many transformations that we can use to process the data (conditional, calculation, comparison, etc), as well as reading, writing, joining and testing the data.

All I needed to do was drag and drop those components into the canvas, and connect them up. Then configure their settings. For example, for the aggregation component I put the Actual column and Predicted column for the “Grouping”, as shown above.

For the Source Table and the Target Table components I connected to the corresponding table in Snowflake like below. And I was surprised how easy it is to do. I just need to supply the URL (well just the first part of the URL, before the “snowflakecomputing.com”) and the user name and password to login into Snowflake.

For the Filter component I add a filtering condition like this:

I set the value above 0 because I’d want to pick up all rows first. Later on I’ll update it to 0.87 (see below) and check that the result is different (should be mostly No).

Once the pipeline is setup I just clicked “validate” at the top. Then I click “run” to run the pipeline.

I then queried the Prediction Aggregate table in Snowflake the result is like this, as expected:
SELECT * FROM PREDICTION_AGGREGATE;

Then I modified the filtering condition to 0.87 like this:

And run it again. The result is like this:

As expected, with filtering the rows to No Probability > 0.87, only rows with Predicted column = No are picked up.

Conclusion

It’s very easy to use Matillion to do ETL. Be it ingestion into lake or transformation within the warehouse. Or analytic applications. It reminded me of the old days of Informatica PowerCenter, 15 years ago. Everything is point and click, drag and drop. There’s no scripting like dbt, no typing like SparkSQL notebook in Databricks. Very easy to use, easy to setup, and easy to connect to Snowflake. Or any other data sources. The UI is very intuitive, and the whole thing is extremely easy to learn. I like Matillion. Thumbs up to their UI designer. And their backend designer too. And it’s all SaaS cloud. No installation!

Disclaimer: I did not receive any reward or compensation from Matillion or anyone else for writing this article, monetary or otherwise. All of the above is my own opinion, there was no influence or suggestion from anyone.

Blog at WordPress.com.