Data Warehousing and Data Science

22 May 2022

Watermark in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:54 am

When we load data into a target table, we write down the last row of data we loaded from the source table. Say we have 5 rows in the source table, and we loaded all of them to the target table. So we write down “5” (meaning that the last row we loaded was row 5), see the picture below:

So now in the target table we have 5 rows.

Note: The above image is slightly in correct. In data warehousing, the pump doesn’t “withdraw” the water from the source (pumping it out). Instead it “copies” the water (water in the source container is not gone but is duplicated). Oh well, I spent almost 2 hours creating that picture and still it’s not right!

Let’s say there are 3 more rows in the source table. So now there are 8 rows in the source. We load from row 6 of course, because the last row we loaded was row 5. And we need to load the 3 new rows from source into the target, see the picture below:

After we loaded the 3 new rows, we update the watermark to 8, see the picture below:

This mechanism is well documented in the Microsoft Azure web page for ADF pipeline (link):

Note here that step 2 is done before step 3. This ensure that the data loading is water tight. Meaning, any row created in the source after the watermark in step 2 will be loaded next time.

Also note above that Microsoft uses the word “watermark” for ADF pipeline (as opposed “delta mark”).

But where did the word “watermark” come from? Well if you see on the Wikipedia (link), the word watermark means an obscured image on paper (almost hidden). If you look at a £20 note, you’ll see a £20 watermark, see the red arrow below:

But that is not the word watermark that we use in data loading 🙂

The word watermark we use in data loading is from this:

It is a water level stick in a dam or a river, and there is a marking on the stick to mark high water level in the past event such as a flood (source: link, link).

Sometime the water mark line is drawn on the wall a house, to mark the level of flooding that hit that house. We can see below left that in 1893 the water level was 41 inches, whereas on the right there were various water marks going back centuries, the most recent one being 1970 and 1982 (source: link, link). Those are known as “high water mark”.

Below left is the high water mark in the Old Town Hall in the city of Passau, Germany (link, link). And below right is the record flood levels at the Boat Inn in Coalport, near Telford, England (link):

That was where the word “watermark” in data loading from! Not the hidden writing on money, but the “high-water marks” due to flooding.

Now you know what a “watermark” in data warehousing is, and where it is from 🙂

30 March 2022

Infrastructure in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:43 am

We often forget that underpinning the data mart/lake/warehouse is the storage, network and security. Without them users can’t access the data. We can’t even store the data! We are so dependent on them and yet we spent so little time on them. Today let’s think about them. About the infrastructure.

When we talk about the infrastructure in Data Warehousing, there are 3 things that we need to do:
1. Storage
2. Network
3. Security

1. Storage

Storage means the place where we store the data, e.g. a file storage, a blob storage, a data lake, a database, etc. It all depends on your architecture. So in Azure or AWS we need to create a storage account, or a data lake store, or a database.

This is what we call them in Azure:

  • Blob storage
  • File share
  • Data lake store
  • SQL database

1.1. Storage – SQL Database

If it’s a SQL database, then we need to think about the sizing, i.e. how much computing power, I/O and memory is allocated to it. In Azure we call it DTU. The default is S0 which is 10 DTU (database transaction unit). A DTU is unit of measure for [CPU, memory, read, write] together. S0 has 10 DTU and 250 GB storage. An S0 has very limited power, it’s very slow. For “real” applications you’ll need an S4 or S6. For light weight application it’s an S3 (link).

  • S0 is 10 DTU with 250 GB storage, $18/month
  • S3 is 100 DTU with 250 GB storage, $184/month
  • S4 is 200 DTU with 250 GB storage, $368/month
  • S6 is 400 DTU with 250 GB storage, $736/month

It’s probably easier to visualise DTU using a “bounding box” like this (source: link):

In my experience, a process consists of the combination of stored procedures and data loading which ran for 2-3 hours on S0, after I changed it to S6 it ran 1-2 minutes.

1.2. Storage – Files

For storing files we first create a “storage account” (link). We can choose either GPv2 storage account which is hard drive based or FileStorage storage account which is SSD based (a lot faster). We also need to choose between local redundancy (3 copies in one physical location), zone redundancy (3 copies in 3 different zones in 1 region) or geo redundancy (6 copies in 2 different regions), see link.

Then we create a container, which is like a folder (link). Then we upload the files into this container. To enable users to access the files from Windows explorer (like \\location\folder\) we need to create a file share (link).

You can also create an Oracle database in Azure (link), but you need to create it on a VM (virtual machine). If you use AWS, the storage is called “S3” (Simple Storage Service). What we call a “container” in Azure, in S3 we call it a “bucket”. Basically it means a group of files (link).

1.3. Storage – Data Lake

In Azure, the storage for data lake is called ADLS2 (Azure Data Lake Storage Generation 2, link). It can be used for Hadoop (HDInsight), Synapse or Databricks. It’s very scalable (exabytes) and cost effective ($0.019 per GB per month, for standard storage, or “hot storage”).

There are 4 different storage in ADLS: hot, cool, archive and premium. Hot is the standard. Archive is automatically deleted after 180 days. Cool is automatically deleted after 30 days. Premium has low latency (meaning fast).

2. Network

Network means access to that storage, i.e. how users can connect to that storage or database. How the reports can access that database or data lake.

2.1 Network for a storage account

After we have a storage account, we configure the “networking” of that storage account. By that I mean we click on the Network pane and configure which the IP ranges are allowed to access this storage account. This is done by configuring the “firewall rules” to allow only certain IP ranges to go in, or to go out.

Storage accounts also have private endpoints which is accessible through the internet. It means that we assign a private IP address from the VNet to the storage account. A VNet (virtual network) is a group of computers and devices connected through the internet. We can put a VM (virtual machine), a storage account, a SQL database into one VNet so they can access each other.

The concept of private endpoint can also be applied to an application, like this (source: link):
(click the image to enlarge)

We can see above that the Europe West and Europe North VNets are both using private IP addresses (meaning 10.something or 192.something), communicating via private endpoints (the green arrows) to a web app front end (link).

2.2. Network for a SQL database

In Azure, a SQL database also has a Network pane on which we can configure which IP ranges are allowed to access this database. This is done by configuring the virtual network rules (VNet rules) or IP rules to allow certain IP ranges to go in, or to go out.

VNet rules are applied to the whole SQL server, not just to one database. Whereas IP rules can be applied to either the SQL server level or the database level.

Networking wise, in front of a SQL database there is a “database gateway” (the green boxes below, source: link). A database gateway is an Azure service which has a public IP address, and listens to port 1433 (the SQL port).

The gateway then redirect the traffic to the right cluster (cluster 54). Inside the cluster, the traffic is forwarded to the right database. The alternative is called “proxy mode”, which means that traffic must hit the gateway and the gateway will get the traffic to the right cluster.

Of course you don’t have to have a cluster. In most cases the SQL server is “stand alone” (meaning not in a cluster). A database cluster means a group of SQL servers with high availability. Each of the SQL server is called a “node”. One node (called the primary node) is copied to other nodes (called secondary nodes), like this: (source: link)

3. Security

Security is about allowing users to access the data, and protecting the data.

3.1. Data Access

Data Access is about 2 things: authentication and authorisation.

  • Authentication mean that we lookup the incoming user in our Active Directory, and verify the password.
  • Authorisation means that we lookup what the user is allowed to access.

Bear in mind that a user can be a service account, not just human users. And bear in mind that instead of user ID and password, for service account we can also have a “key” and a “secret”, which is normally stored in a “key vault”.

If you have a database, and a user wants to read a table in this database, then the system will first find out who this user is. This is done by looking up in the Active Directory (list of all the computers, servers, users in our organisation). It checks whether the password supplied is correct (or the “secret” for the “key”).

We can also create a “guest user” in our Active Directory (link). A guest users “sign in” in their company, then come to our network as a temporary user. So we verify their user ID (or email address) but not their password. But remember, only users from that company are allowed in (“company” is defined as a range of IP addresses).

3.2. Data Protection

Data protection is about 2 things: protecting the data storage, and protecting the data transport.

Data storage means the place where our data is stored, i.e. the storage accounts, the data lake, or the databases. We do this by allowing certain users to be able to access certain tables (or views, or stored procedures). We do that using Active Directory.

Data transport means that when we transfer data through the internet, no one can know its content. And no one can eavesdrop either. We do this by encrypting the data, and securing the “channel” we use to transport the data. “Securing” means that public can’t access this channel. Only authorised users are allowed to access it.

Data Architect

What really strikes me is that as a data architect you would need to know those things. The networking, the security and the storage. Earlier in my career, when I was just started working I was a network guy. Back in uni I installed Novell Ethernet network in my lab and configure it. And in my first job I was connecting IBM token ring terminal for AS/400 network. That was my job when I started. Well not for long, because straight after then I became a COBOL developer in AS/400, as well as SQL developer and C# developer (for security access). Well when you just started working you’d do anything that your boss asked you to, right? 🙂

So when later on in my career I became a data architect I didn’t have much trouble understanding the terminology. Well, it’s true that it’s been 20 years so I need to read up a bit. But the basic understanding was already there. Networking and security doesn’t change much. They have a “new face”, new names, but the principles are still the same.

But yes, my point is that as a data architect you need to do those things. You need to design the access to your data, how to secure the storage and the transport, and how you are going to store the data. Basically those 3 things above: the storage, the network and the security. It’s part of the job as a data architect. We are not data modellers. Or data analysts. We are a data architect, meaning that we design the data models, and we architect the data storage, data loading and data access, and everything around it. Not just the data, but also the processes and the architecture.

21 February 2022

NoSQL in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 3:42 am

Where does NoSQL sit in data warehousing landscape? Does it just sit on the bench, having no role? Do we use it in staging area? Or does it function as one of the main data stores?

NoSQL has many faces. Document database, key value pair, graph database, they are all NoSQL. So obviously its usage depends on what type of data we have to store. If our data is about individual people (or individual companies) then it lends itself to be stored as a graph database. Because people have connections (ditto companies). Or you can store them (people and companies) as documents.

That is if you focus on one of the at the time. But if you’d like to process them en-masse, aggregate them, etc. then tabular storage has its advantages. KVP has a distinct advantage when it comes to schema-on-read. Data from any sources can be stored as key value pairs in the data lake, and thus deliberately avoiding modelling the data.

The idea of schema-on-read is simple, but a bit difficult to be agreed with. But it is key to understanding NoSQL’s role. The basic premise of a data lake is that we don’t model the data when storing it in the lake. We model it when we use it. KVP is ideal for this, for data lake store. All the attributes from all tables are stored in just one column. Obviously this requires a good data dictionary and data lineage. Not just at entity level (what they contain, how they are used, etc.) but down to each attribute.

Document DB’s role in data warehousing is very different (or any data lakescape for that matter). Document DB has models. Every entity has attributes. And that becomes the basis of the data model. But unlike tabular model, in document DB individual entity can have different attributes. Let’s look at security as an example (as in financial instrument). The attribute of an option is very different to a CDS. An option has a “strike price” attribute for example, and a CDS doesn’t. A CDS has a “reference entity” attribute whereas an option doesn’t.

In document DB entities have relationship with each other. If it is one to many then the parent IDs are put on the child entities. If it many to many then we use a “joining document” or we can do embedding and referencing (see: ref #6 below).

There are other types of NoSQL too: wide column store, tuple store, object store, triple store, multi model DB. We really are limiting ourselves if we only do relational databases. There are many, many different types of data store outside relational DB. And as a data architect, it is our job to know them well. Even if eventually we use SQL for a project, we know know very well why we are not implementing it on the other data stores.

A “wide column store” uses rows and columns like in a relational database, but the column names and data types can vary from row to row. The most famous wide column store is probably Cassandra, but both Amazon and Azure have them.

So where does that leave us. Where do we use NoSQL in data warehousing or data lake? KVP is useful to be used as a data lake store. Graph database would be useful as the main data store if our data is about individual object (such as person, security, or company) which is not processed/queried en-masse. Document DB is useful for modelling entites which has variable attributes, such as security. So NoSQL can be used as a data lake store, or as one of the main data stores.

Reference:

  1. Azure Cosmos DB – Documentation: link
  2. Azure Cosmos DB – Key Value Pair: link
  3. Azure Cosmos DB – Document DB: link (MongoDB)
  4. Azure Cosmos DB – Graph DB: link (Gremlin)
  5. Azure Cosmos DB – Designing Data Structure: link
  6. Azure Cosmos DB – Data Modelling: link
  7. Azure Cosmos DB – Wide Column Store: link (Cassandra)
  8. Amazon NoSQL – Documentation: link
  9. Amazon NoSQL – Key Value Pair: link (DynamoDB)
  10. Amazon NoSQL – Document DB: link
  11. Amazon NoSQL – Graph DB: link (Neptune)
  12. Technical Review of Two wide column store NoSQL Databases: Amazon DynamoDB and Apache Cassandra, by Abigail Udo: link
  13. Amazon Athena: link
  14. Oracle NoSQL Database Documentation: link
  15. Wikipedia on NoSQL: link
  16. The curse of Schema-on-read in Data Lakes, by Sandeep Uttamchandani: link

10 February 2022

Using CNN for Stock Prediction

Filed under: Data Warehousing,Data Science — Vincent Rainardi @ 7:23 am

It really puzzled me when people talked about using CNN for stock market prediction. CNN is for processing images. How can CNN be used for predicting the stock market? Surely we need LSTM for that, because it is a time series?

The key here is to recognise that time series can be viewed in polar coordinate, like this: (Ref #2 and #3)

Stock charts are time series, which is basically the price of the stock across time. This is in Cartesian coordinate, i.e. X and Y coordinate. A point in X and Y coordinate can be converted into polar coordinate and vice versa, like this: (Ref #4)

This way, the line in the x and y chart (such as time series) can be converted into a Polar Coordinate chart, like above. The reason we are converting it to Polar Coordinate is to make it easier to detect patterns or anomalies (Ref #5).

To identify the temporal correlation in different time intervals we look at the cosine of sum between each pair of points (Ref #6 and #7), like this:

The above matrix of cosine is called the Gramian Matrix.

To make it easier to visualise, we convert the Gramian Matrix into an image, like below left: (Ref #3)

 The image on the left is a Gramian Angular Summation Field (GASF). If the Gramian Matrix uses sin instead of cos, and the operative is minus instead of plus, then the image is called a Gramian Angular Difference Field (GADF), like above right. Together, GASF and GADF are called GAF (Gramian Angular Field).

So why do we use GAF? What are the advantages? The first advantage is that GAF preserves temporal dependency. Time increases from the top left corner of the GAF image to the bottom right corner. Each element of the Gramian Matrix is a superposition or difference of directions with respect to the time difference between 2 points. Therefore GAF contains temporal correlations.

Second, the main diagonal in the Gramian Matrix contains the values with no time difference between 2 points. Meaning that the main diagonal contains the actual angular values. Using this main diagonal we can construct the time series of the features learned by the neural network (Ref 6# and #7).

Different colour schemes are used when creating GAF chart, but the common one is from blue to green to yellow to red. Blue for -1, green for 0, yellow for 0.3 and red for 1, like this: (Ref 6)

Once the time series become images, we can process them using CNN. But how do we use CNN to predict the stock prices? The idea is to take time series of thousands stocks and indices from various time periods, convert them into GAF images, and label each image with the percentage up or down the next day, like below.

We then train a CNN to classify those GAF images to predict which series will be up or down the next day and by how much (a regression exercise).

Secondly, for each of the stock chart we produce various different indicators such as Bollinger Bands, Exponential Moving Average, Ichimoku Cloud, etc. like below (Ref 15), and convert all of them to GAF images.

We put all these overlays together with the stock/index, forming a 3D image (dimensions: x = time, y = values, z = indicators. We use the same labels, which is the percentage up or down the next day, to train a CNN network using those 3D images and those labels. Now, we don’t only use the stock prices to predict the next day movement, but also the indicators.

Of course, we can use the conventional LSTM to do the same task, without converting them to the Polar Coordinate. But the point of this article is to use CNN for stock prediction.

I’ll finish with 2 more points:

  1. Apart from GAF there are other method for converting time series into images, for example Markov Transition Field (MTF, Ref 9) and Reference Plot (RP, Ref 10). We can use MTF and RP images (both the prices and the indicators) to predict the next day prices.
  2. There are other methods for using CNN to predict stock prices without involving images. The stock prices (and their indicators) remain as time series. See Ref 11 first, then 12 and 13. The time series is cut at different points and converted into matrix, like below.

If you Google “using CNN for predicting stock prices”, the chances are it is this matrix method that you will find, rather than using images. Because this matrix method uses X and y (input variable and target variable) then we can also use other ML algorithm including classical algorithms such as Linear Regression, Decision Trees, Random Forest, Support Vector Machines and Extreme Gradient Boosting.

The third method is using CNN-LSTM. In this method the local perception and weight sharing of CNN is used to reduce the number of parameters in the data, before the data is processed by LSTM (Ref 14).

So there you go, there are 3 ways of using CNN for predicting stock prices. The first one is using images (GAF, MTF, RP, etc), the second one is converting the time series into X and y matrix and the third one is by putting CNN in front of LSTM.

References:

  1. Encoding time series as images, Louis de Vitry (link).
  2. Convolutional Neural Network for stock trading using techincal indicators, Kumar Chandar S (link)
  3. Imaging time series for classification of EMI discharge sources, Imene Mitiche, Gordon Morison, Alan Nesbitt, Michael Hughes-Narborough, Brian G. Stewart, Philip Boreha (link)
  4. Keisan online calculator (link)
  5. Sensor classification using Convolutional Neural Network by encoding multivariate time series as two dimensional colored images, Chao-Lung Yang, Zhi-Xuan Chen, Chen-Yi Yang (link)
  6. Spatially Encoding Temporal Correlations to Classify Temporal Data Using Convolutional Neural Networks, Zhiguang Wang, Tim Oates (link)
  7. Imaging Time-Series to Improve Classification and Imputation, ZhiguangWang and Tim Oates (link)
  8. How to encode Time-Series into Images for Financial Forecasting using Convolutional Neural Networks, Claudio Mazzoni (link)
  9. Advanced visualization techniques for time series analysis, Michaël Hoarau (link)
  10. Financial Market Prediction Using Recurrence Plot and Convolutional Neural Network, Tameru Hailesilassie (link)
  11. How to Develop Convolutional Neural Network Models for Time Series Forecasting, Jason Brownlee (link)
  12. Using CNN for financial time series prediction, Jason Brownlee (link)
  13. CNNPred: CNN-based stock market prediction using several data sources, Ehsan Hoseinzade, Saman Haratizadeh (link)
  14. A CNN-LSTM-Based Model to Forecast Stock Prices Wenjie Lu, Jiazheng Li, Yifan Li, Aijun Sun, Jingyang Wang (link)
  15. StockCharts.com (link)

28 January 2022

Feature Importance

Filed under: Data Warehousing — Vincent Rainardi @ 9:15 am

Quite often in machine learning we would like to know which variables have the most impact on predicting the target variable. Knowing which product features affect the customer’s decision to buy the product is critical for a consumer goods manufacturer. Knowing which financial ratios affect the stock price if very important not only to fund managers, but also to the board of directors. Imagine for a second, that you will be able to know that, out of 10 different factors there are 2 factors which has lots of influence and 1 factor having minor influence, like this:

That would be very useful right? Some people would do anything to get this knowledge.

Well, the good news is: it is possible. In fact, we have been doing it for many decades, since before machine learning. But with machine learning it got better.

Like many things in machine learning, the question is usually: do we have the data? When I first studying ML, I always thought “Do we have the algorithm?” Or “which algorithm should I use?” (link) But after studying many algorithm, I’m now convinced that there would be one suitable. Given that we have deep learning which can find pattern in any data, it’s quite unlikely that we don’t have an algorithm. Even if there isn’t (just for argument sake), we can make one. Either by enhancing an existing algorithm or by combining several.

No, the question in machine learning is usually: do we have the data? And in this case, yes we do. I work a lot with ESG data (environment, social, corporate governance), such as carbon emission, air pollution, customer satisfaction, employee well being, board composition and executive remuneration. There are hundreds of ESG factors for every company and we have their share price (they are listed companies). Naturally, the question every one wants to know the answer to is: which ESG factors affect the share price most? And not only that, for every company we have many financial ratios such as: return on equity, net profit margin, debt to total asset, earning per share. Which ratios affect the share price the most?

Different industry sector have different variables. Imagine the variables in retail sector, health care sector, oil, banking, etc. They all have different variables, and they all have different target variables. And yes, we can find out which variables affect the target variable the most. So yes, we do have the data!

So it’s totally worth it to spend an hour reading and learning about this. The benefit for us is enormous. And once again, it is called Feature Importance (in machine learning features means input variables).

The best starting point I found, is to read what Jason Brownlee wrote on this topic: link. In that article he explains how to get the features importance using various different algorithms such as Random Forest, Decision Trees, Extreme Gradient Boost, Linear Regression, and Permutation. Both for regression (predicting a value) and classification (grouping data).

Yes, there are 2 different things in Feature Importance:

  1. Which variables affect the target variable the most (as I explained above), and
  2. Which variables affect the classification the most i.e. grouping the data points (putting similar data points into the same group)

In certain industry sectors #1 is more important, but in other industry sectors #2 is more important. That’s why we need to understand both.

If you read Jason’s article above, you’ll find that for each algorithm he provides a Python code to calculate the feature importance (mostly using Scikit Learn), as well as the result (both the numbers and the bar charts). In order to be comparable, we can scale those numbers to between 0 and 100 like this:

So if the output of a model is between 0 and 1, we multiply by 100. If it is between 0 and 1000, we divide by 10. This way all model outputs will become between 0 and 100.

In the above tables, the legend for the columns are as follows:

  • XGB = Extreme Gradient Boosting
  • RF = Random Forest
  • DT = Decision Trees (more specifically CART, Classification and Regression Trees)
  • LR = Linear Regression or Logistic Regression (depending whether it is for regression or for classification), using the coefficient for each feature
  • Perm = Permutation (using K Neighbours Regression then calculating the permutation feature importance)

But the above is still not easy to compare between models. To make it easier to compare between models, we can express them as percentage of total. So we first calculate the column totals like this:

And then we express each cell as a percentage of total like this:

Now it is easier to see. We can go through line by line, and spot the anomalies like this:

We can see above that for regression, the Linear Regression considers Factor 2 and Factor 7 as much more important compared to the other models.

On the classification side, we can see that Decision Trees consider Factor 7 as more important compare to other models, but Factor 2 and 6 as less important. And the Logistic Regression model consider Factor 2 as more important than other models, but Factor 4 as less important.

It is usually easier to spot the anomalies if we put them into 3D bar chart like this:

Here on the Regression chart (see the red arrows) we can see the 2 yellow bars for Linear Regression model on factor 2 and 7 are higher than other models.

And on the Classification chart (see the green arrows) we can easily see the bars which are higher than other models. But (see the black arrows) it is more difficult to spot bars which are lower than other models.

The question here is which model is the best one to use? Well, each model have different ways in considering which variables are important, as Jason explained in his article. Random Forest for example, use the decrease in impurity to determine which variables are more important. Specifically, for every variable, the sum of the impurity decreases across every tree in the forest, and is accumulated every time that variable is used to split a node. The sum is then divided by the number of trees in the forest to give an average. By impurity here I mean Gini. Similarly, the Decision Trees (CART) and the XGB models also use the impurity decrease, but in slightly different flavour because the ensemble technique used is different in those models.

So it is not about which model is the best to use. But we have to choose a few models which covers different opinions. The wider the opinion, the better. So we need to choose models which are contrasting each other. For example, for regression we can choose Random Forest and Linear Regression (coefficients). Because Linear Regression points out that factor 2 and 7 are important, whereas other models don’t.

For classification we can use XGB, Decision Trees and Logistic Regression. Because DT and LR have different opinion to XGB with regards to which factors are important. We know from XGB factor 4 is the most important, followed by factors 3, 5 and 6. But DT is saying that factor 7 is important too. And LR is saying that factor 2 is important too. This way we can examine factor 2 and 7 to see if they are significant or not. So the most important thing in choosing the models is to get different opinion about what factors are important.

25 January 2022

Data Lake on Oracle

Filed under: Data Warehousing — Vincent Rainardi @ 8:33 am

A modern data warehouse consists of a data lake and a dimensional data warehouse (link). The data lake is just a collection of files, but there is a SQL layer on top, enabling us to query the data lake using SQL which is called Data Lakehouse (link). I know that Databricks and Snowflake have this feature (link) but it turned out that Oracle has this feature too (link). It is based on Oracle external table which creates a SQL layer on top of data lake files (link). It works not on Oracle Cloud files, but also on S3, Azure and GCP files.

Oracle Autonomous Database is a cloud database that uses machine learning to automate database tuning, security, backups, updates and other routine database management tasks which usually performed by a DBA (link). Data Catalog is a metadata management service that provides an inventory of assets, a business glossary and a metadata store for data lakes. The Autonomous Database can automatically create external tables for thousand of files automatically harvested by the Data Catalog (link). This way, users can immediately query all the data in the data lake using SQL. For each file in the data lake the Data Catalog creates an entity automatically, without us having to manually define the columns and data types. And this entity metadata is automatically synchronised to the files in the lake (link), so the Autonomous Database always reflects what’s in the data lake.

This way, the data lake objects/files are harvested into the Data Catalog, and the Data Catalog is automatically sync into the Autonomous Database, enabling users to query the data lake using SQL, as Martin Gubar describes in the Oracle blog (link):

This makes Oracle a good platform for creating a modern data warehouse, i.e. a data lake + a dimensional data warehouse. Oracle’s relational database engine is legendary, making it a very good platform for hosting the dimensional data warehouse (star schema). And the Autonomous Database with Data Catalog describe above can provide a SQL queryable layer on top of data lakes in AWS S3, Azure, GCP, or Oracle Cloud Infrastructure (OCI, link).

For data movement, on OCI we can use Data Integration to ingest data into the data lake, and into the warehouse. It is a cloud based, managed service (link) with rich transformation, data explorer and data profiler features. We can monitor the running pipelines using various metrics, notifications or alarms. We can use Data Flow to create and run Spark application (link). It’s a fully managed service so we don’t need to create Spark clusters or worry about the underlying infrastructure. We can use Data Transfer to migrate large volume of data into the data lake (link).

Like Azure, AWS and GCP, OCI also provides machine learning platform. It is called Data Science. We can use it to build, train and deploy ML models (link). It is fully managed and serverless so we don’t need to worry about the underlying infrastructure such as updating Python libraries dependencies, Conda or JupyterLab. Anomaly Detection enables us to identify undesireable events in real time (link), such as IOT. It automatically analyses the dataset to build ML models for multivariate signals by considering their correlations.

For BI & reporting, we can use Analytics Cloud to explore, load, model and analyse data by creating reports and dashboards (link). We can do development operations such as release pipelines, build, deployment approval, code repository and continuous integration using a tool called DevOps (link).

So all in all, OCI provides an end-to-end platform to create a modern data warehouse, i.e. a data lake and a data warehouse. I’m glad to learn this today, because all this time I was wondering what Oracle has been doing in this era of cloud data warehousing. Azure, Amazon and Google all are offering cloud data platform for data lakes and data warehousing – what about Oracle, who is the leader in relational databases? It turns out that Oracle also has good cloud data platform offering.

20 January 2022

Data Lake Architecture

Filed under: Data Warehousing — Vincent Rainardi @ 8:47 am

When we google “data lake architecture” we usually get a diagram showing the functions of a data lake (what it is used for), but not showing what’s inside, like this:

But what we really want to know is what is inside that Data Lake box, not the inputs or the outputs. When we try to find out what’s inside the lake, we’ll find something like this:

They show the layers and functions of a data lake, but not the internal mechanism. Not what is really inside a data lake. Note that instead of raw, clean and curated layers, some people call the layers differently, such as:

  • Bronze, silver, gold
  • Temporal, raw, trusted, clean
  • Raw, core, data mart
  • Transient, raw, trusted, refined
  • Landing, raw, standardised, curated
  • Raw, staged, analytics

After reading a few books on data lakes we’ll find that a data lake is just a collection of files. That’s it. Seriously? Are you telling me that a data lake is just a collection of files? Yup. Wow, how disappointing! That’s what I initially thought.

Every body agrees that data lakes should be in the cloud. In other words we should use Amazon S3 (AWS) or Google Cloud Storage (GCS) or Azure Storage (ADLS) to create a data lake. But it is just a collection of files.

Then I saw this diagram in Holt Calder’s article (link):

So we can query a data lake in S3 / GCS / ADLS by creating Snowflake external tables on top of it. This way we can use SQL to query the data lake. You can read how to do it on Snowflake documentation: link.

Creating external table is an old technology. I remember creating an external table in Hive like 10 years ago. Yes, in 2012! At that time there was no data lakes. I was working for an investment bank and all we had was Hadoop (HDFS). On top of this distributed file system we use Apache Hive so we can query the data using SQL. See Hive documentation on how to do this: link.

Then recently when working with Databricks I read Armbrust’s seminal paper on Delta tables: link. Well not just Michael Armbrust but 10 other people wrote it too. In that paper they explained how they store the transaction log, check points and change metadata directly within the object store, like below:

It is similar to Apache Hive ACID but in Hive ACID the metadata is stored in the RDBMS, not in HDFS or object stores. Armburst calls this approach Delta Lake. It is similar to Apache Hudi and Apache Iceberg.

In this paper (link) they explained the mechanism in great details, including the storage format, the log checkpoints, the isolation levels, adding log records atomically, writing transactions, reading from tables

Until today, whenever I query any table in Databricks, I’m amazed with the wonderful mechanism underneath it. Earlier I just took it for granted, didn’t really think how it works, treating it as just another RDBMS. Well because we can do SQL select, update and delete statements, etc. (link, link) just like in Oracle or SQL Server. And yes we can do JOIN too! (link). But underneath it’s all stored as files in the data lake. What a wonderful mechanism!

So Data Lake these days is not just in the cloud, but also represented as tables. We can query it using SQL statements. That is the internal mechanism of cloud data lakes. Either using Snowflake external table, or using Delta table in Databricks.

15 January 2022

Do we still need a data warehouse?

Filed under: Data Warehousing — Vincent Rainardi @ 9:07 am

Yesterday someone asked me what I would include in an “introduction to data warehousing” (including data mart). I get asked that question a lot over the years. This is because I wrote a book on data warehousing, and a blog too. People contacted me through my blog and asked “I want to learn data warehousing, what should I study?” In 2008 to 2010 the answer was not difficult: “Just read my book”. But today, in 2022, there has been so much change in data warehousing that what I wrote in 2007 is no longer relevant. It still provides the basics, but there are so many new things going on.

For a start, the architecture is now different. See the architecture for a modern data warehouse here: link. Then we have a data lake. See the architecture of data lake + data warehouse here: link. And now we have a Data Lakehouse, a data lake which has data warehouse features: link.

So what would I write or say to someone who is a completely new in data warehousing and want to learn about it? Do I say “Don’t learn it”? I thought about this answer for a long time actually. Is data warehousing no longer relevant in 2022? Has it been completely replaced by data lakes? Can it still be useful in the era of AI / machine learning?

Still Relevant

After considering many factors and weighing up everything I am now sure that the answer is “It is still relevant”. You still need to learn data warehousing. Why is that? For a start, a data warehouse integrates data from many different sources. Does a data lake do that? No. A data lake only put them into one place but does not integrate them. Dimensional model is the right way to integrate customers, to integrate products, to integrate sales from multiple sources (those who have experienced it will understand this). An MDM integrates the master data, but not the transaction data. You need a DW/mart to integrate both.

Secondly, a data warehouse is easy to use. It uses a star schema which provides the granularity at the base level, and yet it is easily rolled up to higher levels. And you can do this roll up on any attribute. It has developed over 20 years, it’s mature and has been implemented on all use cases, thanks to Bill Inmon, Ralph Kimball, Margy Ross, Bob Becker, Joy Mundy, Warren Thornthwaite, Chris Adamson, Nicholas Galemmo, Claudia Imhoff, and thousands of other dimensional practitioners, see my Who’s Who in Data Warehousing here, and all data warehousing authors here.

In hundreds of thousands of companies throughout the world, big and small, data warehouse is the back end of millions of reports and analytics. It supports the business intelligence and now machine learning too. You need to have a working knowledge of data warehousing to work in the reporting team in these companies (almost every company has one). Not just modelling techniques, but also loading techniques. And reporting techniques. And infrastructure knowledge too i.e. database, security, DevOps.

Replaced By Data Lakehouse?

Is it true that data warehousing can be replaced by Data Lakehouse? I said that here. But I was wrong. Yes a data lakehouse has ACID properties. But that does not make it a star schema. It only makes it a database with lots of disconnected tables. Can you join the tables? Of course you can, but it is not as slick as a star schema. And no where near as integrated (e.g. customers, products, securities). Those who have tried different approaches in data integration will appreciate the power of dimensional approach.

Can we build a dimensional model in a data lakehouse? Of course we can, but it is not the right platform. It is far better to build the dimensional model on a relational database. Because physically a data lakehouse is just a collection of files, the reason to put a SQL layer on the top is to enable us to access the files easily. But a data lakehouse was not designed to do join, indexing, filtering, etc. that a dimensional model requires. Dimensional model was designed and born in a relational database platform, which supports it very well.

A Data Lakehouse can’t replace a dimensional data warehouse. Because it doesn’t have integration capability and it’s not ease to use/query. If what you need is to put data from different systems in one place, then yes use a Data Lakehouse (which is better than a data lake, see here). But the data will be disintegrated. You won’t have a single customer table. Or a single product table. There won’t be conversion between different measures (say sales from different systems) to integrate them into one table.

Reporting is all about integrating data from various data sources. If you only have one data source, then you can just report from that single source directly. It is because your data is in different places that you need a data warehouse. A data lake or a data lakehouse does not “stitch” the data. Inside a data lake/lakehouse the data is still disintegrated.

Please read James Serra article on this point here: link. I completely agree with James that for small amount of data (say 10 tables), or if the users are all data scientists, or to build a POC or to get a quick win with a report/dashboard, then you can use a Data Lakehouse.

Don’t worry if you don’t know what a Lakehouse is, many people in data warehousing and BI don’t. For that I would recommend you to read the CIDR 2021 paper by Ambrust et al: link. This paper explains the Lakehouse architecture and how it addresses some of the data warehousing challenges.

What to study in data warehousing?

So, these days, what does one need to study in data warehousing? (I include data mart)

First of all, they need to understand the problem, preferably illustrated using several cases. The problem of difficulty to report or analyse data. Because the data needs to be transformed, because the data is not conformant to one another, because the data is located in different places. They need to understand the purpose of the whole exercise, i.e. data warehouse/mart makes reporting seamless and easy because the data is integrated and conformed. That the purpose is to enable users to do analysis and reporting.

They can call it BI, AI, or whatever, but they must realise that it is only possible because all the data they need is integrated in one place. Yes a BI system can connect to US, Japan, France databases to get the sales figures, but that is not integration. They have to realise the FX rates, one data is quarterly while the other is daily or monthly, different data uses different product codes, and different hierarchies. How do you integrate them? They must appreciate what a data warehouse (and all the analysts, developers, testers behind it) going through to integrate the sales data worldwide, the AUM data worldwide, the customer data worldwide, the financial data worldwide. Not “all the data in one place”. But: “all the data integrated in one place”. The difference between the two is not one little word. The difference my friends, is a million dollar and 2 years of hard work. It’s equally important to understand, that if the client is small enough and lucky enough to have just one currency, one office and one system, then they don’t need a data warehouse. Report directly from that system. Because of their size, there is no way to justify a million dollar spending. Or having 5-10 people in IT working on this for 2 years. If the student understand this perspective, they are in the right track.

Second, they need to understand why we need to use dimensional modelling. Why not use relational models, or transactional reporting. Why not use a data vault or data mesh. Why not use data lake or data lakehouse. They need to understand which architecture is suitable for which cases. If you just want to analyse Twitter big data, then there is no need to build a DW or mart. Forget dimensional modelling. You need to use a data lake.

Until they are able to answer “which cases are suitable for data warehousing and which are not” the learner should not move to the next step. It is better to understand the why and not understanding the what, then the opposite. As a learner, ask yourself “why don’t I use a lake or relational?” If you can’t answer this, keep learning this second stage. Keep reading and thinking until you can answer that question.

Third, study the architecture. Start with the architecture of a modern data warehouse (i.e. a lake + a warehouse): link. Carefully look at where ML, BI, IOT (sensors), No SQL, MDM, DW and marts are positioned in this architecture. Then study dimensional modelling concepts: link. This is a vast topic and you will need at least 3 months to understand all the details. So please don’t get trapped into the details. Instead, keep it to the basic, which you can learn in a week. And as beginner, limit yourself to only what Kimball wrote (the Kimball Group, not just Ralph Kimball). I can’t say this enough: avoid the advanced stuff and keep it to the basics, i.e. fact tables, dimension tables, SCD, hierarchy and that’s it.

After studying the basics of dimensional modelling, ask yourself: why use star schema? What is the advantage compared to other models? (i.e. snowflake, relational, data vault, ODS) Until you can answer this question you shouldn’t move on to the next step. There is no point understanding the what, if you don’t understand the why. Keep reading until you can answer that question: why do you have to arrange the data in dim and facts, why not relational model, snowflake model, data vault model, or ODS model?

So those are the first 3 things to learn in data warehousing: the purpose, the architecture, and dimensional modelling.

How DW is used and populated

After those first 3 things, the next stage is to understand how a DW/mart is used: BI, reporting & analytics. Not how to get the data out of the DW/mart, but how the reports will be used. Many people are against this. They feel as people in IT don’t need to know the business. On the contrary, we absolutely need to know the business: what the data will be used for. The whole of the design is based on how it will be used. So if you don’t know the business your DW will be poorly designed, and your reports/dashboard too.

15 years ago BI & reporting are almost the whole purpose of data warehousing. How about today, are they still relevant? Does one still need to learn BI & reporting tools such as Power BI, Tableau and Looker? Yes, today BI is still a major usage of a data warehouse. And yes one still need to learn BI/reporting tools. But, I would argue that, considering there are so many things to learn in modern DW, BI/reporting tools can be considered a separate skills to master. Ditto ML (AI), more and more DW is used for ML, does it make ML as one of the subject that one needs to learn in DW? No, it is a separate subject. So when one learn data warehousing, I would recommend not to learn BI tools and ML. Finish learning DW first (see below), then do BI or ML (not both, but pick only one because each is a very large subject to learn).

Then you need to understand about various different techniques of data loading: ETL, ELT, ingestion, ETLTL, streaming, (small) batching, and so on. Knowing the model is one thing, but loading data into that model is a completely different thing. There are various different tricks and methods that we need to know. Staging area, data quality, alert, monitoring, error handling – all these need to be built.

A modern data warehouse

In a Modern Data Warehouse, we have a lake and a warehouse (link). So not only you need to know how to load data into the dimensional model, but also how to ingest data into the data lake (preferably data lakehouse, along with loading the metadata too), and how to get the data out of the lake to the warehouse. So it is critical that you know the in-and-out of data lakes: ADLS2, ADF pipelines, Synapse, Databricks, Python notebooks, Spark, Linked Service, Key Vault, API, storage account, resource groups (if you use Azure, otherwise the corresponding items in Redshift or Google Query). This is where an old school DW practitioner usually got stuck: a modern DW demands the cloud technology, and they are stuck in the old SQL world. They are not flexible enough (or humble enough) to adapt and learn the cloud and the data lake technologies.

A modern data warehouse can also have machine learning, streaming data and master data management. Machine Learning is so popular these days, every company is using it. It’s no longer about BI, but it’s about AI. So you need to know how those data scientists are using the data for making predictions, which is completely different to BI and reporting. Streaming data like social media feeds, IOT feeds (Internet of Things/sensors), and audio visual streams are quite common these days (yes, even in banking). For this we use Kafka or Azure Stream Analytics (or Kinesis/Flink in AWS). Data mastering is usually sometimes outside a modern DW, but sometimes is included in the ecosystem.

Data Governance & DevOps

A DW ecosystem must always be accompanied with a data governance program. Yes, the data quality process is built-in (into the modern DW ecosystem), but Data Governance is completely different. Data Governance includes data definition, data ownership, data mastering, data lineage, and data quality. Whether you are doing a lake, a DW, an ODS, a Data Vault or even a transaction system database, you need to understand Data Governance. Otherwise what you are doing doesn’t make sense. There is no point creating a data warehouse if we don’t understand the principles of Data Governance.

A data warehouse without Data Governance may be able to deliver in the short term, but not in the long term. So we need to know it. Data Governance it a vast world with a myriad of tools, so as a beginner you will need to limit yourself to just the essentials. Just try to understand the concepts, and forget about the tools and how to use each of them.

Any IT development requires DevOps, and data warehousing is no exception. DevOps (development operations) means development processes and production processes. Development processes include coding, automated testing, pull request, branching, release pipelines, as well as the infrastructure such as databases, servers, service accounts, resource groups, storage accounts and networking. Production processes include change control, monitoring data ingestion, setting up alert, error logging, and managing user access. Knowing DevOps would help a lot in data warehousing.

Recap

To recap:

  • I want to learn data warehousing, what should I learn?
    The purpose, the architecture and dimensional modelling.
    Then the business knowledge, BI/reporting and data loading.
    Then data governance and DevOps.
  • Is learning data warehousing still relevant today?
    Yes, learning data warehousing is still relevant today because of its capability to integrate data, it is easy to use and because thousands of companies throughout the world have data warehouses.
  • Will data warehousing be replaced by data lake/lakehouse?
    No, because a data lake/lakehouse doesn’t have integration capability, it is not easy to use/query, and it is not a suitable platform for dimensional model.
    There are business cases which are suitable for data lake (ones which don’t require data integration), and there are business cases which are suitable for dimensional data warehouse. They have different purposes and capabilities. A modern DW includes a data lake.

In the meantime, if you are new in data warehousing, or looking for an inspiration, read the Who’s Who in Data Warehousing: link. Even for an experienced veteran, it is still a good read. You would recognise all the names, which makes it interesting.

If you are a learner, I wish you all the best in your data warehousing journey. Enjoy your learning experience, that’s the most important thing. Not the results. Or achievements. Any questions, just ping me at vrainardi@gmail.com, I’ll be happy to answer them. I now spend half of my time in machine learning, but still work in data warehousing. And still learning data warehousing.

31 December 2021

Why do we use Python in machine learning?

Filed under: Data Warehousing — Vincent Rainardi @ 9:02 am

Yesterday a friend asked me why we used Python in machine learning. He prefers to use Node.js for application development, and he can code in other languages too including C++.

I replied that we use Python in machine learning (ML) because:

  1. ML libraries such as Scikit Learn, Keras, XGBoost are mostly available in Python (some are available in R), and we need those libraries to build and train ML models.
  2. Code examples in ML are mostly in Python e.g. in Kaggle, OpenAI, Medium, Towards Data Science, Analytics Vidhya, ML Mastery.
  3. ML platforms like Colab, Dataiku, Google AI, OpenAI Gym, Data Robot, Azure ML use Python. Some use R but none uses Node.js.

And we both also observed that universities these days use Python more, compared to other programming languages. And ML courses such as Coursera, Udemy, Udacity too, they also use Python (again some use R but not as popular as Python).

Let me qualify some of the above remarks, with regards to R:

  • Keras and Tensorflow are available in R too: link.
  • You can use R too in Google Colab: link.
  • Dataiku code recipes are available in R too: link.
  • Databricks ML platform is available in R too: link.
  • You can install OpenAI Gym in R too: link.

So yes you can still do ML in R today. But it is a losing battle. More and more examples and academic papers are available in Python, compared to R.

And yes there is no chance for other languages, such as C++ or Node.js. We can’t use them in ML.

Change or Die

IT in general is a very dynamic world. It keeps changing. Particularly programming (development). I started programming when I was in 16 using Basic language. And dBase III+. In uni (around 19-23 years old) I mainly use Pascal and I learned C and Fortran too. I had difficulty to switch to Assembler, so I took a few months course. In uni I took a few odd jobs/projects, using ForPro and Clipper.

Then I joined Accenture (Andersen Consulting then) where I used Cobol and AS/400, C++, PL/SQL. Then I moved to UK (a company called Paxar in Harlow, now Avery Dennison) and used C#.NET and T-SQL. I then joined Lastminute.com and used Java and PL/SQL. Then in the subsequent jobs I used MDX, T-SQL, Python, JavaScript and R.

Yesterday another friend asked me how did I change to Python in ML recently. I replied that I have jumped many times in the past, so the last jump into Python was not difficult. You can see above that I have jumped from one language to the next constantly. I said to my friend (he’s a SQL DBA) that the first jump, i.e. into Assembler was very difficult. But after that it was easy. I jumped into C#, Java, R and Python without difficulty.

And we will have to change again in the future. In the IT world, we have to constantly change. Not just in terms of languages, but also in terms of platforms. For example, we all had to change to the Cloud platform. SQL Server, my DBA friend said yesterday, moved to the Cloud. Ah yes, we all moved to Azure or AWS.

So yes, I’m afraid in the IT world we have to change, or we die.

Best Language for Machine Learning

I’ll close this article with this: there is a survey this week in Linked In about which language will be the best programming language for Data Science and Machine Learning in 2022 by Zakash (link). And the result is overwhelmingly Python (94%). Yes it is only 136 votes today (31st Dec, 9:15am), but click the above link to see the current result after a few days. I am sure it will still be overwhelmingly Python.

29 December 2021

Foreign Keys in Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 7:47 am

I have written about whether we need to create a physical primary key in the fact table or not (link), including whether that primary key needs to be a single column or a composite. But a couple of weeks ago someone asked me if we would need to physicalise the foreign keys in the fact tables or not. This is an interesting topic because we all know that in Kimball dimensional modelling we have surrogate keys in the fact tables, but it has been a long debate in the data warehousing community whether these surrogate keys should be physicalised or not.

The thinking amongst many data warehousing practitioners is that the surrogate keys are fully controlled by the dimensional business key lookup process, so they don’t need to be physicalised. Besides, it would slow down the insert in the fact tables, whereas “fact table insert” is one of the tasks in data warehouse loading which needs to be kept as efficient as possible because it is resource intensive. It is resource intensive because fact tables can be very large tables. So we employ techniques such as partition switching, disabling indexes, etc.

The point is, if the dimensional key lookup process ensure that all the surrogate keys (SK) in the fact tables are valid and correct, then why do we need to physicalise the SKs? For many DW practitioners, me included, the Referential Integrity in the data warehouse should not be in the form of physicalised foreign keys, but maintained by the dimensional key lookup process (link, link point 5). This lookup process not only ensures that the SKs in the fact tables are valid, but also that they are correct.

Practitioners Discussion

Nicholas Galemmo, author of “Mastering Data Warehouse Design” book, link:
In general, the process of assigning surrogate keys in the ETL process is more than sufficient to ensure FK RI. Enforcing constraints in the database is redundant and not necessary. Doing so slows down the load process. However, some BI tools make use of FK declarations to ease defining relationships between tables. Therefore it is common practice to declare FKs, but not implement constraints to enforce them.

Hang from Brisbane, Australia, link:
Since ETL is a tightly controlled process, any database level RI requirements in data warehouse should be superseded by ETL. In OLTP system, RI constraints are targeted at data entry on record level and hence can protect data integrity. In data warehouse, data are loaded in batches by ETL and RI validation is just a bear minimum part of the whole complex process. RI constraints has performance implication to ETL, and some ETL also has its own peg order about loading the data which could require undoing the RI constraints. In my experience, the only value having database RI is to automatically draw the connection lines between tables by some modelling tools.

Martin Rennhackkamp, link, point 5:
The approach we typically follow is to define the foreign key constraints in the model, and implement and activate them in the development, testing and acceptance environments. However, once you are sure that the foreign key constraints are properly adhered to by the ETL processes; you do not implement and enforce them in the database in the production environment.

What Ralph Kimball Says

One of the old saying in data warehousing is, “when in doubt, look at what Ralph Kimball says”. And in this case, Ralph Kimball and Joe Caserta say this in their book in 2014, The Data Warehousing ETL Toolkit (link), page 212:

——————————————————

There are the three main places in the ETL pipeline where referential integrity can be enforced:

  1. Check the “load tables” before loading them into fact tables:
    Careful bookkeeping and data preparation just before loading the fact table records into the final tables, coupled with careful bookkeeping before deleting any dimension records.
  2. Check while loading into fact table using “DBMS RI Check”:
    Enforcement of referential integrity in the database itself at the moment of every fact table insertion and every dimension table deletion.
  3. Check the integrity of tables at later times:
    Discovery and correction of referential integrity violations after loading has occurred by regularly scanning the fact table, looking forbad foreign keys

Practically speaking, the first option usually makes the most sense. One of the last steps just before the fact table load is looking up the natural keys in the fact table record and replacing them with the correct contemporary values of the dimension surrogate keys.

The second option of having the database enforce referential integrity continuously is elegant but often too slow for major bulk loads of thousands or millions of records. But this is only a matter of software technology.

The third option of checking for referential integrity after database changes have been made is theoretically capable of finding all violations but may be prohibitively slow. But perhaps the query can be restricted only to the data that has been loaded today. But this is a sensible approach that probably should be used as a sanity check even if the first approach is the main processing technique.

——————————————————

Bear in mind that Ralph Kimball and Joe Caserta wrote this 17 years ago in 2004 when database technologies were not as advanced as they are today. But it looks that they are in favour of option 1 i.e. maintain the RI on the ETL.

Steve Flynn’s Advice

One of the best advice I found is from Steve Flynn (link) who says the following (link):

  • At design time foreign key constraints should be used. They guarantee referential integrity when the facts are inserted and will throw an error that the developer/admin can catch and fix before the users report issues with the data.
  • Design the ETL as if the constraints are not there; i.e. follow Kimball’s recommendation to enforce RI during data preparation.
  • If ETL performance is an issue and removing the foreign key constraints will make a difference then remove the constraints. But, test the pre and post ETL performance to see what’s been gained, and only do this if it makes real difference and other performance optimisations haven’t been tried first or have been ruled out for other reasons, such as cost.
  • This gives the assurance that RI cannot be compromised, even if modifications are made to a complex ETL process, while at the same time providing scope to optimise fact table load speed by disabling the foreign key constraints should it be required.

Conclusion

In my opinion, we do not need to implement RI in the fact tables because:

  1. The SK lookup process in the ETL is sufficient to ensure RI.
  2. In my experience disabling and enabling FKs before & after fact load takes a lot of time. And not disabling them takes even longer load time.
  3. We don’t delete from dimension tables.
  4. The argument about “in SQL Server FKs are used to optimise star query” is not true. It is the “star join” which is used, not the constraint (link).
Next Page »

Blog at WordPress.com.