Data Warehousing and Data Science

1 November 2022

Power BI Datamart

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 8:39 am
Tags: , ,

Like SQL Data Warehouse, Microsoft once again named a product by what it does: Power BI Datamart. It is an Azure SQL database, with a visual query designer so users don’t need to write SQL to get the data out to BI tools, Excel or ML algorithms. It also generates a Power BI dataset automatically, which we can use to create Power BI reports.

Loading data into Datamart

Power BI Datamart uses a built-in Power BI Dataflow for loading data into it. After we create a Datamart, we click on Get Data and use PowerQuery to connect to various data sources such as Azure SQL database, Azure Data Lake Storage, Databricks (Spark), Azure Synapse, HDInsight, Sharepoint, Excel files, text files, JSON files, Parquet files, API, Salesforce, Snowflake, Redshift, BigQuery, Oracle DB, IBM DB2, Teradata, PostgreSQL, MySQL, Access, SAP, Analysis Services and ODBC (link):

The above list of data sources is so comprehensive, I could not believe it at first. This is a game changer. Gone are the days when we struggling in SSIS, Informatica or ADF to load data from those varied sources. PowerQuery has all the connectors and interfaces to every single one of those databases, files, data lakes and big data.

We then select the tables, specify transformations and PowerQuery would build the ETL rules in that built-in Dataflow, create the relationships and load the data from those tables in the data source into the Datamart. We can then see all those tables and relationships in the Table Tools tab:

Afterwards, we can create relationships (again) and measures. We can also hide the tables and columns that we don’t need.

In the workspace we automatically get a dataset:

We can schedule when we want to reload data from those data sources into the Datamart (called Scheduled Refresh). And we can also setup incremental load, i.e. only changed rows are loaded into the Datamart, based on the column that we want (known as “watermark column”, link):

Security and access control

Power BI Datamart uses row level security and roles to restrict user access to the Datamart. In this example below (link), the users are restricted to only be able to access rows with category ID of 1 and 2. Once that role is defined, we can assign users to that role, as we can see below:

Querying data

As I said at the start of this article, we can use a visual query designer to get the data out without writing SQL (link). As we can see below, the Visual Query Designer is like Markit EDM, SSIS and Informatica.

We can select rows, filter rows, select columns, split columns, remove duplicates, join tables, change data types, transpose table and pivot columns. We can also do error handling, such as removing errors or replacing errors (link). And we can also create custom functions (link).

Of course we can also write SQL:

Accessing Power BI Datamart from SSMS

And we can also query the data in the Power BI Datamart using SSMS (SQL Server Management Studio). And using Azure Data Studio too (link):

Not only we can use SSMS for querying data in the Datamart, but also adding users and configuring security. And we can also use SSMS for looking at the query statistics. This is how it looks from SSMS (link):

We can see above that the Power BI Datamart tables are listed as views, under “model” schema. The relationships in the Power BI Datamart is stored in two views called metadata.relationships and relationshipColumns, as shown above (link).

Conclusion

I think Power BI Datamart is a game changer.

  • It enables us to gather data from databases, data lakes, files and APIs into one data mart.
  • We can reload data every day/week and we can do incremental loading.
  • We can query the data using SSMS (can also do admin stuff).
  • We can build Power BI reports easily (the dataset is automatically created).
  • We query the data using visual query designer or by writing SQL.
  • We can easily get the data out to BI or ML tools, and to Excel.

That is a very powerful tool. Anyone in data warehousing and BI should consider this tool. At least take a look.

And I think the name suits it well. The Power BI Datamart is indeed a good data mart. With all its bells and whistles.

17 April 2017

Definition of Big Data and Data Warehousing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 5:38 pm
Tags: ,

I’m annoyed that many people label the normal Data Warehousing & BI stuff as “Big Data”. It is not. For me, Big Data is the “Hadoop stuff” (HDFS). If it is in SQL Server, Oracle or Teradata, it is just a normal database, not Big Data.

Big Data is stored on HDFS (Hadoop Distributed File System), not in RDBMS. Oracle, IBM, Microsoft, SAP, Teradata, all use Hadoop to store Big Data. Big Data is queried using MapReduce.

The reason why Big Data can’t be stored in RDBMS is because the format is not tabular. Sometimes it is 2 columns, sometimes it is 200 columns. Like Twitter data. The second reason is because it is too big. Sensors can make 100 measurements in a second, and in a year it could be Petabytes. Web Logs is another example. Tracking the ask and offer price of every transaction in every stock market is another example. Yes we can put Petabytes into SQL Server or Oracle, into Netezza or Teradata, but not at this speed (and more importantly not at this price!) Hadoop on the other hand is designed exactly to cope with these kind of speed and volume (and price).

Now the usage. What is Big Data Analytics? Big Data Analytics is when we do analytics on Hadoop Data.

Is Fraud Detection Big Data Analytics? Not always. Fraud Detection can be done on a normal Data Warehouse or a database. Is Machine Learning Big Data? Not always. Machine Learning can be done on a normal Data Warehouse or a database. If the Fraud Detection or the Machine Learning is done on data stored in Hadoop, then it is Big Data Analytics.

Even if it is only 200 GB, if it is stored in Hadoop, it is Big Data. Even if the data is 5 Petabyte, if it is stored in an SQL Server database, it is not Big Data, in my opinion.

Even if the data is in tabular format (i.e. columns and rows), if it is stored in Hadoop, it is Big Data. But if it is stored in an Oracle database, it is not Big Data.

Every Big Data architecture that I know uses Hadoop. No companies (or government) implement Big Data on an RDBMS. Or on a non HDFS files. Every single company, every single Big Data case I read implement the Big Data on Hadoop. I may be wrong and would be happy to be corrected. If you know a case which implements Big Data on a non-Hadoop system/architecture, I will grateful if you could let me know, either through comments, or via vrainardi@gmail.com.

 

 

21 April 2016

Why do we need a data warehouse?

Filed under: Data Warehousing — Vincent Rainardi @ 4:59 pm
Tags:

I have written about this in 2012 (link) but this one is different.

One of the most compelling reasons for creating a data warehouse is to reduce cost. Imagine if you were a large investment bank, trading fixed income, equities, commodity and currency. Every month you need to produce hundreds of reports to your clients and to regulatory bodies, and for management. The reporting team currently spends a lot of time creating the reports. They all have access to various systems. Risk systems, trading systems (order management), compliance systems, fixed income systems, finance systems. They copy and paste data from these systems into Excel, and do some calculations in Excel, which as the data source for making the reports in PowerPoint and PDF. Let’s find out how much this operation cost. Let’s say there are 8 systems, costing $200 per user per month, and there are 10 people in the team. The licence cost is 8 x $200 x 12 x 10 = $16k/month x 12 months = $192k/year, make it $200k. The salary is $60k x 10 = $600k per year. Total cost = $800k/year.

If we make a data warehouse, taking 1 year and costing $500k, we will be able to put together the data from that 8 systems into 1 database, and the reporting team won’t need access to those system anymore. That saves $200k/year.

Secondly, we will be able to automate the production of those reports. That will reduce the amount of time required to make those reports, thus reduce the number of people required, freeing them to do other activities. Assuming that half of the reports can be automated, that would save us $300k per year.

So total saving per year is $500k. Project cost is $500k. Surely a project with this level of “cost vs benefit” is well worth pursuing.

1 April 2016

Data Sourcing

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 8:16 pm
Tags: ,

One of the critical activity in a data warehouse project (and in any data project) is data sourcing. The business users have a few reports which need to be produced from the data warehouse. There is no data for that yet in the data warehouse, so we look at the report and ask ourselves: “Where can I get the data from to produce this report?” The plan is to find the data source, and then bring the data into the data warehouse.

There are 6 steps of data sourcing:

  1. Find the source
  2. Reproduce the numbers
  3. Verify the extraction
  4. Check the coverage
  5. Check the timing
  6. Check the cost

Step 1. Finding the source

We find out how the report was created, who created it, and from where. The report could be a PDF which is created from PowerPoint, which was created from an Excel spreadsheet, which was created from a file, which was exported from a system. It is this system that we need to check. Find the tables, the columns and the rows where the data is coming from.

Step 2. Reproduce the numbers

Once we have located the tables, we then try to reproduce the numbers we see on the report. For example, the report could be like this:

Asset Allocation

Try to reproduce the first number (24.3%) by querying the tables which you think the data is sourced from. In doing this you will need to look at the data processing happening in the Excel spreadsheet. Do they exclude anything? Is it a straight forward sum of the holdings in government bonds? Do they include supranational? Government agencies? Regional? Municipal? Emerging Market Debt?

If you can match 24.3% then great. If not, investigate if the data is manually overridden before it was summed up. For example, in Excel, there might be a particular bond which was manually classified as Government even though the asset class is Corporate, and this is because it was 80% owned by the government, or because it was backed by the government.

We need to particularly careful with regards to the totals. For example, on the “Allocation by Country of Risk”, if the total portfolio value is $200m, but they exclude FX forwards/swaps, or bond futures, or certain derivatives, then the total portfolio value could decrease to $198m, and all the percentages would be incorrect (they are slightly higher).

Understanding the logic behind the report is critical in reproducing the numbers. In-depth industry knowledge will be helpful to understand the logic. For example:

  1. The credit rating for each debt security is categorised into IG (Investment Grade) if it is BBB- or above, and into HY (High Yield) if it is BB+ or below, or NR.
  2. The credit rating used for point 1 is the average between S&P, Moody’s an Fitch, except a) for own fund use look-through, b) for outside fund use IMMFA
  3. The “Allocation by Country of Risk” table excludes cash and FX forwards/swaps.
  4. Each country is categorised into Developed, Emerging or Frontier market.
  5. When determining the country of risk in point 4, for derivative use the underlying.

In the above case, if we are not familiar with the investment banking industry, it would take a long time for us to understand the logic. So, yes, when doing Data Sourcing, it is best if it is done by a Business Analyst with good knowledge & experience in that industry sector.

Step 3. Verify the extraction

Once we can reproduce the numbers, we need to verify if we can get the data out. A few gotchas are:

  1. The numbers are calculated on-the-fly by the system, and not stored anywhere in the database. If this is the case, find out from the vendor if they have an export utility which produces a file after the numbers have been calculated.
  2. Are we allowed to connect to that database and query it? Do not assume that we can, because I’ve encountered a few cases that we are not allowed to do that. It could be because of the system work load / performance (it is a critical transaction system and they don’t want any big query ruining the front end users), or it could be because they have provided daily extract files which all downstream systems must use (instead of querying the database directly). From the system admin point of view, it makes sense not to allow any external query runs on the database, because we don’t know what kind of damage those external queries can cause, it could block the front end queries and causing a lock.
  3. Loosing precision, i.e. the data must be exported from the database but during the export process the precision decreases from 8 decimal places to 2 decimal places.
  4. There is a security restriction because the it is against the “chinese wall” compliance rules (in investment banking, the public-facing departments must not get data from the M&A department)
  5. The system is being migrated, or rewritten, so it is still in a state of flux and we need to wait a few months.
  6. The system is not a solid “production quality”, but only a “thrown-away”, which means that within a few months they could be dropping those tables.

Step 4. Check the coverage

This step is often missed by many business analysts. We need to check if all the products that we need is available in that system. If the report we are trying to reproduce is reporting 4000 products from 100 branches, but the source system/tables only covers 3000 products from 70 stores, than we’ve got to find out where the other 1000 products and 30 stores are sourced from. Are they produced from a different system.

Not only product and stores. We need to check the coverage in terms of: customer/clients, portfolios, securities, line of business, underwriting classes, asset classes, data providers. And the most important coverage check is on dates, e.g. does the source system have data from 2011 to present? It is possible that the source system only have data from 2014.

Step 5. Check the timing

After checking the coverage, we need to check if the data is available when we need it. We need to check these 3 things: data is too late, available extraction window, the data is overwritten.

  1. Data is too late: If our DW load starts at 2.15 am, will the data be available before that? If not, could the business user live with a bit of stale data (data from 2 days ago, i.e. if today is Wednesday, the latest data in the DW would be Monday data).
  2. Available extraction window: In the production environment, when can we extract the data from that source system? If from 11pm to 6am there is an overnight batch running, and we can’t run our extract during that time, then the ealierst we can run is 7am. If the DW load takes 3 hours, DW users can access it at 10am. Is that too late for the users or not?
  3. The data is overwritten: the data from the source system can be updated many times during the day and when we extract it at night, we have no trace of these changes. Is that ok? Do we need intraday, push-driven data load into the DW? Or would 10 minutes data extraction frequency (pull-driven) be enough?

Step 6. Check the cost

There is no free lunch. We need to check how much it would cost us to use that source data.

  1. If the data is valuable (such as prices, yield and rating from Bloomberg, Reuters and Markit) we would have to pay the data providers. We need to check the cost. The cost could be per $5 security, per call, so it could easily be $20-30k per day. The cost is usually shared with other department.
  2. Check with the data provider, if you use the data only as an input to your calculation, and you don’t publish it / send it on to any external parties (clients, etc.), would it still cost you a lot? Even if you don’t have to pay the data providers, your DW project might still have to share the cost with other departments. Say the data provider is Markit and your company pays $300k/year for prices data and it is currently shared by 5 departments ($60k each). Your project may have to bear the cost of $50k/year ($300/6).
  3. The cost could be a killer to the whole thing, i.e. even if #1 to #5 above are all ok, if the cost of the data is $50k, it could force you to cancel the whole project.
  4. Sometimes other department has to create the data for you. Let’s day yield calculation, or risk engine, or OTC pricing engine, and the requirement from the BI/DW is specific so they have to develop it. It could take them 3 months x 3 people and they could cross charge your project $50k (one off). And that could also be a killer to the DW project.
  5. Developing interface: some systems do not allow external system to pull the data out. They insist to develop an export, and charge the cost to your DW project.
  6. Standard data interface system: some large companies (such as multinational banks) have standard interface (real time, end of day, etc.), and the central middle ware team might charge your DW project some low amount (say $2000 one off + $50/month) to use that standard data interface system. Say you need FX rate data from FX system, and there is already a standard message queue for FX rates with publication time of 11am, 1pm and 3pm. So you “subscribe” to this publication MQ and pay the cost (project cross charge).

29 December 2015

DimMonth, DimQuarter and DimYear

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:21 am
Tags: ,

Sometimes the grains of our fact tables are monthly, quarterly, or yearly. In such cases, how do we create DimMonth, DimQuarter and DimYear? Some of the questions in these cases are:

  1. Why do we need to create Month as a dimension? We can’t Month column in the fact table remain as Month, not as a dimension key?
  2. What does DimMonth look like? What are the attributes? How about DimQuarter?
  3. Should we create DimMonth as a physical table, or as a view on top of DimDate?
  4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?
  5. Do we need to create a dimension for year? It seems weird because it only has 1 column (so that would be against the “degenerate dimension” concept)

For example, in the source system we have a table which stores the monthly targets for each store:
Sales Target table

Or quarterly target like this:
Quarterly Target

How do we create a fact table for this? Should we create it like this?
FactSalesTarget

Question 1 above: Why do we need to create Month as a dimension? Why can’t we leave it as Month in the fact table, not as a dim key, like this?
FactSalesTarget - MonthNotAsDimKey

Question 2 above: if we decided to keep the column as MonthKey, how should DimMonth look like? What are the attributes? Should DimMonth be like this?
DimMonth

What attributes should we put there?

  • Do we need quarter and year?
  • Do we need half year? e.g. 2015 H1 and 2015 H2
  • For month name, should we use the short name or long name? e.g. Oct 2015 or October 2015?
  • Do we need an attribute for “Month Name without Year”, e.g. October?
  • Do we need “Month End Date” column? e.g. 2015-10-30
  • Do we need “Month is a Quarter End” indicator column for March, June, Sep and Dec?
  • Do we need “Number of days in a month” column? e.g. 30 or 31, or 28 or 29.

Or should we use “intelligent key” like this?
DimMonth using Intelligent Key

Question 3 above: Should we create DimMonth as a view of DimDate like this?

create view DimMonth as
select min(DateKey) as MonthKey,
MonthNumber, MonthName, Quarter, Year
from DimDate
group by MonthNumber, MonthName, Quarter, Year

DimMonth as a view

What are the advantages of creating DimMonth and DimQuarter as a view of DimDate? (compared to creating it as a physical table) What are the disadvantages?

I think with the above questions and examples we are now clear about what the issue is. Now let’s answer those questions.

Q1. Do we need to create Month as dimension? We can’t Month column in the fact table remain as Month, not as a dimension key, like this?
FactSalesTarget - MonthNotAsDimKey

We need the Month column in the fact table to be a Dim Key to a month dimension because we need to access Year and other attributes such as Quarter.

Bringing Year into the Sales Target fact table like below is not a good idea, because it makes it inflexible. For example, if we want to add Quarter column we have to alter the fact table structure.
Bring year into fact table

Using a Dimension Key to link the fact table to a Month dimension makes it a flexible structure:
DimKey Link

There is an exception to this: Snapshot Month column. In a monthly periodic snapshot fact table, the first column is Snapshot Month. In this case, we do not need to create this column as a dimension key, linking it to DimMonth. In this case, we do not need a DimMonth. Because we do not need other attributes (like Year or Quarter). A monthly periodic snapshot fact table stores the measures as of the last day of every month, or within that month. For example: number of customers, number of products, number of orders, number of orders for each customer, the highest price and lowest price within that month for every product, the number of new customers for that month, etc.

Q2. What does DimMonth look like? What are the attributes?

Obviously, the grain of DimMonth is 1 row per month. So we are clear about what the rows are. But what are the columns? Well it depends on what we need.

I usually put MonthNumber, MonthName, Quarter and Year in DimMonth, because they are frequently used.

I don’t find “Month Name without the Year” as a useful attribute. I rarely come across the need for “Half Year” attribute.

“Month is a Quarter End” column is also rarely used. Instead, we usually use “Quarter” column.

“Month End Date” and “Number of days in a month” are also rarely used. Instead, we usually use “IsMonthEnd” indicator column in the DimDate.

For month name, should we use the short name (Oct 2015) or the long name (October 2015)? I found that the short name is more used that the long name. But the month number (2015-10) is even more frequently used that the short name

Q3. Should we create DimMonth as a physical table, or as a view on top of DimDate?

This is really the core of this article. A view on top of DimDate is better in my opinion, because we avoid maintaining two physical tables. And it makes the dimensions less cluttered.

If we make DimMonth and DimQuarter as a physical dimensions, in SSMS Object Explorer, when we open the table section we would see these:
DimDate
DimMonth
DimQuarter
DimYear

But if we create DimMonth and DimQuarter as views, then we will only see DimDate in the Object Explorer’s table section. The DimMonth and DimQuarter will be in the view section.

The main disadvantage of creating DimMonth as a view from DimDate is that it is less flexible. The attribute column that we want to appear in DimMonth should exist in DimDate. But I found that DimMonth usually only need 2 or 3 attributes i.e. Month, Quarter, Year; and all of them are available in the DimDate table. So this is not an issue.

Avoiding maintaining 2 physical tables is quite important because when we extend the date dimension (adding more years i.e. more rows) and we forget to extend DimMonth and DimQuarter, then we will cause an error.

The other consideration is of course the performance. I do not find the performance of DimMonth and DimQuarter to be an issue. This is because DimDate is not too large, and more importantly because the monthly and quarterly fact tables are small, less than 1 million rows. They are much smaller than daily fact tables which have millions or billions of rows.

Q4. What is the surrogate key of DimMonth and DimQuarter? Do we create a new SK, or do we use the SK from DimDate?

If we create DimMonth and DimQuarter as physical tables, then the surrogate key can either be pure surrogate (1, 2, 3, …) or intelligent key (201510, 201511, 201512, etc.)

But if we create them as a view of DimDate, then the surrogate key can be either the first day of the month (20151001, 20151101, 20151201, etc.) or the month itself (201510, 201511, 201512, etc.). I prefer the latter than the former because it is more intuitive (intelligent key) and there is no ambiguity like the former.

The script to create the view for DimMonth with SK = 201510, 201511, 201512, etc. is like this:

create view DimMonth as
select distinct convert(int,left(convert(varchar,SK_Date),6)) as MonthKey,
[MonthName] , Quarter, [Year]
from DimDate

Q5. Do we need to create a dimension for year?

No we don’t need to create DimYear, because it would only have 1 column.

What should we call the dim key column in the fact table then? Is it YearKey or Year? We should call it YearKey, to be consistent with the other dim key columns.

A dimension which only has 1 column, and therefore be kept in the fact table is called a Degenerate Dimension. A Degenerate Dimension is usually used to store identifier of the source table, such as Transaction ID and Order ID. But it is also perfectly valid for dimensions which naturally only have one attribute/column, like Year dimension. See my article about “A dimension with only one attribute” here: link.

23 December 2015

Data Dictionary

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 8:15 am
Tags: ,

During my 10 years or so doing data warehousing projects, I have seen several initiatives doing data dictionary. A data dictionary is a list of definitions used in the system. Each definition is about usually about 10 to 50 words long. And there are about 50 to 200 definitions. What being defined is mostly technical terms, such as the meaning of each field in the database. This is the origin of data dictionary, grew from the need to explain the fields. For example, delivery date, product category, client reference number, tax amount, and so on. I saw this in 1998 all the way to today (2015).

This data dictionary was created by the business analyst in the project, who was trying to explain each field to clear any ambiguity about the terms used in the system. I found that the business people view it differently. They found that it is not too useful for them. The data dictionary is very useful to the new starter in the IT development project, but not too useful for the business. I’ll illustrate with several examples below.

In the database there are terms like: CDS, obligor, PD, yield. The BA defines these terms by copying a definition from internet searches, so they ended up with something like this:

  • Credit Default Swap (CDS): a financial contract on which the seller will pay the buyer if the bond defaults
  • Obligor: a person or entity who is legally obliged to provide some payment to another
  • Probability of Default (PD): likelihood of a default over a particular time frame
  • Yield: the interest received from a bond

The CDS definition is trying to explain a big concept in one sentence. Of course it fails miserably: what bond? What do you mean by “defaults”? After reading the whole sentence, the readers are none of the wiser. The readers will get much better understanding about CDS if they read the Wikipedia page about CDS.

The Obligor definition is too generic. It is too high level so that it is useless because it doesn’t provide the context to the project/system. In a credit portfolio project, obligor should be defined as the borrower of the loan, whereas in a fixed income portfolio project, obligor should be defined as the issuer of the bond.

The PD definition contains a common mistake in data dictionary: the definition contains the word being defined, because the author doesn’t have a good understanding about the topic. What is a default? It is not explained. Which PD is used in this field, Through The Cycle or Point In Time? Stressed or Unstressed? Is it calculated using linear regression or discriminant analysis? That is what the business wants to know.

The Yield definition does not explain the data source. What the business users need is whether it is Yield To Maturity, Current Yield, Simple Yield, Yield to Call or Yield to Worse. There are different definitions of yield depending on the context: fixed income, equity, property or fund. The author has firmly frame it for fixed income, which is good, but within fixed income there are 5 different definitions so the business want to know which one.

Delivery mechanism: Linking and Web Pages

To make the life of the business users easier, sometimes the system (the BI system, reporting system, trade system, finance system, etc.) is equipped with hyperlinks to the data dictionary, directly from the screen. So from the screen we can see these words: CDS, obligor, PD, yield are all hyperlinks and clickable. If we click them, the data dictionary page opens, highlighting the definition. Or it can also be delivered as a “balloon text”.

But mostly, data dictionaries are delivered as static web pages. It is part of the system documentation such as architecture diagram, content of the warehouse, support pages, and troubleshooting guide.

It goes without saying that as web pages, it should have links between definitions.

I would say that the mechanism of delivery only doesn’t contribute much to the value. It is the content which adds a lot of value.

Accessibility

A data dictionary should be accessible by the business, as well as by IT. Therefore it is better if it is delivered as web pages rather than links direct from the system’s screens. This is because web pages can be accessed by everybody in the company, and the access can be easily controlled using AD groups.

Data Source

The most important thing in data dictionary is in-context, concise definition of the field. The second most important thing is the data source, i.e. where is this field coming from. System A’s definition of Duration could be different from system B’s. They may be coming from different sources. In System A it might be sourced from Modified Duration, whereas in System B it is sourced from Effective Duration.

Because of this a data dictionary is per system. We should not attempt building a company-wide data dictionary (like this: link) before we completed the system-specific data dictionaries.

How deep? If the field is sourced from system A, and system A is sourced from system B, which one should we put as the source, system A or system B? We should put both. This is a common issue. The data dictionary says that the source of this field is the data warehouse. Well of course! A lot of data in the company ended up being in the data warehouse! But where does the data warehouse get it from? That’s what the business what to know.

Not Just the Fields

A good data dictionary should also define the values in the field. For example, if we have a field called Region which has 3 possible values: EMEA, APAC, Americas, we should explain what APAC means, what EMEA means and what Americas means (does it include the Caribbean countries?)

This doesn’t mean that if we have a field called currency we then have to define USD, EUR, GBP and 100 other currencies.  If the value of a field is self-explanatory, we leave it. But if it is ambiguous, we explain it.

If the field is a classification field, we should explain why the values are classified that way. For example: the value of Asset Class field could be: equity, fixed income, money market instruments (MMI), CDS, IRS. Many people would argue that CDS and MMI are included in fixed income, so why having separate category for CDS and MMI? Perhaps because MMI has short durations and the business would like to see its numbers separately. Perhaps because the business views CDS as hedging mechanism rather than investment vehicle so they would like to see its numbers separately.

Summary

So in summary, a data dictionary should:

  1. Contain in-context, concise definition of every field
  2. Contain where the field is sourced from
  3. Contain the definition of the values in the field
  4. It should be system specific
  5. It should be delivered as web pages
  6. It should be accessible by both the business and IT

A good data dictionary is part of every data warehouse. I would say that a data warehouse project is not finished until we produce a good data dictionary.

11 November 2015

Indexing Fact Tables

A bit on Primary Key

Yesterday a friend asked me why there was no primary key on a fact table. I explained that we did have a fact table surrogate key (FSK) on that fact table, but I made it as a unique non-clustered index because we needed the physical ordering (the clustered index) to be on the snapshot date as it was a daily periodic snapshot fact table, queried mostly by using the snapshot date.

The purpose of having a primary key (PK) is to enforce uniqueness in one column of the table. We can achieve the same thing, by creating a unique key, hence we do not need a PK in that fact table.

We need to pause and think, if in the fact table we need a unique identifier of the rows. If we need to refer a fact row from another fact row (self referencing), then we do need a PK, which is usually a single column bigint FSK. But this unique identifier single column bigint FSK does not have to be an FK, it can be a non-clustered unique index.

The second purpose of having a PK is to enforce not null. This is really not the function of the PK, but more of a requirement of a PK. A PK requires that the column must be not-nullable. So not-nullable is a property of the column itself, not a property of the PK. And we implement this non-nullability when declaring the column on the table DDL.

We need to bear in mind that a PK has nothing to do with clustered or non-clustered indexes. SQL Server will automatically implement a PK as either a unique clustered index (UCI) or a unique non-clustered index (UNCI), depending on whether a clustered index already exists.

The Clustered Index

A clustered index does two things:

  1. Sorting the table physically
  2. Act as the row locator in non-clustered indexes

Point a) is for the performance of the queries. If we don’t partition a periodic snapshot fact table on the snapshot date, the next best thing is to cluster the table on the snapshot date.

But point a) is also for the performance of the update and delete. It is rare, but in some cases we need to update a periodic snapshot fact table (PSFT). So far I only found 1 case where I need to update a PSFT, out of about 150 PFSTs that I have created over the last 10 years. When updating fact table, it is absolutely crucial that the partitioning key, or the clustered fact table if you don’t have it partitioned, to be on the business date, plus the columns used as the joining criteria between the fact staging table and the PSFT. The clustered index should not be on the query criteria columns. It is the job of the non-clustered index to support the query.

Point b) means that the narrower the clustered key, the smaller the non-clustered indexes. Some people think that the narrow clustered key means that the non-clustered index will also have better performance but in my opinion and observation this performance increase is negligible.

So that’s the clustered index. For an insert-only PSFT we put the clustered index on the snapshot date plus the query criteria column to support the query performance. For an insert-and-update PSFT we put the clustered index on the joining criteria of the update statement.

For example, if the joining criteria of the update is snapshot date key + order ID (a degenerate dimension, the PK in the source table), whereas the query criteria is snapshot date key + account key, then for insert-only PSFT the clustered index should be snapshot date key + account key whereas for update PSFT the clustered index should be on snapshot date key + order ID.

The join SQL engine takes when updating the fact table depends on not only the clustered index of the PSFT, but also on the clustered index of the fact staging table (the source of the merge). If we do use the Merge command, we should convert it to update & insert commands. See my article here (link) about the Merge command’s inefficiency.

Partitioning

We can replace the physical ordering functionality above with partitioning. It is common and it make sense to partition a PSFT, I agree. But I would recommend not to partition the fact table when we create it, but later on. This is because of these reasons:

  1. We should not spend the time unless it is required, and when we create the fact table we don’t yet know if the partitioning is required.
  2. When the table is populated and queried we can learn about its condition and behaviour. Much, much better than imagining. For example, is the partitioning required to support query performance, or loading performance?
  3. We may have purging on the fact table, limiting the volume so that we don’t need to partition it because the performance is OK.
  4. We may need to create a new column for the partitioning key.
  5. When we create the fact table, we don’t yet know how the data will be distributed and queried. Purging can change the data distribution. We don’t know the query pattern, for example, 90% of the queries might be on today’s data so we should put it into a separate partition.
  6. Point e above dictates the partition maintenance strategy, i.e. whether we have partitioning functions splitting the table into 10,000 partitions or to 100 partitions with a “sliding window” maintenance. At the creation time, we have very limited knowledge of this.

Non Clustered Indexes

Each surrogate key is ideally indexed. Not combined as 1 index, but as separate indexes. All as non-clustered indexes (NCI). Say we have order date key, customer key, product key, store key, ship date key. Then we create 1 NCI on order date key, 1 NCI on customer key, 1 NCI on product key, 1 NCI on store key and 1 NCI on ship date key.

We should not combine these NCIs into 1 NCI because the second, third, and forth column of the combined NCI will not be used. Because their ordering is not sequential.

See also two of my articles which are related:

  • Indexing Fact Tables in SQL Server (link)
  • Primary Key and Clustered Index in the Fact Table (link)

12 September 2015

EAV Fact Tables

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 4:20 am
Tags: ,

A few weeks ago I came across EAV fact tables. EAV is Entity Attribute Values data model (read this for a background on EAV structure). It is a data model which enables us to add column into a table without actually changing the structure. At first I thought this EAV approach has no place in Kimball dimensional model. But after thinking and rethinking I saw that it had some advantages too, not just disadvantages. And in some cases it is appropriate to be used.

So below I’ll explain what it it is, the advantages, the disadvantages and the verdict about it.

What does it look like?

An EAV Fact Table looks like this:

What does an EAV Fact Table look like

The above example is taken from retail industry (a fashion retailer), where they analysed the profitability of each product line every day. What is a product line? They have 9 product groups: women, men, kids, shoes, handbags, accessories, watches, jewelry, beauty. The women product group consists of 23 product lines: dresses, tops, jumpers, cardigans, shirts, T-shirts, blazers, capes, jackets, coats, skirts, shorts, trousers, jeans, leggings, tights, socks, nightware, lingerie, leisureware, swimware, suits, new.

The above EAV fact table is product line performance fact table. Every day, based on the sales figures, profit margins, direct costs and overheads, they calculated various performance measure for each product line: sales growth, profit margin, product distribution, margin stability, cost effectiveness, price variation, colour variation, size variation, style variation, etc. TypeKey 217 means 1 week sales growth, 218 means 1 week margin stability, 219 is product distribution, and so on. Some measures are time-based, so they have periods such as 1 day, 1 week or 1 month. Some measures have 2 versions: net and gross. Some measures have location variations i.e. local and global. Similar measures are grouped.

Performance measurements are different for each product line. Measurements applicable for a product line may not be appliable to other product line. Using EAV structure fits the structure of performance measurement data, and makes it flexible. Because there are so many product lines with so many different performance measures, almost each week they have a new performance measure. This is because in the business analytic software they can create a new measure at any time. They define the formula for that measure, and on which product lines the measure are applicable, the different range of time periods applicable to that new measure, whether it is gross or net measurement, and whether it is global or local based.

What’s bad about it?

The main down side of EAV fact tables is: when we query the fact table we may need to pivot it. If we need to retrieve the weekly margin stability for every product lines, we could just filter on Performance Type Key = 128 and we get what we need. But if we need to retrieve all time variances for sales growth (1 week, 2 weeks, 1 month, etc) for a certain product lines, then we will need to pivot the data.

This pivoting is annoying because we have to hardcode the performance type names to make them as columns. At times it could be so annoying that we wished we had a normal Kimball style fact table so we didn’t have to pivot the data.

What’s good about it?

In Kimball star schema, in the fact tables the measures are created as fixed columns. If we have a new measure, we will need to create a new column. If the retail analytics software is churning out new measures almost every week, our development cycle will not be able to cope with the pace. In a mature warehouse, it will probably take about a month to complete the analysis, design, development, testing and release process, just to add 1 column. Equal to probably 10 man days.

If the employee cost rate is $550 per day (sum of salary, bonus, medical, dental, vision, relocation, life insurance, accident insurance, training, pension, cost of HR staff, childcare, cost of facilities, and 401k for a year, say $120k, divided by 250 working days per year minus 30 days vacation and sick time) that 1 column would cost $5500. It is very not cost effective.

Having a flexible structure like the EAV fact table means that we don’t have to worry about new performance measure churned by the retail analytic software almost every week, ever. That saves $5500 per week, or $275,000 per year, which is a lot of money.

Final verdict

If new measures are created quite often (more than once per quarter) and the measures are different for every product lines (or every customer, or other dimensionality such as instrument or asset type), then the EAV Fact Table approach is justified.

If the measures are quite stable (changes are less than once per quarter), and the measures are the same for every dimensionality, then an EAV Fact Table is not justified. We should build it as a normal, Kimball-style fact table.

But that is just my opinon. You could have different experiences, and therefore different opinions. Hence I would like to hear your comments in this topic.

13 February 2015

Estimating the Size of Dimension and Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 11:25 pm
Tags:

In most cases, when we started designing a data warehouse, as a data warehouse architect the PM needs us to tell them what the size of the DW will be, so that infrastructure can start the procurement process, capacity planning, etc. The problem is, we do not know the size of the warehouse until we finished building the warehouse and ETL.

Below I describe the technique that I used to estimate the size of dimension and fact tables. Once you get that, you can estimate the size of the staging database based on the warehouse size (say 30-50% of DW size), size of the cubes (say 5-10% of DW size), size of the extract files (say 10-20% of DW size).

This way you can estimate the disk requirement, which is one of the most difficult thing to do, if you don’t have the experience. I’ve done about 10 projects of data warehousing, i.e. designing and building data warehouses and BI, and what I wrote here is based on that experience. Of course, other DW architects may have different experience, and if so, please share it so we can all learn from you as well.

OK let’s start with dimension tables, then the fact tables.

Estimating the Size of Dimension Tables

Categorise dimension tables into small, medium, and large, based on how many rows in the dimension. Up to 1000 rows is small, 1000 to 100k is medium, and > 100k rows is large dimension. For example:

  • Country dimension: small
  • Customer dimension: large
  • Currency dimension: small
  • Date dimension: small
  • Salesperson dimension: small
  • Product dimension: large
  • Industry Sector dimension: small
  • Store dimension: medium

We can roughly estimate the size of a small/medium/large dimension based on the KB per row, like this:

  • Small: 1 KB/row x 1000 rows = 1 MB
  • Medium: 5 KB/row x 100k rows = 500 MB
  • Large: 10 KB/row x 1m rows = 10 GB

Then we multiply with the number of dimensions for each of the Small/Medium/Large category, like this:

  • Small: 11 tables x 1 MB = 11 MB
  • Medium: 6 tables x 500 MB = 3 GB
  • Large: 4 tables x 10 GB = 40 GB

Total: 44 GB

We can then cross check this 44 GB using the daily volume of dimension data flowing into the data warehouse, to verify that we are not too far off from the reality.

  1. Total data volume each day loaded by ETL into the data warehouse from all source systems: 15 GB, comprised of say 13 GB delta load (incremental extract / transactional) and 2 GB static load (whole-table extract). This 15 GB is the total volume of data extract from all source systems, which is usually about 5-10% of the size of the source system, if we can do incremental extraction (use 50% if you can’t do extract the data incrementally).
  2. Of all date going through the ETL process, about 5-10% is dimension/static data, and 90-95% is fact/transaction data. If you are not sure which number to take between 5% and 10%, take 5%. This is because most cases are 5%, only a few cases it is 10%. So 5% x 15 GB = 0.75 GB. This is the volume dimension data uploaded to the data warehouse every single day.
  3. About 0.1-3% of the rows in the dimension tables are update or inserted every day. 3% for a very active dimension (like customer and product dimensions), 0.1% for very static dimension (like country and currency dimensions). So if your customer dimension has 10 million rows, about 300k rows (3%) are updated or inserted every day. If your country dimension has 200 rows, about 0.2 rows (0.1%) are updated every day, which means that 1 row per week on average.
  4. So based on #2 and #3 we can estimate the size of the dimension. There 1.75 GB dimension data flowing into the DW every day, and that is 1.5% of the dimension size (0.1% for a very static dimension, 3% for a very active dimension, so the average is (3%+0.1%)/2 = 1.5%). So the size of all dimension tables combined is 100%/1.5% = 67 times of the daily volume of dimension data = 67 x 0.75 GB = 50.25 GB.

So one method giving 44 GB and another method giving us 50.25 GB so we are not too far off. We take the larger of the two methods, which is 50.25 GB.

I don’t like to to be vague or unclear: what’s the tolerance range for “too far off”?

Answer: Based on my experience, if the second method gives us 500 GB (>10x than the first method), we need to question it and look into it again. But if the second method is 2-3x bigger than the first method (132 GB) that’s fine. This 3x is the limit, beyond 3x we have to question it.

Estimated the Size of Fact Tables

We estimate the size of the fact tables based on the data volume in the source systems, like this:

  1. Say that we have 3 source systems, 100 GB, 10 GB and 400 TB, total of 510 TB. If we can do incremental extracting, the volume of the extracted data is about 1-10% of the size of the source system (this is daily). The bigger the source system, the smaller the percentage, i.e. if the source system is > 1 TB, it is more likely to be 2-3% than 8-10%. The newer the source system, the bigger the percentage, i.e. if the the source system is 1 year old it will be closer to 10% than to 1%. If you have no idea, don’t take any risk just take 10%. In this case, as it is less than 1 TB, let’s take 9%, assuming they are quite new. So the daily volume of extracted data is 9% x 510 GB = 46 GB per day.
  2. If you can’t do incremental extracting, use 50%. So if the total volume of the source system is 510 GB, the size of the data extract (daily) is be 255 GB. It is extremely unlikely that you can’t do incremental extraction at all. In all 10 DW projects I’ve done, we could do incremental extraction on every single one of them. If it was a good ETL (we could do incremental extraction for almost all tables) then it was something like 5%. But if it was a bad ETL (could not find ID columns in some tables required for incremental extraction) then it was something like 15%. So before you use 50%, verify that there isn’t really anything that you can use for incremental extraction.

For more info please read my article about incremental extraction here: link. I described 4 things that we can use for doing incremental extraction: timestamp columns, identity column, triggers, and transaction date. It would be an extremely rare case that there in nothing in any tables in the source system that you cannot use to do incremental data extraction. If that is the case, the source system would be a small system any way, or very old. Modern, large, complex transaction systems were built with incremental extraction in mind, or, at the very least, with a unique row identifier in every table. Even in Hadoop.

  1. Landing area and staging tables

Landing area is the file system (directories) which is used to store data dump / export files from the source system, before they are loaded into the data warehouse. Instead of using files, we can keep the data dump from the store system in staging tables. Staging tables are the tables in the staging database. They have no indexes, no primary keys. They are usually truncated before each load. In modern RDBMS, such as SQL Server 2014, staging tables resides in memory so that loading and reading them are a lot faster than disk-based tables.

To estimate the size of the landing area, we multiply the daily data extract volume with the number of days we want to keep the data. In this example, let’s assume that we want to store the extracted data, which is 46 GB/day (see #1 above), for a week (5 days). So the volume of the landing area is 5 x 46 GB = 230 GB. The same calculation principles apply for calculating the staging tables.

  1. Intermediate fact tables

Sometimes, the fact data requires an intermediate tables. For example, to perform grouping and decile calculation. Or to calculate loyalty points. So the fact data flows into a layer of intermediate tables, before they are loaded into the final fact tables. This is because “one-pass” SQL cannot be done to perform that operation. The data needs to be landed in a table frist, then a second-pass is performed. The reason of not doing this in a temp table is because

  1. The data volume is large, for example 500 million rows of weather measurement points per day. Using a temp table would affect the performance of other streams in the warehouse.
  2. A comparison of today’s data with previous days is required. If we use a temp table, we would lose previous days’ data. By having a permanent table we don’t have to rebuild previous days’ data again.

Without performing data modelling, how do we estimate the size of intermediate fact tables? By taking a percentage of a day’s worth of data, like this: one day’s data is 46 GB. Of this, 10% is dimension data and 90% is fact data. So the fact data is 41.4 GB. Of this 41.4 GB fact data, most of it does not require intermediate fact tables. Only 10-20% requires double-passes (or triple-passes). Taking the maximum to be safe (20%), 20% x 41.4 GB = 8.3 GB. That’s the size of the intermediate tables to process fact data.

  1. Final fact tables

The core question is, how do we estimate the size of the fact tables, without first designing them. So at this point of the project we do not know yet whether we will have FactAccountBalanceSnapshot, FactPremiumTransaction, etc. We don’t know how many fact tables, and what they are. Yet.

One approach is to look at the size of the fact data that is coming into the warehouse every day. We know that this is 46 GB (see point 4 above). When this data is loaded into a dimensional model, it become about 20-40%, because the textual columns are put into the dimensions, leaving only the numerical columns in the fact tables. Textual columns take up most of the space in a row.

Most of the fact tables are periodic snapsnot, only 1-2 are transactional. This is because the business usually need to analyse the changes to measures over a period of time. Accummulating snapshot fact table is rare too, usually only 1 or 2 within the whole warehouse. The ratio is about 80-90% by number of tables, and about 95% by data volume (GB). 95% of the fact data GB is in the snapshot fact table. Transactional fact tables are very small compared to periodic snapshot fact tables, because they do not repeat the data every day.

In the case above, assuming that we snapshot all of the incoming fact data, the volume of the periodic snapshot fact tables would be 46 GB x 30%* x 5 days x 52 weeks = 3.6 TB per year. But only 95% (3.42 TB) is in periodic snapshot fact tables, and 5% (0.18 GB) is in transactional and accumulating snapshot fact tables.

*30% is the mid between 20% and 40%, the ratio when the raw fact data is loaded into dimensional model.

It is a common practice that beyond 1 year, we don’t store daily snapshot. But either weekly snapshot or monthly snapshot. So it is something like: daily snapshot for the last 1 year, weekly snapshot for the last 2-3 years, then monthly snapshot for the last 3-10 years. So the number of snapshots are 5 x 52 = 260 for the last 1 year, 2 x 12 = 24 snapshots in the last 2-3 years, and 7 x 12 = 84 snapshots for the last 3-10 years. 260 + 24 + 84 = 368 snapshots.

So to store 10 years of data in the fact tables, the space required would be 46 GB x 30% x 95% x 368 = 48.2 TB.

  1. Infrastructure factors

Depending on the RDBMS chosen, we will then need to add or deduct space for indices, log files, fill factor and database compression. Exadata can compress data to 20% of the original volume. The reference says 10% but the reality is 33% (see Ofir’s article here: link).

In SQL Server and Oracle (and I believe other non MPP RDBMS such as DB2), the space for indices in a DW environment is typically 20-40% of the data. The log files are only a small percentage of the data, say 1-5%. DB maintenance will keep the log files small.

Temp DB is another one to consider, particularly if you are using SQL Server, which is a fixed amount. In some companies it’s 20 GB, but some companies it’s 200 GB. It also depends on memory, if the server has 130 GB RAM (say 120 usable by SQL), then 20 GB may be enough. But if the server only has 16 GB RAM (say 14 GB usable by SQL), then we may need 200 GB. And crucially it depends on what kind of SQL queries are flowing into the RDBMS: joining 20 tables (3 of them self joins) with grouping and rank functions, with each table being 50 GB in size, that could take 100 GB tempDB space. Depends on the execution plan too, i.e. hash join or nested loop. Using recompile or not. But that’s specific to SQL Server.

We also need to consider high availability requirements, i.e. mirroring will double the number of space, RAID 5 stripping adds 20%, RAID 1+0 doubles the space, etc. Hot or cold stand by, snapshot database, replication, etc also requires additional space.

9 February 2015

Populating Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 12:48 am
Tags:

Fact tables are normally loaded from transaction tables such as order tables or from transactional files, such as web logs. Hence the number of rows to update or insert in a load is much larger than in dimensions. The core of loading fact tables is to change the natural keys into surrogate keys. Let’s start with the basic steps, then continue with some practical experiences such as loading partitioned fact tables, slim lookup tables, deduplication, loading snapshot and delta fact tables, fact table indexes and reloading history data.

Basic Steps

When loading a transaction table from OLTP into a fact table in the data warehouse, the value columns on the transaction table become fact table measures, the primary keys on the transaction table such as order number become degenerate dimension columns on the fact table, and the alternate primary keys such as date, customer and product become dimensional key columns on the fact table.

As rows of fact tables are loaded, each of the key columns is converted from the natural key. This is why all the dimension tables must be populated first before we populate the fact tables: because we need the surrogate keys on the dimension tables to translate the source table’s natural keys. This is the very basic and the very heart of data warehouse loading, so it is important to understand it. Perhaps the best way to describe this concept is using an example. Consider a simplified order detail table with the following columns: order_id (primary key), order_date, product_id, quantity and price. In reality there should be line number column as well but in order to simplify it let’s not use line number in this example, i.e. let us assume that each order only contain one line.

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

In the data warehouse let us assume that we have an over simplified star schema consisting of a date dimension, a product dimension and a sales fact table, as describe below.

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

Sales fact table: fact_sales

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 11.99 15/07/2006

Notes:

  1. Note that order_id 352 already exists in the data warehouse but the price has changed. When it was loaded on 15/07/2006 the price was 11.99 but now the price is 12.99.
  2. In reality date dimension would contains many more other attributes, such as day of the week (2 columns: name and number), ISO date (YYYY-MM-DD), date of the year (number of elapsed days since the beginning of the year), Julian date (number of elapsed days since 4713 BC), SQL date (in SQL Server datetime format), day of the month, calendar week / month / quarter / year (3 columns for month: full name, short name and number), fiscal week / period / quarter / year, weekday flag, statutory holiday flag, last day of the month flag, etc. Plus day and month names in other languages if your data warehouse is internationalised.
  3. In reality the dimension tables would have standard dimension table columns such as load_time and SCD attributes but for this simplified case these columns are not displayed here.
  4. It is a common practice not to have a fact_key column in the fact table, with the argument being the combination of all the dimension keys will make the record unique. In this case, the primary key of the fact table is a composite key. When loading this kind of fact table we will need to compare all dimensional key columns to identify the correct row when doing updating (see step 3 below).

The basic steps in loading data warehouse fact tables are described below. It is not that different from the steps in loading dimension tables (see my previous article). In this example, it is assumed that the source table, order_detail on the OLTP, has been loaded incrementally into a stage table called order_detail. The stage database name is stg. It is also assumed that we are using SQL Server 2000 or 2005, hence the SQL statements are in Transact SQL. So what we need to do now is to load the fact table records from the stage into the data warehouse.

Step 1. Create The Temp Table

SELECT * INTO #fact_sales FROM dw.dbo.fact_sales WHERE 1 = 0

The temp table after it is created:

fact_key date_key product_key order_id quantity price load_time

Notes:

  1. By deliberately specifying a false condition, we create an empty temp table, taking the structure of the target dimension table.
  2. You may want / need to increase the size of the temp database depending on the size of your load.
  3. SELECT INTO is better than CREATE TABLE because it is more flexible to adapt to structural changes.

Step 2. Populate The Temp Table

SET IDENTITY_INSERT #fact_sales ON

INSERT INTO #fact_sales
(date_key, product_key, quantity, price, load_time)
SELECT
ISNULL(f.fact_key, 0),
ISNULL(d.date_key, 0),
ISNULL(p.product_key, 0),
ISNULL(s.quantity, 0),
ISNULL(s.price, 0),
@current_load_time
FROM stg.dbo.order_detail s
LEFT JOIN dw.dbo.dim_date d ON s.trans_date = d.sql_date
LEFT JOIN dw.dbo.dim_product p ON s.product_id = p.product_id
LEFT JOIN dw.dbo.sales_fact f ON d.date_key = f.date_key
AND p.product_key = f.product_key
WHERE s.load_time BETWEEN @last_run AND @current_run

The temp table after it is populated:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006
0 2485 92184 410 2 7.99 30/10/2006

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

Notes:

  1. Look how dimensional keys are looked up in a single step, by joining the fact table to the dimension tables on the natural keys.
  2. Because they are LEFT JOINS, the fact table rows without corresponding imensional
    rows will results in NULL dimensional keys in the fact table. The ISNULL then convert these NULLs to zeros.
  3. Notice that we don’t populate the load_time with getdate(), but with a variable named @current_load_time. This variable is populated with getdate() at the beginning of the loading batch and used by all processes in the batch. This is necessary so that in the event of failure, we know the point in time we have to restart the process from.
  4. Notice that we only load the rows between @last_run and @current_run. This is necessary if we keep the records on the stage table for a few days, i.e. if the stage contains more than 1 day data. If we clear the data from the stage table as soon as we load them into data warehouse, we don’t need to specify this where clause.
  5. The example above is using only 2 dimensions but in the real practice we would have to deal with more dimensions. 10 to 15 dimension key columns on the fact tables are common occurance.

Step 3. Update Existing Records

UPDATE f
SET f.date_key = t.date_key,
f.product_key = t.product_key,
f.order_id = t.order_id,
f.quantity = t.quantity,
f.price = t.price,
f.load_time = t.load_time
FROM dw.fact_sales f
INNER JOIN #fact_sales t ON f.fact_key = t.fact_key
WHERE t.fact_key <> 0 AND
(f.date_key <> t.date_key OR
f.product_key <> t.product_key OR
f.order_id <> t.order_id OR
f.quantity <> t.quantity OR
f.price <> t.price)

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

fact_sales after the update:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006

Notes:

  1. Here we update the fact table, based on the data on the temp table. In this case the price was updated from 11.99 to 12.99.
  2. We only update the rows where the tmp table’s fact_key is not 0, i.e. the rows already exist on the target fact table. For the rows where the fact_key is 0 (not exist on the fact table), we will insert them into the fact table later on.
  3. Notice that when updating rows we update the load time column as well. The last line is used to specify which changes we want to pickup. In most cases, we want to pick up changes on all columns, but sometimes there are legitimate reasons for business rules to specify that changes on certain columns are to be ignored.

Step 4. Insert New Records

INSERT INTO dw.dbo.fact_sales
(date_key, product_key, order_id, quantity, price)
SELECT date_key, product_key, order_id, quantity, price
FROM #fact_sales
WHERE fact_key = 0

Source table: order_detail

order_id order_date product_id quantity price last_update
352 15/07/2006 BGCKQ 1 12.99 30/10/2006
410 30/10/2006 KMSCG 2 7.99 30/10/2006

Date dimension table: dim_date

date_key date day month
2388 15/07/2006 Saturday July
2485 30/10/2006 Monday October

Product dimension table: dim_product

product_key product_id product_name description
53076 BGCKQ Aike U2 IDE HD case USB 2
92184 KMSCG Sony BP71 VIAO CG7 battery

fact_sales after the insert:

fact_key date_key product_key order_id quantity price load_time
830923 2388 53076 352 1 12.99 30/10/2006
916912 2485 92184 410 2 7.99 30/10/2006

Notes:

  1. It is a good practice to always declare the column names. This is important for flexibility and maintenance.
  2. Let RDMBS maintains the fact_key. When setting up the data warehouse, set this column to be IDENTITY (1,1).

Logging and Closing

In every step above we need to do error handling and logging. Error handling is important because if the loading fails on any steps, we need to be able to recover from the failure. Logging is important to know what exactly happened on each steps, i.e. how many records are processed, etc.

At the end of the program, we should not forget to clean everything up, i.e. drop the temp table(s), follow control protocol e.g. set the process to complete state, etc.

The above code shows how to do upsert with native SQL codes, which is very fast and efficient. But it is worth to note here, that good dedicated ETL tools such as Informatica and Data Integrator have the facilities to do in-memory lookups which has very good performance too. Disk-base lookup is definitely not they way to go here, as they are slow. Mainly because there are a lot of SQL statements to execute, i.e. one for each lookup, and each of these statements ‘touches’ the disk, which is a costly operation.

Now that we have understood the basic steps in loading fact tables, let’s familiar ourselves with a few practical experiences such as loading partitioned fact tables, slim lookup tables, deduplication, loading snapshot and delta fact tables, and dealing with fact table indexes.

Slim Lookup Tables

When a dimension is very large, sometimes it would be a significant performance improvement if we provide a separate lookup table for key management. For example, if our customer dimension has 100 columns and 10 million rows, we could create a customer key lookup with only 3 columns: customer_id, customer_key, load_time, which would increase the performance of dimensional key lookup process on step 2 above. It is also more suitable for performing in-memory lookup. If you are using Oracle 8i and above, physical table structures such as IOT (Index Organised Tables) would useful in terms of performance. On SQL Server it’s a good practice to cluster index slim lookup tables on the natural keys. On Teradata, we could use Practical Primary Index (PPI) on these lookup tables.

Example of Slim Key Lookup table: lookup_customer

customer_id customer_key load_time
493238 30012 02/10/2006
493240 30013 03/10/2006
493241 30014 03/10/2006

The load_time column would be useful if we have very long dimension table, e.g. 25 million rows. This is not uncommon when the data warehouse is utilised to drive CRM systems (Customer Relationship Management), especially in the dot com companies dealing with online campaigns, be it subscription based or tactical campaigns. In this case we can specify a where clause to limit the selection on this slim key lookup, for example where load_time is within the last 2 years. This would cover, say, 98% of the lookup. For the remaining 2%, e.g. the one older than 3 years, we then go to the main customer dimension table to find the customer_key.

Altough it takes a bit of time to maintain the lookup table, overall we still save time as the time saved by querying a slim key lookup table is a lot greater than the time required to maintain the lookup table itself. This is especially true for a Master Data Management (MDM) dimensions such as customer, product and account, where they are used all over the place in the data warehouse and data marts. Time dimension is not that long – 10 years is only 3651 rows – hence we tend not to have a lookup for time dimension. For those of us who think to set the grain of time dimension to hours or minutes, the general advice is: don’t. Either put a time stamp column on the fact table or have a time of day dimension. But this discussion (time of day dimension) is for another article.

Natural Key Changes

Sometimes a natural key which we thought was a good solid natural key could change and therefore can no longer be a natural key. Perhaps the best way to explain it is using an example. In a OLTP source system the customer table has a composite primary keys as combination of branch code and customer code. Customer code is only unique within a branch. Occasionally a branch could be closed and all the customers in that branch are moved or assigned to another branch. For example, see the customer table below.

branch_code customer_code customer_name other_attributes
1 1 Andrew
1 2 John
2 1 Steve
2 2 Paul

When branch 1 is closed, and its customers are moved to branch 2, the customer
table becomes:

branch_code customer_code customer_name other_attributes
2 1 Steve
2 2 Paul
2 3 Andrew
2 4 John

If we use branch_code and customer_code as natural key in the data warehouse, we need to handle this branch closure event with care. In the previous project we utilise another column in the customer table which would help identify a unique record and we create a special table in the data warehouse to translate the change natural key to the correct data warehouse key.

Unknown Records

If a fact table record does not have a corresponding dimension record, basically we have 2 choices: either we don’t load that record into the fact table, or we load it but we put 0 as the dimension key, referring to the unknown record in the dimension table. An unknown record is a record in the dimension table, with a dimension key of 0 and all the attributes are populated with blank string, number 0 or low value date, depending on the data type. Name and description columns are usually populated with the word “Unknown”. The load_time column is populated with the date of the record was created. This date is normally equal to the date the data warehouse was setup, because the record was created by the data warehouse setup scripts. For example:

Product dimension table: dim_product

product_key product_id product_name description min_level valid_until load_time
0 0 Unknown Unknown 0 1/1/1900 12/10/2004

Here is an example of a row in the source table with a product_id that does not exist in the dimension table:

order_id order_date product_id quantity price last_update
358 19/08/2006 BGCKZ 3 2.99 31/10/2006

Below is how fact_sales looks after that record is loaded. Notice that the product key is 0, which refers to the unknown record in the product dimension.

fact_key date_key product_key order_id quantity price load_time
830937 2424 0 358 3 2.99 31/10/2006

If we load it into the warehouse, we need to flag it into the data quality system, so that it can be reported and corrected on the subsequent load. If we don’t load it it should still be set as a data firewall rule. If we don’t load it, the total of measure on fact table would not be accurate. In the example above, the total sales amount for October 2006 would be 3 x £2.99 less than what it should be. Because of this we tend to load fact table record that does not have a corresponding dimension record and set the dimension key to 0. This way the fact table measure total would be correct, it’s just that the sales could not be traced to a valid product record. But all other dimension keys would still be valid, e.g. it can be traced to a valid date dimension record, a valid customer dimension record, etc. And above all, the referential integrity between the fact tables and the dimension tables are still valid.

Deletion

There are 2 main causes why we perform deletion on the fact tables: 1. because the data extraction approach is fixed period extraction, and 2. to accomodate reverse transactions such as cancellations. No 1 is usually physical deletion and no 2 is usually logical deletion. Let’s discuss them one by one, using examples to clarify.

No 1, loading approach. In one of the data warehousing projects I’ve been involved with, we had difficulties extracting a transaction table from the source system incrementally, because the date stamp was not very reliable. We tried with 3 weeks tolerance i.e. where last updated date or created date was within the last 3 weeks but we still find some leakage, i.e. a few records were updated without the date stamp columns were not updated. Please refer to this article for discussion about incremental extraction and leakage. Luckily, source system did not allow the user to update records that were more than 6 months old. When a user tried to update a transaction record that was created more than 6 months ago, the system displayed an error, something like “You can not modify this record.” So our approach of extracting this transaction table to the stage was to get 6 months data every time. And consequently, the approach of loading the fact table from stage was to delete the rows on the fact table that exist in stage, then reload all the stage records into the fact table. It was actually the same as updating those records, but we found that it was quicker to delete then reload. We identify the records on the fact table that exist in stage by comparing all the dimension key columns, i.e. the composite
natural keys of the source transaction table.

No 2, accomodating reverse transaction. Sometimes, in the source table we had a main transaction table containing normal transactions and a smaller secondary transaction table containing cancellation records. We have 2 options loading this kind of tables into the fact table in the data warehouse: either we load the cancellation as a new record with negative measures, or we load the cancellation as logical deletion. Reverse transaction such as refund and credit notes needs to be implemented as negative measures, but for cancellation we have 2 options.

Each approach has its own advantages and disadvantages. Which one is better depends on the situation, for example whether we will be loading from that fact table into OLAP cubes or not, whether we will be creating a summary fact table or not, whether we need the historical dates or not, and whether the secondary source table contains complete data or not. If we decided to go for logical deletion, then for each cancellation record exists on the secondary source table, we mark the logical delete column on the fact table record. All processes further down the stream such as loading into data marts, creating summary tables or loading into OLAP cubes need to be aware of this column and they need to handle it properly.

Deduplication

When loading records from stage to fact tables, sometimes we have duplicate records. For example: we declare that the grain of the fact table is 1 day for every product for each store. But we found this on the stage table of the fact table:

date product_id store_id quantity
19/8/2006 BGCKZ 309 30
19/8/2006 BGCKZ 309 1

Do we add them up, take the maximum, or take the minimum, or take the average?
It is very important to understand why it happened in the first place. In this case I always found it useful to go back to the business users or the source system expert. The second record could be an error, and in this case we take the earliest record. Or it could be a correction and in this case we sum them up. Or there can only be 1 measurement per day and in the case of 2 or more records found it must be a system error and therefore they want us to take an average. Whatever the business rule is, we need to document it, obtain sign off and then implement it in the code. This process is called deduplication. This normally happens if the source system allow duplication, i.e. it does not have the necessary contraints in the database to make it unique as per the fact table grain criteria.

Deduplication does not only happen in fact tables, but also in dimension tables, especially MDM dimensions such as customer and product. Deduplication can also occurs when the fact table is loaded from 2 or more different source system. Many data quality software such as TrilliumDataFlux, DQ Global, have facilities to deduplicate data.

Fact Table Indexes

Fact tables can be very large. It is not unusual for them to have millions of rows. To improve query performance, fact tables are normally indexed. The cluster index of a fact table is normally a composite of the dimensional keys. Or the fact key column, if there is one. For the non clustered indexes, deciding which column to index depends on how the fact table is used. Whether it is used for direct queries by end users, for populating OLAP cubes or by reports. SQL Profiler and Index Tuning Wizard are useful to understand what indexes would help improving the query performance.

Indexing in data warehousing is quite different from OLTP. In data warehousing it is normal for tables to be indexed heavily to improve query performance. In OLTP, because the records are frequently updated the indexes are not too heavy. It is normal for OLTP tables with 10 to 30 columns to have 1 to 3 indexes. In data warehousing, it is normal for tables with 10 to 30 columns to have 5 to 15 indexes. I’m talking about Kimball’s dimensional model data warehouse here, not Inmon’s 3rd normal form data warehouse in Corporate Information Factory.

Both models of data warehouse are indexed a lot more than OLTP, but generally speaking dimensional model data warehouses are indexed more than the 3NF ones. Because data warehousing tables are heavily indexed, in data warehouse the indexes takes much more space than OLTP. It is normal in a data warehouse for the indexes to take more space than the data. As an example, in a data warehouse implementation I’ve been involved recently, the 3 main fact tables occupies 100 GB for data, but the indexes for these fact tables are 150 GB. This would never happen in an OLTP system.

If our fact table contains 100,000 rows or less, we just load the fact table with the indexes on. There is no need to worry about dropping indexes when loading. If our fact table contains more than 1 million rows, it may be quicker to drop the indexes before loading the data, and recreate them afterwards. If the load is less than 10% of the fact table length, generally speaking we don’t need to worry about dropping indexes. Chances are we could decrease theloading performance by doing so. But if we are loading 20% or more (of the fact table length, i.e. number of rows in the fact table) we may want to consider dropping and recreating indexes. It is very difficult to generalise this, as the performance differs depending on what indexes we have and what loading operations we perform. Hence we always need to test it in dev and QA to prove that our load is significantly improved by dropping and recreating indexes, before implementing it in production.

If you use Oracle, bitmap indexes and Index Organised Table (IOT) are very useful in data warehousing. If implemented correctly, for example bitmap indexes on columns with very low cardinality can be 10 times faster than B-Tree index. IOT are normally used for slim lookup tables. These features don’t exist in SQL Server. As far as I’m aware SQL Server only support B-Tree indexes.

Fact Table Partitioning

Table partitioning is new in SQL Server 2005, but has been in Oracle since 1997 on Oracle 8 and improved in version 8i and 9i. In SQL Server 2000 we only have partitioned view, not partitioned table. This article provides a good overview on the partitioning on SQL Server 2005. Joy Mundy wrote an article about partitioning on SQL Server 2000. For Oracle, it’s on this document and this manual. In DB2, table partitioning was introduced in version 8 since 2002, and was greatly improved in version 9 which was release in July 2006. Paul McInerney describes DB2 partitioning features for data warehousing in this article.

Fact table partitioning is basically dividing the fact table into several physical parts. Each part is called a partition. Each partition is ideally located on a different physical data file and ideally each file is located on different disk. For example, we can divide the fact table so that rows for January 2006 are located on partition 1, rows for February are located on partition 2, and so on. Each partition can be loaded separately in parallel. Partitioning can improve query performance and increase availability. We can add new partition, drop existing partition, truncate (empty) a particular partition, move a partition, split a partition, merge several partitions and exchange/switch partitions. All these activities improve maintenance operations and greatly simplify administration tasks.

If our fact table contains more than 1 million rows, we definitely need to consider partitioning it. Even if the fact table contains hundreds of thousand rows, we still need to consider partitioning it. It can improve the loading performance significantly. We are talking 10-20 times quicker here. This is because we load new data to just 1 partition, which is say 12 times smaller than the size of the whole fact table, hence quicker. There are also techniques to load data into a new empty table with exactly the same structure as the fact table, then switch/exchange partition that new table with the main table. Queries can be a lot quicker too. Again we are talking 10-20 times quicker here. We can also partition indexes so that each part of the index serves only one table partition. This allow more processes to run in parallel. Partition can really
be a life saver for a data warehouse.

If you use Oracle, always use local partitioned indexes instead of global partitioned index. If you use SQL Server 2005, SQL Server automatically partitions the index by using the same partition scheme and partitioning column as the table, unless you specify a different partition scheme. For more information on partition indexes on SQL Server, please refer to these guidelines.

Internal Data Warehouse Entities

Some data such as performance target or budget does not exist in any source system but it needs to exist in data warehouse. Hence they are known as internal data warehouse entities. For example, for each store we can calculate out of stock percentage, i.e. how many times a required product is out of stock. There may be a concensus that it needs to be under 5%. This 5% is not written any where in any source system but will be required by the reports so they need to exist somewhere in the data warehouse .

We have 2 options on how to get this performance target into the warehouse.
1) we can build an application which will store or persist the data into its database, then from there we ETL into staging and into the warehouse, or 2) put it on a simple spreadsheet and we import it into the warehouse. In any case we should not allow the data to be directly entered into data warehouse, for example using direct SQL statement or through SQL Server enterprise manager. All data stored in the warehouse must go through the data quality / data firewall rules, so that any exceptions to the rules are recorded and reported.

Loading Snapsnot and Delta Fact Tables

A snapshot fact table is a fact table that contains condition of measures or status at specific point in time. For example:

  • a fact table that contains actual inventory level for each product at 9 am every day
  • a fact table that contains balances of every saving account in all branches on the first day of every month
  • a fact table that contains the details of all accounts that each customer has every day

There are 2 kinds of snapshot fact tables, periodic and accumulating. Periodic snapshot fact tables contain condition of status taken at regular interval, i.e. we take the snapshot of the source table at regular interval. All 3 examples above are periodic snapshot fact tables. Accumulating snapshot fact tables show the status at any given moment. It is useful to track items with certain life time, for example: status of order lines. Please refer to this Ralph Kimball article for more details about snapshot fact tables. An example of accumulating snapshot can be found here.

How do we load periodic snapshot fact tables? We extract all records that satisfy the criteria from the source table at certain period. For example, take all active rows from the account tables including the balance amount. Do this automatically once a month on the first day. Loading accumulating snapshot is rather different.
We still take all records from the source table that satisfy the criteria, then we update the fact table. For the example of purchasing accumulating snapshot above, everytime there is new piece of information about a particular purchase, we update the fact table record. We only insert a new record in the fact table when there is a new purchase requisition.

Delta fact table is a fact table that we produce as a result of comparing the condition of a source table on 2 different time point. For example: account table. This table in the source system contains all customer accounts. Say on Monday we have 100,000 active accounts and on Tuesday we have 100,001 active accounts, i.e. there were 2 new accounts opened, 1 account closed and 3 accounts changed. Out of the 3 accounts changed, 1 is changed interest rate (from 5.75% to 5.50%), 1 changed the credit limit (from £3000 to £3500), and 1 the interest payment frequency (from daily to monthly). On the delta fact table there will be 6 new rows today (assume today is 6th Nov 2006). Legend for change_type: NA = new account, CA = closed account, IR = Interest Rate,
CL = Credit Limit, IPF = Interest Payment Frequency.

account_key change_date change_type IR_before IR_after CL_before CL_after IPF_before IPF_after
49912 6/11/2006 NA
26077 6/11/2006 CA
32109 6/11/2006 IR 5.75 5.50
19387 6/11/2006 CL 3000 3500
29462 6/11/2006 IPF D M

To populate delta fact table, we download the source account table everyday and compare today’s copy with yesterday’s copy and entered the differences on the delta fact table. As an illustration of practical implementation of delta fact tables, in my previous data warehousing project we use them to tract customer retention or customer churns. It is useful for Customer Relationship Management and also to calculate financial impact of customer churns.

Purging Or Pruning A Fact Table

Some people call it purging, some call it pruning. Purging or pruning a fact table is an activity to remove certain rows from the fact that satisfy certain criteria. To give us an idea below are some examples of purging criteria:

  1. older than 2 years
  2. older than 2 years and status is not active
  3. keep daily records for the last 4 weeks then Monday only for the last 2 years

Example 1 and 2 is normally for transaction fact table and example 3 is normally applicable for periodic snapshot. Purging is important when our data warehouse is a few years old. Purging can improve query and load performance significantly.
If the fact table is partitioned, it is a common exercise to archive the oldest partition (say older than 5 years old, partitioned by month) then drop or truncate the partition. Archiving can be done by exporting the partition (using exp in Oracle) or by backing up the partition.

Reloading Fact Table History

Soon after the data warehouse is in production, user may want us to reload say last 2 years history from the OLTP source system. Say they want 2 years history of order tables. It’s not difficult to reload the history. Normally we just need to modify the time window on the ETL package. For example, if you use SSIS or DTS, we could set the time window on a table in the metadata database, where the Last Successful Extraction Time (LSET) and Current Extraction Time (CET) are kept. See this article for the concept of LSET and CET.

The problem with reloading fact table history is: we need to reload all related dimensions too. And most OLTP do not keep this history. All orders in the last 2 years, yes no problem they have it. But all customer and products and store details in the last 2 years? Often the customer and products history are not kept, their details are overwritten with new ones. SCD concept is not known and implemented in OLTP systems. This way they lost the history. The best way is probably to load the last condition of product, customer and store tables from the source system, and for all order table records which we can’t find a match in the product, customer and store tables, we reference them to the unknown records. This way the sum of measure in the fact table will still be valid and the referential integrity will still be intact.

Conclusions

Fact tables are loaded after the dimension tables are populated. As rows of fact tables are loaded, each of the key columns is converted from the natural key. When loading fact tables sometimes we need to deduplicate the rows. To load fact table record that does not have a corresponding dimension record we set the dimension key to 0, referring to the unknown record in the dimension table. Loading performance could be a lot higher if the fact table is partitioned. Consider removing fact table indexes before loading the data and recreate them afterwards. Purging or pruning fact tables can improve query and load performance significantly.

Vincent Rainardi
12 th November 2006

Next Page »

Blog at WordPress.com.