Data Warehousing and Data Science

21 February 2019

Data Files – Delimiter and Qualifier

Filed under: Data Warehousing — Vincent Rainardi @ 8:06 am

Suppose you got to specify a file spec for your data supplier to send their data via FTP. Would you request them to send you Excel files or CSV files? Do you prefer a tab delimited text file or pipe delimited? Do you ask them to qualify their string with double quotes?

Excel or CSV

I would prefer CSV files than Excel files, because the number of columns in Excel which can be read by SSIS is limited to 255 columns (see link). Whereas in CSV files there are no limitation regarding the number of columns.

To overcome this limitation we need to import it twice (or three times) using two data source components and then join them. But in order to join them we will need to have an identity column, which will become the join criteria (see link). We also need to be careful with the performance when joining because merge join can be slow.

The second issue with Excel file is the OLE DB Provider installed in the server where SSIS is running, otherwise we could get an error message saying that “OLE DB provider Microsoft.ACE.OLEDB.12.0 is not registered” (see link). The “Use 32 bit runtime” in the SQL Agent job step and Run64BitRunTime in the project property are also affecting this.

The other disadvantage of importing an Excel file is dealing with zero in the front of a string, such as “007”, which automatically becomes number 7.

Also numbers which are in scientific notation, such as 1.23E21, which will then be imported as a string rather a number, causing a failure. If it is in CSV it written as 1230000000000000000000 in the file and imported as a number.

The other limitation of Excel is about cells containing long strings, such as 2000 characters, being cut to 256 characters. This issue only happens before 2007 edition.

Delimiter

What delimiter do you prefer: comma, pipe or tab?

The problem with CSV or a comma delimited file is that we have comma in the data. This causes misalignment when the file is imported. For example, if there are 10 columns, and one of them has a comma, this row will become 11 columns. This problem is known in the IT world as “delimiter collision”, see here: link.

Comma in the data is very common when dealing with numeric fields, such as “123,000”. Or, in countries like Indonesia which uses comma as a decimal point, it is like this: “100,23”.

We can enclose it with double quotes, but why not eliminate the problem in the first place? That’s why for text files people prefer pipe delimiter or tab delimiter. Text file with tab delimiter is known as tab-separated values, or TSV (link), but pipe delimited files are not known as PSV. We do have DSV though, which stands for Delimiter Separated Values (link).

Pipe is generally preferable because of the perception that it is rarer than tab. Some strings may contains tab, for example in a commentary field.

Qualifier

Qualifier means enclosing the data with something, such as double quote. Double quote is the most common delimiter. Other delimiters (but far less common) are single quote and brackets. There are 3 different types of bracket, i.e. [ ], < > and { }.

The issue with double quote delimiter is that the data may contain double quote, such as in commentary fields. This applies to other delimiters too. That’s why it is ideal to use pipe delimiter, without qualifier.

So that’s my preferred choice for the format of a data file, if I can choose it: pipe delimited text file without qualifiers. But unfortunately, in reality, we very rarely get the opportunity to choose the format. Usually the vendor dictates the format because the same file goes to many customers.

20 February 2019

Transactional Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 9:11 pm

Transactional fact tables are not as popular as periodic snapshot fact table. In this article I would like to compare transactional and periodic snapshot fact tables, list their advantages and disadvantages, and give two cases from my own experience where I needed to decide between the two. But first let me explain what these two types of fact tables are.

What is a transactional fact table?

In dimensional data modelling we have 3 types of fact tables: transactional, periodic snapshot and accumulation snapshot. I’m going to explain the first two below.

1. Transactional Fact Table

A transactional fact table is a fact table where:

  • Each event is stored in the fact table only once.
  • It has a date column indicating when the event occurred.
  • It has an identifier column which identifies each event.
  • The number of rows is the same as the source table.

A classic example is when in the source system we have a sales table containing customer orders (for example, in a shop, a restaurant, or a factory). Say Monday we had 80 orders, Tuesday 70 orders, Wednesday 90, and so on. So on Monday night we load the 80 rows for Monday into the data warehouse. On Tuesday night we load the 70 rows for Tuesday, and on Wednesday night we load the 90 rows for Wednesday.

Transactional Fact Tables
Figure 1
. A transactional fact table loading 3 days data

In the data warehouse we store the customer orders in the Sales Fact Table, which is a transactional fact table.

  • In this Sales Fact Table, every order is stored only once. The 70 orders on Tuesday are different to the 80 orders on Monday. And they are also different to the 90 Wednesday orders.
  • For this we use the order date column. In the above example the Monday, Tuesday and Wednesday are the order date. This order date column indicates when the event occurred, when the order happened.
  • In this sales table we also have a sales identifier, such as order number if it is a shop, or ticket number if it is a restaurant.
  • On Wednesday night, after the warehouse load finishes, we have 80+70+90 = 240 rows, the same as in the source system.

In addition to insert, we also have update in the source table. In the example above, in addition to 70 new orders on Tuesday, we also have updates to some of the 80 Monday orders. This is they key difference to the Periodic Snapshot fact table: a Transactional fact table updates existing rows, and therefore lost some history.

Other ideal examples of a transactional fact table is the journal table in an accounting system, a trade table in an investment banking system, a premium table in an insurance system, a payment table in a payment system, a transaction table in a retail banking system, and a call table in a telecommunication system.

2. Periodic Snapshot Fact Table

A periodic snapshot fact table is a fact table where:

  • The whole source system is copied into the fact table regularly.
  • The same event is stored multiple times.
  • It has a snapshot date column indicating when a copy of the source table was created.

An ideal example of a periodic snapshot fact table is the bank account balance. At the end of each day, the balances of every customer account in the bank is stored in this account balance table. Say there were 20,000 customers on Monday; 22,000 customers on Tuesday and 24,000 customers on Wednesday.

Periodic Snapshot Fact Tables
Figure 2. Periodic Snapshot Fact Table

  • Every day we copy the whole content of the account balance table into the periodic snapshot fact table.
  • So on Monday night we stored 20,000 rows in the account balances periodic snapshot fact table, on Tuesday night 22,000 rows and on Wednesday night 24,000 rows. So an account is copied every day to the fact table, each day with potentially a different balance amount.
  • In the fact table we have a column called snapshot date. For all the rows created on Monday night, we set the snapshot date column to (for example) 11th Feb 2018. For the rows created on Tuesday night we set the snapshot date to 12th Feb 2018 and for the Wednesday rows we set the snapshot date to 13th Feb 2018.

Of course there are accounts which were closed on Tuesday and no longer in the account balance table in the source system. In the fact table, the Monday data set contains these accounts, but the Tuesday data set doesn’t contain these accounts, and neither does the Wednesday data set.

And there are accounts which were updated on Tuesday. These changes will be reflected on the Tuesday snapshot in the fact table, different to their Monday rows.

Another example is inventory table in manufacturing, holdings table in fund management, billing balance table in telecommunication, and daily temperature table in a weather system.

Advantages and Disadvantages

The advantages of a transactional fact table are:

  • It mimics the source table
  • It is simpler as we only have once version of each event

The disadvantages of a transactional fact table are:

  • We don’t have the previous values
  • Update is slow if the fact table is large, potentially performance issue

Now let’s get on with the two cases of transactional fact table implementations.

Case1: Retail Association

Let’s suppose we are running a Retail Association. Every shop in the country reports their annual sales data on our website, within one month of their financial year end. In January 2019 there were 8,000 shops reporting their sales data on our website, and in Feb 2019 there were 9000 shops reporting their sales data. There are about 100,000 shops in total.

Every month we get a CSV file containing the previous month data. The file is called Data_YYYYMM.csv and the file contains a column called Reporting Date. So:

  • The Data_20180131.csv contains the 8000 shops reporting in January 2019, with the Reporting Date column containing dates from 1st to 31st Jan 2019.
  • The Data_20190228.csv contains the 9000 shops reporting in February 2019, with the Reporting Date column containing dates from 1st to 28th Feb 2019.

Case 1
Figure 3. Retail Association monthly data file loaded into a transactional fact table

Because on the February file the January data is not given in full (but only the changed and new rows), we can’t make it as a periodic snapshot fact table. So in this case a transactional fact table is the only option.

Heart of Transactional Fact Tables

In theory the loading is straight forward. Jan file is loaded, then Feb file, and so on. But in reality this is rarely the case. A few shops who supposed to report in January were late, and they reported in February. So the in the February file we also get a few rows of the January “late reporter” shops.

Some shops made an error in their January submission and corrected it in February. So in February file we also have a few rows containing January corrections.

That is the heart of transactional fact tables: performing updates to the previous months’ data. In this case it goes back only one month, but in real cases it could be a few months.

Case 2: Waste Reporting

Suppose we are working with a government department responsible for implementing the waste regulations in our country. Every company and government agency in the country needs to report the amount and types of their packaging waste, electronic waste, commercial waste and recyclables e.g. plastic, paper, metal. This report is annual (once a year). Every town council also need to report the amount and types of their household waste, also annually.

The timing when they have to submit the report is different for every company, agency and council, depending on their financial year end. But the report contains the same year, which is the previous calendar year. So companies/agencies/councils with financial year ending on 31st March 2018 need to report the waste happened from Jan to Dec 2017. Those with financial year ending on 31st August 2018 also need to report the waste happened from Jan to Dec 2017.

The data we get from this central government department is monthly, i.e. every month we get a file. The file contains the waste produced in the previous calendar year. So the file we get in all 12 months of 2019 contains the waste data for 2018, but growing every month. For example, in Jan 2019 the file contains 17,000 entities (i.e. companies, agencies and councils), in Feb 2019 the file contains 32,000 entities, the March 2019 file contains 49,000 entities, and so on, like below.

Case 2
Figure 4. Waste Regulation monthly data file loaded into a transactional fact table

The February file contains all the 17,000 January entities plus 15,000 entities that report in February. But the February file also contains corrections for January. In other words, in the February file the waste data for some of these 17,000 January entities might have changed. For example, company A which in January reported waste of 11,000 kg of waste for 2018 calendar year, in February might submitted a correction to change this figure to 11,500 kg. This correction may happen up to 6 months after the submission deadline. So the correction for January data can happen in the July file.

Now we need to decide whether transactional fact table is the right approach for this case.

Case 2 Periodic vs Transactional
Figure 5. Comparing Transactional Fact Table and Periodic Snapshot

If we build this fact table as a periodic snapshot fact table, we will have every version of the data. We will have the January version when Entity A was 11,000 kg, and we will also have the February and March version when entity A was 11,500 kg.

If we build it as a transactional fact table, we will only have the latest version.

From my experience many people simply go for the snapshot model because of “just in case”. In case we need the January version.

The other reason, which is more valid, is trace-ability. If we use a transactional fact table, the report we produced in Feb 2019 (based on the Jan data) will have different numbers to the report we produce in Apr 2019 (based on the Mar data). If we use the snapshot model, we can explain the differences, i.e. because in January Entity A was 11000 kg, not 11500 kg.

So in case 1 we are dealing with “incremental” monthly data files, whereas in case 2 we are dealing with “accumulated” monthly data files.

 

15 February 2019

Data Type

Filed under: Data Warehousing — Vincent Rainardi @ 6:53 pm

In the data warehouse, should we make the data type the same as the source system? If in the source system the data type of the currency code column is varchar(50), in the our data warehouse should we set the data type to varchar(50) or to varchar(3)? If an ISIN column is defined as varchar(100) in the source system should you set the data type to varchar(100) or varchar(12)?

The short answer is: set it to whatever the source system specifies.

The long answer is:

  1. If the source system specifies a ridiculously long data type such as varchar(500) for currency code or varchar(1000) for ISIN, or varchar(8000) for a description column, and the actual data in those columns are much shorter than the data type (e.g. 3, 12 and 200 characters respectively for the above 3 examples), then set the data type to fit the actual data length, plus spare.
  2. Have a standard in the warehouse for length. For example: 1, 5, 10, 20, 50, 100, 200 and pick the next one up. For example if the actual data length is 12 set the data type for that column to 20, and if the actual data length is 30 set the data type to 50.
  3. In the ETL, if the actual data exceed the column data type, cut the data. For example, if the data type of the attribute column in the dimension is 100 and the actual data is 101 characters, then cut it to 100 characters and load it into the dimension but flag this in the ETL error log.
  4. Talk to the source system team. Perhaps there is a reason why the designer of that source column set it to wider than what it should be. Perhaps when there is no ISIN, they put the other security identifier such as currency + SEDOL in there? Or internal security identifier, which is quite long?
  5. If the source system team confirmed that the data they put into that column will not exceed 12 characters (even though the column is defined as 100 characters) then set the column in the warehouse to 12. But again, in the ETL cut it to 12 if it exceed 12, and log it.

 

Blog at WordPress.com.