Data Warehousing, BI and Data Science

11 March 2010

A dimension with only one attribute

Filed under: Data Warehousing — Vincent Rainardi @ 7:10 pm
Tags:

If a dimension only has one attribute, is it worth putting the attribute in the fact table instead? Why should we bother to create a separate table for that dimension, with 2 columns: dim key and that single attribute? That will make the query slow right? Consider a simple 1 dim 1 fact scenario:
create dim1 (dim1key int, attr1 varchar(20))
create fact1 (dim1key int, measure1 money)

A typical star join query for that mart is:
select d.attr1, f.sum(measure1)
from fact1 f
inner join dim1 d on d.dim1key = f.dim1key
group by d.attr1

If we put the attribute in the fact table:
create fact1(attr1 varchar(20), measure1 money)
we could query the fact table directly, which means better performance:
select attr1, sum(measure1) from fact1 group by attr1

Well, there are 3 reasons why we should put that single attribute in its own dimension table:

  1. To keep the fact table slim (4 bytes int rather than 20 bytes varchar)
  2. If the value changes in the source system, we don’t have to update the big fact table
  3. Yes it’s only 1 attribute today, but in the future we could have another attribute

There is an exception: snapshot month. We use snapshot month column in a periodic snapshot fact table. Snapshot month in the form of an integer (201003 for March 2010) doesn’t violate the 3 point above.

  1. It is an integer, not char(6).
  2. The value never changes, March 2010 will still be March 2010 forever
  3. There will not be another attribute (snapshot year? there is no such things)

How about snapshot date (20100311)? And snapshot week (201014)? Same thing, in my opinion. They can go directly on the fact table without having a dim table.

If on the other hand it’s a varchar (or char because the length is constant), then it needs to go to its own dimension table. For example: ‘2010-03-11’ (snapshot date), ‘2010-14’ (snapshot week) and ‘2010-03’ (snapshot month).

There is one more thing I need to mention: junk dimension (Kimball’s term). We could (sometimes should) combine dimensions with one attribute into a junk dimension.

VR, 30/9/13: one other factor that we need to consider is the cardinality of the attribute column. Cardinality in database design means “degree of value uniqueness”. An DD attribute with very high cardinality such as order number, transaction ID and invoice ID, should probably be left in the fact table. This is the convention, which was recommended in Kimball’s Data Warehouse Toolkit book (put DD in fact table). But if the DD has very low cardinality (status for example), i would argue that it is beneficial to put it in a separate fact table. Factor #3 above (there will be another attribute) would be applicable to Status DD. Status could have 3 attributes, not just 1: account status code (A, C, S), account status (Active, Closed, Suspended), and account status description (explaining that Suspended means in dispute pending investigation).

SQL 2008 Data Warehousing Features

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 12:06 am
Tags: ,
New data warehousing features introduced in SQL Server 2008:
  • Merge: faster loading
  • Changed data capture: enables us to drip changes from DW to DMs, or from ODS/OS to DW
  • Resource governor: query is given higher priority over loading/processing
  • Minimally logged insert: faster loading
  • Star join query/bitmap filter: faster reports
  • Backup compression: small loading batch window
  • Data compression: less disk space, value of investment
  • Parallel query on partition tables: faster query on snapshot fact tables
  • Partitioned aligned indexed views: faster query on fact tables
  • Grouping sets: faster query from reports
  • Change tracking: real time, lightweight propagation of data changes into DW

10 March 2010

Merge in SQL Server Data Warehousing

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 7:20 pm
Tags: ,

Merge is a Transact SQL statement that enables us to perform insert, update and delete in one statement.

In SQL Server data warehousing, merge is typically used for:

  • Updating SCD type 1 or type 2 dimensions
  • Incremental loading of transaction or accumulated fact tables
  • Updating a data mart from a data warehouse, particularly dimension tables

Why use merge instead of SSIS?

  • No, we still use merge in SSIS for ETL (rather than stored procedures), for code maintainability and ability to reach various platforms.
  • Some projects use SPs for ETL (after landing the data to the DW SQL Server) because they believe it’s faster. But that is for another blog post.

The question should be why using merge in SSIS instead of SCD2 or LookUp + OLEDB?

  • Because merge is more efficient (better performance)
  • Because merge also does delete, in addition to insert and update
    (like replication but set based)

Example/demo:

dim2 is a dimension table with 3 columns: dimkey2, attr1, attr2, containing 1000 rows. In this example we want to update dim2 to reflect changes (update, insert, delete) in the source table which is a master data table. First we create the master data table, by copying from dim2 and modifying it.

select * from dim2

— Create the source table for dim2
select * into master2 from dim2
alter table master2 drop column dimkey2
select * from master2

— Create a new row, delete a row and update a row in the source table
update master2 set attr2 = ‘Value1a’ where attr1 = ‘Value1’
delete from master2 where attr1 = ‘Value4’
insert into master2 (attr1, attr2) values (‘Value4a’, ‘Value4a’)

— Now do a merge to update dim2
select * into dim2backup from dim2 –backup first
merge dim2 as T
using master2 as S on (T.attr1 = S.attr1)
when matched and T.attr2 <> S.attr2 then update set T.attr2 = S.attr2
when not matched by target then insert (attr1,attr2) values (S.attr1,S.attr2)
when not matched by source then delete
output deleted.*, inserted.*, $action;

Optimisation:

  • It’s a join operation so index on join columns
  • Put the constraint on the when clause, rather than on the join condition
    (No ‘simple parameterisation’ for merge)
  • If you need to put contraints (literals) on the join condition, put the merge in a stored proc so the literals are parameterised. Or specify ‘parameterization forced’ query hint.
  • Put the constraint/filter on a view
  • Careful when using CTE

5 March 2010

Transaction Dimension

Filed under: Data Warehousing,SQL Server — Vincent Rainardi @ 7:12 pm
Tags: ,

Transaction Dimension

Pure Kimball practitioners usually leave degenerate dimension (DD) attributes in the transaction fact table. But in some data warehouse implementation people place these attributes in a dimension called Transaction Dimension. It usually has the same grain as the transaction fact table (hence it’s discouraged) and contains DD attributes such as the transaction number. In this article I will discuss transaction dimension: what it is, the pros and cons, and how to implement it.

What is a Transaction Dimension?

A Transaction Dimension is a dimension table containing the textual details of a business transaction that is left over in the source transaction table, after all the other attributes have been put to other dimensions. Consider a classic retail scenario of a supermarket. The source transaction tables are order table and order line table. The transaction fact table grain is order line, in other words: 1 row in the fact table for each row in the order line. Columns on these 2 tables are put into 4 main dimensions: customer, product, store and date.

What’s left on the order table is order ID, order line ID and scan timestamp columns. We could either: a) leave these columns in the fact table, b) put them in the transaction dimension, or c) throw them away.

If we put the columns in the fact table, we will find that the order ID is repeated across several fact rows, because the order ID is the same for all the lines in the order. In some cases we may find other attributes are also repeated across fact rows, for example: cashier, price scheme, advertisements, discount scheme, exchange rate profile, notes, special condition and scan timestamp. Generally, if an attribute is repeated across fact rows, they shouldn’t be put in the fact table. They should be put on a dimension table.

Some of these attributes have low cardinality (distinct value is 10,000 or lower) hence, should go to their own dimension (not transaction dimension), for example: discount scheme dimension, cashier dimension and time of day dimension (truncate to minute or second). But some of them have very high cardinality, which means that they are different for almost each transaction, for example: order ID, order line ID, notes, special condition and scan timestamp. These high cardinality attributes can be put on the transaction dimension.

Reasons Creating a Transaction Dimension (the pros)

There are 4 reasons why we create a transaction dimension:

  1. Performance. By creating a transaction dimension we essentially move the DD columns to a dimension table, away from the ‘centrally traffic heavy’ fact table. DW queries don’t touch those DD columns unless they need to.  If you have a few DD attributes totalling of say 30 bytes, they will be replaced by a 4 bytes int column. That makes the fact table slimmer, better for queries.
  2. Snapshot. If you make a periodic snapshot fact table, the saving is even greater. Say your monthly snapshot fact stores the last 10 years data = 120 months = 120 copies of the trans fact table. Rather than specifying the DDs repeatedly over and over again 120 times, they are specified once on the transaction dimension, and all that is left on the fact table is a slim 1 int column: the transaction key.
  3. Greater grains. Some fact tables have grains greater than the transaction. For example, a financial transaction (say payment received from a customer) could be posted into 4 different accounts in the general ledger fact table: 1 bank account and 2 income accounts and 1 VAT account. That single financial transaction becomes 4 G/L lines – 4 fact rows. Because it’s a single financial transaction, it only has 1 row in the transaction dimension. All 4 fact rows will have the same value on the transaction key column. Just like a normal dimension, the granularity of the transaction dimension is lower than the granularity of the fact table, e.g. if the fact table contains 10 million rows, the transaction dimension contains 3 million rows. This is because 1 transaction creates 3 to 4 rows in the fact table but only 1 row in the transaction dimension.
  4. Linked/related transactions. We may also find that some transactions are related, for example, in the mobile industry, the 3rd call that a customer make in that day may get cheaper tariff, say half price compared to the first 2 calls. Or, in retail, a purchase of a kitchen might need to be created as 2 related orders (delivered at the same date), because the worktop is made to order. In these situations, rather than creating a ‘related order’ column on the fact tables, it is better to create the ‘related order’ columns on the transaction dimension. Because, a) an order can consist of many fact rows (1 row per order line) so there the “related order number” will be duplicated across these fact rows, and b) by putting the “related order” column in the transaction dimension, the fact table becomes slimmer and faster.

The Cons

OK that’s the benefits. What are the disadvantages? In what situation it’s not suitable?

  1. If your fact table is a transaction fact table and the grain of the transaction dimension is the same as the grain of the fact table, and you only have 1 DD column, perhaps it’s better to leave the DD in the fact table. You don’t get a lot of space or speed by putting it on a transaction dimension.
  2. If the mart is only used to supply SSAS, then there is little point of having transaction dimension table physically. Because in SSAS you can create the transaction dimension as “fact dimension”. SSAS can create the transaction dimension “on the fly” from the fact table.
  3. If you use the transaction dimension as a place to put many attributes (as opposed to storing them in the main dimension), with the argument “well that’s the value of that attribute when the transaction happened, as opposed to the value of the attribute now” – this is not right, use type 2 SCD for this.
  4. Any dimension with the grain the same as the fact table (like transaction dimension) is questionable: do we really need this dim? Is it really going to be used? Wide textual columns such as notes – are they going to use it? If the purpose of creating the mart/warehouse is analytics, will that free-text notes column ever be used in the OLAP?
  5. Any dimension with the grain the same as the fact table can potentially cause performance issue (unless your fact table is small). Say your fact table is 10m rows. The transaction dimension is also 10m rows. Joining 10m to 10m could potentially slow, especially if the physical ordering of the trans dim is not the joining column.

How do we use it?

A very large percentage of data warehouse queries is on the main dimension (account, retail: product, utilities: customer, insurance: policy, healthcare: patient, airline: ticket, telecom: call, etc) and date dimension. Very little percentage of data warehouse queries is on transaction dimension. Probably less than 1%. This is understandable, as by definition a data warehouse is not a transactional system.

If you find that transactional queries are more than 10%, ask yourselves this question: why are these queries not done against the operational systems or ODS? In some cases the answers are genuine: because we don’t have ODS. Or worse, because we don’t have reporting facility in the operational system. Our operational system is custom built and when we built it we decided that all reporting will be done from the warehouse. It doesn’t mean that they are the right reasons though; quite the other way around.

Anyway, I digress. Back to the topic of this section, how do we use a transaction dimension? A typical way of using a transaction dimension is to constrain the star query to a particular transaction ID, then go to the fact table via the transaction key to get the necessary rows.

Another way of using a transaction dimension is to provide a key to the main dimension, e.g. customer or account, so you can go from customer/account dimension directly to the transaction dimension without going through the fact table.

Can we have 2 transaction dimensions? Rare but possible, i.e. in the case 2 grains (higher level and lower level), or when you have 2 different types of transactions, i.e. premium vs claim in insurance, or mortgage vs checking transactions in banking. In the case of the latter, each type of transaction normally located on different fact table. So fact table 1 is using transaction dimension 1, while fact table 2 is using transaction dimension 2.

Indexing and Loading

Create nonclustered indexes on attributes that have a) low density (many distinct values), b) high selectivity (the where clause / query predicate is specific, i.e. selecting only 1 row out of 1 million). Index on the transaction key (clustered), just like any other dimension. This is very important as the transaction dimension almost always used by joining from the fact table. this index will be used to join to the fact table.

Consider partitioning the transaction dimension if it contains more than 1 million rows. The question is always “on what key?” Range (left or right doesn’t matter) of Transaction ID is a good candidate, as Transaction ID will be frequently used in the query predicate. If you do partition it, you still need to cluster index on transaction key (as the second key as the first key must be the partitioning key). It is possible to partition on the transaction key (dim’s PK) but the maintenance of the partitioning key will require quite a bit of work.

99.9% the transaction dimension is in type 1, not type 2. This is because the attributes do not change, as they are DD attributes. If you do make it type 2, it is beneficial to add “is current” flag; we can’t rely on the validity date columns for query performance.

The daily routine on loading is to insert new rows from the source transaction table into both the fact table and the transaction dimension. The transaction key identity col (clustered) on the PK makes it a quick insert. Update and delete are very unlikely, as transaction dimension only contains degenerate dimension attributes.

Like everything else in data warehousing, for Transaction Dimension there are a lot of “it depends” so I’d welcome any discussion, correction and opinion.

Source: Statistics Used by Query Optimizer 2005 by Eric Hanson and Lubor Kollar

3 March 2010

Reasons for Creating a Data Mart from a Data Warehouse

Filed under: Data Warehousing — Vincent Rainardi @ 11:45 pm
Tags:

There are several reasons why we want to create a data mart from a data warehouse. But before I list them down I would like to clarify the meaning of “data mart” and the meaning of “data warehouse”.

Data mart: in this article I am referring to the Kimball meaning of the word “mart”, which means that it is in dimensional model. Meaning that it consists of a few fact tables (usually 1 to 5 but not more than 10), and the necessary dimensions (usually about 5 to 10). Quoting my own definition in my book, a dimensional data mart is a group of related fact tables and their corresponding dimension tables containing the measurements of business events categorised by their dimensions.

The other possible meanings of “data mart” in data warehousing practice are (which I’m not talking about them in this article):

  1. An OLAP cube. Usually quite small in size, created from 1 or 2 fact tables (and the necessary dimension). Probably not at the lowest grain but at higher/summarised level.
  2. A relational database but not dimensional (not fact & dim). Usually it is not at 3rd normal form either. Just a collection of tables without (necessarily) having referential integrity / foreign key relationships between the tables. It’s quite a small database, there are not many tables, probably between 5 and 15.

Whereas for the term “data warehouse”, it’s as what I defined in my book: a data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalised data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time the transaction happens in the source systems.

So, with those meanings of data mart and data warehouse in mind, the reasons for creating a data mart from a data warehouse are:

  1. Because we need to create a static copy of a few fact tables (with their corresponding dimensions) which does not change every day like the warehouse, for analysis purposes. Usually it’s only temporary. The users use the data in the mart for a project which does not last long, say a few days to a few weeks. And after that the data mart is deleted.
  2. Because they need a mart for data mining. Processing a data mining model (“training”, “predictive analysis”) creates heavy work load and we don’t want it to affect the performance of the central/core warehouse.
  3. To ease the query workload on the warehouse. Say 50% of the reports are querying one certain fact table (and its dimensions). To enlighten the burden of the warehouse we create a copy of that fact table (and its dimension) in a separated database located on different server, and point some of th reports that way. This mart is refreshed/updated every day.
  4. Because they want to change the data to simulate some business scenarios. They can’t change the data in the core warehouse (it will affect everybody!) so we provide a data mart for them to running their scenarios.
  5. To support the query performance of the reports (the mart is structured differently, indexed differently). Probably putting additional tables in it (summary tables/fact at higher grain, or list tables). The mart is usually read only, refreshed/updated every day.
  6. Because the data warehouse is in normalised format (kind of NDS, probably on PDW), so we need to build a dimensional mart for the analytics/OLAP to get the data from.
« Previous Page

Blog at WordPress.com.