Data Warehousing and Data Science

19 February 2011

Where to Store the Business Logic

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 10:29 am
Tags: ,

If we have measure C which is calculated from measure A and measure B, we need to decide whether we want to store measure C physically in the fact table, or whether we want to calculate it on the fly in the report. The term for this is “physicalise”, as in “do we physicalise that measure?”.

There are a few considerations when deciding this:

  1. For ratio or percentage, do not store it in the fact table. We need to calculate it on the fly whether it is in a cube or in a BI report. This is to enable the users to be able to view that ratio correctly, at any level in the hierarchy. I’ve written this in the past (link) particularly with regards to an OLAP cube.
  2. If it is an aggregation, do not store it in the fact table, but calculate it on the fly. OLAP Cubes are is clearly better here in doing aggregation, compared to relational reports. After all, that’s what cubes are for: to aggregate data. If you store it in a fact table, you would have to “pin” the calculation to a particular level, so users won’t be able to drill up and down the hierarchy viewing that aggregate measure.
  3. There is a classic argument of “report performance” versus “development time”. This thought is inherited from OLTP reporting, rather than originally from data warehousing/BI. The idea is to calculate the “heavy” things in advance, and store it as a table. With “heavy” they usually mean “aggregation” and “join”. If it is A+B we don’t have any issue, but if it’s joining 15 tables, or aggregating 1 billion rows to 100 rows using “group by”, then the SQL could run for 1 hour, causing the report to be useless. Hence the idea of performing that calculation in advance, and store it in a “result” table (or column). The issue is, when the underlying data changes, the result table or column must be updated. So when deciding this, always look at how much time does it take to update the result table/column.
    These days RDBMS technology gives us “computed columns”, which means that, for a simple logic like modulo or substraction, we could store them as “computed columns”. The database engine will calculate these columns on the fly, as and when the rows are updated/inserted. They are very efficient. I’ve seen 200 million rows having 2 calculated columns running on SQL Server 2008. The overhead was negligible compared before we created those 2 columns, as in: no detectable performance degradation when the ETL runs to update that 200 million rows fact table. Of course, it’s an incremental update, not a truncate-reload. If you are doing a trunc-reload on a 200m rows fact table, you should seriously look at incremental extraction, see my article here.
  4. There are several places within the DWBI system where you can “physicalise” a calculation or a business logic: a) in the ETL, then store it physically as a measure, b) in a view, whether it’s materialised or not, c) in stored procedures, d) in the BO universe or Cognos metadata layer or MicroStrategy metadata layer or SSAS DMV, e) in the BO/SSRS reports, Cognos Transformer, CubePlayer, Novaview, QlikView, Tableau or SSAS calculation. There are advantages and disadvantages of each place, so we need to choose carefully.
    There are 3 things you might want to consider when selecting the place: a) the earlier it is in the stream, the more accessable it is for down stream component. For example, if you put the business logic in a database view, then both the reports and the cube can access it. But if you put it in the report, then the cube can’t access it. b) If it is a ratio or a aggregate, then it is not suitable to calculate it in the cube or reports. It needs to be in the ETL or view. c) In the ETL you can do “multiple passes” or “row-by-row processing”, so you have more calculation ability, where as in the view, metadata layer or report your ability is limited to a 1 pass select.
« Previous Page

Blog at WordPress.com.