Data Warehousing and Business Intelligence

19 February 2011

Where to Store the Business Logic

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 10:29 am
Tags: ,

If we have measure C which is calculated from measure A and measure B, we need to decide whether we want to store measure C physically in the fact table, or whether we want to calculate it on the fly in the report. The term for this is “physicalise”, as in “do we physicalise that measure?”.

There are a few considerations when deciding this:

  1. For ratio or percentage, do not store it in the fact table. We need to calculate it on the fly whether it is in a cube or in a BI report. This is to enable the users to be able to view that ratio correctly, at any level in the hierarchy. I’ve written this in the past (link) particularly with regards to an OLAP cube.
  2. If it is an aggregation, do not store it in the fact table, but calculate it on the fly. OLAP Cubes are is clearly better here in doing aggregation, compared to relational reports. After all, that’s what cubes are for: to aggregate data. If you store it in a fact table, you would have to “pin” the calculation to a particular level, so users won’t be able to drill up and down the hierarchy viewing that aggregate measure.
  3. There is a classic argument of “report performance” versus “development time”. This thought is inherited from OLTP reporting, rather than originally from data warehousing/BI. The idea is to calculate the “heavy” things in advance, and store it as a table. With “heavy” they usually mean “aggregation” and “join”. If it is A+B we don’t have any issue, but if it’s joining 15 tables, or aggregating 1 billion rows to 100 rows using “group by”, then the SQL could run for 1 hour, causing the report to be useless. Hence the idea of performing that calculation in advance, and store it in a “result” table (or column). The issue is, when the underlying data changes, the result table or column must be updated. So when deciding this, always look at how much time does it take to update the result table/column.
    These days RDBMS technology gives us “computed columns”, which means that, for a simple logic like modulo or substraction, we could store them as “computed columns”. The database engine will calculate these columns on the fly, as and when the rows are updated/inserted. They are very efficient. I’ve seen 200 million rows having 2 calculated columns running on SQL Server 2008. The overhead was negligible compared before we created those 2 columns, as in: no detectable performance degradation when the ETL runs to update that 200 million rows fact table. Of course, it’s an incremental update, not a truncate-reload. If you are doing a trunc-reload on a 200m rows fact table, you should seriously look at incremental extraction, see my article here.
  4. There are several places within the DWBI system where you can “physicalise” a calculation or a business logic: a) in the ETL, then store it physically as a measure, b) in a view, whether it’s materialised or not, c) in stored procedures, d) in the BO universe or Cognos metadata layer or MicroStrategy metadata layer or SSAS DMV, e) in the BO/SSRS reports, Cognos Transformer, CubePlayer, Novaview, QlikView, Tableau or SSAS calculation. There are advantages and disadvantages of each place, so we need to choose carefully.
    There are 3 things you might want to consider when selecting the place: a) the earlier it is in the stream, the more accessable it is for down stream component. For example, if you put the business logic in a database view, then both the reports and the cube can access it. But if you put it in the report, then the cube can’t access it. b) If it is a ratio or a aggregate, then it is not suitable to calculate it in the cube or reports. It needs to be in the ETL or view. c) In the ETL you can do “multiple passes” or “row-by-row processing”, so you have more calculation ability, where as in the view, metadata layer or report your ability is limited to a 1 pass select.

4 Comments »

  1. What If we have a calculated column “Performance” but the column is derived by passing 3 variables ID, FromDate and EndDate to a function. How do you store the calculated column in your fact table since the FromDate and EndDate must be provided by the user, thus dynamic date values?

    Comment by Obinna Ekeanyanwu — 28 May 2013 @ 2:27 pm | Reply

    • Since From Date and End Date are provided at run time by the user, we don’t store/physicalise Performance measure/metric. We calculate it on the fly in the BI application / front end.

      Comment by Vincent Rainardi — 28 May 2013 @ 3:27 pm | Reply

  2. “If it is a ratio or a aggregate, then it is not suitable to calculate it in the cube or reports. ”
    This is contradicting to your article beginning?

    Comment by Cnuk — 20 July 2014 @ 7:18 am | Reply

    • Thank you for your correction Cnuk. Yes the sentence that you mentioned is incorrect.
      Here is my opinion with regards to ratio and aggregate:
      1. Ratio: ratio needs to be calculated in the cube. The nominator and denominator should be stored in the fact table, but the division should be done in the cube (with “if null” or divide in the MDX to avoid division by zero). This way, users can use the ratio at any levels (for example: date-product-store, month-customer, year-product-city, etc).
      Now for reports: report 1 may be operating at date-product-store level, but report 2 may be at month-customer level, so again, we should not store the devision result (the ratio) in the fact table, but store the nominator and denominator, and do the division in the report.
      2. Aggregate
      Cube is very good at producing aggregation and serving it at lightning speed, so we should not store the aggregation (the output of the group by) in the fact table, but must leave the aggregation task to the cube.
      Report aggregate a lot more slower than cube, so it is important to do aggregation in the summary fact table. If your fact table is at this grain: date-product-customer-store, and there are 800 million rows for 2014, and the report is at month-productgroup-city level, then you need to create a summary fact table at month-productgroup-city. So that the report don’t need to do group by at run time.

      In principle, reports must not do join and “group by” and especially “order by” at run time. All these must be pre-prepared the night before. This is in order to achieve 1 second respond time.
      If your fact table is under 100k rows, nothing to worry, you can do join, group by, order by on your report SQL queries.
      But if your fact table is 20 million rows and above, some good rules are:
      a) no join, group by or order by at run time,
      b) filtering (where clause) must be assisted with partitioning if possible, or at the very list, with index. Columnar index if possible.
      The principle is: the report only has 1 second to display the data to the users. It doesn’t have time to perform any complex operations.
      How do we avoid join? By preparing a big wide summary fact table containing all the attributes and measures required by many different reports.
      Hope this help,
      Vincent

      Comment by Vincent Rainardi — 20 July 2014 @ 1:13 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: