Data Warehousing and Data Science

27 November 2009

Combining Data Warehouse and ERP Data in SSAS Cubes

Filed under: Analysis Services,Data Warehousing — Vincent Rainardi @ 9:50 pm
Tags: ,

One of the fundamental principles in Kimball data warehousing is to build the fact tables at the lowest granularity, not at summary levels; where as in Inmon method the lowest grain is kept in a normalized-form warehouse. Whatever the reason is, sometimes we are in a situation where the data we need for the cube is not the dimensional warehouse. As usual there are multiple ways of addressing this,

  1. Bring that missing data into the warehouse (of course)
  2. Gets the missing data straight from the ERP.

Long ago, before the concept of ‘Unified Data Model’, ‘Self-service BI’ and ‘BI for the masses’ were conceived, we would never consider getting the data directly from the ERP. At that time, bypassing the data warehouse was a big no-no. In the early years of data warehousing, all data must be ‘unified’ first in the data warehouse, where it became the single version of the truth, before it is consumed.

As the law of nature dictates, pragmatic considerations always surpassed idyllic ideas. Then the concept of UDM and Self Service BI was born, where the BI tool can unify data from anywhere. The advantage: faster delivery, shorter development time, greater ROI. In reality, as sure as eggs is eggs, project proposals that deliver greater business benefits in shorter time frame triumph over those that do not. TM1, Qlikview, Board, SSAS and PowerPivot, all jumped into this wave. All the users care about is that they get the BI data they need, quickly. And these BI tools satisfy that need.

Yes it’s bad news for ETL. And for data warehousing. It’s not about ‘how quickly can you build a data warehouse’. It’s about being able to deliver BI, without building and maintaining a DW. What I demonstrated last week at SQLBits using SSAS, and what TM1, Qlikview, Board and PowerPivot have in common is that we deliver ‘BI on the go’, i.e. the BI tool retrieves the data from various sources, integrate them and present it to the users.

It is still difficult for me to believe (about bypassing DW), but it is happening. There are limits of course. Everything has a limit. In this case the limit is the complexity. ‘BI for the masses’, ‘self service BI’ and the like are OK for simple BI. Once it reaches certain complexity, we do need a warehouse.

Back to the title of this post, combining DW and ERP data in the cube, it’s simple. On Data Source View create 2 named query. One points to the DW, and the other points to the ERP. Then when creating the database dimension, we combine those 2 named querys/tables. What about facts/measure groups? If you want to source some facts from DW and some from ERP, create 2 measure groups. One points to the DW, and the other points to the ERP. Then on the dimension relationship link them to the corresponding dimensions. If you want to source the same facts from both DW and ERP, then either a) use linked server (if they are both SQL Server) to join both systems on the DSV, or b) use 2 partitions. One partition points at DW and the other points at ERP.

The challenge of combining DW and ERP data is not sourcing the data. It’s joining them. First, we need to join on the natural keys. Natural keys are the primary keys of the ERP’s ‘master tables’. They are the common fields between the 2 systems. Second, there might be some overlap between the 2 systems. We need to avoid data duplication. I mean row duplication. So we may need to use ‘where not exists’, ‘where not in ()’, and ‘union’ rather than ‘union all’, those sort of things, to filter out duplicate data that exists in both systems.

I’ll write 4 examples/cases,

  1. Dimension: 10 attributes are from DW, 1 attribute is from ERP
  2. Dimension: 300,000 rows (members) are from DW, 3000 members are from ERP
  3. Fact: measure group 1,2,3 are from DW, MG4 is from ERP
  4. Fact: 10 measures on MG1 are from DW, 1 measure is from ERP
  5. Fact: 10m rows are from DW, 100k are from ERP

Well OK, that’s 5. But read #4 below. Instead of ERP, we can also take the data from ODS. It’s not the right place nor the time to explain in great length about ODS now, so I’ll just give a short definition. It’s 8am on Sunday morning! But I’ll bring my ODS article to this blog later and refer to it, it’s in SQL Server at the moment. If you read Imhoff and Inmon’s book on ODS (1995) and my book/articles, basically ODS is a copy of the ERP (structure wise and data wise). Well, that is not an accurate definition, there’s data integration aspect etc, but for the purpose of this discussion it’s suffice. I write about ODS in greater length later, promise. I mean differences between supplementing DW data with ODS, compared to from ERP.

Back to the examples/cases,

1. Dimension: 10 attributes are from DW, 1 attribute is from ERP

In this case the join is left join, like this:

select c1.attr1, c1.attr2, ..., c1.attr10, isnull(c2.attr11, 'Unknown')
from DW.dbo.dim_customer c1
left join ERP.dbo.customer c2 on c2.customer_id = c1.customer_id

The ERP is usually located on the different SQL Server than the DW. The ERP may even on different RDMBS, e.g. Oracle or DB2. In these cases we could a) create a copy of the ERP database, on the DW SQL server, or b) create a linked server from the DW SQL Server to the ERP server.

2. Dimension: 300,000 rows (members) are from DW, 3000 members are from ERP

In this case we use union. The core of this work is to find out from which ERP column each DW attribute is coming from. We do this by looking at the ETL, tracing the target attribute back to the stage, and to the ERP. Of course we can refer to the mapping documentation, as long as it is reliable. Practical experience over the years has taught me that we can’t rely on it, so we need to double check. Once we found out which columns on ERP, we test the ERP data using select distinct, i.e. “select distinct attr1 from customer”, comparing it with the data warehouse data: “select distinct attr1 from dim_customer”.

In doing this we need to be careful about 2 things: a) the data type, and b) the performance. The data type of attr1 in the DW might be varchar(20), but in the ERP it is datetime. So we need to convert. After all, that’s what the ETL does.

About the performance, if the DW table is big (say 10m rows) but the ERP result set is small (say 1000 rows) consider using “union all” with “select distinct” rather than union. Rather than:

select attr1, attr2, …, attr10 from DW.dbo.dim_customer union
select col1 as attr1, col2 as attr2, …, convert(varchar, col10) as attr10 from ERP.dbo.customer


select attr1, attr2, …, attr10 from DW.dbo.dim_customer union all
select distinct col1 as attr1, col2 as attr2, …, convert(varchar, col10) as attr10 from ERP.dbo.customer

If DW.dbo.dim_customer contains 10m rows, ERP.dbo.customer contains 1000 rows the bottom query is faster than the first query. Union (without all) does select distinct between 2 tables so it could take a while.

The main issue about union-ing a DW dim table and the corresponding ERP table is, of course, the dim surrogate key. ERP customer table doesn’t have customer_key. That’s generated in DW, either by ETL or using identity column. So how can we address this? Either a) we prefix the surrogate key with the system code, or b) we use artificial (dummy) key for the ERP rows, such as negative numbers. If the surrogate key of customer dimension is 0 to 1,000,000, then the surrogate key of ERP rows are (for example) -1 to -1000. Prefixing is a bit difficult because we need to change the dimkeys in the fact tables too.

The other thing to consider when union-ing a DW dim table and an ERP master table is, that not all attribute can be supplied. One or two of the DW attributes may not available on the ERP. Or may not be readily available on ERP. In this case a) omit that attribute (set it to blank string), or b) link to the other ERP table where the attribute is located. In the case of the later, we may need to do lookup, transformation, conversion, or formatting first.

3. Fact: measure group 1,2,3 are from DW, MG4 is from ERP

This is normal situation in UDM world/SSAS, that we can source the DSV tables/Named Query from multiple data sources connected to different systems. Even on different RDBMS. What I mean is, DS1 connects to Oracle (ODBC) and DS2 connects to SQL Server (OLEDB). We then create 2 Named Queries on the DSV. NQ1 supplies MG1 and NQ2 supplies MG2.

4. Fact: 10 measures on MG1 are from DW, 1 measure is from ERP

Left join again, like #1, this time is on the fact table rather than dim table.

select dw.m1, dw.m2, …, dw.m10, isnull(e.m11,0)
from DW.dbo.fact1 dw
join ERP.dbo.fact2 e on e.a1=dw.a1 and e.a2=dw.a2 and …

Again, for ERP located on different SQL server or different RDBMS (Oracle for instance) we can a) create a copy on the DW SQL Server or b) create a linked server on the DW server to the ERP server.

Generally speaking it is better put the ERP measure on a separate MG. If we put both DW and ERP measure together we will need to join the fact table on DW with the transaction table from ERP and both could be big tables. This tend to cause performance issue, i.e. could take hours reading them when processing the cube. So I’d suggest #3 rather than #4. The users are not going to bother anyway whether we put the ERP measure on the same MG or different MG because from their point of view, the measure is located on the same folder. Yes, where the measure appears in the cube depends on the Display Folder property of the measure.

Secondly, it is better to put the ERP measure on a separate MG than the DW measures because we can set the dimension relationship more flexibly. The DW MG may link to 5 dimensions whereas the ERP MG may only link to 4 dimensions.

See? It’s only 4 cases in the end, not 5, because this one doesn’t count 🙂
5. Fact: 10m rows are from DW, 100k are from ERP

In this case we union-ing the DW fact table with the ERP transaction table.

select dw.a1, dw.a2, ..., dw.m1, dw.m2, ...
from fact1 dw
select e.c1, e.c2, ..., e.m1, e.m2, ...
from ERP.dbo.transaction_table e

The main work is to make sure that we get c1, c2, … rights (the attribute columns). We may need to join to another ERP table(s) to get the right values of those columns, for example a lookup or decode table, or master-detail style transaction table. Again, check the data type (especially date) and length.

Rather than doing union on the DSV, it is better to do it on cube partition. Create separate partition(s) for the ERP data and specify the ERP query on the new partition. The performance of cube processing is better this way, because separate SQL queries run on both systems at the same time, loading to cube partitions. To do this, we need to create separate Named Queries on the DSV. On the DW partition(s), select Data Source = DW. Whereas on the ERP partition(s) select Data Source = ERP. Both partition have binding type = Query Binding. On the ERP query, we need to name the column output the same as the DW query, so use ‘select as’: select col1 as DWname …

One thing we need to be careful when doing this: double check that there’s no duplication occur between the rows coming from DW and the rows coming from ERP. If you need to do ‘where not exists (…)’ or ‘where not in (…)’ in the ERP partition query to make sure there’s no duplication, then you’d need to do it using union. The idea of combining rows in cube partition only works if we know for certain that there’s no duplication. For example, if we know that DW fact table only contains rows from year 2000 onwards. Or DW does not contain transaction type X. So have a clean cut: we source transactions happening before 2000 from ERP, we source rows for trans type X from ERP, etc.

Well that’s it. If I’m talking rubbish, or make mistakes above (everybody make mistakes), let me know. Any idea you want discuss, just leave a comment or contact me. Thanks.

25 November 2009

Building Cubes From Operational Systems (SQLBits Presentation)

Filed under: Analysis Services,Event — Vincent Rainardi @ 9:40 pm
Tags: ,

My presentation at SQLBits 5 last Saturday is here: Building Cubes From Operational Systems.

SQL Scripts for the DDL and DSV are here:, and the BIDS project files are here:

Any problem downloading please let me know.

23 November 2009

The 4.2 billion tuples limitation

Filed under: Analysis Services — Vincent Rainardi @ 4:14 pm

In the What Are Cubes Bad At post, I mentioned about SSAS 4.2 billion tuples limitation. The fact that moving the attributes about solves the problem really amazed me so I’ll detail it out a little bit.

When we run a complex MDX query, sometimes SSAS returns the following error message:
The expression contains a function that cannot operate on a set with more than 4,294,967,296 tuples.

This error is consistent, i.e. it always happens every time we run that MDX query. It does not matter where we are running the query from, SSMS, BIDS, ProClarity, or MDX Studio, it always returns the same error message.

Oddly enough, when you reorder the attributes on rows or columns the error goes away. It seems that this error only happens on complex MDX queries, i.e. when we have more than 5 attributes on the rows or columns, and only when we use the non empty function. This error happens on both SSAS 2005 and 2008 (I don’t know if it happens on R2 or 2k).

The error occurred because the query results in more than 4.2 billion tuples and therefore the non empty function fails. For this error to occur the user must belong to a role that has dimensional security. The non empty cross join function fails, as Microsoft explained here, because the ‘optimized algorithm for non empty’ can’t handle dimension security on the measures, so SSAS uses the default algorithm.

It is very interesting to learn that when we move the attributes about, the error goes away. Let’s have a look at this MDX query:

SELECT { [Measures].[M1], [Measures].[M2], [Measures].[M3] } ON COLUMNS,
    ( { [Dim1].[Attr1].&[Member1], [Dim1].[Attr1].&[Member2],
        [Dim1].[Attr1].&[Member3], [Dim1].[Attr1].&[Member4] } *
      { [Dim2].[Attr1].&[Member1], [Dim2].[Attr1].&[Member2], [Dim2].[Attr1].&[Member3] } *
      { [Dim3].[Attr1].Members } *
      { [Dim4].[Attr1].Members } *
      { [Dim4].[Attr2].Members } *
      { [Dim3].[Attr2].Members } *
      { [Dim4].[Attr3].Members } *
      { [Dim5].[Attr1].Members } *
      { [Dim6].[Attr1].Members },
      { [Dim7].[Attr1].&[Member1] } *
      { [Dim5].[Attr1].&[Member1], [Dim5].[Attr1].&[Member2] } *
      { [Measures].[M1], [Measures].[M2], [Measures].[M3] }
FROM [Cube1]
WHERE ([Dim7].[Attr1].&[Member1], [Dim4].[Attr4].&[Member1])

The above query results in the 4.2 billion error message. But if we move line 9 and 10 above like this, the error goes away:

SELECT { [Measures].[M1], [Measures].[M2], [Measures].[M3] } ON COLUMNS,
    ( { [Dim1].[Attr1].&[Member1], [Dim1].[Attr1].&[Member2],
        [Dim1].[Attr1].&[Member3], [Dim1].[Attr1].&[Member4] } *
      { [Dim2].[Attr1].&[Member1], [Dim2].[Attr1].&[Member2], [Dim2].[Attr1].&[Member3] } *
      { [Dim3].[Attr1].Members } *
      { [Dim3].[Attr2].Members } *
      { [Dim4].[Attr1].Members } *
      { [Dim4].[Attr2].Members } *
      { [Dim4].[Attr3].Members } *
      { [Dim5].[Attr1].Members } *
      { [Dim6].[Attr1].Members },
      { [Dim7].[Attr1].&[Member1] } *
      { [Dim5].[Attr1].&[Member1], [Dim5].[Attr1].&[Member2] } *
      { [Measures].[M1], [Measures].[M2], [Measures].[M3] }
FROM [Cube1]
WHERE ([Dim7].[Attr1].&[Member1], [Dim4].[Attr4].&[Member1])

The last MDX does not result in the 4.2 billion error message. In this last MDX query, you’ll notice that Dim3’s attributes are put together and Dim4’s attributes are put together; whereas on the first MDX query, the attributes are sandwiched. While this order of attribute might not be the one that suit the user, it is the one that SSAS likes. If we put the attribute from the same dimension together, SSAS only return the tuples that actually exist, rather than returning all possible tuples. It is because of Auto-Exists, as Mosha pointed out here, i.e. SSAS detects that the set it needs to apply Auto-Exists to, can be nicely split into parts. A backgrounder on Auto-Exists is here (by Christian Wade, look under Fact Dimension section) and here (by Teo Lachev). It is basically the MDX Exists function, but automatically put there by SSAS. A full explanation about Auto-Exists is here (look under ‘Cube Space and Auto-Exists’ section).

So, the take away is: put attributes from the same dimension next to each other.

22 November 2009

SQLBits 5

Filed under: Uncategorized — Vincent Rainardi @ 9:27 am

On Friday and Saturday, 20th-21st November 2009 I attended SQLBits 5 at Celtic Manor, Newport, Wales. I felt lucky to get the opportunity to meet and talk to famous SQL Server experts like Chris Webb, Henk van der Valk, Satya Jayanti, Bob Duffy, Rob Farley, Alistair Aitchison, Anthony Howcroft, Simon Sabin, Darren Green, Allen Mitchel, Andre Kamman, Ian Marritt, Matt Feltham, Martin Bell, James Rowland-Jones, Andrew Couch, Andrew Fryer, Ramesh Meyyappan, John Tunnicliffe and Ashwani Roy. I attended the sessions of Donald Farmer (R2 SSRS, MDS), Thomas Kejser (Madison/PDW), James Rowland-Jones (2008 DW) , Bob Duffy (Gemini/PowerPivot), Christian Bolton (Statistics), Allan Mitchell (SSIS 2008) and am grateful to get the opportunity to learn directly from these experts.

I tried to understand what motivates them, why would they want to spend a lot of time and effort, organising SQL Server conference like SQLBits. During the breaks I asked several people in the conference how do they learn new product/version (by reading books, web sites, and using the product), how they solve problems (mainly by googling), and how they keep themselves up-to-date with the technology developments (by going to events and subscribe to newsletters/mailing list). I talked to several vendors and attended their presentations to understand their products: Wherescape (Red), Board, Attunity (Oracle-SSIS CDC), Quest (Spotlight, SSAS) and Solid Quality Mentors. I am particulary impressed with Wherescape Red’s ability to design data warehouses, complete with ETL, cubes and documentation, in very short time, but disappointed with Spotlight’s ability to monitor SSAS. I missed SQLSentry, but I have evaluated their product (with regard to SSAS monitoring) in the past few weeks.

I presented a session about building cubes from operational systems. I explained why would we want to do it (the benefits), how to do it (a step-by-step demo in BIDS to build a stock broker cube from a stock broker database), and what risks/pitfalls should we be aware of (performance implications, date dimension, large dimension, etc). There were about 20-25 people attending my sessions, about 4-5 questions. Had problem with the projector (again!) even though I have tested it just a couple of hours before. Overall I’m very impressed with SQLBits 5, I learned a lot (both technical/SQL Server and non-technical), both from the speakers, sponsors and from the audience. I look forward to next one.

20 November 2009

What Are Cubes Bad At?

Filed under: Uncategorized — Vincent Rainardi @ 11:32 pm

(OLAP) Cubes are built for analysing data by doing slicing and dicing, drilling up and drilling down. Cubes’ distinct advantage (over reports) is their ability to contain millions of numbers and be delivered to the users’ finger tips, allowing them to browse the data at will, at lightning speed. People who does OLAP* as part of their job sometimes feel that the experience is like Minority Report, where Tom Cruise explored the glass surface containing millions of pictures with those special gloves.

This ‘power at your finger tips’ is sometimes so powerful so that the users forgot that is an OLAP tool, not a reporting tool. But when I said that, nearly always they asked back “What’s the difference?” What’s the difference between delivering data warehouse information through reports and delivering it through cubes? What are reports good at but cubes are absolutely rubbish at it?

There are 3 things that cubes are particularly bad at: (they affect OLAP cubes in general, not only SSAS)

  1. Cosmetics
  2. Columnar reports
  3. Text

1. Cosmetics.
OLAP engines are not good at cosmetics/appearance. There is no formatting. No fancy fonts, bold, italics, logo, images, colour, print layout, screen layout. Cube browser like ProClarity delivers the data bare and plain. Just the numbers. OK I might be overemphasised it. Cube browsers do fonts and charting as well. But it is limited. Where as in reporting world, cosmetics are in abundance. Ask any report developer and they will agree that querying the database and getting the data into the report is only half of the job. A large chunk of the work is about formatting and appearance. Both on the screen and when printed (or exported to PDF).

In SQL Server world, this issue is addressed by:

  1. Exporting the data from the cube (say using ProClarity) into Excel and format it in Excel. So charting, layout are all done in Excel. The function of the cube browser is just to extract the data from the cube.
  2. Browse the cube directly from Excel 2007 or 2010, either using Pivot tables or R2’s add-in of PowerPivot. This way we can format the appearance.
  3. Users order dashboards from IT (or KPI, scorecards, indicators). IT uses tools that read the cubes and display the cube data in the form of dashboards (say using PerformancePoint Dashboard Designer). All the users have to do is to look at the fully-baked dashboards and consume/use the information.
  4. Users order reports from IT. IT uses report design tools like SSRS to “wrap” the cube data in nice layout with pretty format. Like dashboards, the solution is fully baked and all the users need to do is to consume the information.

2. Columnar reports

Columns are the worse enemy of OLAP. Column in its tabular sense, as in a vertical line of data. It is probably clearer if we use picture. This is an example of a 7 columns ‘report’:

Attr1 Attr2 Attr3 Attr4 Attr5 Attr6 Attr7 Measure1
member member member member member member member value
member member member member member member member value
member member member member member member member value
member member member member member member member value

When a user uses Excel or ProClarity to compose a ‘report’ from a cube, he or she may think “I can put as many columns as I like” because “the data is stored in database tables”. When we explained that if column1 is an attribute with 1000 members (we usually need to explain what a member is first), and column2 is an attribute with 1000 members, then a 7 column reports can potentially have a lot of rows: 1000 x 1000 x 1000 x … until 7 times. That is a number with 21 zeros. And that will either make the query slow (like taking 5 minutes to complete), or cause the query to fail, as it hits the 4.2 billion tuples limitation.

In practice, we can try to address the issue of ‘too many columns’ using:

  1. Use the Non Empty buttons in ProClarity (automatically applied if the cube is in SSAS 2008) which will make the report ‘blastening fast’. Here’s a simplified way to explain it to the user: suppose out of the 1000 members of attribute1, only 10 members have values for measure1. Like wise attribute2 (the 2nd column), only 10 members have values for measure1. So the number of rows with values on measure1 is: 10 x 10 x 10 x … until 7 times, which equal to 10 millions. So from the gazillion number with 21 zeros we are now down to 10 million. And that’s why the report is a lot quicker, from 5 minutes down to 2-3 seconds.
  2. Put the attributes from the same dimension next to each other. This way, as Mosha explained, SSAS will run inner join against each dimension separately, and then do full cross join between dimensions”. The query will be ‘faster’ because the attributes from the same dimension ‘hang’ off the dimension key, so SSAS doesn’t need to go deeper than the dimension key.
  3. Other ways include: move some attributes from rows to columns or background, specify the members for particular attribute(s), split the report into 2 pages.
  4. If it is a static report or dashboard, i.e. we compose the MDX query ourselves, we can use NonEmpty on each attribute before cross joining across all attributes, as Darren Gosbell suggested here.

3. Text. Text is another thing which OLAP engines are not so good at.

16 November 2009

Cube Developer

Filed under: Uncategorized — Vincent Rainardi @ 8:15 am

Please correct me if I’m wrong but I think two years ago when I wrote “The Unpopular Select Statement“, article there was not a role advertised as “Cube developer”. At that time, we did have “Cognos Architect”, “Cognos ReportNET Developer”, and “Cognos PowerPlay developer”, but in the SQL Server world that role was always combined with SSRS and SSIS in the form of “Microsoft BI developer”, or “Microsoft BI Architect” role. Looking back, I think this was because in most companies they only have 1 or 2 cubes, so there was no need for somebody to be full time at it. Occassionally there was “OLAP developer” job advert came up, but at that time (2 years ago) it was more often meant to be Hyperion and PowerPlay than SSAS. If 2 years ago the number of Analysis Services job advertised in the last 3 months was 738 (see my article I mentioned earlier, source ITJobsWatch), today it is 1118, a 50% growth.

Now, we see “cube developer” roles advertised in IT job web sites and that (nearly) always means SSAS. At least in London’s insurance and investment banking market, there is a requirement for a full time cube developer. The idea of reporting has changed from “Contact IT department to order a report” to “Ask IT to create a cube containing all the data, so we can browse the cube and create the reports ourselves, thank you very much”.

So the role of a cube developer is to develop the cube in the first place, and when the cube is in production his job is to satisfy the change requests, e.g. add new attributes, new measures, etc. He is also responsible for administering the cubes, i.e. add/remove users into/from roles (or use AD group and move this task to helpdesk), make sure that they are refreshed (processed) every day, investigate if users reported any descrepancies in the numbers, help new users to use the client tool to analyse the cube, troubleshoot slow queries, maintain the aggregations.

The role of cube developer also involved maintaining the SSAS server. Installation, configurations, patching, upgrade to 2008 (R2), setting the memory limits, maintaining OLAP Query Log (and perhaps creating some stats out of it), query the DMV to understand some stats/usage, XMLA scripting to automated processing the cube (e.g. not full process but ‘process update’ the dimensions separately then process the latest partitions), do AMO scripting (e.g. to identify when a new dim or partition is added, and regenerate the XMLA script in SQL agent/SSIS), do ADOMD scripting, AD group scripting (using .NET, to maintain a list of who can access what – probably stored in a table).

The role of cube developer could also be responsible for the creation of SSRS reports which access the cubes. And the maintenance of SSIS packages that refresh/process the cube. And the installation, configuration and maintenance of the related BI tools such as PowerPoint Server 2007. and the deployment of the client tools/cube browsers such as Excel 2007, CubePlayer or ProClarity.

15 November 2009

Using Cubes For Reconciliation

Filed under: Uncategorized — Vincent Rainardi @ 11:19 pm

cube can be used for 4 purposes:
1. Data analysis
2. Reporting
3. Reconcilation
4. Testing

By far the most common use of a cube is data analysis. Data from dimensional data warehouse is loaded into cubes so that business users can browse the cubes to analyse the data. Browsing a cube means slicing and dicing the data to get the break down of the measures by the dimensional attributes. It also means drilling down (viewing the data at a more detail level) and drilling up (viewing the data at a higher level).

The second usage of a cube is for reporting. In this case the cube is accessed by a reporting tool such as SSRS, and displayed in tabular format. ‘cube’ reports like this often contain parameters in the form of dropdown lists. Users use the dropdown lists to specify which slice of the cube to display on the report. For example, if the cube has 3 dimensions (product, date, store), and the report is “monthly sales by product” (i.e. product on the rows, and months on the column), user can select which store to display in the report parameter.

Data analysis and reporting probably cover 99% of the cube usage. Not many companies (and other types of organisations) use OLAP cubes for any other purposes. The idea of using cubes for reconciliation is quite simple: it is to compare 2 systems and identify the differences. To do this, data from both systems are loaded into the cube. In the cube, for each measure that we want to compare we create a calculated measure, which is basically the measure from system B minus the measure from system A.

We then browse the cube. If at the very top level this calculated measure shows say £1 million that is the difference between system A and system B. Users can then drill down to get the break down of this £1 million by whatever dimension attributes they want.

For example, say a retail company needs to migrate data from system A to system B. System A is an old system, and system B is the new system. This is not a straight forward task, but it is a very common scenario. All companies need to change their systems sooner or later. Tables table on system A are mapped to the tables in system B. Each column within each table in system A is mapped to the target column system B. And there are various business rules. And there are numerous exceptions. If this than that, if that then this. But if the condition is so and so, then use this calculation instead. And so on and so forth. After a few months of design and development, the team managed to create the ETL, or data migration application. Say it is in SSIS. The team managed to execute the SSIS package successfully (in Test environment of course) and all data from system A has been migrated to system B.

Now comes the checking. In system A, the total sales from all stores in City X in 2005 was $1,300,000. In system B it was also $1,300,000. Checked. The inventory value of product Y in 2006 was $2,900,000. In system B it was $2,800,000. The $100k difference was expected, because there is a business rule affecting the inventory valuation in 2006. Checked. But the total sales of all products, from all stores worldwide, for all years doesn’t match. It differ by say $1,500,000.  Testers and analysts alike all issuing SQL queries to both system A and system B trying to pin point the root cause. After long hours they found the cause: a price list update in such and such store for such and such product in such and such year. And it goes on and on. All in all, a huge chunk of project time is spent on this activity: Reconciliation. Checking and rechecking. Hundreds of “select from where group by” queries issued agains both source and target systems. Weeks and weeks of valuable time is spent on this Reconciliation.

Using the reconciliation cube I described earlier, the testers will see the top level difference. Then they can drill down on any attribute at any level to find out the break down of that figure. They can then decide whether that difference is expected or not. If it is not expected, they can check the ETL package and correct it. Rerun the package and refresh the cube. After several iterations they can declare that the testing is complete and they are ready to sign off the project. At the migration day, the reconciliation cube is then pointed at the production system. After the ETL package is executed on the production database(s), the reconcilation cube is refreshed and reviewed. This approach is much more effective than issuing hundreds of SQL queries to do it manually.

This approach can also be applied for checking data warehouse load, i.e. to reconcile the data that is already loaded in the warehouse against the data in the source system. There are millions of numbers in the source system and millions of numbers in the warehouse. To be able to state confidently that all numbers match is a pretty tall order. This is especially so if your tool is limited to SQL queries. OLAP cubes, on the other hand, is purposely built to handle millions of numbers. And using a simple calculated measure like B-A (or (B-A)/A to get the percentage) the reconcilation process is so much simpler.

What is a Cube?

Filed under: Analysis Services — Vincent Rainardi @ 10:43 pm

I found myself repeatedly need to explain “What is a cube?” in writing post for this blog. So I decided to write a short post explaining what a cube is. So next time I use the term ‘cube’ in my blog post, I just have to refer/link to this post.

In SQL Server world, practically speaking, a cube is an Analysis Services database. People often use the word ‘cube’ as the short form of ‘SSAS cube’.

In computer science world, an OLAP cube, or just ‘cube’ for short, is a multidimensional database. A multidimensional database (MDB) is a structure designed to store measure data (fact) and hierarchical data (dimension), where the combination of members of each dimension points to zero, one or many measure values. This relationship can be described as a discreet function m[1..n] = f (d1, …, dn), in other words m is a function of d1 to dn, where m is the measure value and d1 to dn are the dimension members. The structure stores the aggregate values as well as the base values, typically in compressed multidimensional array format, rather than in RDBMS tables. Aggregate values are pre-computed summary or calculations of the base values. Examples of multidimensional database are Microsoft Analysis Services, Hyperion Essbase and Cognos PowerCube. The terms that people use for multidimensional database are hypercube, cube, OLAP cube, multi-dimensional database (abbreviated as MDB, MDD or MDDB) and multi-dimensional data store (MDDS).

Building Cubes from Operational Systems

Filed under: Uncategorized — Vincent Rainardi @ 8:07 am

Three years ago, I always thought that cubes* must always be built from a data warehouse or data mart*. So I did. I built a data mart first, with the ETL, before creating the cubes. About a year ago a data warehousing colleague of mine asked me if it is possible to build a cube straight from the transactions system, without building a data warehouse first. I explained to him that in theory it was possible, but there were challenges. Since then I have built several cubes straight from the operational systems*. In doing so I have encountered numerous issues, and found ways to overcome them.

*”What is a cube?”, “What is the difference between a data warehouse and a data mart?”, and “What is the difference between an operational system and a transaction system?” are explained at the end of this post.

When we build a cube from an operational system, we face challenges like:

  • there is no surrogate key
  • there is no date table
  • attributes are buried in large transaction tables
  • normalised structure – attributes are scattered on multiple tables
  • there is no ‘time of day’ dimension table
  • attributes and measures are located in the same table
  • data is raw – no calculations / transformations
  • performance implications

So this Saturday at the SQLBits* I will be presenting about that, building cubes from operational systems. I thought the best way to explain it is using a demo. So I plan to spend about 10 minutes laying out the background and describing the benefits, then we will be off to BIDS* and SSMS* building a cube from a stock broker database.

*SQLBits is a SQL Server conference in UK, held since 2007. Currently it is held twice a year, i.e. every 6 months. The first one was on 6/10/07 at the Microsoft campus in Reading. The next one, SQLBits 5, is this week, 19-21/11/09, in Newport. The first one I attended was SQLBits 3, 13/9/08 in Hatfield. At SQLBits 4 in Manchester, 28/3/09, I presented a topic about data warehousing. SQLBits usually have 4 main tracks: SQL DBA, SQL Development, BI and new release. SQLBits is usually free to attend, or at least there is a free day. In SQLBits 5, Thursday and Friday are ‘paid’ days, where as the Saturday is free. I found that it is worth taking a day off work and pay the conference fee, to be taught by experts like Chris Webb, Allan Michell, Donald Farmer and Thomas Kejser. The complete agenda is here.

*BIDS = Business Intelligence Development Studio, is the development tool that we use to build SSAS cube, SSIS package and SSRS reports. It is Visual Studio, but customised specifically to do BI development, as opposed to building .NET applications.

*SSMS= SQL Server Management Studio. It is the tool that we use to administer SQL Server 2005 and 2008, as well as doing design and development e.g. creating stored procedures, table structures, and database diagrams.

*What is a cube? In daily conversation, a cube is an SSAS database. Amongst SSAS developers, we use the word ‘cube’ to differentiate our stuff from ‘database’. In SQL Server community, the word ‘database’ means relational database, where as the word ‘cube’ means OLAP database. I know that this is technically incorrect, and in my younger years I would have protested strongly (my background is engineering physics), so for those who prefered a more scientific definition, here it is: “A multidimensional database (MDB) is a structure designed to store measure data (fact) and hierarchical data (dimension), where the combination of members of each dimension points to zero, one or many measure values. This relationship can be described as a discreet function m[1..n] = f (d1, …, dn), in other words m is a function of d1 to dn, where m is the measure value and d1 to dn are the dimension members. The structure stores the aggregate values as well as the base values, typically in compressed multidimensional array format, rather than in RDBMS tables. Aggregate values are pre-computed summary or calculations of the base values.”

*What is the difference between a data warehouse and a data mart?
A data mart only has one fact table where as a data warehouse has many fact tables. Again, I know it is technically incorrect, and for those who preferred a more formal and precise definition, quoting from my book, “A data warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalized data store. It normally keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batch not every time a transaction happens in the source system.”

*What is the difference between an operational system and a transaction system? An operational system can be a reporting system (mainly read) or a transaction system (mainly write). A transaction system is an IT system which companies use to enter daily transactions such as sales, payments and call logs. A reporting system is an IT system which companies (or other types of organisations) use to extract data from the transaction system either to understand what happened in the business or analyse the business.

12 November 2009

Concatenating attributes to form dimension key

Filed under: Analysis Services — Vincent Rainardi @ 9:14 pm

In SSAS, to create a dimension from a large transaction table, we should avoid using the table’s primary key as the dim key. Instead, it is better to use the concatenation of the attributes as the dim key.

For example, in stock trading we have a transaction table called Trade with these columns: action (whether it’s buy, sell or stop loss), trade type (AT for Automated Trade, O for Ordinary Trade, NK for Block Trade), status (whether it’s outstanding, in progress, or completed), result (failed, rejected or successful). This Trade table has a lot of data, say 50 millions rows, containing 8 years of history. And these attributes (actions, trade types, trade status and trade results) are not stored in separate tables. There is no table called trade types for example; the only place we could find AT, O and NK is in the Trade table.

Now we want to create an Trade dimension with that four attributes. The primary key of the Trade table is trade_ref. If we use “select trade_ref, action, trade_type, status, result from trade”, SSAS will have to do “select distinct” 5 times to build the dimension. It would take probably 10 minutes to build this dimension. From my experience, if we have 15-20 attributes instead of 4, it could take about 30-45 minutes to build the dimension. About 60-70% of this time is to build the trade_ref attribute, which is the dimension’s key attribute. A more efficient way of doing it is to:

a) split it into 4 dimensions, or
b) do “select distinct action + “|” + trade_type + “|” + status + “|” + result + “|” + action as dimkey, trade_type, status, result from trade”

From my experience, users prefer to have related attribute in the same dimension. If we do a), those 4 attributes will be located in separate dimensions. Where as if we do b), we will have those 4 attributes in 1 dimension, and the query would finish in a few seconds, if we have an index in each of the 4 columns. Probably 5 minutes for 15 attributes. This key attribute is hidden anyway, so whether we use trade_ref or the concatenated one, users won’t see it.

In the Named Query the Measure Group (MG), we need to provide a dim key. We can’t use trade_ref for this (select trade_ref, … from Trade – [Q1]). We need to use the concatenated columns (select action + “|” + trade_type + “|” + status + “|” + result + “|” + action as trade_key, … from Trade [Q2]). From experience, the [Q2] query performs as well as the [Q1] query. In a 50 millions transaction table, both queries take about 15 minutes (10 partitions located on separate disks). This is because the MG query is a straight forward query (1 pass), which is very different from the dim query (multiple passes, 1 for each attribute, select distinct).

Next Page »

Blog at