One of the fundamental principles in Kimball data warehousing is to build the fact tables at the lowest granularity, not at summary levels; where as in Inmon method the lowest grain is kept in a normalized-form warehouse. Whatever the reason is, sometimes we are in a situation where the data we need for the cube is not the dimensional warehouse. As usual there are multiple ways of addressing this,
- Bring that missing data into the warehouse (of course)
- Gets the missing data straight from the ERP.
Long ago, before the concept of ‘Unified Data Model’, ‘Self-service BI’ and ‘BI for the masses’ were conceived, we would never consider getting the data directly from the ERP. At that time, bypassing the data warehouse was a big no-no. In the early years of data warehousing, all data must be ‘unified’ first in the data warehouse, where it became the single version of the truth, before it is consumed.
As the law of nature dictates, pragmatic considerations always surpassed idyllic ideas. Then the concept of UDM and Self Service BI was born, where the BI tool can unify data from anywhere. The advantage: faster delivery, shorter development time, greater ROI. In reality, as sure as eggs is eggs, project proposals that deliver greater business benefits in shorter time frame triumph over those that do not. TM1, Qlikview, Board, SSAS and PowerPivot, all jumped into this wave. All the users care about is that they get the BI data they need, quickly. And these BI tools satisfy that need.
Yes it’s bad news for ETL. And for data warehousing. It’s not about ‘how quickly can you build a data warehouse’. It’s about being able to deliver BI, without building and maintaining a DW. What I demonstrated last week at SQLBits using SSAS, and what TM1, Qlikview, Board and PowerPivot have in common is that we deliver ‘BI on the go’, i.e. the BI tool retrieves the data from various sources, integrate them and present it to the users.
It is still difficult for me to believe (about bypassing DW), but it is happening. There are limits of course. Everything has a limit. In this case the limit is the complexity. ‘BI for the masses’, ‘self service BI’ and the like are OK for simple BI. Once it reaches certain complexity, we do need a warehouse.
Back to the title of this post, combining DW and ERP data in the cube, it’s simple. On Data Source View create 2 named query. One points to the DW, and the other points to the ERP. Then when creating the database dimension, we combine those 2 named querys/tables. What about facts/measure groups? If you want to source some facts from DW and some from ERP, create 2 measure groups. One points to the DW, and the other points to the ERP. Then on the dimension relationship link them to the corresponding dimensions. If you want to source the same facts from both DW and ERP, then either a) use linked server (if they are both SQL Server) to join both systems on the DSV, or b) use 2 partitions. One partition points at DW and the other points at ERP.
The challenge of combining DW and ERP data is not sourcing the data. It’s joining them. First, we need to join on the natural keys. Natural keys are the primary keys of the ERP’s ‘master tables’. They are the common fields between the 2 systems. Second, there might be some overlap between the 2 systems. We need to avoid data duplication. I mean row duplication. So we may need to use ‘where not exists’, ‘where not in ()’, and ‘union’ rather than ‘union all’, those sort of things, to filter out duplicate data that exists in both systems.
I’ll write 4 examples/cases,
- Dimension: 10 attributes are from DW, 1 attribute is from ERP
- Dimension: 300,000 rows (members) are from DW, 3000 members are from ERP
- Fact: measure group 1,2,3 are from DW, MG4 is from ERP
- Fact: 10 measures on MG1 are from DW, 1 measure is from ERP
- Fact: 10m rows are from DW, 100k are from ERP
Well OK, that’s 5. But read #4 below. Instead of ERP, we can also take the data from ODS. It’s not the right place nor the time to explain in great length about ODS now, so I’ll just give a short definition. It’s 8am on Sunday morning! But I’ll bring my ODS article to this blog later and refer to it, it’s in SQL Server Central.com at the moment. If you read Imhoff and Inmon’s book on ODS (1995) and my book/articles, basically ODS is a copy of the ERP (structure wise and data wise). Well, that is not an accurate definition, there’s data integration aspect etc, but for the purpose of this discussion it’s suffice. I write about ODS in greater length later, promise. I mean differences between supplementing DW data with ODS, compared to from ERP.
Back to the examples/cases,
1. Dimension: 10 attributes are from DW, 1 attribute is from ERP
In this case the join is left join, like this:
select c1.attr1, c1.attr2, ..., c1.attr10, isnull(c2.attr11, 'Unknown') from DW.dbo.dim_customer c1 left join ERP.dbo.customer c2 on c2.customer_id = c1.customer_id
The ERP is usually located on the different SQL Server than the DW. The ERP may even on different RDMBS, e.g. Oracle or DB2. In these cases we could a) create a copy of the ERP database, on the DW SQL server, or b) create a linked server from the DW SQL Server to the ERP server.
2. Dimension: 300,000 rows (members) are from DW, 3000 members are from ERP
In this case we use union. The core of this work is to find out from which ERP column each DW attribute is coming from. We do this by looking at the ETL, tracing the target attribute back to the stage, and to the ERP. Of course we can refer to the mapping documentation, as long as it is reliable. Practical experience over the years has taught me that we can’t rely on it, so we need to double check. Once we found out which columns on ERP, we test the ERP data using select distinct, i.e. “select distinct attr1 from customer”, comparing it with the data warehouse data: “select distinct attr1 from dim_customer”.
In doing this we need to be careful about 2 things: a) the data type, and b) the performance. The data type of attr1 in the DW might be varchar(20), but in the ERP it is datetime. So we need to convert. After all, that’s what the ETL does.
About the performance, if the DW table is big (say 10m rows) but the ERP result set is small (say 1000 rows) consider using “union all” with “select distinct” rather than union. Rather than:
select attr1, attr2, …, attr10 from DW.dbo.dim_customer union
select col1 as attr1, col2 as attr2, …, convert(varchar, col10) as attr10 from ERP.dbo.customer
Consider:
select attr1, attr2, …, attr10 from DW.dbo.dim_customer union all
select distinct col1 as attr1, col2 as attr2, …, convert(varchar, col10) as attr10 from ERP.dbo.customer
If DW.dbo.dim_customer contains 10m rows, ERP.dbo.customer contains 1000 rows the bottom query is faster than the first query. Union (without all) does select distinct between 2 tables so it could take a while.
The main issue about union-ing a DW dim table and the corresponding ERP table is, of course, the dim surrogate key. ERP customer table doesn’t have customer_key. That’s generated in DW, either by ETL or using identity column. So how can we address this? Either a) we prefix the surrogate key with the system code, or b) we use artificial (dummy) key for the ERP rows, such as negative numbers. If the surrogate key of customer dimension is 0 to 1,000,000, then the surrogate key of ERP rows are (for example) -1 to -1000. Prefixing is a bit difficult because we need to change the dimkeys in the fact tables too.
The other thing to consider when union-ing a DW dim table and an ERP master table is, that not all attribute can be supplied. One or two of the DW attributes may not available on the ERP. Or may not be readily available on ERP. In this case a) omit that attribute (set it to blank string), or b) link to the other ERP table where the attribute is located. In the case of the later, we may need to do lookup, transformation, conversion, or formatting first.
3. Fact: measure group 1,2,3 are from DW, MG4 is from ERP
This is normal situation in UDM world/SSAS, that we can source the DSV tables/Named Query from multiple data sources connected to different systems. Even on different RDBMS. What I mean is, DS1 connects to Oracle (ODBC) and DS2 connects to SQL Server (OLEDB). We then create 2 Named Queries on the DSV. NQ1 supplies MG1 and NQ2 supplies MG2.
4. Fact: 10 measures on MG1 are from DW, 1 measure is from ERP
Left join again, like #1, this time is on the fact table rather than dim table.
select dw.m1, dw.m2, …, dw.m10, isnull(e.m11,0)
from DW.dbo.fact1 dw
join ERP.dbo.fact2 e on e.a1=dw.a1 and e.a2=dw.a2 and …
Again, for ERP located on different SQL server or different RDBMS (Oracle for instance) we can a) create a copy on the DW SQL Server or b) create a linked server on the DW server to the ERP server.
Generally speaking it is better put the ERP measure on a separate MG. If we put both DW and ERP measure together we will need to join the fact table on DW with the transaction table from ERP and both could be big tables. This tend to cause performance issue, i.e. could take hours reading them when processing the cube. So I’d suggest #3 rather than #4. The users are not going to bother anyway whether we put the ERP measure on the same MG or different MG because from their point of view, the measure is located on the same folder. Yes, where the measure appears in the cube depends on the Display Folder property of the measure.
Secondly, it is better to put the ERP measure on a separate MG than the DW measures because we can set the dimension relationship more flexibly. The DW MG may link to 5 dimensions whereas the ERP MG may only link to 4 dimensions.
In this case we union-ing the DW fact table with the ERP transaction table.
select dw.a1, dw.a2, ..., dw.m1, dw.m2, ... from fact1 dw union select e.c1, e.c2, ..., e.m1, e.m2, ... from ERP.dbo.transaction_table e
The main work is to make sure that we get c1, c2, … rights (the attribute columns). We may need to join to another ERP table(s) to get the right values of those columns, for example a lookup or decode table, or master-detail style transaction table. Again, check the data type (especially date) and length.
Rather than doing union on the DSV, it is better to do it on cube partition. Create separate partition(s) for the ERP data and specify the ERP query on the new partition. The performance of cube processing is better this way, because separate SQL queries run on both systems at the same time, loading to cube partitions. To do this, we need to create separate Named Queries on the DSV. On the DW partition(s), select Data Source = DW. Whereas on the ERP partition(s) select Data Source = ERP. Both partition have binding type = Query Binding. On the ERP query, we need to name the column output the same as the DW query, so use ‘select as’: select col1 as DWname …
One thing we need to be careful when doing this: double check that there’s no duplication occur between the rows coming from DW and the rows coming from ERP. If you need to do ‘where not exists (…)’ or ‘where not in (…)’ in the ERP partition query to make sure there’s no duplication, then you’d need to do it using union. The idea of combining rows in cube partition only works if we know for certain that there’s no duplication. For example, if we know that DW fact table only contains rows from year 2000 onwards. Or DW does not contain transaction type X. So have a clean cut: we source transactions happening before 2000 from ERP, we source rows for trans type X from ERP, etc.
Well that’s it. If I’m talking rubbish, or make mistakes above (everybody make mistakes), let me know. Any idea you want discuss, just leave a comment or contact me. Thanks.