Three years ago, I always thought that cubes* must always be built from a data warehouse or data mart*. So I did. I built a data mart first, with the ETL, before creating the cubes. About a year ago a data warehousing colleague of mine asked me if it is possible to build a cube straight from the transactions system, without building a data warehouse first. I explained to him that in theory it was possible, but there were challenges. Since then I have built several cubes straight from the operational systems*. In doing so I have encountered numerous issues, and found ways to overcome them.
*”What is a cube?”, “What is the difference between a data warehouse and a data mart?”, and “What is the difference between an operational system and a transaction system?” are explained at the end of this post.
When we build a cube from an operational system, we face challenges like:
- there is no surrogate key
- there is no date table
- attributes are buried in large transaction tables
- normalised structure – attributes are scattered on multiple tables
- there is no ‘time of day’ dimension table
- attributes and measures are located in the same table
- data is raw – no calculations / transformations
- performance implications
So this Saturday at the SQLBits* I will be presenting about that, building cubes from operational systems. I thought the best way to explain it is using a demo. So I plan to spend about 10 minutes laying out the background and describing the benefits, then we will be off to BIDS* and SSMS* building a cube from a stock broker database.
*SQLBits is a SQL Server conference in UK, held since 2007. Currently it is held twice a year, i.e. every 6 months. The first one was on 6/10/07 at the Microsoft campus in Reading. The next one, SQLBits 5, is this week, 19-21/11/09, in Newport. The first one I attended was SQLBits 3, 13/9/08 in Hatfield. At SQLBits 4 in Manchester, 28/3/09, I presented a topic about data warehousing. SQLBits usually have 4 main tracks: SQL DBA, SQL Development, BI and new release. SQLBits is usually free to attend, or at least there is a free day. In SQLBits 5, Thursday and Friday are ‘paid’ days, where as the Saturday is free. I found that it is worth taking a day off work and pay the conference fee, to be taught by experts like Chris Webb, Allan Michell, Donald Farmer and Thomas Kejser. The complete agenda is here.
*BIDS = Business Intelligence Development Studio, is the development tool that we use to build SSAS cube, SSIS package and SSRS reports. It is Visual Studio, but customised specifically to do BI development, as opposed to building .NET applications.
*SSMS= SQL Server Management Studio. It is the tool that we use to administer SQL Server 2005 and 2008, as well as doing design and development e.g. creating stored procedures, table structures, and database diagrams.
*What is a cube? In daily conversation, a cube is an SSAS database. Amongst SSAS developers, we use the word ‘cube’ to differentiate our stuff from ‘database’. In SQL Server community, the word ‘database’ means relational database, where as the word ‘cube’ means OLAP database. I know that this is technically incorrect, and in my younger years I would have protested strongly (my background is engineering physics), so for those who prefered a more scientific definition, here it is: “A multidimensional database (MDB) is a structure designed to store measure data (fact) and hierarchical data (dimension), where the combination of members of each dimension points to zero, one or many measure values. This relationship can be described as a discreet function m[1..n] = f (d1, …, dn), in other words m is a function of d1 to dn, where m is the measure value and d1 to dn are the dimension members. The structure stores the aggregate values as well as the base values, typically in compressed multidimensional array format, rather than in RDBMS tables. Aggregate values are pre-computed summary or calculations of the base values.”
*What is the difference between a data warehouse and a data mart?
A data mart only has one fact table where as a data warehouse has many fact tables. Again, I know it is technically incorrect, and for those who preferred a more formal and precise definition, quoting from my book, “A data warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalized data store. It normally keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batch not every time a transaction happens in the source system.”
*What is the difference between an operational system and a transaction system? An operational system can be a reporting system (mainly read) or a transaction system (mainly write). A transaction system is an IT system which companies use to enter daily transactions such as sales, payments and call logs. A reporting system is an IT system which companies (or other types of organisations) use to extract data from the transaction system either to understand what happened in the business or analyse the business.