Data Warehousing and Data Science

1 November 2022

Power BI Datamart

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 8:39 am
Tags: , ,

Like SQL Data Warehouse, Microsoft once again named a product by what it does: Power BI Datamart. It is an Azure SQL database, with a visual query designer so users don’t need to write SQL to get the data out to BI tools, Excel or ML algorithms. It also generates a Power BI dataset automatically, which we can use to create Power BI reports.

Loading data into Datamart

Power BI Datamart uses a built-in Power BI Dataflow for loading data into it. After we create a Datamart, we click on Get Data and use PowerQuery to connect to various data sources such as Azure SQL database, Azure Data Lake Storage, Databricks (Spark), Azure Synapse, HDInsight, Sharepoint, Excel files, text files, JSON files, Parquet files, API, Salesforce, Snowflake, Redshift, BigQuery, Oracle DB, IBM DB2, Teradata, PostgreSQL, MySQL, Access, SAP, Analysis Services and ODBC (link):

The above list of data sources is so comprehensive, I could not believe it at first. This is a game changer. Gone are the days when we struggling in SSIS, Informatica or ADF to load data from those varied sources. PowerQuery has all the connectors and interfaces to every single one of those databases, files, data lakes and big data.

We then select the tables, specify transformations and PowerQuery would build the ETL rules in that built-in Dataflow, create the relationships and load the data from those tables in the data source into the Datamart. We can then see all those tables and relationships in the Table Tools tab:

Afterwards, we can create relationships (again) and measures. We can also hide the tables and columns that we don’t need.

In the workspace we automatically get a dataset:

We can schedule when we want to reload data from those data sources into the Datamart (called Scheduled Refresh). And we can also setup incremental load, i.e. only changed rows are loaded into the Datamart, based on the column that we want (known as “watermark column”, link):

Security and access control

Power BI Datamart uses row level security and roles to restrict user access to the Datamart. In this example below (link), the users are restricted to only be able to access rows with category ID of 1 and 2. Once that role is defined, we can assign users to that role, as we can see below:

Querying data

As I said at the start of this article, we can use a visual query designer to get the data out without writing SQL (link). As we can see below, the Visual Query Designer is like Markit EDM, SSIS and Informatica.

We can select rows, filter rows, select columns, split columns, remove duplicates, join tables, change data types, transpose table and pivot columns. We can also do error handling, such as removing errors or replacing errors (link). And we can also create custom functions (link).

Of course we can also write SQL:

Accessing Power BI Datamart from SSMS

And we can also query the data in the Power BI Datamart using SSMS (SQL Server Management Studio). And using Azure Data Studio too (link):

Not only we can use SSMS for querying data in the Datamart, but also adding users and configuring security. And we can also use SSMS for looking at the query statistics. This is how it looks from SSMS (link):

We can see above that the Power BI Datamart tables are listed as views, under “model” schema. The relationships in the Power BI Datamart is stored in two views called metadata.relationships and relationshipColumns, as shown above (link).

Conclusion

I think Power BI Datamart is a game changer.

  • It enables us to gather data from databases, data lakes, files and APIs into one data mart.
  • We can reload data every day/week and we can do incremental loading.
  • We can query the data using SSMS (can also do admin stuff).
  • We can build Power BI reports easily (the dataset is automatically created).
  • We query the data using visual query designer or by writing SQL.
  • We can easily get the data out to BI or ML tools, and to Excel.

That is a very powerful tool. Anyone in data warehousing and BI should consider this tool. At least take a look.

And I think the name suits it well. The Power BI Datamart is indeed a good data mart. With all its bells and whistles.

23 November 2014

SQL Server 2014 Installation

Filed under: Business Intelligence,SQL Server — Vincent Rainardi @ 4:42 pm
Tags: ,

I’m installing SQL Server 2014 on a new laptop and it has not been an easy experience.

First of all, the Books Online is not included and when I tried to “Install Content from Online” I had the issue below (I include the solution).

Secondly, there is no BIDS! BIDS needs to be downloaded from this link, and installed into Visual Studio 2013. But my Visual Studio is 2015 (Preview), so I had to install VS 2013 first.

After installing VS 2013 and downloading SSDT BI (new name for BIDS), when I installed SSDT BI it said that the architecture was incorrect: my SQL Server 2014 was 64-bit, and the SSDT BI download page said that it could do 32 or 64 bit. Bit in fact, the SSDT BI installation file was only for 32 bit (SSDTBI_x86_ENU.exe). There is no SSDT installation for 64 bit, because Visual Studio is only available in 32 bit.

The trick is, as “g2server” pointed out here: when installing SSDT BI, on the Installation Type we need to choose “new instance”, not “existing instance” (even though you already installed both SQL Server 2014 and VS 2013).

All sorted now. Hopefully with this article those of you who want to install SQL Server 2014 documentation and SSDT BI don’t have to spend time as much as me.

Symptom: open SQL Server 2014 documentation, Microsoft Help Viewer 1.1 opened up, Install Content from Online, chose the content (SQL Server 2014 Books Online, click Update button. Help Library Manager said “Updating Local Library, Analyzing package delta” for 2 mins then displayed this error message: An exception has occured. See event log for details.

The Event log:

An error occurred while updating local content: Microsoft.Help.CacheLib.CacheLibBitsException: Retrieving the COM class factory for component with CLSID {4991D34B-80A1-4291-83B6-3328366B9097} failed due to the following error: 80080005 Server execution failed (Exception from HRESULT: 0x80080005 (CO_E_SERVER_EXEC_FAILURE)). —> System.Runtime.InteropServices.COMException: Retrieving the COM class factory for component with CLSID {4991D34B-80A1-4291-83B6-3328366B9097} failed due to the following error: 80080005 Server execution failed (Exception from HRESULT: 0x80080005 (CO_E_SERVER_EXEC_FAILURE)). at Microsoft.Help.CacheLib.Downloader.CreateManager() — End of inner exception stack trace — at Microsoft.Help.CacheLib.Downloader.CreateManager() at Microsoft.Help.CacheLib.Downloader.Jobs() at Microsoft.Help.CacheLib.Downloader.CancelAllPendingJobs() at Microsoft.Help.CacheLib.DocumentationCache.DownloadWebContent(IEnumerable`1 webDownloads, Int32 totalDownloads, Int32 completedDownloads, String workingDirectory, StatusCallback`1 statusCallback) at Microsoft.Help.CacheLib.DocumentationCache.Download(ChangeDescription change, StatusCallback`1 statusCallback) at Microsoft.Help.CacheLib.DocumentationCache.Update(VendorName vendorName, Boolean checkForOnlineUpdates, StatusCallback`1 statusCallback, CertificateCheckCallback certCheckCallback) at Microsoft.Help.CacheLib.CacheManager.<>c__DisplayClass24.<UpdateAsync>b__23() at Microsoft.Help.CacheLib.AsyncOperationRunner.Run(Object state)

Source: https://support.microsoft.com/kb/870655?wa=wsignin1.0

In Register, HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems:

%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,768 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16

Added 512 like this: Windows SharedSection=1024,20480,768,512

Still the same error.

Solution: download 2012 and 2014 documentation mhsa files from

SQL Server 2014: http://www.microsoft.com/en-gb/download/confirmation.aspx?id=42557

Product Documentation for Microsoft SQL Server 2014 for firewall and proxy restricted environments

SQL Server 2012: https://www.microsoft.com/en-us/download/confirmation.aspx?id=347

Product Documentation for Microsoft SQL Server 2012 for firewall and proxy restricted environments

8 November 2013

The 5 Stages of DWBI Journey

Filed under: Business Intelligence — Vincent Rainardi @ 7:39 am
Tags:

During my years working in MS BI, I saw that the companies’ journeys in DWBI, particularly in MS BI, can be grouped into a few stages. I’m going to write that view, which I hope can be useful.

Stage 1: Report Developer (Cost: £25k/year. Duration: 2-3 years)

In most companies, the “BI Team” usually started with just 1 report developer. The business asked IT to produce report from “the database” (i.e. the back end database of the business application), and this report developer sat down with the business to “define” the requirements and write the report, say using SSRS, Jasper or Crystal. At this time they were not called BI. The word “BI” was never heard anywhere in the company. It was “Report”, not BI, not even MI. There were probably 5 people in IT (out of 100 employees, annual revenue of £5m) and 90% of the IT business was about desktop support, i.e. Exchange/email, file server, network, SQL server. And this report business was only a part time job. Then the business asked for the second report. And then the third. An the report developer became busier and busier. If the first report took him 3 weeks. The 5th report only took him 3 days. He became conversant with the tool, and with the data structure in the database (of the main business application, e.g. EPOS). No formal report specification was written. Just rough sketches of the report layout when sitting down with the business users.

How much? Annual spending at this stage was probably in the region of £20-25k. £14k for half FTE (Full Time Employee) for the junior developer (cause he’s doing desktop support too), plus £6k for training, PC, office, expenses and pension. No cost for software. For SQL Server shops, the RS came with SQL. For MySQL shop Jasper is free. May be £500 for Crystal, for the Oracle shops. I don’t like the perception that RS is free, so let’s calculate it. Say RS is 25% of SQL Server (the other 75% is Relational DB, IS, AS). And SQL Server is say £15k (server hardware + licences). So RS is £4k (25% of £15k). So the annual spending is 14k + 6k + 4k = £24k. In US the numbers are probably the same, but in $, i.e. $24k. How long? A typical company would probably spend about 2-3 years at this stage. Depends on the business growth of course. And the industry sector. A law firm would be quite different from an asset management firm. The former is not IT intensive whereas the latter is IT intensive.

Stage 2: More Reports (Cost: £150k/year. Duration: 1-2 years)

As the business grew, the business users would requested more and more reports from IT. This is especially true if the business application doesn’t have good reporting tool. If it can only report at transaction-by-transaction level, or customer-by-customer level, then the pressure to have reports developed by IT is bigger. If the business application doesn’t have the capability to report things at the company level, or provide “time series” information which shows how the company progresses from week to week, from month to month, then the pressure to create reports will be bigger. The number of reports produced by the IT department would probably exceeds 50 now. Probably even more than 80. Many divisions now are requesting reports. Not just the sales department, but also stock management and finance. And IT also need their own reports too, out of the helpdesk calls. The report development is no longer a part time job. It is a full time job now. Probably even grow from 0.5 person to 2 persons. Number of IT staff has grown from 5 to 15. The company size is probably about 200 employees. Or even 400 employees, depending on which industry sector. The annual revenue is probably £30m now, with total IT spend of £0.5m. Again depending on which industry. They don’t just use 1 business application, but probably 4-5 applications, most of them are off-the-shelf. The maintenance of these apps are outsourced. They have 1 or 2 full time report developers, and probably 1 or 2 full time application developer, most probably Web-.NET-SQL developers, or PHP-Java-PL/SQL.

I’m going to give an example in MS BI now, but it’s probably the similar situation in other tech stack. The report developer is now called MI Report developer. He/she uses SSRS to develop the reports, and host them in SharePoint. There are a few reports which are scheduled and emailed out. Reports provide values to the business. The quantity of the reports and the workload are both manageable. The business loves IT department because they can get the information out of these 4-5 business application, and get this valuable information to the business. It took 2-3 days to produce a new report, but sometimes 2 weeks because there are a queue. Prioritisation works quite well. Some of the reports beginning to be not used, particularly those which were developed 2 years ago. They are now superseded by newer reports. So they have initiative to archive reports which are not used. They monitor the usage of the reports. The IT manager want to make sure that the reports are worth the time invested to create them. These usage report is a worthwhile tool for the IT manager to put forward good arguments to those business users who ordered reports but not using them actively.

How much? The cost of MI is probably about £150k now. The salary of 2 report developers = 2 x £28k = £56k. May be 3 developers = £84k. Plus 3 x £6k for expenses & overhead (PC, office, training, pension) = 18k + 84k = £102k. Now IT has dedicated SQL Server for MI (£15k for H/W + S/W). Probably 2 servers (£30k). Plus some SharePoint licences to host the RS reports (say £10k). Probably plus £5k for some consultant fee to setup/configure RS/SharePoint. So the annual spending is 102k + 30k + 10k + 5k = £147k. How long? This stage is surprisingly fast. Probably only 1-2 year, at which point the company then will experience stage 3: too many reports.

Stage 3: Too Many Reports – Self Service BI (Cost: £150k/year + £50k one-off. Duration 1 year)

By the time the number of report development workload achieved 3 FTEs (Full Time Employees), or 150k per year, the IT manager is very much interested to the idea of Self Service BI. The idea of creating a data warehouse and cubes on top of it becomes a good business proposition from the cost point of view. The users will be able to use Excel to query the cubes. They may consider the idea of buying BI software like BO, Cognos, MicroStrategy, QlikView but the cost is probably becomes a barrier. It’s incredibly difficult to get the prices (if you are not buying) but I’ll take a stab, otherwise we won’t have an idea about the cost. If I’m way off, I’ll be glad if the corresponding vendors can correct me. QlikView is probably about £150k for 100 users, i.e. £1500 per named user including document licence. Plus £50k for live & test server licences. Plus £10k for SharePoint component + workbench. Plus £10k for server/infra cost. Plus training £10k for training, £20k for consulting services. So total of 150k + 50k + 10k + 10k + 10k + 20k = £250k one off cost. Plus 20% annual cost for support (£50k). BO is £35k per 25 CAL, x4 for 100 users = £140k. Plus 3 x £400 = £1200 for developer licences. Say £50k for training, consultant, servers. So £200k one off, plus probably 20% = £40k annually. Cognos and MicroStrategy is probably similar, i.e. in the region of £200-250k one off cost, plus £40-50k annually, for 100 business users. So it’s a no-goer if the Stage 2 cost is £150/year, because these developer cost won’t go away, but increase as they have to hire a QlikView/BO/Cognos developer, or retrain existing staff.

Hence the idea of doing self service using a data warehouse + SSAS cubes starting to take off, in the head of the IT manager. The numbers look good: SSAS is free. At least that was the initial perception. SharePoint is already used for SSRS, and the Excel is widely deployed to each desktop, so no additional cost then! And this Data Warehousing stuff will only cost 1 new server, with a big storage, so £20k. Say 2 new servers, plus storage = £40k. Much more palatable than the £250k of the popular off-the-shelf BI software above. There is a training cost, i.e. SSAS, SSIS, Modelling, say £10k. This initiative is double-acting as a “retainer” too, from HR point of view. Good people stay because they can develop their skills. Getting them ready to go to the next level in their career. And the whole DW OLAP thing is a new world for them. Of course they’d stay. Eagerly. At this point they learn about dimensional modelling, about ETL, and about data quality. They learn about metadata, about the technicality of building cubes. After struggling about 6 months, in their spare time, they managed to construct 1-2 fact tables and a few dimensions in the data warehouse, and built a cube on top of it. The demo was a great success, the business absolutely loved it. I’ve been attending “first business demo” (on SSAS) for about 8 times now and I haven’t experienced a single event where the business is not excited seeing their numbers being sliced and diced in front of them, turned into time series charts and gadgets.

Now, with all data being delivered via the cubes to them, the business can do self service BI. What they loved in the first few weeks was that they don’t need to wait 3 weeks for IT to create a report for them. They can directly get the information they need themselves. And what the IT manager loves in the first few months is that the number of report requests was down a bit. The number of reports each quarter was something like: 100, 120, 150, 200, 160. They managed to knock some reports of the list (in this example 40) because the information need can be satisfied/fulfilled from the cubes, i.e. self service BI.

At this point they would also consider 2 kinds of software which helps the development of the data warehouse: a one click warehouse builder (such as RED and Kalido), which can build a data warehouse very quickly, and a data virtualisation software (such as Composite), which can deliver BI without the pain of building a data warehouse. So they tried out 1 or 2 of these software. They would hope that these software could enlighten their workload. A few companies might switch and abandon the MS BI, but majority would stay with MS BI and only use tools that complement to MS BI.

How much and how long? The one off cost is about £50k. £40k for servers and £10k for training. The annual spend is the same as Stage 2, £150k/year (£100 for 3 people, £50 for infra). This is for 100 business users. How long? Usually companies stay at this stage only for a short period, i.e. 1 year. This is the period when the initial data warehouse is constructed, and the cubes developed. After the initial version is deployed, the IT Department will have many requests to put more and more data into the DW and cubes. And that’s Stage 4.

Stage 4: Growth of DW and Cubes (Cost: £200k/year. Duration: 2-3 years)

The were many requests to put various attributes and measure into the cubes and warehouse. Can you put the Sales Person into the cube please? Oh and supplier please. And the store manager names please? There will be a change of product grouping next month, can you accommodate that please? We have a new price list coming this quarter, that needs to go into the warehouse. Can we have the employee hierarchy in the cube? You get the picture. Lots of changes and new requests to add more data into the warehouse and the cubes.

The danger is, the “BI Team” is still new to DW. They just learned yesterday what a dimension is, what a fact table is, and what grain is. They are inexperienced in data modelling. And if they got it wrong, it cost a lot of working days to correct it. Most probably weeks, not days. They read Kimball’s data modelling book, and expect to see a design that suitable for their case, but there are a lot of specific points which are not mentioned in the book. They tried to browse the internet to get information about data modelling, but got even more confused. And they need even more guidance on the ETL side. They are clueless how to do SCD type 2. They hope that SSIS SCD2 task will solve it for them, and it did to some degree, but alas, the population of the fact tables now becomes an issue. Plus the performance is an issue, and the metadata is an issue. At this point, they seriously thinking to hire a data architect. Someone who have been designing and building data warehouses and cubes in different companies. Someone who knows how the ETL should be built. So they hire a data architect, say £50k/year. He knows about data modelling, ETL, and SSAS. And they hire an ETL developer too. An SSIS + T-SQL guy who has some exposure to cubes. This guy has a lot of experience moving data from 1 place to another. Data integration experience. 1 of the 3 report developer now becomes an ETL developer. They are now called a BI Team, with 5 people: 2 report developers, 2 ETL developers, 1 data architect who also does SSAS.

Some of the SSRS reports are now decommissioned. They managed to get the reports down. Say from 200 to 100. Some of these reports are re-pointed to the data warehouse. Instead of reading from a copy of the business application database, the SSRS reports read from the data warehouse tables. Initially the DW needs to be reconciles with a copy of the business application (of course!), but as time moves on, there are measures which are created in the warehouse which don’t exist in the live app, for example: ranking measures such as decile and quartile, and various ratios between original measures such as financial ratios.

Business users enjoys the benefit of having a lot of data at their finger tips. They can have the information instantly, without waiting for 3 weeks like before, for the reports to be created by the report writer.

How much and how long? Surprisingly, instead of decreasing, the cost of BI is increasing. It’s now 5 people instead of 3, so 4 x £28k + 1 x £50k = £162k. Plus £40k of annual infra cost such as additional capacity, replacement servers, etc. So about £200k/year. How long? A typical period is 2-3 years, after which the business users become dependant of BI. The data coming from the DW via cubes and reports really drive the business forward.

Stage 5: Users Became Dependant of BI (Cost: £50k one off + 10k/year. Duration: 3-4 years)

One by one, each business area uses the new technology (DW & cubes). Sales department, manufacturing department, finance department, marketing department, even HR department. In insurance industry: underwriting, claim, reinsurance, actuarial, finance. In utility industry: sales, customer service, operation, finance. The information from BI affects the business decision in each area: premium at renewal, pricing policies, CRM strategy, product design, marketing strategy, production cost control, etc. Even more requests to put more and more data into the warehouse and cubes and reports.

The companies at stage 5 usually looked for BI tools. Visualisation tools. And data warehouse tools such as DQ and ETL. They want to make the lives of their users easier with BI and visualisation tools. They want to make their own lives (IT) easier, using various development tools. As development work grows, they need to tidy up on the governance, version control & code repository, request & bug management, release management and data quality. Software like Red Gate, SQL Sentry, JIRA, TFS and would come into their consideration horizon. Red Gate & SQL Sentry is SQL Server development tools (and server monitoring). JIRA is request and bug management tool. TFS is code repository and version control tool. It also does task management such as Sprint planning.

With regards to data visualisation, software like Cube Player, Power BI, Strategy Companion, Panorama, XLCubed, Pyramid, Verigence Vivid, Bonavista Dimensions and Tableau would come into the picture of any MS BI shop. Out of about 15-20 SSAS client out there, most companies only know 1 or 2. Probably 30 actually (I mean SSAS client), complete list is here at Vidas Matelis’ ssas-info website, the most content intensive website on SSAS. At this stage, companies are trying out these SSAS client, and eventually hooked with 1 of them. I can’t recommend which one (even though I have personal preference), hmm… probably good to write about SSAS clients actually. I just browsed for them and very little is written about their comparison. Most notably ones are here, here and here.

Apart from SSAS clients, companies also frequently thought about reporting tool, i.e. what’s better than RS? If their budget is big enough, they will probably be approached by (or think about) QlikView and Spotfire. If their budget is minimal they probably think about Jasper and Pentaho. QlikView, Spotfire, Hana, TM1, PowerPivot are all “in memory OLAP”. Unlike SSAS which stores the Multidim structure on disk, these tools read relational tables, and create the multidimensional structure in memory. Much like MicroStrategy. TM1 is not like that, it stores the proprietary Multidim structure as files. Ditto QlikView.

How much and how long? It depends on the tools they select and the industry they are in. For retail, telco, shipment, manufacturing, legal, mining, transport, utilities, it is about 2-3 years. For banks, fund managers, healthcare, insurance, probably 3-4 years. Could be 5. For costing, let’s use Tableau. The cost of implementing Tableau on top of DW + SSAS is surprisingly low: 50k one off + 10k/year. Developers uses Desktop Pro which is $1999 + $199/year maintenance (source: link). Business users will become report creators, they uses Tableau Online Licences which is $500/year. And report reader is free. So if we have 3 developers and 15 report creators the cost is about (assuming the price is same in the UK but in £): 3 x £2k = £6k one off + annual cost of 3x£200 + (15+3) x £500 = £9600/year. Plus £40k for training, servers and consultant it’s around £50k one off project cost and £10k/year on-going cost.

After

What’s after stage 5? BI is stable, matured and managed well. Operation are governed. What usually happened at this stage is a complete change of technology. For example: implementation of Cognos or Business Objects. Yes, changing all the Microsoft BI stack with Cognos or BO stack, leaving just the SQL Server DB engine, which could be replaced with Oracle too. The other approach is to stay in Microsoft BI path, and look at a few things:

a. Parallel Database

For high volume industry such as retail and telecom, as the business grow, the number of transactions per day could be in the million. This requires some kind of MPP (Massively Parallel Processing), meaning parallel database engine. Either Netezza, PDW (Parallel Data Warehouse, Microsoft new MPP), Exadata, Greenplum or Teradata. Or column oriented database such as Vertica, Sybase IQ.

b. Data mining
For certain industries, CRM and marketing is vital, particularly in the retail sectors. They could look into integrating CRM and marketing campaign management with their data warehouse. This requires data mining and forecasting, which can be done in SSAS, but some companies uses SAS, SPSS, BIRT or even manually using .Net or R.

c. Big Data

This is where most company after stage 5 are going. At least trying. Map Reduce, Hadoop, Mongo DB, Hive, and things like that. The project cost is not high, because it’s common hardware and the software is mostly free. But the benefit is potentially big. Hence the motivation.

d. Text Mining

For certain industries such as healthcare, insurance and retail, the proposal of able to crawl over millions of patents documents, emails correspondence with customers, and thousands of policies, have business good values. Software like Clarabridge, Angoss, Attensity, NetOwl can be implemented for this purpose (mining text within documents or emails for information).

For IT managers, I hope this helps broadening the DWBI horizon, whatever stage you are in. For vendors, I’d be grateful if you could advise if I’m wrong, particularly about the cost. For developers, I hope this article could clarify a bit about the future of your career path, if you stay within DW BI world. There are many experts out there and I’m not even an expert so I would be grateful for your comments.

VR, 9/11/14:

  • Andrei Padrei on QlikView and Tableau prices: link
  • Tableau prices: link

14 July 2012

What is Big Data, Data Warehouse, Data Mining

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 6:35 am
Tags: ,

Michael Koploy from Software Advice gathered the opinion of 5 BI experts about the definition of 3 BI terminologies that puzzle many people:

  • What is Big Data
  • What is a Data Warehouse
  • What is Data Mining

The article is here. The 5 experts are:

My definitions:

What is Big Data? (from Hadoop in Data Warehousing article in my blog)

What is big data? Again it is easier to learn by example than by definition. Big data is images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.

Big data does not have to be big (peta/exabytes). Even 50 GB can be said as big data if the structure is too complex for a normal RDBMS to store.

What is small data? Small data is simple data structures, e.g. numbers (be it monetary, integers, fractions or floating points), strings (names, description, types), dates, times, and all the data we used to know in the last 30 years of data warehousing history.

A large collection of small data is big data. Again I have to put a number here so you have an idea of how big is big. I don’t like being unclear. A collection of simple data structures of more than 1 petabytes is big data.

Bear in mind that now is 2012. Petabytes (PB) is 2012 term. But 5 years ago in 2007 it was Terabytes (TB). So in 2014 the definition will change to: a collection of simple data structures of more than 1 Exabytes is big data. And in 2015 the definition will change to “more than 1 Zettabytes”.

In a few years time, these seemingly foreign words will be common words in IT: Petabytes, Exabytes, Zettabytes, Yottabytes. Here the Wikipedia page for their definition: link.

What is a Data Warehouse?

From my book: (Amazon.com, Amazon.co.uk, Apressmy blog)

A data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Below is a diagram of a data warehouse system: (from my book)

Below is a short explanation about each component in the above diagram, and how the components work together in a data warehouse system: (again, this is from my book)

The source systems are the OLTP systems that contain the data you want to load into the data warehouse. Online Transaction Processing (OLTP) is a system whose main purpose is to capture and store the business transactions. The source systems’ data is examined using a data profiler to understand the characteristics of the data. A data profiler is a tool that has the capability to analyse data, such as finding out how many rows are in each table, how many rows contain NULL values, and so on.

The extract, transform, and load (ETL) system then brings data from various source systems into a staging area. ETL is a system that has the capability to connect to the source systems, read the data, transform the data, and load it into a target system (the target system doesn’t have to be a data warehouse). The ETL system then integrates, transforms, and loads the data into a dimensional data store (DDS). A DDS is a database that stores the data warehouse data in a different format than OLTP. The reason for getting the data from the source system into the DDS and then querying the DDS instead of querying the source system directly is that in a DDS the data is arranged in a dimensional format that is more suitable for analysis. The second reason is because a DDS contains integrated data from several source systems.

When the ETL system loads the data into the DDS, the data quality rules do various data quality checks. Bad data is put into the data quality (DQ) database to be reported and then corrected in the source systems. Bad data can also be automatically corrected or tolerated if it is within a certain limit. The ETL system is managed and orchestrated by the control system, based on the sequence, rules, and logic stored in the metadata. The metadata is a database containing information about the data structure, the data meaning, the data usage, the data quality rules, and other information about the data.

The audit system logs the system operations and usage into the metadata database. The audit system is part of the ETL system that monitors the operational activities of the ETL processes and logs their operational statistics. It is used for understanding what happened during the ETL process.

Users use various front-end tools such as spreadsheets, pivot tables, reporting tools, and SQL query tools to retrieve and analyse the data in a DDS. Some applications operate on a multidimensional database format. For these applications, the data in the DDS is loaded into multidimensional databases (MDBs), which are also known as cubes. A multidimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of variables called dimensions. Each cell represents a business event, and the values of the dimensions show when and where this event happened.

Tools such as analytics applications, data mining, scorecards, dashboards, multidimensional reporting tools, and other BI tools can retrieve data interactively from multidimensional databases. They retrieve the data to produce various features and results on the front-end screens that enable the users to get a deeper understanding about their businesses. An example of an analytic application is to analyse the sales by time, customer, and product. The users can analyse the revenue and cost for a certain month, region, and product type.

What is Data Mining? (from my book)

Data mining is the process of exploring data to find the patterns and relationships that describe the data and to predict the unknown or future values of the data. The key value of data mining is the ability to understand why some things happened in the past and the ability to predict what will happen in the future. To refer to predicting the future with regard to data mining, some people use the term forecasting, and some call it predictive analytics. On the other hand, when data mining is used to explain the current or past situation, it is called descriptive modeling, descriptive analytics, or knowledge discovery.

Implementing data mining in the business is growing by the day, both through descriptive and predictive analytics. Using data mining, we can find the correlation between purchase patterns and customer demographics. For example, in our Amadeus Entertainment case study, we can find whether there is a relation between the sales of a music product type and the customer interest or occupation. In the financial services industry, data mining is traditionally used for credit scoring, for example, to predict the ability and willingness to pay the loan based on (each country has different factors) payment behavior, past delinquency, current debt, credit history, number of inquiries, address, bankruptcies filed, and customer demography.

In business intelligence, popular applications of data mining are for fraud detection (credit card industry), forecasting and budgeting (finance), cellular/mobile package development by analyzing call patterns (telecommunication industry), market basket analysis (retail industry), customer risk profiling (insurance industry), usage monitoring (energy and utilities), and machine service times (manufacturing industry).

17 August 2011

How to Choose the Right BI Technology to Suit Your Style

Filed under: Business Intelligence — Vincent Rainardi @ 5:42 pm
Tags:

By: Tejas Goradia

I want to see reports. Microsoft offers Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services; which one should I use?

This post shall help choose Microsoft’s data visualization tool based on reporting requirements driven by business process and stakeholder expectation and need analysis. Typically reporting styles fall into Self-Service Analysis, Business Reporting, Parameterized & Operational Reporting, Performance Monitoring and Scorecarding.

Figure 1 – Choosing the Right Microsoft BI Tool (Microsoft Excel, PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), and PerformancePoint Services in SharePoint)

Description and Key Characteristics

  • Self-Service Analysis – The best analysts (non-technical users), and often the best assets, in most organizations are those users who love to dig into the data and really understand why the business is performing the way it is. They want to perform self-service analysis (reports, graphs, dashboards, and so forth) without reliance on IT. They want to be able to easily drill down, pivot, filter and format data, integrate information from a variety of sources without the need of specialized technical skills such as SQL, MDX, or other query languages, using familiar business analysis skills & tools such as Excel. Because Self-Serivce BI tools empower users to gather data from multiple sources, the need for a structured corporate system to be built, such as a data warehouse, is reduced.Tool Options: Excel, PowerPivot for Excel
  • Business Reporting – While typically not as fluid, fun, or “sexy” as the self-service creations discussed above, Business Reporting is just as important to organizations because of the need for power users in business functions to create highly formatted and distributable reports  delivered in multiple formats such as PDF, Excel, HTML, and so on. Reports are typically based upon approved corporate data, and then shared more broadly with managers, teams, or departments.  Specialized technical skills such as SQL, MDX, or other query languages may not be required to author reports. IT involvement is moderate, usually overseeing the distribution and monitoring of the reporting environment and building of the structured data layer upon which the reports are built.  A data warehouse, data mart and/or online-analytical-processing (OLAP) environment is often in place and used as the data source for business user reporting.Tool Options: Excel, PowerPivot for Excel, Report Builder
  • Parameterized & Operational Reporting – Similar to the Business Reporting style, Parameterized &Operational Reporting is also characterized by fixed-format reports that need to be distributed to the responsible parties for each store, region, or other organizational entity in a timely manner and without requiring user involvement to locate and run the report.  The reports, however, are authored and managed centrally by IT instead of business users and usually follow a pixel perfect format and rendering style.  Consistency, scalability, manageability, auto-refresh and automated distribution are some of the key characteristics of this style.  Data used to create reports comes from corporate-sanctioned and IT-managed data sources. Reports may be complex and require special technical skills such as advanced SQL, MDX, or other query languages to build. The reports may be delivered via portal, file share, email, or even sent directly to a printer, and may be generated as PDF, Excel, XML, comma delimited text file, TIFF image, HTML or Microsoft Word formats.
    Tool Options: SQL Server Reporting Services (including Report Builder)
  • Performance Monitoring – Just as a dashboard in a car helps a driver quickly understand the real-time operational performance of the automobile, a dashboard report helps business decision makers understand the current health of their business.  A dashboard often combines key performance indicators from various business functions on a single page to help provide an at-a-glance view of performance and the ability to drill down further when something is off track or performing extremely well.  This style is called “Performance Monitoring (Dashboard)”.  This style is generally found in more mature BI environments where data points of interest have been defined, key determinants of business success have been identified, and a comprehensive BI strategy exists.Tool Options: Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services.
  • Scorecarding – Scorecarding is a style that describes highly summarized views with Key Performance Indicators (or KPIs) measured and scored against predefined targets such as a balanced scorecard.  This style is generally a part of a performance management program, though it can also be used to measure operational performance. While all of the Microsoft BI tools permit “scorecard-like”reporting, PerformancePoint Services is the Microsoft tool that enables Scorecards by using a methodology such as Balanced Scorecard or by creating your own methodology.  PerformancePoint Services facilitates weighted score rollups, measurement against targets, and pre-defined KPIs, in addition to providing the broader set of PerformancePoint functionality to allow integrating the Scorecard into a dashboard alongside other reporting and analysis capabilities.Tool Options: Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services.

References:
Choose the right business intelligence technology to suit your style (white paper)

31 May 2011

Differences Between Data Warehousing and Business Intelligence

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 6:38 pm
Tags: ,

Try asking your colleague what is the difference between business intelligence and a data warehouse. I find that a lot of people, even those who work in BI projects and BI industry, do not understand the difference. A lot of people use these 2 terms interchangeably. Some people even prefer to use 1 term instead of the other because it simply “sounds better”. Many people think that business intelligence is not just a data warehouse, but there is more to it. But when asked “what business intelligence systems are not data warehouse systems?” or “what part of business intelligence systems are not data warehouses?”, most of them have difficulties explaining the answer.

These days, “business intelligence” is the norm used by most vendors in the industry, rather than “data warehouse”. Most of them call / classify their tools as business intelligence software, not data warehouse software. The name of Cognos product is “Cognos 8 Business Intelligence”. BusinessObjects label themselves as “BI software company” and “global leader in BI software”. The name of one of Hyperionproducts is “Hyperion System 9 BI+”. SAS Enterprise BI Server provides a fully integrated and comprehensive suite of business intelligence software. Microsoft promotes SQL Server 2005 as the end-to-end business intelligence platform. It seems that only Kimball Group who consistently use the term data warehouse. Bill Inmon, as the inventor of this term, also uses the term data warehouse.

So, let’s get into the details. This is an example of a data warehouse system:


It includes ETL from the source system, front end applications (those 10 boxes on the right hand side), and everything in between. It has a control system, an audit system and a data quality system (also known as data firewall). Not all data warehouse systems have all the components pictured above, for example, some data warehouse system may not have operational data stored (ODS), see this article for details.

The 2 blue items are data warehouse databases. The cylinder is in relational format (labelled as dimensional data store, DDS for short), the box is in multidimensional format (labelled as cubes in the picture above). This blue cube is also known as on line analytical processing cube, or OLAP cube for short.

The yellow items are business intelligence applications. Most business intelligence applications take data from multidimensional format data warehouse, but some do take data from the relational format. The whole diagram above is also known as business intelligence system.

Some business intelligence applications take data directly from the source system. For example, some dashboard systems may get sales summary data from the source system and display it in gauge meter format. In this case, we can not call the system a data warehouse system. It is still a business intelligence system, but it is not a data warehouse system, because it does not have a data warehouse database behind the gauge meter application.

Business intelligence systems, in the past also known as executive information systems, or decision support systems, are a non-transactional IT system used to support business decision making and solve management problems, normally used by top executives and managers. Many varied definitions exist in the market place today about the business intelligence system; one from Dr. Jay Liebowitz is arguably one of the better ones. Most people agree that OLAP and data warehouse systems are a major and important part of business intelligence systems. Most business intelligence systems are in the form of a data warehouse systems. Yes, there are business intelligence systems that do not use OLAP or data warehouses, as illustrated in the example of gauge meter application above, but they are more rare than the ones with OLAP or a data warehouse.

According to Ralph Kimball, in his book The Data Warehouse ETL Toolkit, a data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making. He stressed that a data warehouse is not a product, a language, a project, a data model or a copy of transaction system. In an interview with Professional Association for SQL Server (PASS) on 30th April 2004, he explained about the relationship between data warehousing and business intelligence.

In their latest book, The Microsoft Data Warehouse Toolkit, Joy Mundy and Warren Thornthwaite do not differentiate data warehouse systems and business intelligence systems. They consistently use the term DW/BI system throughout the book. This is understandable because, as I describe above, most business intelligence systems are in the form of a data warehouse system.

Bill Inmon, who invented the term data warehouse, defines data warehouse as a source of data that is subject oriented, integrated, nonvolatile and time variant for the purpose of management’s decision processes. He pointed that the term data warehouse was never trademarked or copyrighted. As a result, anyone can call anything a data warehouse. He recently defined a new term, DW 2.0, and this one is trademarked so nobody can change the definition. He explained the architecture in his article in dmreview, along with the differences between the first generation of data warehouses and DW 2.0 and its advantages.

So, as a summary, back to the original question, what is the difference between data warehouse and business intelligence? Most business intelligence systems are based on data warehouse systems (the one with dimensional model, fact tables, dimension, etc), but some business intelligence systems are not data warehousing, i.e. taking data directly from the source system, like the example described above. Business intelligence application (as opposed to business intelligence system) is the yellow boxes on the diagram above, i.e. the front end applications. The data warehouse database (or sometimes people dropped the word database, so it becomes just ‘data warehouse’) is the blue cylinder and blue box on the diagram above, i.e. the dimensional storage, whether in relational database format or in multidimensional database format.

If people say ‘data warehouse’, be careful because it can mean either data warehouse system (the whole diagram above) or data warehouse database (just the blue items). If people say ‘business intelligence’, it can mean either business intelligence system (the whole diagram above, or a BI system without data warehouse) or business intelligence application (the yellow boxes).

I hope this article makes the terms clearer, but I am open to comments and suggestions. As Ralph Kimball said, if you ask 10 different people what data warehouse is you are likely to get 10 different answers.

Vincent Rainardi
1st May 2006

This is a repost from SQLServerCentral.

5 April 2011

Building a Data Warehouse at the Same Time as the Operational System

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 6:10 am
Tags: ,

A data warehouse is usually built when the operational system is up and running. But sometimes we have to build a data warehouse or data mart at the same time as the operational system. This has particular challenges.

  1. We can’t get production quality data, because the system is not in production yet. And for several items we can’t get the data at all, because that part of the system is not built yet.
  2. The source system in general is still buggy and not stable. The database can still change. The functionality can still change.

This doesn’t mean that we can’t do it. We can. But there are certain things that we need to do.

  1. First and foremost, for the benefit of project management (the economics of resources), agree the interface between the operational system and MI. This will speed up the project timeline overall, because the MI / DW team can start development without waiting for the operational system to complete first. We remove the dependency between them.
  2. Unlike a normal warehouse project where we usually query the source tables directly using our Data Integration tool, I would advise not to do it that way. It is better if this agreed interface is in the form of flat files. This is because a) the operational database is still changing, you don’t want to interface with a database which is still changing. By interfacing using files we establish some stability. b) the files could be used by other downstream systems too, not only the warehouse.
  3. Produce sample files for the interface. If we have the data in the operational system, use real data. If for certain areas we don’t have any data (because the system hasn’t been built yet for that area), then use a made up data. Ask the ops team to describe the functionality of that area, then compose the sample data. Ask the operational team to verify that the sample data makes sense.
  4. Then we go ahead with the usual nitty gritty of DW project development cycle, be it Agile or waterfall. Define the front end of the MI, either cube or report. Define the dim model, create the tables. Build the ETL (can’t believe we still use this term) and build the reports & cubes.
  5. Follow the ops team development. If they develop area 4 first, then area 1, we do so too. If they put area 4 to UAT, and area 1 is still in dev, we do so too. I know we need all the dims to complete a particular star, and some of them could be in area 1 which is still in dev. But the principles still apply: as far as possible, follow the operational team development progress. You don’t want to MI something which hasn’t been developed.

As usual I welcome any questions and comments at vrainardi@gmail.com. Vincent 5/3/11.

19 February 2011

Where to Store the Business Logic

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 10:29 am
Tags: ,

If we have measure C which is calculated from measure A and measure B, we need to decide whether we want to store measure C physically in the fact table, or whether we want to calculate it on the fly in the report. The term for this is “physicalise”, as in “do we physicalise that measure?”.

There are a few considerations when deciding this:

  1. For ratio or percentage, do not store it in the fact table. We need to calculate it on the fly whether it is in a cube or in a BI report. This is to enable the users to be able to view that ratio correctly, at any level in the hierarchy. I’ve written this in the past (link) particularly with regards to an OLAP cube.
  2. If it is an aggregation, do not store it in the fact table, but calculate it on the fly. OLAP Cubes are is clearly better here in doing aggregation, compared to relational reports. After all, that’s what cubes are for: to aggregate data. If you store it in a fact table, you would have to “pin” the calculation to a particular level, so users won’t be able to drill up and down the hierarchy viewing that aggregate measure.
  3. There is a classic argument of “report performance” versus “development time”. This thought is inherited from OLTP reporting, rather than originally from data warehousing/BI. The idea is to calculate the “heavy” things in advance, and store it as a table. With “heavy” they usually mean “aggregation” and “join”. If it is A+B we don’t have any issue, but if it’s joining 15 tables, or aggregating 1 billion rows to 100 rows using “group by”, then the SQL could run for 1 hour, causing the report to be useless. Hence the idea of performing that calculation in advance, and store it in a “result” table (or column). The issue is, when the underlying data changes, the result table or column must be updated. So when deciding this, always look at how much time does it take to update the result table/column.
    These days RDBMS technology gives us “computed columns”, which means that, for a simple logic like modulo or substraction, we could store them as “computed columns”. The database engine will calculate these columns on the fly, as and when the rows are updated/inserted. They are very efficient. I’ve seen 200 million rows having 2 calculated columns running on SQL Server 2008. The overhead was negligible compared before we created those 2 columns, as in: no detectable performance degradation when the ETL runs to update that 200 million rows fact table. Of course, it’s an incremental update, not a truncate-reload. If you are doing a trunc-reload on a 200m rows fact table, you should seriously look at incremental extraction, see my article here.
  4. There are several places within the DWBI system where you can “physicalise” a calculation or a business logic: a) in the ETL, then store it physically as a measure, b) in a view, whether it’s materialised or not, c) in stored procedures, d) in the BO universe or Cognos metadata layer or MicroStrategy metadata layer or SSAS DMV, e) in the BO/SSRS reports, Cognos Transformer, CubePlayer, Novaview, QlikView, Tableau or SSAS calculation. There are advantages and disadvantages of each place, so we need to choose carefully.
    There are 3 things you might want to consider when selecting the place: a) the earlier it is in the stream, the more accessable it is for down stream component. For example, if you put the business logic in a database view, then both the reports and the cube can access it. But if you put it in the report, then the cube can’t access it. b) If it is a ratio or a aggregate, then it is not suitable to calculate it in the cube or reports. It needs to be in the ETL or view. c) In the ETL you can do “multiple passes” or “row-by-row processing”, so you have more calculation ability, where as in the view, metadata layer or report your ability is limited to a 1 pass select.

16 January 2011

Team Setup for DWBI Projects

Filed under: Business Intelligence,Data Warehousing,Other — Vincent Rainardi @ 8:40 am
Tags: , ,

A few people contacted me with regards to the team setup for a BI/DW project. One of the feedback I received was something like: I’m a PM, I’m running a BI project. It’s MSBI end to end. DW/DM on SQL05, IS as ETL, RS reports, AS cubes. I have 3 people working on the project: 1 ETL developer, 1 report/cube developer and 1 data architect/modeller. How do I run the project to ensure that everybody is “loaded”, i.e. in the first month when the DW is being modelled/designed, the report/cube guy will not be doing anything, and ditto is the ETL guy.

Another feedback I received today was: My experience from my clients is that the BI teams most often tend to be (way) too small for the task at hand. In the Kimball books they suggest a team setup with a multitude of roles spread on preferably as many people, when comparing that to my reality the gap is huge where most roles do exist but only spread over a handful people. I find it a bit hard to separate concerns too much since everything within BI/DW is interconnected, architectural needs are often captured through report demands, two ends of the rope. What’s a good team size and setup (of course this is related to the project size but in general)?

I’ve written about this topic on chapter 3 (DW Development Methodology) of my book, and early part of chapter 4. But for now I’ll just focus on directly addressing the 2 situations above.

I’m a believer that we learn better from examples, rather than from theory. So I will explain this team setup thing using an example. The project is a very small, only 4 people on the project.

Team Members

In this project we have 4 people: 1 architect, 1 BA, 1 PM, 1 BI Developer. Because the project is very small, we want a hands on architect. Do not use architects who can only do data modelling. Advertise for “BI architect”, not for “data architect”, they are completely different. The architect must have a DBA experience, i.e. they know how to partition a table, update statistics, etc. When selecting a PM, choose someone who has delivered at least 3 DWBI projects. And ask which part of the DWBI project he delivered, because some PM only did part way through the project, e.g. just the reporting part. It is better to pay an experienced PM highly, then an inexperienced PM cheaply. Do not use PMs who are not a DWBI PM. I know that a DWBI PM is difficult to get, but it is a must that you use a DWBI PM, not a “App Dev” PM or any other PM.

The BI Developer must be an all-rounder. Throw away people who can only do RS or IS. He or she must be able to do T-SQL, IS, RS and AS. I know they are more expensive, but you have no chance of succeeding if your developer is 1 one sided (i.e. just RS). And choose someone who has delivered at least 3 BI projects. Do not use developers who done less than 3 projects. And ask which part of the BI project they did, i.e. ETL or report. Pick a BA who really understands the business. Ideally he/she has been with the business for at least 3 years. The BA will be involved throughout the project, all the way to the end.

The goal of the project is to build a data warehouse on SQL Server 08, with 10 reports and 3 cubes (TBD), in 6 months. Say it’s a utility company, selling gas & electricity to 3 million customers. The source is 1 business system used for sales, stock, bulk purchase, distribution, etc. They want to analyse sales figure, profitability, supplier performance, and distribution performance. The architecture is 1 DDS (see my book chapter 2), no ODS or NDS in the middle.

Break Into Small Parts

The key of managing this project effectively (in terms of its resource use and time) is to break the project into several parts, based on functionality. Team first work together to deliver part 1 (say it’s sales analysis). On week 1, the BA & Architect sit down for a day or two, scoping the functionality. Then the architect spend 3 days doing data modelling (designing the DW table structure), just for part 1. @Architect: I would emphasise not to worry too much about getting it right 100%. Just take a first stab on it. You can refine it later. On week 2, the Architect & the BA spend 2 days specifying the mappings (which column in source going to which column in DW). The architect then spend 2 days creating the DW tables. On the first day of week 2, the architect creates the stage tables, which mimic the structure of the source tables. This way the BI Dev will not be idle week 2: he/she needs to populate the stage tables. Please do not take the number of days literally, as they depends on the size of the work. You could easily double all the numbers. But the proportion should be more or less the same.

Remember that at this time, what we are doing is Sales Analysis (SA). We do not touch customer profitability, supplier performance or distribution performance. Say the ETL goes for 2 weeks (SA only), which is done by the BI Developer. In this time the Architect are designing the reports and cube (SA only). Starting with the UI (which is done together with the BA), then onto the “where the data should be sourced” and then the SQL statement (rough SQL only, leave the details to the developer). One question that the BA should ask him/herself all the time is: Is this what the business need? And the one question that the Architect should ask him/herself all the time is: Will it work?

Testing and Rhythm

The Architect needs to be able to guide the developer. If the Developer is stuck on a particular point (don’t know how to do it), it is the Architect’s job to show how to do it. If there are several approaches how it can be built (say view or direct SQL), it is the Architect’s job to select the best approach. While the developer is busy building the reports and the cube, the Architect and the BA do two things a) create a test plan for SA, and b) design the Customer Profitability (CP), ETL, report & cube. The test plan will probably take ½ week, whereas designing CP will probably take 1-1.5 week. When they are building CP, they will probably get the hang of the rhythm (the repeated cycle of timing). And so, onwards with part 3 (supplier performance) and part 4 (distribution performance).

How about testing? The BA writes the test plan. The BA (helped by the Architect) does the system testing & UAT. The architect prepares the DWBI system for testing. The developer does the unit testing.

It is my experience and a pair of Architect + BA will be able to feed 1 ETL developer and 1 Report/Cube developer quite adequately. This is a point where I find the balance. The workload is not too much or too little on either side.

Management

What does the PM do? The PM is the one ultimately responsible for the delivery of the project. Apart from managing the project (project plan, monitoring progress, providing adequate resource, budgeting), there are 2 main things that the PM must do: a) to shield the team from outside pressures, b) to communicate with outside parties/stake holders. Some people say that PM (in this team structure of 4-5 people) is a part time job, which I tend to agree. The PM can take 1  more project (or two), while at the same time running this BI project. Some PM plays 2 schedules: 1 for the external stake holders and 1 for the team member. This does not work. Within 3 weeks he or she will get caught. Need to have 1 schedule and be open. As with any projects, the name of the game is trust. In order for the project to run well, you need to trust the other team members.

For a bigger project, the 3 pillars are still there: PM, BA, Architect. These people are the ones who drive the rest of the project team. They drive it from 3 different sides: 1 from the management side, 1 from the business side and 1 from the technical side. You can add 2 more developers and 1 more tester, as long as you break it into smaller part like above, and the 3 pillars stand, you will be able run the project in balance and keep everyone occupied.

Hope this helps. As usual I welcome any question and discussion at vrainardi@gmail.com or via comments. Vincent Rainardi, 16/1/2011

25 December 2010

Comparing Oracle 11g OLAP Option and SSAS

For my next DWBI project, I’m thinking about implementing Oracle 11g OLAP option instead of SSAS, so I’m comparing them.

  1. Browsability: users can use OBI (e.g. Answers), BO (since 9i AW/2003), Excel or other tool to browse Oracle OLAP. Users are not expected to use SQL to query the cube. In AS now we have Crescent to look forward to. In the mean time we could use Tableau (see Jen’s session in SQLBits 8), Strategy Companion, CubePlayer or Excel 2010.
  2. Oracle OLAP cubes are stored as true multidimensional store (I’m interpretting this as compressed MOLAP), not as relational tables with materialized view/MV. This MOLAP storage solves the classic ROLAP problem of “MV aggregates can’t cover every permutation of every member of every attribute against other members of other attributes”.
  3. Oracle mentions compression & cost-based aggregation (see page 7). SSAS is certainly compressed, but I’m not sure if SSAS aggregations are cost-based.
  4. Oracle OLAP has the ability to blend multidimensional and SQL models within the same query (see Bud’s comments below). This “blend concept” is similar to SQL Server’s BISM concept in Denali (2011). BISM can expose itself as multidimensional interface (MDX) or SQL/relational interface (DAX). The difference is that BISM is relational (tables), with “in-memory column” (VertiPaq), where as 11g’s OLAP is compressed MOLAP.
  5. Many-to-many relationship. If you are not familiar with M2M in SSAS, pls read Marco’s paper about M2M Revolution. This is critical for many business situations. Can’t live without it. Have searched the whole library and I couldn’t find M2M in 11g OLAP. Bud said we can do it (see below), and from what I understand here, this is done by simply mapping a base measure to the fact table containing the M2M relationship, just like the way we create any base measure. Not sure how do we hide the measure though (because the measure is dummy, it is only used to establish M2M relationship between 2 dimensions).
  6. Yes 11g OLAP Option can create calculated measures. But as we in SSAS world know, calculated measure such as YTD or growth is not a problem if it’s only 1 or 2 measures, but if we have 50 measures we use date tool aka date calculation dimension (see Chris, Alberto & Marco’s most useful SSAS book here, chapter 6). Seems that we can’t do this in 11g OLAP. Surely we can’t create 50 * 10 calc measures? Well we can, but how about the usability?
    Update 27/12: I have read Bud’s comment below, and I know we can create “measure dim” and “calculation dim”, but in SSAS we need to create a hidden measure and write some “MDX calculations” to establish how the growth/2 yr growth, etc will be calculated. Have read about creating calc measure on 11g here, but couldn’t find how that “MDX calculation” is done in Oracle OLAP. Could anybody point me a direction?
  7. Complex calculation/MDX, i.e. existing (results depends on what’s on the slicer). Hidden named sets which is used by a calc member, etc. How do we do that in 11g OLAP?
    Update 27/11: have explored Simba’s MDX Provider for Oracle OLAP (link, link, link) and could not find comprehensive OLAP/MDX functions such as this. It seems that Simba’s product only provide connectivity, not ablity to do complex calculations. Would be grateful if somebody can enlighten me (list of Simba’s MDX functions for Oracle 11g OLAP). Have also looked at Bizzantz’s DeltaMaster methods, but it’s far below the richness of AS’ MDX e.g. scoping, existing, etc.
  8. MDX interface. 11g OLAP cube does mention MDX interface to OBIEE, but I expect this is basic “select … on columns, … on rows from cube” without the ability to use MS’ MDX functions such as NECJ, parallel period, etc. See page 14.
    Update 27/11: same comment as point 7. Update 28/11: see OLAP DML Functions here.
  9. Measure expression. This is required for example for currency conversion. If we don’t have measure expression in 11g OLAP (and we don’t have M2M either), how do we do currency conversion?
    Update 27/11: we can do this in 11g OLAP using “calculated measure” (see here), just need to specify the correct “calculation type” and “calculation”. It is quite comprehensive functions with regards to period / time series computation, but outside period I’m not sure. Update 28/11: found it. It’s called OLAP DML Functions, see here. It’s very comprehensive.
  10. Dimensional/cell security. In AS I sort of take it for granted that group1 can only see member1 whereas group2 can only see member2, along with giving users different default members based on role. But in 11g OLAP?
    Update 28/11: as Bud said, yes we can do dimensional security in Oracle OLAP, see “Selecting data by criteria” here.
  11. In AS we can control the incremental cube processing, i.e. process changed partition only etc. It is mentioned in 11g OLAP this can be automated (see the end on page 28 re Fast cube refresh), and that they can update incrementally by reading the log files (see last paragraph of page 6). This is a good thing (compared to AS).

Oracle OLAP blog is here. The Wiki is here. 11g OLAP user guide is here. OLAP Option documentation is here.

If you find something incorrect above, I’d be grateful if you could let me know, either via comments or vrainardi@gmail.com. I will correct them. Thanks again for Bud Eggress for your corrections.

Next Page »

Blog at WordPress.com.