Data Warehousing and Business Intelligence

7 February 2015

Data Scientist

Filed under: Business Intelligence,Other — Vincent Rainardi @ 8:57 am

Some people wonder what is the meaning of data scientist. Put it simply, a data scientist is an expert in:

  1. Statistical analysis
  2. Data mining
  3. Big data
  4. Data visualisation

In addition to the above technical skills, a data scientist usually has good business knowledge in one or two sectors, such as banking, insurance, investment, oil or biotech.

In this article, I will explain each of the 4 technical areas, and then pick an example.

Statistical Analysis

I consider the following statistical analysis skills to be important to have for a data scientist:

  1. Generalised Linear Model (Bernoulli distribution, Bayesian methods, Gaussian regressor)
  2. Cluster analysis (k-means, Expectation Maximisation, fuzzy clustering)
  3. Goodness of Fit Tests (Kolmogorov Smirnov, Pearson’s Chi2, Anderson-Darling)
  4. Bayesian Statistics (Naïve Bayes Classifier, Bayesian Network, Hidden Markov Models)
  5. Factor Analysis (Maximum Likelihood Extraction, Principal Component Extraction)
  6. Time Series, Matrix computations, Latent Dirichlet Allocation

For many candidates, this first subject (Statistical Analysis) is the biggest stumbling block. The above topics are advanced statistics, some of them are at PhD level. Very few people in IT know them because they are usually taught in universities when you take mathematics, physics or engineering.

As with anything else, Statistical Analysis requires the use of software/tool. The most widely used Statistical Analysis tools are SPSS and SAS. These two are defacto industry standard. Matlab has a special place. It’s not as user friendly nor comprehensive as SPSS and SAS, but can be programmed to do anything and extremely flexible for any kind of statistical processing.

Statistical analysis requires a good programming skills, particularly R and S. Knowledge of parallel processing and multithreading will be also useful. R is the defacto standard language for statistical computing, data analysis and data mining. R is derived from S, with Lexical Scoping semantics added to it, so S provides good fundamental but practically no longer used. Some of the popular IDE for writing R are Rstudio, Rattle GUI and RWeka.

Data Mining

The reason Data Mining is required is for predictive analysis and forecasting the future, as well as descriptive analysis (explaining the past). For this you need to be able to build data mining models, train the models and use it for forecasting. Data mining requires strong mathematical foundation, such as clustering, regression and neural networks. It also requires knowledge about specific tools such as SAS and Analysis Services.

Data mining requires knowledge of data warehousing and BI, because data mining can only uncover patterns actually present in the data. As the data is usually a very large set, it is commonly stored in a data warehouse, and undergo data warehousing processes such as data integration, denormalisation modelling, data quality improvement, and data cleansing.

Data mining requires business knowledge, such as Customer Relationship Management, market basket analysis, credit card fraud, insurance claims, marine underwriting, credit risk and FX trading. Without the business knowledge, it is impossible to create a good mining model.

Big Data

A data scientist need to know about big data because increasingly, more data is stored in big data architecture, i.e. Hadoop, HBase, MapReduce, EC2/S3. They do not need to know about Pig, Oozie, Lucene, Flume, Sqoop in detail, but they need to have experience with platform that the company uses, such as Hortonworks, Cloudera, BigInsights (IBM), and HDInsight (Microsoft). These platforms are fully equiped with all the tools that you need to load and process data in Hadoop. Data access layers, streaming, query language, security, scheduling, and governance all rolled-in into an Enterprise-ready platform.

A data scientist may not be write complex MapReduce transformation in PigLatin and extend the functions using Python or Ruby (that’s for the programmer to do), but they do need to understand the basic concept. For instance, how the Reduce job combines the output tuples from a Map into a smaller set of tuples, what are keys and values in Hadoop, why you need a Shuffle between a Map and a Reduce, etc. Whether you use Horton, BigInsights or HDI – the implementations are different between companies, but the core concepts are always the same.

Using Big Data platforms such as BigInsights enable data scientists to do data discovery and visualisation. It comes with advanced text analytics tool, machine learning analytics, large scale indexing, adaptive MapReduce, compression, security and stream analytics. Not having knowledge of such a platform means that the data scientist limits their capability to process the data.

Data Visualisation

It’s the buzz word for BI tool. Tableau, QlikView and Spotfire are the most popular. IBM has five: Cognos Insight, Cognos Express, SPSS Analytic Catalyst, SPSS Visualisation Designer and Watson Analytics. SAP has Lumira. SAS has Visual Analytics and Visual Statistics. And there are tons of other tools: Dundas, iDashboards, Datazen,  MicroStategy, Birst, Roambi, Klipfolio, Inetsoft, to name a few.

A good data scientist must have experienced creating visualisation using one or two of the above popular tools. All these tools are not difficult to use, compared to programming in Python or Ruby for example, or even compared to Statistical Analysis.Within 2 weeks you will be able to grasp the basics, and within a month you would be able to use them fluently. They are very user friendly, highly visual GUI (i.e. point and click, drag and drop, that sort of things).

Example

One of the most famous team of data scientist is AIG’s. As the Wall Street Journal reported (link): “AIG analyzes reams of data, from massive databases to handwritten notes from claim adjusters, to identify potentially fraudulent activity. It produces a list of around 100 claims to investigate, ranked highest priority to lowest. It also produces charts and other visualizations, like heat maps, that helps investigators understand why machine learning algorithms make the connections they do.”

Jobs and Pay

I recently saw a job advert for a data scientist for £800 per day. Admittedly this is the highest I have ever seen. I believe the normal range is £400 to £600 per day, depending on industry sector. Banking paying the highest, insurance or pharma probably second, and public sector including NHS the lowest.

  • ITJobsWatch (link) reported that the average salary for permanent position is £57,500; whereas for contract position it is £450/day.
  • Glassdoor (link) reported that the national average for data scientist is £40,000.
  • PayScale (link): reported that the median is ££39,691.

It is impossible to get all 4 skills in 1 person. So in many companies this role is carved up to two to three positions: a big data guy, a visualisation developer, and a statistical analyst. In some companies, you don’t need to know any statistics to be called Data Scientist; you just need to know SQL. They call the SQL database guy as Data Scientist just to make the role more attractive and looks good. Ten years ago the “SQL guy” it used to be called “report developer”, but now a days it could be (mis)labelled as “data scientist”.

 

23 November 2014

SQL Server 2014 Installation

Filed under: Business Intelligence,SQL Server — Vincent Rainardi @ 4:42 pm
Tags: ,

I’m installing SQL Server 2014 on a new laptop and it has not been an easy experience.

First of all, the Books Online is not included and when I tried to “Install Content from Online” I had the issue below (I include the solution).

Secondly, there is no BIDS! BIDS needs to be downloaded from this link, and installed into Visual Studio 2013. But my Visual Studio is 2015 (Preview), so I had to install VS 2013 first.

After installing VS 2013 and downloading SSDT BI (new name for BIDS), when I installed SSDT BI it said that the architecture was incorrect: my SQL Server 2014 was 64-bit, and the SSDT BI download page said that it could do 32 or 64 bit. Bit in fact, the SSDT BI installation file was only for 32 bit (SSDTBI_x86_ENU.exe). There is no SSDT installation for 64 bit, because Visual Studio is only available in 32 bit.

The trick is, as “g2server” pointed out here: when installing SSDT BI, on the Installation Type we need to choose “new instance”, not “existing instance” (even though you already installed both SQL Server 2014 and VS 2013).

All sorted now. Hopefully with this article those of you who want to install SQL Server 2014 documentation and SSDT BI don’t have to spend time as much as me.

Symptom: open SQL Server 2014 documentation, Microsoft Help Viewer 1.1 opened up, Install Content from Online, chose the content (SQL Server 2014 Books Online, click Update button. Help Library Manager said “Updating Local Library, Analyzing package delta” for 2 mins then displayed this error message: An exception has occured. See event log for details.

The Event log:

An error occurred while updating local content: Microsoft.Help.CacheLib.CacheLibBitsException: Retrieving the COM class factory for component with CLSID {4991D34B-80A1-4291-83B6-3328366B9097} failed due to the following error: 80080005 Server execution failed (Exception from HRESULT: 0x80080005 (CO_E_SERVER_EXEC_FAILURE)). —> System.Runtime.InteropServices.COMException: Retrieving the COM class factory for component with CLSID {4991D34B-80A1-4291-83B6-3328366B9097} failed due to the following error: 80080005 Server execution failed (Exception from HRESULT: 0x80080005 (CO_E_SERVER_EXEC_FAILURE)). at Microsoft.Help.CacheLib.Downloader.CreateManager() — End of inner exception stack trace — at Microsoft.Help.CacheLib.Downloader.CreateManager() at Microsoft.Help.CacheLib.Downloader.Jobs() at Microsoft.Help.CacheLib.Downloader.CancelAllPendingJobs() at Microsoft.Help.CacheLib.DocumentationCache.DownloadWebContent(IEnumerable`1 webDownloads, Int32 totalDownloads, Int32 completedDownloads, String workingDirectory, StatusCallback`1 statusCallback) at Microsoft.Help.CacheLib.DocumentationCache.Download(ChangeDescription change, StatusCallback`1 statusCallback) at Microsoft.Help.CacheLib.DocumentationCache.Update(VendorName vendorName, Boolean checkForOnlineUpdates, StatusCallback`1 statusCallback, CertificateCheckCallback certCheckCallback) at Microsoft.Help.CacheLib.CacheManager.<>c__DisplayClass24.<UpdateAsync>b__23() at Microsoft.Help.CacheLib.AsyncOperationRunner.Run(Object state)

Source: https://support.microsoft.com/kb/870655?wa=wsignin1.0

In Register, HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems:

%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,768 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16

Added 512 like this: Windows SharedSection=1024,20480,768,512

Still the same error.

Solution: download 2012 and 2014 documentation mhsa files from

SQL Server 2014: http://www.microsoft.com/en-gb/download/confirmation.aspx?id=42557

Product Documentation for Microsoft SQL Server 2014 for firewall and proxy restricted environments

SQL Server 2012: https://www.microsoft.com/en-us/download/confirmation.aspx?id=347

Product Documentation for Microsoft SQL Server 2012 for firewall and proxy restricted environments

15 February 2014

Building a Data Warehouse and BI on Low Budget

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 11:34 am

The core activities in building a data warehouse (DW) and BI are:

  1. Design the dimensional DW (fact and dimension tables)
  2. Identify the data source for each column in the DW
  3. Build the ETL to populate the DW
  4. Build a cube on top of the DW

A lot of companies want to know how to build a DWBI on small budget. A medium size DWBI perhaps cost about £400k to build: £100k on hardware, £100k on software, £200k on labour. Many companies don’t have £400k. They want to spend £25k for POC, then another £75k to make the real one. Some even only have £50k.

Building a DW + BI with £400k

The £400k DW is as follows. The £100k hardware is probably something like this:

  • 4 DW servers 32 GB RAM (Dev, UAT, Staging/DR, Production), 6 cores. Blade or rack with cabinet. Dev & UAT could be 16 GB 4 cores, or VM.
  • 5-10 TB disk storage (1-2 TB per environment), RAID 5, 146 GB 15k rpm, e.g. NAS or SAN including enclosure.
  • 2 ETL servers 16 GB RAM, 4 cores (1 for DEV+UAT, 1 for Production + Staging/DR)
  • 2 Reporting & OLAP servers 16 GB RAM

The £100k software is to purchase ETL and BI software, such as Informatica, QlikView or Cognos, and the database software, such as SQL Server or Oracle. Also OS e.g. Windows Server 2012, Spotlite and VMWare, team collaboration tools e.g. TFS, and other software such as JIRA, Beyond Compare, Visio, Office, MS Project, SnagIt and RedGate.

The £200k labour is for paying 5-6 FTE for 1 year: 1 DW architect & data modeller, 1 BA, 1 ETL Developer, 1 Report Developer, 1 tester, ½ PM, ½ Infra.

Building a DW and BI with £50k

To get this down to £50k we need to minimize all 3 fronts: we need to use only 2 servers. One for Dev + UAT, 1 for Prod. The server will be installed with everything: ETL, BI software and RDBMS. We won’t have DR. This could get the H/W and S/W cost down to 20k including storage and OS.

The remaining £30k is to pay 1 FTE for 1 year who does everything from requirement analysis, design, build, test, manage. From data models, create databases and tables on the database server, stored procedure, ETL packages, reports, and analytic cubes. It will be difficult to find this 1 person because the variety of skills required probably won’t exist in 1 person.

With 1 server functioning as both RDBMS and ETL and Reporting and Analytics, it means that we can only have a small DWBI system. Say 1 TB total net storage, which means that the DW size is probably about 100 GB max, including Staging tables.

We may able to spend £3k on software, but not more than that. SQL Server is about £3k for 2012 Standard edition for 10 users (link). And that includes RDMBS, Reporting and Analytics. Oracle BI Standard Edition is $460 (about £345 including VAT, link) per named user, so 10 users is about £3450. You would probably need to consider open source BI and databases such as MySQL, JasperSoft and Pentaho

I think it is possible to build a DW with £50k, but it has to be a small, simple DW, so that a) the small infra can host it, and b) a single developer can build it within 1 year. Simple, i.e. type 1 dimensions, no accumulative snapshot fact table (only transactional or periodic)

Building a DW and BI with £100k

With £100k we can definitely buy Microsoft BI (£3k for 10 users), or Oracle BI (£3.5k for 10 users), but not Cognos or Informatica. Cognos 10 BI for 10 users is about £30k (link). This is just Analysis Studio + Report Studio (assuming the same price) and 1 Admin pack, including VAT. BO Edge is about £27k (link). That’s for 5 users and 1 report developer. Informatica PowerCenter starts at $140k (link).

The allocation of £100k is probably something like: £30k hardware, £10k software, £60k labour. To me this sounds right for a small company. The £60k is for 2 developers for a year, with 1 slightly more senior (say £25k – £35k split). With 2 developers we can be built a lot more. The first version can be release after 6 months, and we have another 6 months for enhancing it to the users’ needs.

The £30k hardware will buy us 2 servers with ok-ish spec (something like 4 cores, 32 GB) with good amount of storage, probably about 5 TB. We won’t be able to effort SAN, but we can effort RAID 5 on DAS.

Building a DW and BI with less than £50k

If the budget is below £50k, or the time frame is less than 6 months, I would recommend not to build a proper data warehouse. We can still deliver BI without having a DW. One way is using data virtualisation such as Composite. Another way is using QlikView, Tableau, or Spotfire which reads the data sources directly. From cost, the latter is lower than the former.

You can attempt to build a DW (say MS BI) but not proper Kimball DW. I call this Pseudo Dimensional DW. It is fact and dim tables, but no surrogate key. The dimension tables have natural key as their PK. This is a lot simpler and faster to build than a proper Kimball DW. For a small scale (5 dims + 2 facts), instead of taking 3 months, a Pseudo DW could be done in 1 month.

8 November 2013

The 5 Stages of DWBI Journey

Filed under: Business Intelligence — Vincent Rainardi @ 7:39 am
Tags:

During my years working in MS BI, I saw that the companies’ journeys in DWBI, particularly in MS BI, can be grouped into a few stages. I’m going to write that view, which I hope can be useful.

Stage 1: Report Developer (Cost: £25k/year. Duration: 2-3 years)

In most companies, the “BI Team” usually started with just 1 report developer. The business asked IT to produce report from “the database” (i.e. the back end database of the business application), and this report developer sat down with the business to “define” the requirements and write the report, say using SSRS, Jasper or Crystal. At this time they were not called BI. The word “BI” was never heard anywhere in the company. It was “Report”, not BI, not even MI. There were probably 5 people in IT (out of 100 employees, annual revenue of £5m) and 90% of the IT business was about desktop support, i.e. Exchange/email, file server, network, SQL server. And this report business was only a part time job. Then the business asked for the second report. And then the third. An the report developer became busier and busier. If the first report took him 3 weeks. The 5th report only took him 3 days. He became conversant with the tool, and with the data structure in the database (of the main business application, e.g. EPOS). No formal report specification was written. Just rough sketches of the report layout when sitting down with the business users.

How much? Annual spending at this stage was probably in the region of £20-25k. £14k for half FTE (Full Time Employee) for the junior developer (cause he’s doing desktop support too), plus £6k for training, PC, office, expenses and pension. No cost for software. For SQL Server shops, the RS came with SQL. For MySQL shop Jasper is free. May be £500 for Crystal, for the Oracle shops. I don’t like the perception that RS is free, so let’s calculate it. Say RS is 25% of SQL Server (the other 75% is Relational DB, IS, AS). And SQL Server is say £15k (server hardware + licences). So RS is £4k (25% of £15k). So the annual spending is 14k + 6k + 4k = £24k. In US the numbers are probably the same, but in $, i.e. $24k. How long? A typical company would probably spend about 2-3 years at this stage. Depends on the business growth of course. And the industry sector. A law firm would be quite different from an asset management firm. The former is not IT intensive whereas the latter is IT intensive.

Stage 2: More Reports (Cost: £150k/year. Duration: 1-2 years)

As the business grew, the business users would requested more and more reports from IT. This is especially true if the business application doesn’t have good reporting tool. If it can only report at transaction-by-transaction level, or customer-by-customer level, then the pressure to have reports developed by IT is bigger. If the business application doesn’t have the capability to report things at the company level, or provide “time series” information which shows how the company progresses from week to week, from month to month, then the pressure to create reports will be bigger. The number of reports produced by the IT department would probably exceeds 50 now. Probably even more than 80. Many divisions now are requesting reports. Not just the sales department, but also stock management and finance. And IT also need their own reports too, out of the helpdesk calls. The report development is no longer a part time job. It is a full time job now. Probably even grow from 0.5 person to 2 persons. Number of IT staff has grown from 5 to 15. The company size is probably about 200 employees. Or even 400 employees, depending on which industry sector. The annual revenue is probably £30m now, with total IT spend of £0.5m. Again depending on which industry. They don’t just use 1 business application, but probably 4-5 applications, most of them are off-the-shelf. The maintenance of these apps are outsourced. They have 1 or 2 full time report developers, and probably 1 or 2 full time application developer, most probably Web-.NET-SQL developers, or PHP-Java-PL/SQL.

I’m going to give an example in MS BI now, but it’s probably the similar situation in other tech stack. The report developer is now called MI Report developer. He/she uses SSRS to develop the reports, and host them in SharePoint. There are a few reports which are scheduled and emailed out. Reports provide values to the business. The quantity of the reports and the workload are both manageable. The business loves IT department because they can get the information out of these 4-5 business application, and get this valuable information to the business. It took 2-3 days to produce a new report, but sometimes 2 weeks because there are a queue. Prioritisation works quite well. Some of the reports beginning to be not used, particularly those which were developed 2 years ago. They are now superseded by newer reports. So they have initiative to archive reports which are not used. They monitor the usage of the reports. The IT manager want to make sure that the reports are worth the time invested to create them. These usage report is a worthwhile tool for the IT manager to put forward good arguments to those business users who ordered reports but not using them actively.

How much? The cost of MI is probably about £150k now. The salary of 2 report developers = 2 x £28k = £56k. May be 3 developers = £84k. Plus 3 x £6k for expenses & overhead (PC, office, training, pension) = 18k + 84k = £102k. Now IT has dedicated SQL Server for MI (£15k for H/W + S/W). Probably 2 servers (£30k). Plus some SharePoint licences to host the RS reports (say £10k). Probably plus £5k for some consultant fee to setup/configure RS/SharePoint. So the annual spending is 102k + 30k + 10k + 5k = £147k. How long? This stage is surprisingly fast. Probably only 1-2 year, at which point the company then will experience stage 3: too many reports.

Stage 3: Too Many Reports – Self Service BI (Cost: £150k/year + £50k one-off. Duration 1 year)

By the time the number of report development workload achieved 3 FTEs (Full Time Employees), or 150k per year, the IT manager is very much interested to the idea of Self Service BI. The idea of creating a data warehouse and cubes on top of it becomes a good business proposition from the cost point of view. The users will be able to use Excel to query the cubes. They may consider the idea of buying BI software like BO, Cognos, MicroStrategy, QlikView but the cost is probably becomes a barrier. It’s incredibly difficult to get the prices (if you are not buying) but I’ll take a stab, otherwise we won’t have an idea about the cost. If I’m way off, I’ll be glad if the corresponding vendors can correct me. QlikView is probably about £150k for 100 users, i.e. £1500 per named user including document licence. Plus £50k for live & test server licences. Plus £10k for SharePoint component + workbench. Plus £10k for server/infra cost. Plus training £10k for training, £20k for consulting services. So total of 150k + 50k + 10k + 10k + 10k + 20k = £250k one off cost. Plus 20% annual cost for support (£50k). BO is £35k per 25 CAL, x4 for 100 users = £140k. Plus 3 x £400 = £1200 for developer licences. Say £50k for training, consultant, servers. So £200k one off, plus probably 20% = £40k annually. Cognos and MicroStrategy is probably similar, i.e. in the region of £200-250k one off cost, plus £40-50k annually, for 100 business users. So it’s a no-goer if the Stage 2 cost is £150/year, because these developer cost won’t go away, but increase as they have to hire a QlikView/BO/Cognos developer, or retrain existing staff.

Hence the idea of doing self service using a data warehouse + SSAS cubes starting to take off, in the head of the IT manager. The numbers look good: SSAS is free. At least that was the initial perception. SharePoint is already used for SSRS, and the Excel is widely deployed to each desktop, so no additional cost then! And this Data Warehousing stuff will only cost 1 new server, with a big storage, so £20k. Say 2 new servers, plus storage = £40k. Much more palatable than the £250k of the popular off-the-shelf BI software above. There is a training cost, i.e. SSAS, SSIS, Modelling, say £10k. This initiative is double-acting as a “retainer” too, from HR point of view. Good people stay because they can develop their skills. Getting them ready to go to the next level in their career. And the whole DW OLAP thing is a new world for them. Of course they’d stay. Eagerly. At this point they learn about dimensional modelling, about ETL, and about data quality. They learn about metadata, about the technicality of building cubes. After struggling about 6 months, in their spare time, they managed to construct 1-2 fact tables and a few dimensions in the data warehouse, and built a cube on top of it. The demo was a great success, the business absolutely loved it. I’ve been attending “first business demo” (on SSAS) for about 8 times now and I haven’t experienced a single event where the business is not excited seeing their numbers being sliced and diced in front of them, turned into time series charts and gadgets.

Now, with all data being delivered via the cubes to them, the business can do self service BI. What they loved in the first few weeks was that they don’t need to wait 3 weeks for IT to create a report for them. They can directly get the information they need themselves. And what the IT manager loves in the first few months is that the number of report requests was down a bit. The number of reports each quarter was something like: 100, 120, 150, 200, 160. They managed to knock some reports of the list (in this example 40) because the information need can be satisfied/fulfilled from the cubes, i.e. self service BI.

At this point they would also consider 2 kinds of software which helps the development of the data warehouse: a one click warehouse builder (such as RED and Kalido), which can build a data warehouse very quickly, and a data virtualisation software (such as Composite), which can deliver BI without the pain of building a data warehouse. So they tried out 1 or 2 of these software. They would hope that these software could enlighten their workload. A few companies might switch and abandon the MS BI, but majority would stay with MS BI and only use tools that complement to MS BI.

How much and how long? The one off cost is about £50k. £40k for servers and £10k for training. The annual spend is the same as Stage 2, £150k/year (£100 for 3 people, £50 for infra). This is for 100 business users. How long? Usually companies stay at this stage only for a short period, i.e. 1 year. This is the period when the initial data warehouse is constructed, and the cubes developed. After the initial version is deployed, the IT Department will have many requests to put more and more data into the DW and cubes. And that’s Stage 4.

Stage 4: Growth of DW and Cubes (Cost: £200k/year. Duration: 2-3 years)

The were many requests to put various attributes and measure into the cubes and warehouse. Can you put the Sales Person into the cube please? Oh and supplier please. And the store manager names please? There will be a change of product grouping next month, can you accommodate that please? We have a new price list coming this quarter, that needs to go into the warehouse. Can we have the employee hierarchy in the cube? You get the picture. Lots of changes and new requests to add more data into the warehouse and the cubes.

The danger is, the “BI Team” is still new to DW. They just learned yesterday what a dimension is, what a fact table is, and what grain is. They are inexperienced in data modelling. And if they got it wrong, it cost a lot of working days to correct it. Most probably weeks, not days. They read Kimball’s data modelling book, and expect to see a design that suitable for their case, but there are a lot of specific points which are not mentioned in the book. They tried to browse the internet to get information about data modelling, but got even more confused. And they need even more guidance on the ETL side. They are clueless how to do SCD type 2. They hope that SSIS SCD2 task will solve it for them, and it did to some degree, but alas, the population of the fact tables now becomes an issue. Plus the performance is an issue, and the metadata is an issue. At this point, they seriously thinking to hire a data architect. Someone who have been designing and building data warehouses and cubes in different companies. Someone who knows how the ETL should be built. So they hire a data architect, say £50k/year. He knows about data modelling, ETL, and SSAS. And they hire an ETL developer too. An SSIS + T-SQL guy who has some exposure to cubes. This guy has a lot of experience moving data from 1 place to another. Data integration experience. 1 of the 3 report developer now becomes an ETL developer. They are now called a BI Team, with 5 people: 2 report developers, 2 ETL developers, 1 data architect who also does SSAS.

Some of the SSRS reports are now decommissioned. They managed to get the reports down. Say from 200 to 100. Some of these reports are re-pointed to the data warehouse. Instead of reading from a copy of the business application database, the SSRS reports read from the data warehouse tables. Initially the DW needs to be reconciles with a copy of the business application (of course!), but as time moves on, there are measures which are created in the warehouse which don’t exist in the live app, for example: ranking measures such as decile and quartile, and various ratios between original measures such as financial ratios.

Business users enjoys the benefit of having a lot of data at their finger tips. They can have the information instantly, without waiting for 3 weeks like before, for the reports to be created by the report writer.

How much and how long? Surprisingly, instead of decreasing, the cost of BI is increasing. It’s now 5 people instead of 3, so 4 x £28k + 1 x £50k = £162k. Plus £40k of annual infra cost such as additional capacity, replacement servers, etc. So about £200k/year. How long? A typical period is 2-3 years, after which the business users become dependant of BI. The data coming from the DW via cubes and reports really drive the business forward.

Stage 5: Users Became Dependant of BI (Cost: £50k one off + 10k/year. Duration: 3-4 years)

One by one, each business area uses the new technology (DW & cubes). Sales department, manufacturing department, finance department, marketing department, even HR department. In insurance industry: underwriting, claim, reinsurance, actuarial, finance. In utility industry: sales, customer service, operation, finance. The information from BI affects the business decision in each area: premium at renewal, pricing policies, CRM strategy, product design, marketing strategy, production cost control, etc. Even more requests to put more and more data into the warehouse and cubes and reports.

The companies at stage 5 usually looked for BI tools. Visualisation tools. And data warehouse tools such as DQ and ETL. They want to make the lives of their users easier with BI and visualisation tools. They want to make their own lives (IT) easier, using various development tools. As development work grows, they need to tidy up on the governance, version control & code repository, request & bug management, release management and data quality. Software like Red Gate, SQL Sentry, JIRA, TFS and would come into their consideration horizon. Red Gate & SQL Sentry is SQL Server development tools (and server monitoring). JIRA is request and bug management tool. TFS is code repository and version control tool. It also does task management such as Sprint planning.

With regards to data visualisation, software like Cube Player, Power BI, Strategy Companion, Panorama, XLCubed, Pyramid, Verigence Vivid, Bonavista Dimensions and Tableau would come into the picture of any MS BI shop. Out of about 15-20 SSAS client out there, most companies only know 1 or 2. Probably 30 actually (I mean SSAS client), complete list is here at Vidas Matelis’ ssas-info website, the most content intensive website on SSAS. At this stage, companies are trying out these SSAS client, and eventually hooked with 1 of them. I can’t recommend which one (even though I have personal preference), hmm… probably good to write about SSAS clients actually. I just browsed for them and very little is written about their comparison. Most notably ones are here, here and here.

Apart from SSAS clients, companies also frequently thought about reporting tool, i.e. what’s better than RS? If their budget is big enough, they will probably be approached by (or think about) QlikView and Spotfire. If their budget is minimal they probably think about Jasper and Pentaho. QlikView, Spotfire, Hana, TM1, PowerPivot are all “in memory OLAP”. Unlike SSAS which stores the Multidim structure on disk, these tools read relational tables, and create the multidimensional structure in memory. Much like MicroStrategy. TM1 is not like that, it stores the proprietary Multidim structure as files. Ditto QlikView.

How much and how long? It depends on the tools they select and the industry they are in. For retail, telco, shipment, manufacturing, legal, mining, transport, utilities, it is about 2-3 years. For banks, fund managers, healthcare, insurance, probably 3-4 years. Could be 5. For costing, let’s use Tableau. The cost of implementing Tableau on top of DW + SSAS is surprisingly low: 50k one off + 10k/year. Developers uses Desktop Pro which is $1999 + $199/year maintenance (source: link). Business users will become report creators, they uses Tableau Online Licences which is $500/year. And report reader is free. So if we have 3 developers and 15 report creators the cost is about (assuming the price is same in the UK but in £): 3 x £2k = £6k one off + annual cost of 3x£200 + (15+3) x £500 = £9600/year. Plus £40k for training, servers and consultant it’s around £50k one off project cost and £10k/year on-going cost.

After

What’s after stage 5? BI is stable, matured and managed well. Operation are governed. What usually happened at this stage is a complete change of technology. For example: implementation of Cognos or Business Objects. Yes, changing all the Microsoft BI stack with Cognos or BO stack, leaving just the SQL Server DB engine, which could be replaced with Oracle too. The other approach is to stay in Microsoft BI path, and look at a few things:

a. Parallel Database

For high volume industry such as retail and telecom, as the business grow, the number of transactions per day could be in the million. This requires some kind of MPP (Massively Parallel Processing), meaning parallel database engine. Either Netezza, PDW (Parallel Data Warehouse, Microsoft new MPP), Exadata, Greenplum or Teradata. Or column oriented database such as Vertica, Sybase IQ.

b. Data mining
For certain industries, CRM and marketing is vital, particularly in the retail sectors. They could look into integrating CRM and marketing campaign management with their data warehouse. This requires data mining and forecasting, which can be done in SSAS, but some companies uses SAS, SPSS, BIRT or even manually using .Net or R.

c. Big Data

This is where most company after stage 5 are going. At least trying. Map Reduce, Hadoop, Mongo DB, Hive, and things like that. The project cost is not high, because it’s common hardware and the software is mostly free. But the benefit is potentially big. Hence the motivation.

d. Text Mining

For certain industries such as healthcare, insurance and retail, the proposal of able to crawl over millions of patents documents, emails correspondence with customers, and thousands of policies, have business good values. Software like Clarabridge, Angoss, Attensity, NetOwl can be implemented for this purpose (mining text within documents or emails for information).

For IT managers, I hope this helps broadening the DWBI horizon, whatever stage you are in. For vendors, I’d be grateful if you could advise if I’m wrong, particularly about the cost. For developers, I hope this article could clarify a bit about the future of your career path, if you stay within DW BI world. There are many experts out there and I’m not even an expert so I would be grateful for your comments.

VR, 9/11/14:

  • Andrei Padrei on QlikView and Tableau prices: link
  • Tableau prices: link

14 July 2012

What is Big Data, Data Warehouse, Data Mining

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 6:35 am
Tags: ,

Michael Koploy from Software Advice gathered the opinion of 5 BI experts about the definition of 3 BI terminologies that puzzle many people:

  • What is Big Data
  • What is a Data Warehouse
  • What is Data Mining

The article is here. The 5 experts are:

My definitions:

What is Big Data? (from Hadoop in Data Warehousing article in my blog)

What is big data? Again it is easier to learn by example than by definition. Big data is images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.

Big data does not have to be big (peta/exabytes). Even 50 GB can be said as big data if the structure is too complex for a normal RDBMS to store.

What is small data? Small data is simple data structures, e.g. numbers (be it monetary, integers, fractions or floating points), strings (names, description, types), dates, times, and all the data we used to know in the last 30 years of data warehousing history.

A large collection of small data is big data. Again I have to put a number here so you have an idea of how big is big. I don’t like being unclear. A collection of simple data structures of more than 1 petabytes is big data.

Bear in mind that now is 2012. Petabytes (PB) is 2012 term. But 5 years ago in 2007 it was Terabytes (TB). So in 2014 the definition will change to: a collection of simple data structures of more than 1 Exabytes is big data. And in 2015 the definition will change to “more than 1 Zettabytes”.

In a few years time, these seemingly foreign words will be common words in IT: Petabytes, Exabytes, Zettabytes, Yottabytes. Here the Wikipedia page for their definition: link.

What is a Data Warehouse?

From my book: (Amazon.com, Amazon.co.uk, Apressmy blog)

A data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Below is a diagram of a data warehouse system: (from my book)

Below is a short explanation about each component in the above diagram, and how the components work together in a data warehouse system: (again, this is from my book)

The source systems are the OLTP systems that contain the data you want to load into the data warehouse. Online Transaction Processing (OLTP) is a system whose main purpose is to capture and store the business transactions. The source systems’ data is examined using a data profiler to understand the characteristics of the data. A data profiler is a tool that has the capability to analyse data, such as finding out how many rows are in each table, how many rows contain NULL values, and so on.

The extract, transform, and load (ETL) system then brings data from various source systems into a staging area. ETL is a system that has the capability to connect to the source systems, read the data, transform the data, and load it into a target system (the target system doesn’t have to be a data warehouse). The ETL system then integrates, transforms, and loads the data into a dimensional data store (DDS). A DDS is a database that stores the data warehouse data in a different format than OLTP. The reason for getting the data from the source system into the DDS and then querying the DDS instead of querying the source system directly is that in a DDS the data is arranged in a dimensional format that is more suitable for analysis. The second reason is because a DDS contains integrated data from several source systems.

When the ETL system loads the data into the DDS, the data quality rules do various data quality checks. Bad data is put into the data quality (DQ) database to be reported and then corrected in the source systems. Bad data can also be automatically corrected or tolerated if it is within a certain limit. The ETL system is managed and orchestrated by the control system, based on the sequence, rules, and logic stored in the metadata. The metadata is a database containing information about the data structure, the data meaning, the data usage, the data quality rules, and other information about the data.

The audit system logs the system operations and usage into the metadata database. The audit system is part of the ETL system that monitors the operational activities of the ETL processes and logs their operational statistics. It is used for understanding what happened during the ETL process.

Users use various front-end tools such as spreadsheets, pivot tables, reporting tools, and SQL query tools to retrieve and analyse the data in a DDS. Some applications operate on a multidimensional database format. For these applications, the data in the DDS is loaded into multidimensional databases (MDBs), which are also known as cubes. A multidimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of variables called dimensions. Each cell represents a business event, and the values of the dimensions show when and where this event happened.

Tools such as analytics applications, data mining, scorecards, dashboards, multidimensional reporting tools, and other BI tools can retrieve data interactively from multidimensional databases. They retrieve the data to produce various features and results on the front-end screens that enable the users to get a deeper understanding about their businesses. An example of an analytic application is to analyse the sales by time, customer, and product. The users can analyse the revenue and cost for a certain month, region, and product type.

What is Data Mining? (from my book)

Data mining is the process of exploring data to find the patterns and relationships that describe the data and to predict the unknown or future values of the data. The key value of data mining is the ability to understand why some things happened in the past and the ability to predict what will happen in the future. To refer to predicting the future with regard to data mining, some people use the term forecasting, and some call it predictive analytics. On the other hand, when data mining is used to explain the current or past situation, it is called descriptive modeling, descriptive analytics, or knowledge discovery.

Implementing data mining in the business is growing by the day, both through descriptive and predictive analytics. Using data mining, we can find the correlation between purchase patterns and customer demographics. For example, in our Amadeus Entertainment case study, we can find whether there is a relation between the sales of a music product type and the customer interest or occupation. In the financial services industry, data mining is traditionally used for credit scoring, for example, to predict the ability and willingness to pay the loan based on (each country has different factors) payment behavior, past delinquency, current debt, credit history, number of inquiries, address, bankruptcies filed, and customer demography.

In business intelligence, popular applications of data mining are for fraud detection (credit card industry), forecasting and budgeting (finance), cellular/mobile package development by analyzing call patterns (telecommunication industry), market basket analysis (retail industry), customer risk profiling (insurance industry), usage monitoring (energy and utilities), and machine service times (manufacturing industry).

11 July 2012

Business Objects Voyager

Filed under: BI Tools,Business Intelligence — Vincent Rainardi @ 1:00 am

What is it?

BO Voyager is a tool for “browsing” cubes. It can read 3 OLAP cubes: SSAS, Hyperion Essbase, SAP BW (link, page 52). DB2 OLAP (8.2) uses the same provider as Essbase 7 hence Voyager can open DB2 OLAP. Oracle OLAP can only be accessed using Simba MDX provider (link), not native. SAP BW (or BIW) can only be accessed using BO Integration Kit for SAP.

The formal name of Voyager today (10/7/12) is “SAP BusinessObjects Analysis, edition for OLAP” (link). The user guide is here, the admin guide is here, the product sheet is here.

There are 2 versions of “BO Analysis”, the other one is “edition for Microsoft Office”, which is an Excel plugin (link). This is the new name of Business Explorer Analyzer (link).

This is how Voyager looks like: (link, page 51)

What is for?

Voyager has been used for many years (probably more than 7 years) to browse SSAS cube. Now that Excel can browse SSAS cube, the use of Voyager for browsing SSAS cube is a question mark. Tableau and XLCubed have better capability than Voyager. We can’t even create a hyperlink in Voyager.

Unlike Tableau, Excel or XLCube however, Voyager can browse Hyperion/Essbase and SAP BW as well. Small companies usually only have 1 OLAP platform. But large corporation like global banks usually have multiple OLAP platforms and this is where Voyager kicks in.

Apart from “browsing”/querying the cube and creating charts from it, we can use Voyager to do the following:

  • Highlight exceptions e.g. make the cell red (or any colour) if the number is bigger or smaller than certain limit.
  • Sort a column or a row, i.e. ascending, descending or hierarchies.
  • Filter a column or a row, i.e. display only top 10, display only where revenue > certain threshold.
  • Do calculations, i.e. sum, sumproduct, +, -, /, x, % of total
  • Useful OLAP functions: Prior period, period to date, moving average, rank.
  • Link to BO report (but not to URL)

Is it good?

No. In my opinion, Voyager is a last generation product. It feels like ProClarity to me. It needs a lot of investment to beat modern BI products such as Tableau, XLCubed, Qlikview, Roambi and Spotfire.

It provides basic functionality to explore Essbase & SSAS cubes, but nothing more. If this was 2006, Voyager would have a chance. But in 2012 where the BI tools have grown strong in competition, there’s no chance.

Prominent areas for improvements:

  1. Port to true thin client i.e. HTML5/Javascript
  2. Visualisation e.g. chart, dashboard
  3. Interactive ability, i.e. dashboard, drilldown, map, alert
  4. More integration with SSAS & Essbase, e.g. displaying description of measures & attributes, drilldown, KPI
  5. Scripting to support queries
  6. Automation and scheduling
  7. Mobile platform i.e. iPad, iPhone, Galaxy, Android, BB
  8. Cloud interface (S/I/PaaS), stream data
  9. Big data integration, i.e. Hadoop

10. Silverlight integration

11. Data mashup integration, e.g. composite

SAP is a giant corporation with good R&D. I admire Hana, which is a modern (even future) platform, i.e. in-memory database & analytics. I wrote an article about Hana (link). I also admire BO Webi, which in my opinion is the best in its class i.e. dynamic reporting. BO functionality is a lot richer than SSRS Report Builder.

But I don’t admire Voyager. I believe SAP has the capacity to improve it, which is why I took the time to write the 11 points above. It needs good investment to match today’s standards. SAP is one of the top 4 giants in BI, along with Oracle, Microsoft, SAS and IBM (link). SAP has the financial & technical capability to refurbish Voyager to be one of the best analytic tools.

7 July 2012

SAP Hana, an in-Memory Database

Filed under: BI Tools,Business Intelligence,Data Warehousing — Vincent Rainardi @ 7:19 am

What is it?

Hana is an in-memory database from SAP. It enables us to do lightning OLAP analytics from relational databases. We can access Hana data using SQL, MDX or from BO. Or from Cognos, Report Builder and PowerPivot. We can create calculated attributes and measures. We can create our own data model and hierarchies.

A must read technical introduction from SAP is here, the official web site home page is here, and the technical operation manual is here, the data modelling manual is here (very bad, no screenshots at all) and the Wikipedia page is here.

HANA stands for HAsso’s New Architecture. Hasso Plattner is a cofounder of SAP.
HANA also stands for High performance ANalytic Appliance.
Hana is also a name, rather than acronym.

Actually Hana is not just an in-memory database. It is also an analytics appliance.
Cisco provides a Hana appliance on Intel Xeon 7500 (link).
HP provides a Hana appliance on ProLiant DL580 and DL980 (link).
IBM provides a Hana appliance on x3690 X5 and x3950 X5 (link).
Violin provides a Hana appliance on Intel Xeon 5600 (link).

HANA is also a cloud application. It is a PaaS, Platform as a Service. EMC and Cisco provides a Hana could app on Cisco UCS and EMC VNX 5300 (link).

Hana was released in Nov 2010. Version 3 was Nov 2011, and version 4 will be in Q2 2012.

How does it work?

Hana has a parallel calculation engine. Hana takes in a SQL query and run the query in parallel across many partitioned data sections in memory, and across many multicore processors.

Hana also has business functions to speed up calculations, for example currency conversion and calendar functions, as well as predictive analysis library.

Hana database is both row based and column based. If a calculation involved only a few columns, the data columns are stored in columnar store (compressed) in memory. Hana also support temporal tables which speed up queries for historical data.

The key to Hana speed is the first sentence I wrote above: in-memory database. Memory is 1000 times faster than disk, and these days we can have a server with 2 TB of memory.

But that is the story of the past. Everybody now store in memory. What Hana has that the crowd doesn’t is the technique to load data from memory into CPU cache. To partition a query and data, and load them into different CPU cache so that each CPU can process them in parallel. That’s what differentiates Hana from the others.

This is how Hana works: (from the must read technical introduction that I mentioned earlier, link, page 10)

Hana does parallel SQL execution and parallel aggregation. These parallel execution and aggregation happens on multiple threads. Each thread has a hash table which stores the aggregation result. These hash tables are then merged by the merged threads, as we can see below: (from the technical introduction doc, link, page 12)

What is it for?

How do we use Hana? What can we use it for? Using Hana, we don’t have to create summary tables in the data warehouse. This removes the basic need for a data warehouse. We need a DW to aggregate data fast. Now we can do it using Hana. This also eliminates the need for having OLAP cubes. We use OLAP cubes such as SSAS to aggregate the data fast. Now we can do it using Hana.

Still need a SQL interface though, which is a no go-er from the start. Who wants to type in the SQL queries? Definitely not the business users! Hence we need a BI tool, such as BO Universe, Cognos, SSRS Report Builder and PowerPivot. The major weakness of BO Universe, RS Report Builder, Cognos and other flexible reporting technology when operating on a DW is that it is slow. Because the DW is disk based. Now that we have Hana, an in-memory database which is both row-based and columnar (and temporal!), these flexible reporting tool will be incredibly fast.

17 August 2011

How to Choose the Right BI Technology to Suit Your Style

Filed under: Business Intelligence — Vincent Rainardi @ 5:42 pm
Tags:

By: Tejas Goradia

I want to see reports. Microsoft offers Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services; which one should I use?

This post shall help choose Microsoft’s data visualization tool based on reporting requirements driven by business process and stakeholder expectation and need analysis. Typically reporting styles fall into Self-Service Analysis, Business Reporting, Parameterized & Operational Reporting, Performance Monitoring and Scorecarding.

Figure 1 – Choosing the Right Microsoft BI Tool (Microsoft Excel, PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), and PerformancePoint Services in SharePoint)

Description and Key Characteristics

  • Self-Service Analysis – The best analysts (non-technical users), and often the best assets, in most organizations are those users who love to dig into the data and really understand why the business is performing the way it is. They want to perform self-service analysis (reports, graphs, dashboards, and so forth) without reliance on IT. They want to be able to easily drill down, pivot, filter and format data, integrate information from a variety of sources without the need of specialized technical skills such as SQL, MDX, or other query languages, using familiar business analysis skills & tools such as Excel. Because Self-Serivce BI tools empower users to gather data from multiple sources, the need for a structured corporate system to be built, such as a data warehouse, is reduced.Tool Options: Excel, PowerPivot for Excel
  • Business Reporting – While typically not as fluid, fun, or “sexy” as the self-service creations discussed above, Business Reporting is just as important to organizations because of the need for power users in business functions to create highly formatted and distributable reports  delivered in multiple formats such as PDF, Excel, HTML, and so on. Reports are typically based upon approved corporate data, and then shared more broadly with managers, teams, or departments.  Specialized technical skills such as SQL, MDX, or other query languages may not be required to author reports. IT involvement is moderate, usually overseeing the distribution and monitoring of the reporting environment and building of the structured data layer upon which the reports are built.  A data warehouse, data mart and/or online-analytical-processing (OLAP) environment is often in place and used as the data source for business user reporting.Tool Options: Excel, PowerPivot for Excel, Report Builder
  • Parameterized & Operational Reporting – Similar to the Business Reporting style, Parameterized &Operational Reporting is also characterized by fixed-format reports that need to be distributed to the responsible parties for each store, region, or other organizational entity in a timely manner and without requiring user involvement to locate and run the report.  The reports, however, are authored and managed centrally by IT instead of business users and usually follow a pixel perfect format and rendering style.  Consistency, scalability, manageability, auto-refresh and automated distribution are some of the key characteristics of this style.  Data used to create reports comes from corporate-sanctioned and IT-managed data sources. Reports may be complex and require special technical skills such as advanced SQL, MDX, or other query languages to build. The reports may be delivered via portal, file share, email, or even sent directly to a printer, and may be generated as PDF, Excel, XML, comma delimited text file, TIFF image, HTML or Microsoft Word formats.
    Tool Options: SQL Server Reporting Services (including Report Builder)
  • Performance Monitoring – Just as a dashboard in a car helps a driver quickly understand the real-time operational performance of the automobile, a dashboard report helps business decision makers understand the current health of their business.  A dashboard often combines key performance indicators from various business functions on a single page to help provide an at-a-glance view of performance and the ability to drill down further when something is off track or performing extremely well.  This style is called “Performance Monitoring (Dashboard)”.  This style is generally found in more mature BI environments where data points of interest have been defined, key determinants of business success have been identified, and a comprehensive BI strategy exists.Tool Options: Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services.
  • Scorecarding – Scorecarding is a style that describes highly summarized views with Key Performance Indicators (or KPIs) measured and scored against predefined targets such as a balanced scorecard.  This style is generally a part of a performance management program, though it can also be used to measure operational performance. While all of the Microsoft BI tools permit “scorecard-like”reporting, PerformancePoint Services is the Microsoft tool that enables Scorecards by using a methodology such as Balanced Scorecard or by creating your own methodology.  PerformancePoint Services facilitates weighted score rollups, measurement against targets, and pre-defined KPIs, in addition to providing the broader set of PerformancePoint functionality to allow integrating the Scorecard into a dashboard alongside other reporting and analysis capabilities.Tool Options: Excel Services / PowerPivot for Excel, SQL Server Reporting Services (including Report Builder), Performance Point Services.

References:
Choose the right business intelligence technology to suit your style (white paper)

31 May 2011

Differences Between Data Warehousing and Business Intelligence

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 6:38 pm
Tags: ,

Try asking your colleague what is the difference between business intelligence and a data warehouse. I find that a lot of people, even those who work in BI projects and BI industry, do not understand the difference. A lot of people use these 2 terms interchangeably. Some people even prefer to use 1 term instead of the other because it simply “sounds better”. Many people think that business intelligence is not just a data warehouse, but there is more to it. But when asked “what business intelligence systems are not data warehouse systems?” or “what part of business intelligence systems are not data warehouses?”, most of them have difficulties explaining the answer.

These days, “business intelligence” is the norm used by most vendors in the industry, rather than “data warehouse”. Most of them call / classify their tools as business intelligence software, not data warehouse software. The name of Cognos product is “Cognos 8 Business Intelligence”. BusinessObjects label themselves as “BI software company” and “global leader in BI software”. The name of one of Hyperionproducts is “Hyperion System 9 BI+”. SAS Enterprise BI Server provides a fully integrated and comprehensive suite of business intelligence software. Microsoft promotes SQL Server 2005 as the end-to-end business intelligence platform. It seems that only Kimball Group who consistently use the term data warehouse. Bill Inmon, as the inventor of this term, also uses the term data warehouse.

So, let’s get into the details. This is an example of a data warehouse system:


It includes ETL from the source system, front end applications (those 10 boxes on the right hand side), and everything in between. It has a control system, an audit system and a data quality system (also known as data firewall). Not all data warehouse systems have all the components pictured above, for example, some data warehouse system may not have operational data stored (ODS), see this article for details.

The 2 blue items are data warehouse databases. The cylinder is in relational format (labelled as dimensional data store, DDS for short), the box is in multidimensional format (labelled as cubes in the picture above). This blue cube is also known as on line analytical processing cube, or OLAP cube for short.

The yellow items are business intelligence applications. Most business intelligence applications take data from multidimensional format data warehouse, but some do take data from the relational format. The whole diagram above is also known as business intelligence system.

Some business intelligence applications take data directly from the source system. For example, some dashboard systems may get sales summary data from the source system and display it in gauge meter format. In this case, we can not call the system a data warehouse system. It is still a business intelligence system, but it is not a data warehouse system, because it does not have a data warehouse database behind the gauge meter application.

Business intelligence systems, in the past also known as executive information systems, or decision support systems, are a non-transactional IT system used to support business decision making and solve management problems, normally used by top executives and managers. Many varied definitions exist in the market place today about the business intelligence system; one from Dr. Jay Liebowitz is arguably one of the better ones. Most people agree that OLAP and data warehouse systems are a major and important part of business intelligence systems. Most business intelligence systems are in the form of a data warehouse systems. Yes, there are business intelligence systems that do not use OLAP or data warehouses, as illustrated in the example of gauge meter application above, but they are more rare than the ones with OLAP or a data warehouse.

According to Ralph Kimball, in his book The Data Warehouse ETL Toolkit, a data warehouse is a system that extracts, cleans, conforms, and delivers source data into a dimensional data store and then supports and implements querying and analysis for the purpose of decision making. He stressed that a data warehouse is not a product, a language, a project, a data model or a copy of transaction system. In an interview with Professional Association for SQL Server (PASS) on 30th April 2004, he explained about the relationship between data warehousing and business intelligence.

In their latest book, The Microsoft Data Warehouse Toolkit, Joy Mundy and Warren Thornthwaite do not differentiate data warehouse systems and business intelligence systems. They consistently use the term DW/BI system throughout the book. This is understandable because, as I describe above, most business intelligence systems are in the form of a data warehouse system.

Bill Inmon, who invented the term data warehouse, defines data warehouse as a source of data that is subject oriented, integrated, nonvolatile and time variant for the purpose of management’s decision processes. He pointed that the term data warehouse was never trademarked or copyrighted. As a result, anyone can call anything a data warehouse. He recently defined a new term, DW 2.0, and this one is trademarked so nobody can change the definition. He explained the architecture in his article in dmreview, along with the differences between the first generation of data warehouses and DW 2.0 and its advantages.

So, as a summary, back to the original question, what is the difference between data warehouse and business intelligence? Most business intelligence systems are based on data warehouse systems (the one with dimensional model, fact tables, dimension, etc), but some business intelligence systems are not data warehousing, i.e. taking data directly from the source system, like the example described above. Business intelligence application (as opposed to business intelligence system) is the yellow boxes on the diagram above, i.e. the front end applications. The data warehouse database (or sometimes people dropped the word database, so it becomes just ‘data warehouse’) is the blue cylinder and blue box on the diagram above, i.e. the dimensional storage, whether in relational database format or in multidimensional database format.

If people say ‘data warehouse’, be careful because it can mean either data warehouse system (the whole diagram above) or data warehouse database (just the blue items). If people say ‘business intelligence’, it can mean either business intelligence system (the whole diagram above, or a BI system without data warehouse) or business intelligence application (the yellow boxes).

I hope this article makes the terms clearer, but I am open to comments and suggestions. As Ralph Kimball said, if you ask 10 different people what data warehouse is you are likely to get 10 different answers.

Vincent Rainardi
1st May 2006

This is a repost from SQLServerCentral.

19 May 2011

Data Warehousing & BI Companies

Filed under: Business Intelligence — Vincent Rainardi @ 10:57 pm

As a professional in data warehousing and BI, I need to know the companies in DWBI. And I do intend to stay in data warehousing and BI in the next 10 years, may be 20, making this need even more important. That’s the first purpose of writing this page. Secondly, it’s for you. Everybody knows (or they think they know) who IBM are. But who knows who Adatis are? Not many. And, for those who think they know IBM (in data warehousing/BI context that is), they don’t know that well. For example, did you know what Initiate is? Even people who work for IBM might not know what Initiate is. (it is a MDM service) That was the 2nd purpose of this page. The 3rd purpose: people forget. I often ask myself “What was the name of that company again?” If I know the name I can Google it. But I forgot the name! With this page I can browse and find it. And you can too. The 4th and last purpose is to share the DW/BI companies I have come across throughout the years.

I think that’s enough intro. I’ll get on with it. This page will contain the companies in data warehouse/BI world, e.g. solution providers, product providers, consultancy firms, education/training companies, etc. That is how I will group them. For each company, I will describe who they are in 1 or 2 lines. I will also write the URL. Later on, if I need to write more than 1 paragraph about a company, I will provide a separate page. I will limit myself to data warehousing and BI. A lot of companies are not just doing DW/BI; they do other businesses as well. I will not write about their non-DWBI businesses. If a company is mixed, e.g. they are a product provider, and they are a consultancy firms, I will put them in 1 area only, i.e. the area where they are most known for.

Some people whom I discussed the idea of this page with, responded with “How about individuals?” Why don’t you put famous DW/BI people here as well? We need to know them too? Some of the readers of my “Analysis Services Who’s Who” page said “Are you doing to do the same for data warehousing people, i.e. Ralph Kimball, Bill Inmon, etc?” The answer is I will. I have, in fact. I started writing it 12 months ago, got only 2-3 names then I never finished it. But I will. It would be on a separate page though, not on this companies page. What I will do is probably write about 30 or so companies, then 10 or so people, then publish them. And I will add to them as we go along. But I will finish this page first, before I do the individuals.

Before I start, the credit first. There are many people, web pages, and books (on Safari) that inspired me to write this page. But the most influential of them all is Andrea Vincenzi’s BI & DW Resource Centre. So thank you Andrea, for the inspiration.

So here we go. If you find there’s something is incorrect, I would be grateful if you could let me know and I will correct them.

DWBI Consultancy – UK

  1. IM Group: A well known and respected name in London for their expertise in MS BI. One of the very few who are still specialising in MS BI. About 70 consultants.
  2. Conchango: Conchango was a well known and respected name in London for their expertise in MS BI. Now part of EMC, a worldwide IT company (not just consulting).
  3. Hitachi Consulting (was Edenbrook): A worldwide consulting company. Not a BI consulting specialist, but a general IT consulting practice. Edenbrook was a BI specialist.
  4. Altius Consulting: 2 areas: Enterprise Performance Management and BI. A respectable old name in London in MS BI. Now also do Oracle BI (Hyperion and OBIEE).
  5. Adatis: A DW/BI consultancy specialising in MS BI.
  6. Contemporary: A specialist in BO and MS BI. Also gives training.
  7. Business & Decision: A worldwide company, 19 countries, 2800 consultants. Public company (listed on Euronext). Also does lots of non-BI stuff.
  8. bIntelligent: Moore Stephens’ consulting arm. Well known in London for MS BI in insurance sector. Its parent company, Moore Stephens, is an accountancy firm.
  9. Insource: offering a product called Data Academy which builds a SQL Server DW fast (like Red). Analyzer is the front end client. Pity they chose to specialise on Health Care.
  10. Clarity Integration: a generalist BI consulting practice (not specialist in certain platform, well if I must specify: MS BI, Informatica, BO). Also offer their own product (ClearView).
  11. Peak Consulting: A DW/BI consulting company, but generalist. In London, NY, Dubai and Mumbai.
  12. IT Performs: An BI consulting firm specialising in BO and MS BI. Also QlikView. Based near Birmingham, UK.

Companies Producing DWBI Software:

  1. Microsoft: SQL Server, PDW, SSAS, SSIS, SSRS, Crescent, Performance Point, PowerPivot. The most widely used DWBI products on the planet, in my observation.
  2. Oracle: Oracle 11g, OBIEE (was Siebel), ODI, Hyperion, Essbase, Exadata, Golden Gate, Oracle OLAP. Widely used, good quality, good support. One of the oldest, top, respectable names in DWBI.
  3. IBM: Data Stage, Infosphere, Cognos, TM1, DB/2, Express. Now include Netezza. Giant in DWBI, possibly in top 3 if I have to answer “Who’s the biggest in DWBI?”
  4. Informatica: PowerCentre, IDE, IDQ. Best ETL/DI tool on the planet, features wise (and performance). One of the oldest, well respected names in BI that hasn’t been bought yet by the big companies.
  5. SAP: BusinessObjects, Business Warehouse, Netweaver BI. Giant in DWBI, worldwide.
  6. Strategy Companion: offering a product called Analyzer, which is a SSAS client (cube browser) and dashboard. Very user friendly.
  7. Tableau: the best SSAS client (appearance wise), in my opinion. Good charting/visualisation.
  8. QlikTech: appearance wise, QlikView is the slickest analytics tool on the planet. Quick to build (1-2 month).
  9. MicroStrategy: comprehensive reporting solution and integrated BI platform. Multidimensional and relational. One of the oldest, well respected names in BI that is still going strong as individual company (not yet acquisitioned).
  10. Ab Initio: ETL/DI tool. Very secretive. I think feature wise much less than Informatica, hence the reason of not revealing.
  11. Teradata: a very good MPP, feature wise. Long history, proven in many customers. Wide range of products (in terms of capacity & price).
  12. Wherescape: offering a DW product called Red, which is known for its speed in building a DW in any RDBMS. In UK, US and NZ.
  13. SSAS clients: various companies producing SSAS tools. The most comprehensive list on the planet (by far), thanks to Vidas. 74 tools (!) and counting.
  14. Jaspersoft: growing rapidly in popularity. Widely known for their very low cost. Complete BI platform: DI/ETL, Reporting, Analytics.
  15. SAS: analytics, data mining, ETL/DI, DQ (DataFlux), MDM. One of the most complete BI platform. Historically respected as world #1 in data mining.
  16. Actuate: a product suite called BIRT (stands for BI Reporting Tool), which is known for its visualisation. No ETL/DI.
  17. Inetsoft: a product suite called Style, a reporting tool known for its visualisation. No ETL/DI.
  18. iQ4bis: reporting/analytics solution. Prebuilt plugins for JD Edwards & MS ERPs (Dynamics, Axapta, Great Plain, Dynamics NAV) for reporting/analysis via SSAS, SSRS, Excel Pivot tables and iQ4bis own tool.
  19. Board: reporting/analytics plus EPM (budget vs performance, etc). Award winning (BI Survey). Quick to build (1-2 months).
  20. Trillium: considered as one of the best and widely used data quality tool in DWBI world. The company is a DQ specialist, not acquisitioned by anyone yet.

DWBI Consultancies – World Wide

  1. SQLPower: BI, XBRL and data migration. Based in Canada. Many platforms: BO, Microstrategy, Oracle, Informatica, etc.
  2. KDNuggets: list of many consulting companies in data mining and analytics, probably 100 companies from all over the world.
  3. McKnight: a specialist in DWBI consulting based in Texas. Various DWBI platform & tools. Also do training.
  4. Data Warehouse Consultants: DWBI specialist based in Pittsburgh. Founded in 2004 by John Shantz.

DWBI Recruitment Agencies

  1. Business Intelligence Recruitment: DWBI recruitment specialist in Australia, based in Brisbane.
  2. The Data Warehouse Company (TDWC): DWBI recruitment specialist in UK, based in London.

DWBI Education and Training

  1. The Data Warehouse Institute (TDWI): provides education, training and certification in data warehousing, based in US. The most well-known and influential certification program in data warehousing.
  2. Toronto College of Technology: offering Postgraduate Diploma in DWBI for CAD 9950. Covers Oracle, SQL Server, Erwin, Cognos, Crystal, Informatica and SAS.
  3. Koenig: 1 week Oracle Data Warehousing boot camp in India for £790 incl. accommodation, meal and transport.
  4. Axioms Semantics: offers DWBI courses in Chennai, India. Microstrategy, MS BI, Informatica, OWB, Cognos and Crystal training courses.
  5. CCCITM (Canadian Career College of Innovative Technology and Management): offers a Post Graduate Diploma in DWBI in Canada.
  6. Aroha: a training company in Bengalore, India, specialising in DWBI. Offers a Diploma in Data Warehousing, covering SQL, DW concepts, Informatica, BO.
  7. University of Dundee: the School of Computing offers MSc degree in Business Intelligence. Taught by Dr Mark Whitehorn, one of well-known, long standing, very respectable names in data warehousing in UK. To my knowledge, this is the only one in the world (formal MSc title in BI), and it’s from a reputable university.

4/1/16 Update:

Apologies I missed Kimball University in the Education section. This is the best data warehouse training course in the world. Unfortunately they stopped doing DWBI courses in December 2015.

I was recently asked to specify DWBI companies in India. In India, the big 6 (TCS, Infosys, Tech Mahendra, Wipro, HCL, Cognizant) all have BI practice.

And these are the top BI companies in India:

  • Aditi, Bengalore & Chennai, aditi.com
  • Birla, Noida, birlasoft.com
  • Exilant, Bengalore, exilant.com
  • Advantics, Mumbai, advantics.co.in
  • Board, Mumbai, board.com
  • Birst, Ahmedabad, birst.com
  • Datawise, Hyderabad, mydatawise.com
  • Deflytics, Mumbai, Deflytics.com
  • ElegantJ, Admedabad, elegantjbi.com
  • Graymatter, Bengalore, graymatter.co.in
  • Helical, Hyderabad, helicaltech.com
  • Inteliment, Pune, inteliment.com
  • Ixsight, Mumbai, ixsight.com (look at the MDM)
  • Jaspersoft, Bengalore, jaspersoft.com
  • MAIA, Mumbai, maia-intelligence.com
  • Nalgan, Hyderabad, nalgan.com
  • OTSI, Hyderabad, OTSI.co.in
  • Progen, Hyderabad, progenbusiness.com
  • Sryas, Chennai, sryas.com
  • Symtrax, Mumbai, symtrax.com
  • Techaxes, Gurgaon, techaxes.com

(source: cioreview.in, removing companies which I think are not DWBI enough)

We must also look at the IT centres of Investment Banks as they do a lot of DWBI development in India, such as: Royal Bank of Scotland (Gurgaon), Bank of America (Hyderabad), Citi (Pune). Or other business sector such as telecom: Verizon (Chennai), T-Mobile, Sprint.

And my favorite DWBI company in India is: Capgemini, Bengalore, https://www.in.capgemini.com/insights-data

Next Page »

Blog at WordPress.com.