Data Warehousing and Data Science

29 November 2013

Data Consistency in Oracle and SQL Server

Filed under: SQL Server — Vincent Rainardi @ 8:09 am

I feel ashamed that I have not taken the time to really understand the data consistency mechanism in Oracle and SQL Server. I’ve read about transaction isolation level a few times, but didn’t really realise that in Oracle there is no dirty read like in SQL Server (Oracle does not offer Read Uncommitted mode). 

Oracle SCN Mechanism

When a query begins, the current system change number (SCN) is determined. In the picture above the SCN is 10. As data blocks are read, only blocks written with the observed SCN are used. Blocks with recent SCNs containing changed or uncommitted data are not read. This guarantees that the data read does not change during the duration of the query. This is why in Oracle there is no dirty reads (reading uncommitted data). Source: Link

In SQL Server, when the data is being read, the table is locked. Even thought table is locked, other processes can modify the data. Rows can be deleted and updated, new rows can appear. The lock only prevents SQL Server from reading uncommitted transactions (dirty reads). But it does read new rows and deleted rows (which is correct, we want the latest data). With NO LOCK, SQL Server will read uncommitted transactions. Source: Link

So, both SQL Server and Oracle, by default, allows phantom read and non-repeatable read, but prevent dirty reads. But they do it using different mechanism: Oracle uses SCN and SQL Server uses locks.

There are 3 issues with data consistency:

  • A dirty read is when a process reads a row written by an uncommitted transaction. If this transaction is rolled back, the read becomes incorrect.
  • A non repeatable read is when a transaction rereads a row it has previously read (for example, for updating a row) and finds that another process has modified or deleted the row.
  • A phantom read is when a transaction rerun the same query, it finds a new row inserted by another process.

Here are the 4 isolation levels and their ability to solve the issues:

Isolation Level Dirty Read Non Repeatable Read Phantom Read
Read uncommitted Yes Yes Yes
Read committed No Yes Yes
Repeatable read No No Yes
Serializable No No No

8 November 2013

The 5 Stages of DWBI Journey

Filed under: Business Intelligence — Vincent Rainardi @ 7:39 am

During my years working in MS BI, I saw that the companies’ journeys in DWBI, particularly in MS BI, can be grouped into a few stages. I’m going to write that view, which I hope can be useful.

Stage 1: Report Developer (Cost: £25k/year. Duration: 2-3 years)

In most companies, the “BI Team” usually started with just 1 report developer. The business asked IT to produce report from “the database” (i.e. the back end database of the business application), and this report developer sat down with the business to “define” the requirements and write the report, say using SSRS, Jasper or Crystal. At this time they were not called BI. The word “BI” was never heard anywhere in the company. It was “Report”, not BI, not even MI. There were probably 5 people in IT (out of 100 employees, annual revenue of £5m) and 90% of the IT business was about desktop support, i.e. Exchange/email, file server, network, SQL server. And this report business was only a part time job. Then the business asked for the second report. And then the third. An the report developer became busier and busier. If the first report took him 3 weeks. The 5th report only took him 3 days. He became conversant with the tool, and with the data structure in the database (of the main business application, e.g. EPOS). No formal report specification was written. Just rough sketches of the report layout when sitting down with the business users.

How much? Annual spending at this stage was probably in the region of £20-25k. £14k for half FTE (Full Time Employee) for the junior developer (cause he’s doing desktop support too), plus £6k for training, PC, office, expenses and pension. No cost for software. For SQL Server shops, the RS came with SQL. For MySQL shop Jasper is free. May be £500 for Crystal, for the Oracle shops. I don’t like the perception that RS is free, so let’s calculate it. Say RS is 25% of SQL Server (the other 75% is Relational DB, IS, AS). And SQL Server is say £15k (server hardware + licences). So RS is £4k (25% of £15k). So the annual spending is 14k + 6k + 4k = £24k. In US the numbers are probably the same, but in $, i.e. $24k. How long? A typical company would probably spend about 2-3 years at this stage. Depends on the business growth of course. And the industry sector. A law firm would be quite different from an asset management firm. The former is not IT intensive whereas the latter is IT intensive.

Stage 2: More Reports (Cost: £150k/year. Duration: 1-2 years)

As the business grew, the business users would requested more and more reports from IT. This is especially true if the business application doesn’t have good reporting tool. If it can only report at transaction-by-transaction level, or customer-by-customer level, then the pressure to have reports developed by IT is bigger. If the business application doesn’t have the capability to report things at the company level, or provide “time series” information which shows how the company progresses from week to week, from month to month, then the pressure to create reports will be bigger. The number of reports produced by the IT department would probably exceeds 50 now. Probably even more than 80. Many divisions now are requesting reports. Not just the sales department, but also stock management and finance. And IT also need their own reports too, out of the helpdesk calls. The report development is no longer a part time job. It is a full time job now. Probably even grow from 0.5 person to 2 persons. Number of IT staff has grown from 5 to 15. The company size is probably about 200 employees. Or even 400 employees, depending on which industry sector. The annual revenue is probably £30m now, with total IT spend of £0.5m. Again depending on which industry. They don’t just use 1 business application, but probably 4-5 applications, most of them are off-the-shelf. The maintenance of these apps are outsourced. They have 1 or 2 full time report developers, and probably 1 or 2 full time application developer, most probably Web-.NET-SQL developers, or PHP-Java-PL/SQL.

I’m going to give an example in MS BI now, but it’s probably the similar situation in other tech stack. The report developer is now called MI Report developer. He/she uses SSRS to develop the reports, and host them in SharePoint. There are a few reports which are scheduled and emailed out. Reports provide values to the business. The quantity of the reports and the workload are both manageable. The business loves IT department because they can get the information out of these 4-5 business application, and get this valuable information to the business. It took 2-3 days to produce a new report, but sometimes 2 weeks because there are a queue. Prioritisation works quite well. Some of the reports beginning to be not used, particularly those which were developed 2 years ago. They are now superseded by newer reports. So they have initiative to archive reports which are not used. They monitor the usage of the reports. The IT manager want to make sure that the reports are worth the time invested to create them. These usage report is a worthwhile tool for the IT manager to put forward good arguments to those business users who ordered reports but not using them actively.

How much? The cost of MI is probably about £150k now. The salary of 2 report developers = 2 x £28k = £56k. May be 3 developers = £84k. Plus 3 x £6k for expenses & overhead (PC, office, training, pension) = 18k + 84k = £102k. Now IT has dedicated SQL Server for MI (£15k for H/W + S/W). Probably 2 servers (£30k). Plus some SharePoint licences to host the RS reports (say £10k). Probably plus £5k for some consultant fee to setup/configure RS/SharePoint. So the annual spending is 102k + 30k + 10k + 5k = £147k. How long? This stage is surprisingly fast. Probably only 1-2 year, at which point the company then will experience stage 3: too many reports.

Stage 3: Too Many Reports – Self Service BI (Cost: £150k/year + £50k one-off. Duration 1 year)

By the time the number of report development workload achieved 3 FTEs (Full Time Employees), or 150k per year, the IT manager is very much interested to the idea of Self Service BI. The idea of creating a data warehouse and cubes on top of it becomes a good business proposition from the cost point of view. The users will be able to use Excel to query the cubes. They may consider the idea of buying BI software like BO, Cognos, MicroStrategy, QlikView but the cost is probably becomes a barrier. It’s incredibly difficult to get the prices (if you are not buying) but I’ll take a stab, otherwise we won’t have an idea about the cost. If I’m way off, I’ll be glad if the corresponding vendors can correct me. QlikView is probably about £150k for 100 users, i.e. £1500 per named user including document licence. Plus £50k for live & test server licences. Plus £10k for SharePoint component + workbench. Plus £10k for server/infra cost. Plus training £10k for training, £20k for consulting services. So total of 150k + 50k + 10k + 10k + 10k + 20k = £250k one off cost. Plus 20% annual cost for support (£50k). BO is £35k per 25 CAL, x4 for 100 users = £140k. Plus 3 x £400 = £1200 for developer licences. Say £50k for training, consultant, servers. So £200k one off, plus probably 20% = £40k annually. Cognos and MicroStrategy is probably similar, i.e. in the region of £200-250k one off cost, plus £40-50k annually, for 100 business users. So it’s a no-goer if the Stage 2 cost is £150/year, because these developer cost won’t go away, but increase as they have to hire a QlikView/BO/Cognos developer, or retrain existing staff.

Hence the idea of doing self service using a data warehouse + SSAS cubes starting to take off, in the head of the IT manager. The numbers look good: SSAS is free. At least that was the initial perception. SharePoint is already used for SSRS, and the Excel is widely deployed to each desktop, so no additional cost then! And this Data Warehousing stuff will only cost 1 new server, with a big storage, so £20k. Say 2 new servers, plus storage = £40k. Much more palatable than the £250k of the popular off-the-shelf BI software above. There is a training cost, i.e. SSAS, SSIS, Modelling, say £10k. This initiative is double-acting as a “retainer” too, from HR point of view. Good people stay because they can develop their skills. Getting them ready to go to the next level in their career. And the whole DW OLAP thing is a new world for them. Of course they’d stay. Eagerly. At this point they learn about dimensional modelling, about ETL, and about data quality. They learn about metadata, about the technicality of building cubes. After struggling about 6 months, in their spare time, they managed to construct 1-2 fact tables and a few dimensions in the data warehouse, and built a cube on top of it. The demo was a great success, the business absolutely loved it. I’ve been attending “first business demo” (on SSAS) for about 8 times now and I haven’t experienced a single event where the business is not excited seeing their numbers being sliced and diced in front of them, turned into time series charts and gadgets.

Now, with all data being delivered via the cubes to them, the business can do self service BI. What they loved in the first few weeks was that they don’t need to wait 3 weeks for IT to create a report for them. They can directly get the information they need themselves. And what the IT manager loves in the first few months is that the number of report requests was down a bit. The number of reports each quarter was something like: 100, 120, 150, 200, 160. They managed to knock some reports of the list (in this example 40) because the information need can be satisfied/fulfilled from the cubes, i.e. self service BI.

At this point they would also consider 2 kinds of software which helps the development of the data warehouse: a one click warehouse builder (such as RED and Kalido), which can build a data warehouse very quickly, and a data virtualisation software (such as Composite), which can deliver BI without the pain of building a data warehouse. So they tried out 1 or 2 of these software. They would hope that these software could enlighten their workload. A few companies might switch and abandon the MS BI, but majority would stay with MS BI and only use tools that complement to MS BI.

How much and how long? The one off cost is about £50k. £40k for servers and £10k for training. The annual spend is the same as Stage 2, £150k/year (£100 for 3 people, £50 for infra). This is for 100 business users. How long? Usually companies stay at this stage only for a short period, i.e. 1 year. This is the period when the initial data warehouse is constructed, and the cubes developed. After the initial version is deployed, the IT Department will have many requests to put more and more data into the DW and cubes. And that’s Stage 4.

Stage 4: Growth of DW and Cubes (Cost: £200k/year. Duration: 2-3 years)

The were many requests to put various attributes and measure into the cubes and warehouse. Can you put the Sales Person into the cube please? Oh and supplier please. And the store manager names please? There will be a change of product grouping next month, can you accommodate that please? We have a new price list coming this quarter, that needs to go into the warehouse. Can we have the employee hierarchy in the cube? You get the picture. Lots of changes and new requests to add more data into the warehouse and the cubes.

The danger is, the “BI Team” is still new to DW. They just learned yesterday what a dimension is, what a fact table is, and what grain is. They are inexperienced in data modelling. And if they got it wrong, it cost a lot of working days to correct it. Most probably weeks, not days. They read Kimball’s data modelling book, and expect to see a design that suitable for their case, but there are a lot of specific points which are not mentioned in the book. They tried to browse the internet to get information about data modelling, but got even more confused. And they need even more guidance on the ETL side. They are clueless how to do SCD type 2. They hope that SSIS SCD2 task will solve it for them, and it did to some degree, but alas, the population of the fact tables now becomes an issue. Plus the performance is an issue, and the metadata is an issue. At this point, they seriously thinking to hire a data architect. Someone who have been designing and building data warehouses and cubes in different companies. Someone who knows how the ETL should be built. So they hire a data architect, say £50k/year. He knows about data modelling, ETL, and SSAS. And they hire an ETL developer too. An SSIS + T-SQL guy who has some exposure to cubes. This guy has a lot of experience moving data from 1 place to another. Data integration experience. 1 of the 3 report developer now becomes an ETL developer. They are now called a BI Team, with 5 people: 2 report developers, 2 ETL developers, 1 data architect who also does SSAS.

Some of the SSRS reports are now decommissioned. They managed to get the reports down. Say from 200 to 100. Some of these reports are re-pointed to the data warehouse. Instead of reading from a copy of the business application database, the SSRS reports read from the data warehouse tables. Initially the DW needs to be reconciles with a copy of the business application (of course!), but as time moves on, there are measures which are created in the warehouse which don’t exist in the live app, for example: ranking measures such as decile and quartile, and various ratios between original measures such as financial ratios.

Business users enjoys the benefit of having a lot of data at their finger tips. They can have the information instantly, without waiting for 3 weeks like before, for the reports to be created by the report writer.

How much and how long? Surprisingly, instead of decreasing, the cost of BI is increasing. It’s now 5 people instead of 3, so 4 x £28k + 1 x £50k = £162k. Plus £40k of annual infra cost such as additional capacity, replacement servers, etc. So about £200k/year. How long? A typical period is 2-3 years, after which the business users become dependant of BI. The data coming from the DW via cubes and reports really drive the business forward.

Stage 5: Users Became Dependant of BI (Cost: £50k one off + 10k/year. Duration: 3-4 years)

One by one, each business area uses the new technology (DW & cubes). Sales department, manufacturing department, finance department, marketing department, even HR department. In insurance industry: underwriting, claim, reinsurance, actuarial, finance. In utility industry: sales, customer service, operation, finance. The information from BI affects the business decision in each area: premium at renewal, pricing policies, CRM strategy, product design, marketing strategy, production cost control, etc. Even more requests to put more and more data into the warehouse and cubes and reports.

The companies at stage 5 usually looked for BI tools. Visualisation tools. And data warehouse tools such as DQ and ETL. They want to make the lives of their users easier with BI and visualisation tools. They want to make their own lives (IT) easier, using various development tools. As development work grows, they need to tidy up on the governance, version control & code repository, request & bug management, release management and data quality. Software like Red Gate, SQL Sentry, JIRA, TFS and would come into their consideration horizon. Red Gate & SQL Sentry is SQL Server development tools (and server monitoring). JIRA is request and bug management tool. TFS is code repository and version control tool. It also does task management such as Sprint planning.

With regards to data visualisation, software like Cube Player, Power BI, Strategy Companion, Panorama, XLCubed, Pyramid, Verigence Vivid, Bonavista Dimensions and Tableau would come into the picture of any MS BI shop. Out of about 15-20 SSAS client out there, most companies only know 1 or 2. Probably 30 actually (I mean SSAS client), complete list is here at Vidas Matelis’ ssas-info website, the most content intensive website on SSAS. At this stage, companies are trying out these SSAS client, and eventually hooked with 1 of them. I can’t recommend which one (even though I have personal preference), hmm… probably good to write about SSAS clients actually. I just browsed for them and very little is written about their comparison. Most notably ones are here, here and here.

Apart from SSAS clients, companies also frequently thought about reporting tool, i.e. what’s better than RS? If their budget is big enough, they will probably be approached by (or think about) QlikView and Spotfire. If their budget is minimal they probably think about Jasper and Pentaho. QlikView, Spotfire, Hana, TM1, PowerPivot are all “in memory OLAP”. Unlike SSAS which stores the Multidim structure on disk, these tools read relational tables, and create the multidimensional structure in memory. Much like MicroStrategy. TM1 is not like that, it stores the proprietary Multidim structure as files. Ditto QlikView.

How much and how long? It depends on the tools they select and the industry they are in. For retail, telco, shipment, manufacturing, legal, mining, transport, utilities, it is about 2-3 years. For banks, fund managers, healthcare, insurance, probably 3-4 years. Could be 5. For costing, let’s use Tableau. The cost of implementing Tableau on top of DW + SSAS is surprisingly low: 50k one off + 10k/year. Developers uses Desktop Pro which is $1999 + $199/year maintenance (source: link). Business users will become report creators, they uses Tableau Online Licences which is $500/year. And report reader is free. So if we have 3 developers and 15 report creators the cost is about (assuming the price is same in the UK but in £): 3 x £2k = £6k one off + annual cost of 3x£200 + (15+3) x £500 = £9600/year. Plus £40k for training, servers and consultant it’s around £50k one off project cost and £10k/year on-going cost.


What’s after stage 5? BI is stable, matured and managed well. Operation are governed. What usually happened at this stage is a complete change of technology. For example: implementation of Cognos or Business Objects. Yes, changing all the Microsoft BI stack with Cognos or BO stack, leaving just the SQL Server DB engine, which could be replaced with Oracle too. The other approach is to stay in Microsoft BI path, and look at a few things:

a. Parallel Database

For high volume industry such as retail and telecom, as the business grow, the number of transactions per day could be in the million. This requires some kind of MPP (Massively Parallel Processing), meaning parallel database engine. Either Netezza, PDW (Parallel Data Warehouse, Microsoft new MPP), Exadata, Greenplum or Teradata. Or column oriented database such as Vertica, Sybase IQ.

b. Data mining
For certain industries, CRM and marketing is vital, particularly in the retail sectors. They could look into integrating CRM and marketing campaign management with their data warehouse. This requires data mining and forecasting, which can be done in SSAS, but some companies uses SAS, SPSS, BIRT or even manually using .Net or R.

c. Big Data

This is where most company after stage 5 are going. At least trying. Map Reduce, Hadoop, Mongo DB, Hive, and things like that. The project cost is not high, because it’s common hardware and the software is mostly free. But the benefit is potentially big. Hence the motivation.

d. Text Mining

For certain industries such as healthcare, insurance and retail, the proposal of able to crawl over millions of patents documents, emails correspondence with customers, and thousands of policies, have business good values. Software like Clarabridge, Angoss, Attensity, NetOwl can be implemented for this purpose (mining text within documents or emails for information).

For IT managers, I hope this helps broadening the DWBI horizon, whatever stage you are in. For vendors, I’d be grateful if you could advise if I’m wrong, particularly about the cost. For developers, I hope this article could clarify a bit about the future of your career path, if you stay within DW BI world. There are many experts out there and I’m not even an expert so I would be grateful for your comments.

VR, 9/11/14:

  • Andrei Padrei on QlikView and Tableau prices: link
  • Tableau prices: link

Blog at