Data Warehousing, BI and Data Science

29 November 2017

What’s in Azure?

Filed under: Business Intelligence,Data Science,Data Warehousing — Vincent Rainardi @ 5:31 am

As of 28th Nov 2017 we can create the following in Microsoft Azure:

  • Data: SQL Database, SQL Server, SQL Data Warehouse, SQL Elastic Database Pool, MySQL, PostgreSQL, Data Lake, CosmosDB, MongoDB, Redis, Teradata Database, Oracle Database, IBM DB2, Aerospike, XtremeData, HazelCast, ArangoDB.
  • Data Integration: Informatica, Data Factory, Data Catalog, File Sync, Profisee Maestro Integration Server, Information Builder iWay Big Data Integrator, BizTalk Service, IBM WebSphere Application Server, IBM WebSphere MQ, IBM MQ, Datameer, QuerySurge.
  • Big Data: HDInsight, Hadoop, HDFS, HBase, Hive (Interactive Query), Hive Streaming, Spark, Kafka, Storm, Hortonworks, Cloudera, Cassandra, GridGain, MapR, F5 BIG-IP, Syncfusion, Informatica Big Data Mgt, Kyligence, AtScale.
  • Data Science: ML Experimentation, ML Model Management, R Server, Data Science VM, Bot Service, Computer Vision API, Face API, Language Understanding Intelligent Service, Translation Speech API, Text Analytics API, ML Studio Web Service, Sparkling Water H2O.
  • Analytics: Databricks, Stream Analytics, Analysis Services, Data Lake Analytics, Time Series Insights, Tableau Server, Qlik Sense Server, Pyramid Analytics, Nutanix Analytics, Real World Analytics, Exasol Analytic, HP Vertica Analytics, Teradata Aster Analytics, Bing Search, Dundas BI, Power BI, Panorama Necto, Spago BI, Targit, KNIME, SAP Hana, Kepion, Jedox, Diagramics.
  • Internet of Things: IoT Hub, Event Hub, Notification Hub, Pilot Things IoT VPN, TrendMicro IoT Security.
  • Developer Tools: Visual Studio, VS Anywhere, DevOps Project, Team Project, DevTest Labs, Application Insights, API Management, Operational Insights, Jenkins,
  • Infrastructure: Virtual Network, Load Balancer, Network Security Group, Security & Audit, Security & Compliance, Event Tracker Security Centre, Log Analytics, Automation, Active Directory, Scheduler, CloudAMQP, Cradeon.
  • Web: Website + SQL, WebApp, WordPress, Drupal, Joomla, Django, API App, Logic App, CDN, Media Services, VoIP, SiouxApp, App Dynamics, CakePHP, BlogEngine.NET, MVC Forum, Better CMS, Node JS
  • Compute: Windows Server, Red Hat Enterprise Linux, Oracle Linux, Ubuntu Server, VM Instances, Service Fabric Cluster, Web App for Container, Function App, Batch Service, Cloud Service.
  • Blockchain: Ethereum, Hyperledger, Corda, Quorum, STRATO, Chain Core, Stratis.
Advertisements

7 February 2015

Data Scientist

Filed under: Business Intelligence,Other — Vincent Rainardi @ 8:57 am

Some people wonder what is the meaning of data scientist. Put it simply, a data scientist is an expert in:

  1. Statistical analysis
  2. Data mining
  3. Big data
  4. Data visualisation

In addition to the above technical skills, a data scientist usually has good business knowledge in one or two sectors, such as banking, insurance, investment, oil or biotech.

In this article, I will explain each of the 4 technical areas, and then pick an example.

Statistical Analysis

I consider the following statistical analysis skills to be important to have for a data scientist:

  1. Generalised Linear Model (Bernoulli distribution, Bayesian methods, Gaussian regressor)
  2. Cluster analysis (k-means, Expectation Maximisation, fuzzy clustering)
  3. Goodness of Fit Tests (Kolmogorov Smirnov, Pearson’s Chi2, Anderson-Darling)
  4. Bayesian Statistics (Naïve Bayes Classifier, Bayesian Network, Hidden Markov Models)
  5. Factor Analysis (Maximum Likelihood Extraction, Principal Component Extraction)
  6. Time Series, Matrix computations, Latent Dirichlet Allocation

For many candidates, this first subject (Statistical Analysis) is the biggest stumbling block. The above topics are advanced statistics, some of them are at PhD level. Very few people in IT know them because they are usually taught in universities when you take mathematics, physics or engineering.

As with anything else, Statistical Analysis requires the use of software/tool. The most widely used Statistical Analysis tools are SPSS and SAS. These two are defacto industry standard. Matlab has a special place. It’s not as user friendly nor comprehensive as SPSS and SAS, but can be programmed to do anything and extremely flexible for any kind of statistical processing.

Statistical analysis requires a good programming skills, particularly R and S. Knowledge of parallel processing and multithreading will be also useful. R is the defacto standard language for statistical computing, data analysis and data mining. R is derived from S, with Lexical Scoping semantics added to it, so S provides good fundamental but practically no longer used. Some of the popular IDE for writing R are Rstudio, Rattle GUI and RWeka.

Data Mining

The reason Data Mining is required is for predictive analysis and forecasting the future, as well as descriptive analysis (explaining the past). For this you need to be able to build data mining models, train the models and use it for forecasting. Data mining requires strong mathematical foundation, such as clustering, regression and neural networks. It also requires knowledge about specific tools such as SAS and Analysis Services.

Data mining requires knowledge of data warehousing and BI, because data mining can only uncover patterns actually present in the data. As the data is usually a very large set, it is commonly stored in a data warehouse, and undergo data warehousing processes such as data integration, denormalisation modelling, data quality improvement, and data cleansing.

Data mining requires business knowledge, such as Customer Relationship Management, market basket analysis, credit card fraud, insurance claims, marine underwriting, credit risk and FX trading. Without the business knowledge, it is impossible to create a good mining model.

Big Data

A data scientist need to know about big data because increasingly, more data is stored in big data architecture, i.e. Hadoop, HBase, MapReduce, EC2/S3. They do not need to know about Pig, Oozie, Lucene, Flume, Sqoop in detail, but they need to have experience with platform that the company uses, such as Hortonworks, Cloudera, BigInsights (IBM), and HDInsight (Microsoft). These platforms are fully equiped with all the tools that you need to load and process data in Hadoop. Data access layers, streaming, query language, security, scheduling, and governance all rolled-in into an Enterprise-ready platform.

A data scientist may not be write complex MapReduce transformation in PigLatin and extend the functions using Python or Ruby (that’s for the programmer to do), but they do need to understand the basic concept. For instance, how the Reduce job combines the output tuples from a Map into a smaller set of tuples, what are keys and values in Hadoop, why you need a Shuffle between a Map and a Reduce, etc. Whether you use Horton, BigInsights or HDI – the implementations are different between companies, but the core concepts are always the same.

Using Big Data platforms such as BigInsights enable data scientists to do data discovery and visualisation. It comes with advanced text analytics tool, machine learning analytics, large scale indexing, adaptive MapReduce, compression, security and stream analytics. Not having knowledge of such a platform means that the data scientist limits their capability to process the data.

Data Visualisation

It’s the buzz word for BI tool. Tableau, QlikView and Spotfire are the most popular. IBM has five: Cognos Insight, Cognos Express, SPSS Analytic Catalyst, SPSS Visualisation Designer and Watson Analytics. SAP has Lumira. SAS has Visual Analytics and Visual Statistics. And there are tons of other tools: Dundas, iDashboards, Datazen,  MicroStategy, Birst, Roambi, Klipfolio, Inetsoft, to name a few.

A good data scientist must have experienced creating visualisation using one or two of the above popular tools. All these tools are not difficult to use, compared to programming in Python or Ruby for example, or even compared to Statistical Analysis.Within 2 weeks you will be able to grasp the basics, and within a month you would be able to use them fluently. They are very user friendly, highly visual GUI (i.e. point and click, drag and drop, that sort of things).

Example

One of the most famous team of data scientist is AIG’s. As the Wall Street Journal reported (link): “AIG analyzes reams of data, from massive databases to handwritten notes from claim adjusters, to identify potentially fraudulent activity. It produces a list of around 100 claims to investigate, ranked highest priority to lowest. It also produces charts and other visualizations, like heat maps, that helps investigators understand why machine learning algorithms make the connections they do.”

Jobs and Pay

I recently saw a job advert for a data scientist for £800 per day. Admittedly this is the highest I have ever seen. I believe the normal range is £400 to £600 per day, depending on industry sector. Banking paying the highest, insurance or pharma probably second, and public sector including NHS the lowest.

  • ITJobsWatch (link) reported that the average salary for permanent position is £57,500; whereas for contract position it is £450/day.
  • Glassdoor (link) reported that the national average for data scientist is £40,000.
  • PayScale (link): reported that the median is ££39,691.

It is impossible to get all 4 skills in 1 person. So in many companies this role is carved up to two to three positions: a big data guy, a visualisation developer, and a statistical analyst. In some companies, you don’t need to know any statistics to be called Data Scientist; you just need to know SQL. They call the SQL database guy as Data Scientist just to make the role more attractive and looks good. Ten years ago the “SQL guy” it used to be called “report developer”, but now a days it could be (mis)labelled as “data scientist”.

 

1 December 2014

Using BI Tools as ETL Tools

Filed under: Business Intelligence — Vincent Rainardi @ 6:59 pm
Tags:

A few years ago a colleague said that sometimes users were only using the BI tool as an ETL tool, i.e. they use the BI tool such as Tableau and QlikView, just to get the data. Just to bring the data into Excel. Then they do everything in Excel.

Today I realise that it is true. For some users, they have all of their data in Excel, which they get from various sources. ERP, trading system, settlement system, reference data system, etc. And the data from data warehouse is just one additional source to them. If we give them the DW/BI data in SSAS cubes via Excel pivot table, great! They can “integrate” the DW data straight away.

But if we give them a BI tool such as Tableau, QlikView, Spotfire, Hana, Cognos, BO or Microstrategy, then they export the data into Excel and they use the data. This is particularly true in financial industry where the users are so advanced with Excel. Investment banks, retail banks, insurance and asset management, those are the industries where there are a lot of Excel wizards amongst the underwriters, claim adjusters, financial analysts, debt traders and fund managers. Yes, in financial industry, Excel is the defacto tool that every analyst uses to do their day-to-day job.

23 November 2014

SQL Server 2014 Installation

Filed under: Business Intelligence,SQL Server — Vincent Rainardi @ 4:42 pm
Tags: ,

I’m installing SQL Server 2014 on a new laptop and it has not been an easy experience.

First of all, the Books Online is not included and when I tried to “Install Content from Online” I had the issue below (I include the solution).

Secondly, there is no BIDS! BIDS needs to be downloaded from this link, and installed into Visual Studio 2013. But my Visual Studio is 2015 (Preview), so I had to install VS 2013 first.

After installing VS 2013 and downloading SSDT BI (new name for BIDS), when I installed SSDT BI it said that the architecture was incorrect: my SQL Server 2014 was 64-bit, and the SSDT BI download page said that it could do 32 or 64 bit. Bit in fact, the SSDT BI installation file was only for 32 bit (SSDTBI_x86_ENU.exe). There is no SSDT installation for 64 bit, because Visual Studio is only available in 32 bit.

The trick is, as “g2server” pointed out here: when installing SSDT BI, on the Installation Type we need to choose “new instance”, not “existing instance” (even though you already installed both SQL Server 2014 and VS 2013).

All sorted now. Hopefully with this article those of you who want to install SQL Server 2014 documentation and SSDT BI don’t have to spend time as much as me.

Symptom: open SQL Server 2014 documentation, Microsoft Help Viewer 1.1 opened up, Install Content from Online, chose the content (SQL Server 2014 Books Online, click Update button. Help Library Manager said “Updating Local Library, Analyzing package delta” for 2 mins then displayed this error message: An exception has occured. See event log for details.

The Event log:

An error occurred while updating local content: Microsoft.Help.CacheLib.CacheLibBitsException: Retrieving the COM class factory for component with CLSID {4991D34B-80A1-4291-83B6-3328366B9097} failed due to the following error: 80080005 Server execution failed (Exception from HRESULT: 0x80080005 (CO_E_SERVER_EXEC_FAILURE)). —> System.Runtime.InteropServices.COMException: Retrieving the COM class factory for component with CLSID {4991D34B-80A1-4291-83B6-3328366B9097} failed due to the following error: 80080005 Server execution failed (Exception from HRESULT: 0x80080005 (CO_E_SERVER_EXEC_FAILURE)). at Microsoft.Help.CacheLib.Downloader.CreateManager() — End of inner exception stack trace — at Microsoft.Help.CacheLib.Downloader.CreateManager() at Microsoft.Help.CacheLib.Downloader.Jobs() at Microsoft.Help.CacheLib.Downloader.CancelAllPendingJobs() at Microsoft.Help.CacheLib.DocumentationCache.DownloadWebContent(IEnumerable`1 webDownloads, Int32 totalDownloads, Int32 completedDownloads, String workingDirectory, StatusCallback`1 statusCallback) at Microsoft.Help.CacheLib.DocumentationCache.Download(ChangeDescription change, StatusCallback`1 statusCallback) at Microsoft.Help.CacheLib.DocumentationCache.Update(VendorName vendorName, Boolean checkForOnlineUpdates, StatusCallback`1 statusCallback, CertificateCheckCallback certCheckCallback) at Microsoft.Help.CacheLib.CacheManager.<>c__DisplayClass24.<UpdateAsync>b__23() at Microsoft.Help.CacheLib.AsyncOperationRunner.Run(Object state)

Source: https://support.microsoft.com/kb/870655?wa=wsignin1.0

In Register, HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\SubSystems:

%SystemRoot%\system32\csrss.exe ObjectDirectory=\Windows SharedSection=1024,20480,768 Windows=On SubSystemType=Windows ServerDll=basesrv,1 ServerDll=winsrv:UserServerDllInitialization,3 ServerDll=sxssrv,4 ProfileControl=Off MaxRequestThreads=16

Added 512 like this: Windows SharedSection=1024,20480,768,512

Still the same error.

Solution: download 2012 and 2014 documentation mhsa files from

SQL Server 2014: http://www.microsoft.com/en-gb/download/confirmation.aspx?id=42557

Product Documentation for Microsoft SQL Server 2014 for firewall and proxy restricted environments

SQL Server 2012: https://www.microsoft.com/en-us/download/confirmation.aspx?id=347

Product Documentation for Microsoft SQL Server 2012 for firewall and proxy restricted environments

16 July 2014

iDashboards

Filed under: Business Intelligence — Vincent Rainardi @ 7:22 pm

A lot of people have heard about QlikView and Tableau, which in my opinion are the top two in dashboard software. But I found there are 3 dashboard software which are quite good, but much less well known: Dundas, iDashboards, Datazen. Today I would like to write about iDashboards, perhaps in other opportunity I’ll write about the other two. I want to write about them because I feel that these 3 software need more exposure in the BI community, because they are functionaly good, but most BI developers don’t know them.

What I write here is purely my opinion. I will start with what a dashboard is, and what features are important for them to have. Then I will be investigating iDashboards software with regards to the visualisation, development environment, data integration, security & collaboration, miscellaneous. After the conclusion, I’ll mention 43 other dashboard software which are outside the above 5.

What is a Dashboard?

A dashboard is a BI tool used for visualising and analysing data (BI tools which are not dashboard are reporting tools, OLAP cubes and data mining). Using a dashboard, business users can view their data at a high level to understand the trends and anomalies, then they can drilldown but still kept at high level (not at transaction level like in a BI reporting tool).

From the user point of view, a dashboards consists of a collection of controls i.e. widgets such as tables, charts/graphs (bar, pie, line, combi), maps, pictures, videos, gauges, thermometers, sparklines, KPI indicators (traffic lights, smileys, thumbs up/down, arrows), clocks, heatmap, dropdown, button, texts, labels, tabs, scrolling bars, sliders.

From the developer point of view, a dashboard system (I’m talking about a good one here, one which has undergone at least 5 years of evolution) consists of a development environtment (IDE = Integrated Dev Env), a database and a rendering engine. When we create a dashboard, we use the IDE to combine all those controls, connect them with our source database (or XML/text files, Excel spreadsheets, cubes), define KPIs, setup interactivity and drilldown path. When we click save, all those work is then saved into an internal database containing 40-50 different tables (controls, KPIs, users, dashboards, chart settings, permissions, data sources, formatting, hierarchies, tabs, etc). We then click view (or publish) and the rendering engine then read those tables and display the dashboard on the screen. The rendering engine also responds to mouse hover and mouse clicks from the users, showing different things and different data accordingly.

A scorecard was a particular type of dashboard which doesn’t have interactivity. It is a static dashboard where people can see the high level scores of snapshots of each key areas of the business so that at a glance, the business users will be able understand the status and position of their businesses. But that was 10 years ago. Now a days scorecards are interactives and users can filter, drilldown, and highlight certain areas of the scorecard; no difference to any other dashboards, apart from their function which is to report a particular state and status of the business.

What features does a Dashboard need to have?

A good dashboard will have these features:

  1. Very slick, modern looking visualisation controls with good interactivity abilities, including drilldown/up, hyperlinks, mouse click/hover.
  2. Good, easy to use development environment, complete with version control, layout tool, calculation, KPI, undo, scripting language, migration tool.
  3. Data sourcing: fantastic ability combine data from different tables and files (mesh up), i.e. multi-table, multi-column joins, filtering, sorting, grouping, XML reading, executing SPs, able to accept parameters and variables, data type conversion.
  4. Collaboration: different users see different data depending on their permissions, users can copy and modify dashboards, saves and share their version within their team, collaborate and share comments, single sign on, integration with SharePoint, templates, export/import, printing, alert.
  5. Architecture: rendering engine, web-based platform (thin client not Windows-based app), mobile access, web server, database storage, API, custom extension, flexibility.

Dashboards are used by 3 groups of users, each using the dashboards in different ways:

  1. Administrators: install the software, create data sources, create user groups, manage user permissions, manage licences, upgrade and patching.
  2. Developers: create KPIs, charts, join data tables, feed to controls and compose them into dashboards.
  3. Business users: view the dashboards, modify and save them, share their comments within their teams.

In the next 5 sections I will describe how iDashboards is doing with the 5 dashboard features above: visualisation, development environment, data sourcing, collaboration, and architecture.

1. Visualisation

iDashboards has a stunning visualisation front end. If you download and install the evaluation copy, you will be able to see various controls and images such as thermometer-like controls, speedometer-like (gauges), sliders, bar charts and line charts, maps, clocks, cones, KPIs, bubble charts, month-chooser, calendar controls, videos, sliders, colourful pie charts and ring charts, pyramids, butterfly charts, calculators, data tables, buttons, dropdown lists, rolling marques, Gantt chart, heatmaps, count down timer, images, videos, two-axis charts, traffic lights, texts and labels.

The interactivity is basic but useful, i.e. filter charts based on input values (parameters), choosing an specific data in one control limits the data in other controls, we can drilldown and go back up, when we click on a chart it drilled own to lower level chart or data tables, if we hover the mouse it explains the numbers, we can choose the value which act as a parameter using a dropdownlist, calendar input or selecting a particular section of a chart which will automatically select the corresponding part in another chart.

I said basic but there is not much more interactivity you can expect to get from a dashboard. So from that angle I think it is quite good. Plus we still have the following interactivity features: we can add/open bookmark, enter dashboard parameters, print the dashboard, change language, change skin colour, create alerts, receive alerts by email, change zoom levels, import and export to Excel.

2. Development Environment

I feel that iDashboards is designed with the business users as the developers. It is not difficult for business users to create a dashboard. It is almost as easy as creating charts in Excel: choose a layout, create charts, customize a little bit and we are done. Unfortunately the development environment has a non-intuitive user interface. For example, to discard a dashboard, we click on the left arrow, and to return to the application we click on the right arrow. Another example: to set the thickness of the border, we use a slider. It is not the standard Windows interface like what we find in Excel or PowerPoint. Throughout the development process I don’t feel comfortable with the user interface.

Putting a chart into a dashboard is quite easy, i.e. using drag-and-drop. We can read data from the output of a stored procedure, Excel file, text file, or database tables and views. We can create various charts, set the title size, colours, number formatting and margins. We can define legends and define drilldown to a different chart, different dashboard or a URL. We can create hierarchy for drilldown, we can pass a parameter when drilling down, and we can create macros.

Users can create reports based on the dashboard. The report can be in HTML format or in PDF. Reports contain the data snapshot which was in a particular dashboard at a particular point in time. Again, this “create report” feature seems to be intended for the user community, rather than IT developer. User can create reports for a particular chart, or the whole dashboard. This feature is very basic, almost like printing facility (to a PDF).

We can build custom application and place it into our dashboard. It is called VIPanel, stands for Visually Interactive Panel. In iDashboards Support’s website, developers can download VIPanel applications, along with their documentation. But VIPanel does not work on iOS (understandably as it’s iDashboards is built on Flash) and may not work on Android.

Unfortunately there is no facility for source control (check in, check out, version control), a feature which is important if more than 1 developers/users are working on the same dashboard. But there is a feature for migrating a dashboard from Dev/UAT environment to Prod environment and vice versa. We can export any chart or dashboard, from one environment to another.

3. Data Sourcing

iDashboards read RDBMS tables via ODBC or JDBC. It also read Excel. It can read SQL Server using native driver (version 6.5 to 2014), but the UI is not user friendly, i.e. we need to type server name, database name, instance name, port number, and table name pattern. It is not obvious that iDashboards can join multiple data sources. It seems (very vaguely) that using the “Custom SQL” data source we can join multiple tables, but it is not obvious nor certain. But as a work around, joining tables is not possible, we can create a view on the RDBMS. We can (definitely) use stored procedure as a data source, and when doing so we can supply parameters. We can create a custom drop down list for the users to choose as a parameter.

iDashboards has 3 data types: string, number and datetime. Data from RDBMS and spreadsheets are mapped into there 3 types. I don’t see that iDashboards can read XML file, with or without XSD/schema. It can not ready Analysis Services cube, or any other OLAP cubes. It can only read tables and files. There is a facility to pivot data, which I found very useful when I was developing dashboards using Spotfire.

User can create calculations, for example: colum1 / column2. The functions are very basic. Only a few functions are available: min, max, avg, total, median, standard deviation. iDashboards also have if statement to enable branching. We cannot filter on a calculated column.  We can read text file, or stared procedure. iDashboards seems to have quite an extensive range of date function (they are called “date macros”), from “current month” to “beginning of current month + 3, MMMM d, yyyy” which means if today si 15th July 2014 it is “October 1, 2014” (3 months from now, the first day).

4. User Collaboration

iDashboards’ security is Single Sign On (SSO) friendly, meaning that users don’t have to login to iDashboards. When users access the dashboard, the dashboard server automatically recognise who the users are, and whether they are allowed to access that dashboard or not. “who can access what” are stored in iDashboards database in SQL Server (or Oracle, DB2, Sybase). This database is called the “repository”. Different users see different data depending on their permissions. Depending on their roles, users can copy and modify dashboards, saves and share their version with their colleagues. A dashboard consists of several charts. Charts and dashboards are grouped, and the users are grouped, then the permissioning assign which user groups can access which dashboard groups.

In iDashboards we have a function to get the current user, which is very useful pass into RDBMS to get “different data for different users”. The “Filter on user” facility is more direct: we set the chart itself to present different data to different users.

In iDashboards, users can not write comments about a particular chart within a dashboard, and collaborate using these comments. There is no integration with Sharepoint, i.e. the permissioning model does not inherit Sharepoint security, such as folders and sites, so users can’t collaborate the sharepoint way, i.e. via webparts. Still about collaboration, in iDashboards we don’t have “corporate templates”, i.e. where all standard styling are stored to ensure that they adhere corporate standard. All standard appearance and stayles need to be implemented manually on each dashboard.

Excel is always a big topic in the user community when it comes to collaboration. “Can I export my dashboard to Excel?” is the most frequent question every BI user ask (because they want to share the data with their colleagues using Excel, merged with other data that they have). “Can I import my Excel file into my dashboard” is the 2nd most popular question. In iDashboards the answer is Yes to both questions. A dropdown list in a dashboard can be populated from a table in Excel, and the chart data can be sourced from a table in Excel. We can right click on any chart, and export the chart data to Excel. It is not native Excel format, but CSV, which can be opened in Excel.

Printing: users can print a single chart or the entire dashboard. Most dashboard software are web based and we can’t rely on IE printing feature, because this will include the menu bar of the browser. So an important question to ask is: “Does the print feature prints just the dashboard/chart, or the browser window?” The answer is just the chart/dashboard, not the browser window. iDashboards has “Print Dashboard” button which prints just the dashboard.

5. Architecture

iDashboards, like all other good dashboard software, consists of 4 components: a repository/database, a development environment, a dashboard engine/server. On the server side, we have the repository (which is on SQL Server or Oracle), the dashboard server (which is on Java web server such as Tomcat or Websphere), and connections to all the source systems. On the client side, we have the users PC, and in this PC we have a browser (IE, Chrome, Firefox), and in this browser we have the Flash, and within this Flash we have the dashboards. On the admin PC, we have the admin application, which is an HTML application.

It needs a browser with a Macromedia Flash add-on installed. This is not a problem on PC, but it is a problem for tablets and mobile device. Flash gives iDashboards an edge with regards to appearance, but it does limit the platform it can operate on. There are mobile/tablets with Flash, but it’s limited.

Conclusion

As always there are always pluses and minuses. The top 3 strengths of iDashboards are: visualisation richness, self service (easy to use by business users), and I believe the cost as well. Its top 3 weaknesses are data integration capability (ability to combine data and perform calculation), development environment which looks outdated, it is too simple/basic.

Would I recommend it? Yes, if your requirements simple / not too complex and if self service is very important to your business users (they want to create the dashboard themselves). No if you need to read cubes, or perform complex calculations.

Beside the 2+3 I mentioned at the beginning of this article, there are also 5 other dashboard software which have strong visualisation ability: MicroStategy, Birst, Roambi, Klipfolio, Inetsoft. Outside these ten, if you want a comprehensive list of dashboard software and you would like to evaluate them yourself, they are: Prognoz, Targit, Actuate BIRT, LogiAnalytics, Cognos Insight, Spotfire, Datameer, Information Builders, Cognos Express, PowerView, SiSense, BO Exelcius, SAS, Panorama Novaview, Jaspersoft, Salient, Board, Strategy Companion, BO Webi, CubePlayer, XLCubed, BO Voyager, PowerOLAP, Bitam, VisualMining, PureShare, Varigence Vivid, dminebi, Arcplan, Gooddata, InforSense, Pentaho, Alteryx, Zoho, Domo, Xtraction, Target Dashboard, Cartodb. Some of them are not just a dashboard software, but a BI software which has a lot of other capabilities apart from just creating dashboards. Most of their websites can be found by googling “Dashboard + their name”, but if you need their URLs please let me know. A good source for finding dashboard software are Dashboard Insight and Dashboard Spy.

15 February 2014

Building a Data Warehouse and BI on Low Budget

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 11:34 am

The core activities in building a data warehouse (DW) and BI are:

  1. Design the dimensional DW (fact and dimension tables)
  2. Identify the data source for each column in the DW
  3. Build the ETL to populate the DW
  4. Build a cube on top of the DW

A lot of companies want to know how to build a DWBI on small budget. A medium size DWBI perhaps cost about £400k to build: £100k on hardware, £100k on software, £200k on labour. Many companies don’t have £400k. They want to spend £25k for POC, then another £75k to make the real one. Some even only have £50k.

Building a DW + BI with £400k

The £400k DW is as follows. The £100k hardware is probably something like this:

  • 4 DW servers 32 GB RAM (Dev, UAT, Staging/DR, Production), 6 cores. Blade or rack with cabinet. Dev & UAT could be 16 GB 4 cores, or VM.
  • 5-10 TB disk storage (1-2 TB per environment), RAID 5, 146 GB 15k rpm, e.g. NAS or SAN including enclosure.
  • 2 ETL servers 16 GB RAM, 4 cores (1 for DEV+UAT, 1 for Production + Staging/DR)
  • 2 Reporting & OLAP servers 16 GB RAM

The £100k software is to purchase ETL and BI software, such as Informatica, QlikView or Cognos, and the database software, such as SQL Server or Oracle. Also OS e.g. Windows Server 2012, Spotlite and VMWare, team collaboration tools e.g. TFS, and other software such as JIRA, Beyond Compare, Visio, Office, MS Project, SnagIt and RedGate.

The £200k labour is for paying 5-6 FTE for 1 year: 1 DW architect & data modeller, 1 BA, 1 ETL Developer, 1 Report Developer, 1 tester, ½ PM, ½ Infra.

Building a DW and BI with £50k

To get this down to £50k we need to minimize all 3 fronts: we need to use only 2 servers. One for Dev + UAT, 1 for Prod. The server will be installed with everything: ETL, BI software and RDBMS. We won’t have DR. This could get the H/W and S/W cost down to 20k including storage and OS.

The remaining £30k is to pay 1 FTE for 1 year who does everything from requirement analysis, design, build, test, manage. From data models, create databases and tables on the database server, stored procedure, ETL packages, reports, and analytic cubes. It will be difficult to find this 1 person because the variety of skills required probably won’t exist in 1 person.

With 1 server functioning as both RDBMS and ETL and Reporting and Analytics, it means that we can only have a small DWBI system. Say 1 TB total net storage, which means that the DW size is probably about 100 GB max, including Staging tables.

We may able to spend £3k on software, but not more than that. SQL Server is about £3k for 2012 Standard edition for 10 users (link). And that includes RDMBS, Reporting and Analytics. Oracle BI Standard Edition is $460 (about £345 including VAT, link) per named user, so 10 users is about £3450. You would probably need to consider open source BI and databases such as MySQL, JasperSoft and Pentaho

I think it is possible to build a DW with £50k, but it has to be a small, simple DW, so that a) the small infra can host it, and b) a single developer can build it within 1 year. Simple, i.e. type 1 dimensions, no accumulative snapshot fact table (only transactional or periodic)

Building a DW and BI with £100k

With £100k we can definitely buy Microsoft BI (£3k for 10 users), or Oracle BI (£3.5k for 10 users), but not Cognos or Informatica. Cognos 10 BI for 10 users is about £30k (link). This is just Analysis Studio + Report Studio (assuming the same price) and 1 Admin pack, including VAT. BO Edge is about £27k (link). That’s for 5 users and 1 report developer. Informatica PowerCenter starts at $140k (link).

The allocation of £100k is probably something like: £30k hardware, £10k software, £60k labour. To me this sounds right for a small company. The £60k is for 2 developers for a year, with 1 slightly more senior (say £25k – £35k split). With 2 developers we can be built a lot more. The first version can be release after 6 months, and we have another 6 months for enhancing it to the users’ needs.

The £30k hardware will buy us 2 servers with ok-ish spec (something like 4 cores, 32 GB) with good amount of storage, probably about 5 TB. We won’t be able to effort SAN, but we can effort RAID 5 on DAS.

Building a DW and BI with less than £50k

If the budget is below £50k, or the time frame is less than 6 months, I would recommend not to build a proper data warehouse. We can still deliver BI without having a DW. One way is using data virtualisation such as Composite. Another way is using QlikView, Tableau, or Spotfire which reads the data sources directly. From cost, the latter is lower than the former.

You can attempt to build a DW (say MS BI) but not proper Kimball DW. I call this Pseudo Dimensional DW. It is fact and dim tables, but no surrogate key. The dimension tables have natural key as their PK. This is a lot simpler and faster to build than a proper Kimball DW. For a small scale (5 dims + 2 facts), instead of taking 3 months, a Pseudo DW could be done in 1 month.

8 November 2013

The 5 Stages of DWBI Journey

Filed under: Business Intelligence — Vincent Rainardi @ 7:39 am
Tags:

During my years working in MS BI, I saw that the companies’ journeys in DWBI, particularly in MS BI, can be grouped into a few stages. I’m going to write that view, which I hope can be useful.

Stage 1: Report Developer (Cost: £25k/year. Duration: 2-3 years)

In most companies, the “BI Team” usually started with just 1 report developer. The business asked IT to produce report from “the database” (i.e. the back end database of the business application), and this report developer sat down with the business to “define” the requirements and write the report, say using SSRS, Jasper or Crystal. At this time they were not called BI. The word “BI” was never heard anywhere in the company. It was “Report”, not BI, not even MI. There were probably 5 people in IT (out of 100 employees, annual revenue of £5m) and 90% of the IT business was about desktop support, i.e. Exchange/email, file server, network, SQL server. And this report business was only a part time job. Then the business asked for the second report. And then the third. An the report developer became busier and busier. If the first report took him 3 weeks. The 5th report only took him 3 days. He became conversant with the tool, and with the data structure in the database (of the main business application, e.g. EPOS). No formal report specification was written. Just rough sketches of the report layout when sitting down with the business users.

How much? Annual spending at this stage was probably in the region of £20-25k. £14k for half FTE (Full Time Employee) for the junior developer (cause he’s doing desktop support too), plus £6k for training, PC, office, expenses and pension. No cost for software. For SQL Server shops, the RS came with SQL. For MySQL shop Jasper is free. May be £500 for Crystal, for the Oracle shops. I don’t like the perception that RS is free, so let’s calculate it. Say RS is 25% of SQL Server (the other 75% is Relational DB, IS, AS). And SQL Server is say £15k (server hardware + licences). So RS is £4k (25% of £15k). So the annual spending is 14k + 6k + 4k = £24k. In US the numbers are probably the same, but in $, i.e. $24k. How long? A typical company would probably spend about 2-3 years at this stage. Depends on the business growth of course. And the industry sector. A law firm would be quite different from an asset management firm. The former is not IT intensive whereas the latter is IT intensive.

Stage 2: More Reports (Cost: £150k/year. Duration: 1-2 years)

As the business grew, the business users would requested more and more reports from IT. This is especially true if the business application doesn’t have good reporting tool. If it can only report at transaction-by-transaction level, or customer-by-customer level, then the pressure to have reports developed by IT is bigger. If the business application doesn’t have the capability to report things at the company level, or provide “time series” information which shows how the company progresses from week to week, from month to month, then the pressure to create reports will be bigger. The number of reports produced by the IT department would probably exceeds 50 now. Probably even more than 80. Many divisions now are requesting reports. Not just the sales department, but also stock management and finance. And IT also need their own reports too, out of the helpdesk calls. The report development is no longer a part time job. It is a full time job now. Probably even grow from 0.5 person to 2 persons. Number of IT staff has grown from 5 to 15. The company size is probably about 200 employees. Or even 400 employees, depending on which industry sector. The annual revenue is probably £30m now, with total IT spend of £0.5m. Again depending on which industry. They don’t just use 1 business application, but probably 4-5 applications, most of them are off-the-shelf. The maintenance of these apps are outsourced. They have 1 or 2 full time report developers, and probably 1 or 2 full time application developer, most probably Web-.NET-SQL developers, or PHP-Java-PL/SQL.

I’m going to give an example in MS BI now, but it’s probably the similar situation in other tech stack. The report developer is now called MI Report developer. He/she uses SSRS to develop the reports, and host them in SharePoint. There are a few reports which are scheduled and emailed out. Reports provide values to the business. The quantity of the reports and the workload are both manageable. The business loves IT department because they can get the information out of these 4-5 business application, and get this valuable information to the business. It took 2-3 days to produce a new report, but sometimes 2 weeks because there are a queue. Prioritisation works quite well. Some of the reports beginning to be not used, particularly those which were developed 2 years ago. They are now superseded by newer reports. So they have initiative to archive reports which are not used. They monitor the usage of the reports. The IT manager want to make sure that the reports are worth the time invested to create them. These usage report is a worthwhile tool for the IT manager to put forward good arguments to those business users who ordered reports but not using them actively.

How much? The cost of MI is probably about £150k now. The salary of 2 report developers = 2 x £28k = £56k. May be 3 developers = £84k. Plus 3 x £6k for expenses & overhead (PC, office, training, pension) = 18k + 84k = £102k. Now IT has dedicated SQL Server for MI (£15k for H/W + S/W). Probably 2 servers (£30k). Plus some SharePoint licences to host the RS reports (say £10k). Probably plus £5k for some consultant fee to setup/configure RS/SharePoint. So the annual spending is 102k + 30k + 10k + 5k = £147k. How long? This stage is surprisingly fast. Probably only 1-2 year, at which point the company then will experience stage 3: too many reports.

Stage 3: Too Many Reports – Self Service BI (Cost: £150k/year + £50k one-off. Duration 1 year)

By the time the number of report development workload achieved 3 FTEs (Full Time Employees), or 150k per year, the IT manager is very much interested to the idea of Self Service BI. The idea of creating a data warehouse and cubes on top of it becomes a good business proposition from the cost point of view. The users will be able to use Excel to query the cubes. They may consider the idea of buying BI software like BO, Cognos, MicroStrategy, QlikView but the cost is probably becomes a barrier. It’s incredibly difficult to get the prices (if you are not buying) but I’ll take a stab, otherwise we won’t have an idea about the cost. If I’m way off, I’ll be glad if the corresponding vendors can correct me. QlikView is probably about £150k for 100 users, i.e. £1500 per named user including document licence. Plus £50k for live & test server licences. Plus £10k for SharePoint component + workbench. Plus £10k for server/infra cost. Plus training £10k for training, £20k for consulting services. So total of 150k + 50k + 10k + 10k + 10k + 20k = £250k one off cost. Plus 20% annual cost for support (£50k). BO is £35k per 25 CAL, x4 for 100 users = £140k. Plus 3 x £400 = £1200 for developer licences. Say £50k for training, consultant, servers. So £200k one off, plus probably 20% = £40k annually. Cognos and MicroStrategy is probably similar, i.e. in the region of £200-250k one off cost, plus £40-50k annually, for 100 business users. So it’s a no-goer if the Stage 2 cost is £150/year, because these developer cost won’t go away, but increase as they have to hire a QlikView/BO/Cognos developer, or retrain existing staff.

Hence the idea of doing self service using a data warehouse + SSAS cubes starting to take off, in the head of the IT manager. The numbers look good: SSAS is free. At least that was the initial perception. SharePoint is already used for SSRS, and the Excel is widely deployed to each desktop, so no additional cost then! And this Data Warehousing stuff will only cost 1 new server, with a big storage, so £20k. Say 2 new servers, plus storage = £40k. Much more palatable than the £250k of the popular off-the-shelf BI software above. There is a training cost, i.e. SSAS, SSIS, Modelling, say £10k. This initiative is double-acting as a “retainer” too, from HR point of view. Good people stay because they can develop their skills. Getting them ready to go to the next level in their career. And the whole DW OLAP thing is a new world for them. Of course they’d stay. Eagerly. At this point they learn about dimensional modelling, about ETL, and about data quality. They learn about metadata, about the technicality of building cubes. After struggling about 6 months, in their spare time, they managed to construct 1-2 fact tables and a few dimensions in the data warehouse, and built a cube on top of it. The demo was a great success, the business absolutely loved it. I’ve been attending “first business demo” (on SSAS) for about 8 times now and I haven’t experienced a single event where the business is not excited seeing their numbers being sliced and diced in front of them, turned into time series charts and gadgets.

Now, with all data being delivered via the cubes to them, the business can do self service BI. What they loved in the first few weeks was that they don’t need to wait 3 weeks for IT to create a report for them. They can directly get the information they need themselves. And what the IT manager loves in the first few months is that the number of report requests was down a bit. The number of reports each quarter was something like: 100, 120, 150, 200, 160. They managed to knock some reports of the list (in this example 40) because the information need can be satisfied/fulfilled from the cubes, i.e. self service BI.

At this point they would also consider 2 kinds of software which helps the development of the data warehouse: a one click warehouse builder (such as RED and Kalido), which can build a data warehouse very quickly, and a data virtualisation software (such as Composite), which can deliver BI without the pain of building a data warehouse. So they tried out 1 or 2 of these software. They would hope that these software could enlighten their workload. A few companies might switch and abandon the MS BI, but majority would stay with MS BI and only use tools that complement to MS BI.

How much and how long? The one off cost is about £50k. £40k for servers and £10k for training. The annual spend is the same as Stage 2, £150k/year (£100 for 3 people, £50 for infra). This is for 100 business users. How long? Usually companies stay at this stage only for a short period, i.e. 1 year. This is the period when the initial data warehouse is constructed, and the cubes developed. After the initial version is deployed, the IT Department will have many requests to put more and more data into the DW and cubes. And that’s Stage 4.

Stage 4: Growth of DW and Cubes (Cost: £200k/year. Duration: 2-3 years)

The were many requests to put various attributes and measure into the cubes and warehouse. Can you put the Sales Person into the cube please? Oh and supplier please. And the store manager names please? There will be a change of product grouping next month, can you accommodate that please? We have a new price list coming this quarter, that needs to go into the warehouse. Can we have the employee hierarchy in the cube? You get the picture. Lots of changes and new requests to add more data into the warehouse and the cubes.

The danger is, the “BI Team” is still new to DW. They just learned yesterday what a dimension is, what a fact table is, and what grain is. They are inexperienced in data modelling. And if they got it wrong, it cost a lot of working days to correct it. Most probably weeks, not days. They read Kimball’s data modelling book, and expect to see a design that suitable for their case, but there are a lot of specific points which are not mentioned in the book. They tried to browse the internet to get information about data modelling, but got even more confused. And they need even more guidance on the ETL side. They are clueless how to do SCD type 2. They hope that SSIS SCD2 task will solve it for them, and it did to some degree, but alas, the population of the fact tables now becomes an issue. Plus the performance is an issue, and the metadata is an issue. At this point, they seriously thinking to hire a data architect. Someone who have been designing and building data warehouses and cubes in different companies. Someone who knows how the ETL should be built. So they hire a data architect, say £50k/year. He knows about data modelling, ETL, and SSAS. And they hire an ETL developer too. An SSIS + T-SQL guy who has some exposure to cubes. This guy has a lot of experience moving data from 1 place to another. Data integration experience. 1 of the 3 report developer now becomes an ETL developer. They are now called a BI Team, with 5 people: 2 report developers, 2 ETL developers, 1 data architect who also does SSAS.

Some of the SSRS reports are now decommissioned. They managed to get the reports down. Say from 200 to 100. Some of these reports are re-pointed to the data warehouse. Instead of reading from a copy of the business application database, the SSRS reports read from the data warehouse tables. Initially the DW needs to be reconciles with a copy of the business application (of course!), but as time moves on, there are measures which are created in the warehouse which don’t exist in the live app, for example: ranking measures such as decile and quartile, and various ratios between original measures such as financial ratios.

Business users enjoys the benefit of having a lot of data at their finger tips. They can have the information instantly, without waiting for 3 weeks like before, for the reports to be created by the report writer.

How much and how long? Surprisingly, instead of decreasing, the cost of BI is increasing. It’s now 5 people instead of 3, so 4 x £28k + 1 x £50k = £162k. Plus £40k of annual infra cost such as additional capacity, replacement servers, etc. So about £200k/year. How long? A typical period is 2-3 years, after which the business users become dependant of BI. The data coming from the DW via cubes and reports really drive the business forward.

Stage 5: Users Became Dependant of BI (Cost: £50k one off + 10k/year. Duration: 3-4 years)

One by one, each business area uses the new technology (DW & cubes). Sales department, manufacturing department, finance department, marketing department, even HR department. In insurance industry: underwriting, claim, reinsurance, actuarial, finance. In utility industry: sales, customer service, operation, finance. The information from BI affects the business decision in each area: premium at renewal, pricing policies, CRM strategy, product design, marketing strategy, production cost control, etc. Even more requests to put more and more data into the warehouse and cubes and reports.

The companies at stage 5 usually looked for BI tools. Visualisation tools. And data warehouse tools such as DQ and ETL. They want to make the lives of their users easier with BI and visualisation tools. They want to make their own lives (IT) easier, using various development tools. As development work grows, they need to tidy up on the governance, version control & code repository, request & bug management, release management and data quality. Software like Red Gate, SQL Sentry, JIRA, TFS and would come into their consideration horizon. Red Gate & SQL Sentry is SQL Server development tools (and server monitoring). JIRA is request and bug management tool. TFS is code repository and version control tool. It also does task management such as Sprint planning.

With regards to data visualisation, software like Cube Player, Power BI, Strategy Companion, Panorama, XLCubed, Pyramid, Verigence Vivid, Bonavista Dimensions and Tableau would come into the picture of any MS BI shop. Out of about 15-20 SSAS client out there, most companies only know 1 or 2. Probably 30 actually (I mean SSAS client), complete list is here at Vidas Matelis’ ssas-info website, the most content intensive website on SSAS. At this stage, companies are trying out these SSAS client, and eventually hooked with 1 of them. I can’t recommend which one (even though I have personal preference), hmm… probably good to write about SSAS clients actually. I just browsed for them and very little is written about their comparison. Most notably ones are here, here and here.

Apart from SSAS clients, companies also frequently thought about reporting tool, i.e. what’s better than RS? If their budget is big enough, they will probably be approached by (or think about) QlikView and Spotfire. If their budget is minimal they probably think about Jasper and Pentaho. QlikView, Spotfire, Hana, TM1, PowerPivot are all “in memory OLAP”. Unlike SSAS which stores the Multidim structure on disk, these tools read relational tables, and create the multidimensional structure in memory. Much like MicroStrategy. TM1 is not like that, it stores the proprietary Multidim structure as files. Ditto QlikView.

How much and how long? It depends on the tools they select and the industry they are in. For retail, telco, shipment, manufacturing, legal, mining, transport, utilities, it is about 2-3 years. For banks, fund managers, healthcare, insurance, probably 3-4 years. Could be 5. For costing, let’s use Tableau. The cost of implementing Tableau on top of DW + SSAS is surprisingly low: 50k one off + 10k/year. Developers uses Desktop Pro which is $1999 + $199/year maintenance (source: link). Business users will become report creators, they uses Tableau Online Licences which is $500/year. And report reader is free. So if we have 3 developers and 15 report creators the cost is about (assuming the price is same in the UK but in £): 3 x £2k = £6k one off + annual cost of 3x£200 + (15+3) x £500 = £9600/year. Plus £40k for training, servers and consultant it’s around £50k one off project cost and £10k/year on-going cost.

After

What’s after stage 5? BI is stable, matured and managed well. Operation are governed. What usually happened at this stage is a complete change of technology. For example: implementation of Cognos or Business Objects. Yes, changing all the Microsoft BI stack with Cognos or BO stack, leaving just the SQL Server DB engine, which could be replaced with Oracle too. The other approach is to stay in Microsoft BI path, and look at a few things:

a. Parallel Database

For high volume industry such as retail and telecom, as the business grow, the number of transactions per day could be in the million. This requires some kind of MPP (Massively Parallel Processing), meaning parallel database engine. Either Netezza, PDW (Parallel Data Warehouse, Microsoft new MPP), Exadata, Greenplum or Teradata. Or column oriented database such as Vertica, Sybase IQ.

b. Data mining
For certain industries, CRM and marketing is vital, particularly in the retail sectors. They could look into integrating CRM and marketing campaign management with their data warehouse. This requires data mining and forecasting, which can be done in SSAS, but some companies uses SAS, SPSS, BIRT or even manually using .Net or R.

c. Big Data

This is where most company after stage 5 are going. At least trying. Map Reduce, Hadoop, Mongo DB, Hive, and things like that. The project cost is not high, because it’s common hardware and the software is mostly free. But the benefit is potentially big. Hence the motivation.

d. Text Mining

For certain industries such as healthcare, insurance and retail, the proposal of able to crawl over millions of patents documents, emails correspondence with customers, and thousands of policies, have business good values. Software like Clarabridge, Angoss, Attensity, NetOwl can be implemented for this purpose (mining text within documents or emails for information).

For IT managers, I hope this helps broadening the DWBI horizon, whatever stage you are in. For vendors, I’d be grateful if you could advise if I’m wrong, particularly about the cost. For developers, I hope this article could clarify a bit about the future of your career path, if you stay within DW BI world. There are many experts out there and I’m not even an expert so I would be grateful for your comments.

VR, 9/11/14:

  • Andrei Padrei on QlikView and Tableau prices: link
  • Tableau prices: link

17 December 2012

Composite

Filed under: Business Intelligence — Vincent Rainardi @ 8:19 pm

VR, Update 11/10/2015: Composite is now formally called Cisco Data Virtualisation Suite

Composite enables us to integrate data from various sources to provide BI. It is called data virtualisation (DV). It’s the exact opposite of a data warehouse (DW), where we collect data from various sources and store it in a new data store.

The biggest advantage of using Composite is its speed to market. Without building a data warehouse, we can provide the BI in 2-3 months, as opposed to 12-18 months. We don’t need to design the DW (3-6 months), we don’t need to design the ETL (6-9 months), and we don’t need to spend a lot of time testing the ETL (2-3 months).

As a DV tool, Composite is very good. It has it’s own SQL-like language. It has many functions. It has a adapter to most RDBMS as well as Hadoop and Teradata. Also flat files and Excel. We can create views and procedures which integrate many sources. We can cache them, so that the BI tool can experience fast query performance. The cache can be a database table, or we can also cache it in Hadoop, so that the query is super fast.

Rather caching the data, we can also build loading procs to retrieve data from the source system incrementally. This way, we lighten the burden to the source systems. We can also setup various triggers to invoke those loading procedures either on timer basis or event-based. And we can create a “batch” procedure which calls the loading procedures one-by-one so there’s no “waiting and delay” which happens if we arrange them on timer basis.

On the publication side, there is a facility to publish the views and procedures that we created as web services. These can then be consumed by BI tools, for example by Spotfire Information Designer.

That’s the strong points. What’s the weakness? The scripting language is primitive. Compared to Java, C# or even PL/SQL there are a lot of gap to close. Data manipulation functions are poor. In this area, Informatica functionality is far more superior. Programming environment is poor. It’s like going back to 1992, no debugging, no break point, no watch.

The database function is also weak. For example, we can’t truncate a partition. Something that is very basic in data loading. Data inspection (and DQ in general) is also weak. We can’t easily generate the statistics of the source data, e.g. distribution of values.

Would I recommend Composite for BI projects? Yes of course. Because it provides an excellent easy-to-use tool to integrate data from many places, to be consumed by the reporting/BI tool. It can cut development time significantly, as in 3 months instead of 12 months.

Disclosure: I receive no financial benefit from any company saying/writing this. The opinion is entirely mine, and it doesn’t necessarily reflect opinion of my employer or my clients. I have used both Informatica and Composite, hands on, as well as SSIS, BO, Oracle, Spotfire. As always I may be wrong and I’m open to suggestion. If I said something incorrect above, I would be happy to be corrected.

Vincent Rainardi, 17/12/2012

2 December 2012

Tibco Spotfire

Filed under: Business Intelligence — Vincent Rainardi @ 12:49 am

Like QlikView, Spotfire provides capability to serve BI users with analytical reports, with which users can slice and dice the data, export to Excel/PDF, in various different types of visualisations including pivot tables and charts.

Unlike SSAS, Spotfire don’t require an OLAP cubes to be built first. We define a set of data tables which are then used by the visualisations.

Spotfire has a good back end capability. It’s called Information Designer. In there we can create columns and procedures, from which we then create Information Links which feed the data to the reports. Info Links accept parameters and filters, which make them flexible enough for retrieving only the data we need.

Most of Spotfire developers time is spent on 3 things:

  1. Create information links (20%)
  2. Create data tables (30%)
  3. Create visualisations (30%)

Once Info Links are setup, they usually repeat 2 & 3 with the users in a loop / iterative way. Occassionally, they also do some automation (using Automation Services) and administration (arranging library content/folders and administration users/access)

Spotfire also have a development kit, which is in to Visual Studio (using .NET language such as C#), where we can do programming to add customisation to the visualisation and data tables. Alternatively, we can also do the programming inside the Spotfire itself, i.e. as a code behind on a button in a text area, using Phyton as the scripting language. The scripting capability has variables, branching, looping as usual, and API access to all visualisation and data table objects within Spotfire.

As most BI tools, Spotfire is a rapid development tool. An analytic report is typically developed within a week. That’s from the point where we sit down with the business (or BA) detailing the requirements, until we present the first draft version back to the business. From then of course there are many iterations to fine tune it until we get the finish products. This fine tuning typically takes 2-3 weeks, depending on how wide the business community involved and how fast/slow their responses are. And of course it depends on how complex the report is. A 1 pager with 2 visualisation and 3 data tables is about 2 days development time. Four pages with 7 different visualisations and 15 data tables takes about a week.

Most of the issues in Spotfire development lies in the data tables. This is not a thing that is specific to Spotfire. It is a general phenomenon in BI. The data is a very difficult thing to get it right. The reports on the other hand, are easy. They just are reflecting the numbers in the data. If somebody can guarantee that the data in the database/data warehouse is 100% correct and accurate, then to make the report would be accurate. I would say that 90% of the BI effort goes to prepare a database that is accurate, and only 10% goes to the reporting.

Overall Spotfire is a good BI tools for presenting data to the users and let them analyse it. It has Web Player and it has Enterprise Player. Web Player (WP) runs on web browsers / thin client and EP is a thick client. Like other BI tools, the thick client has more features than the Web Client. Your slice and dice capability is more limited if you use Web Player, for example, you can’t change the axes.

Strengh of Spotfire? I would say it’s the backend, i.e. the Information Designer and Data Tables. Also in the calculated column, there are so many things we can do there, very rich functions. Also strengh is the API. Weakness? It’s the visualisations in my opinion. For example, when arranging visualisations on a page, it’s unbelievably difficult. Bar chart: we can’t put the numbers on each bar. Using filters to slice data is difficult. Tableau and Strategy Companion is a lot easier to use when it comes to slicing and dicing the data.

Would I recommend it as a BI tool? Absolutely. Along with QlikView, I think Spotfire is one of the best visualisation/reporting tool in the BI world.

14 July 2012

What is Big Data, Data Warehouse, Data Mining

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 6:35 am
Tags: ,

Michael Koploy from Software Advice gathered the opinion of 5 BI experts about the definition of 3 BI terminologies that puzzle many people:

  • What is Big Data
  • What is a Data Warehouse
  • What is Data Mining

The article is here. The 5 experts are:

My definitions:

What is Big Data? (from Hadoop in Data Warehousing article in my blog)

What is big data? Again it is easier to learn by example than by definition. Big data is images/photos/video/music streams (and as files too), unstructured text (documents, emails), web site log files, outputs (streams) from scientific/industrial instruments and sensors such as thermometers, digital pneumatic/pressure gauges.

Big data does not have to be big (peta/exabytes). Even 50 GB can be said as big data if the structure is too complex for a normal RDBMS to store.

What is small data? Small data is simple data structures, e.g. numbers (be it monetary, integers, fractions or floating points), strings (names, description, types), dates, times, and all the data we used to know in the last 30 years of data warehousing history.

A large collection of small data is big data. Again I have to put a number here so you have an idea of how big is big. I don’t like being unclear. A collection of simple data structures of more than 1 petabytes is big data.

Bear in mind that now is 2012. Petabytes (PB) is 2012 term. But 5 years ago in 2007 it was Terabytes (TB). So in 2014 the definition will change to: a collection of simple data structures of more than 1 Exabytes is big data. And in 2015 the definition will change to “more than 1 Zettabytes”.

In a few years time, these seemingly foreign words will be common words in IT: Petabytes, Exabytes, Zettabytes, Yottabytes. Here the Wikipedia page for their definition: link.

What is a Data Warehouse?

From my book: (Amazon.com, Amazon.co.uk, Apressmy blog)

A data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalized data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Below is a diagram of a data warehouse system: (from my book)

Below is a short explanation about each component in the above diagram, and how the components work together in a data warehouse system: (again, this is from my book)

The source systems are the OLTP systems that contain the data you want to load into the data warehouse. Online Transaction Processing (OLTP) is a system whose main purpose is to capture and store the business transactions. The source systems’ data is examined using a data profiler to understand the characteristics of the data. A data profiler is a tool that has the capability to analyse data, such as finding out how many rows are in each table, how many rows contain NULL values, and so on.

The extract, transform, and load (ETL) system then brings data from various source systems into a staging area. ETL is a system that has the capability to connect to the source systems, read the data, transform the data, and load it into a target system (the target system doesn’t have to be a data warehouse). The ETL system then integrates, transforms, and loads the data into a dimensional data store (DDS). A DDS is a database that stores the data warehouse data in a different format than OLTP. The reason for getting the data from the source system into the DDS and then querying the DDS instead of querying the source system directly is that in a DDS the data is arranged in a dimensional format that is more suitable for analysis. The second reason is because a DDS contains integrated data from several source systems.

When the ETL system loads the data into the DDS, the data quality rules do various data quality checks. Bad data is put into the data quality (DQ) database to be reported and then corrected in the source systems. Bad data can also be automatically corrected or tolerated if it is within a certain limit. The ETL system is managed and orchestrated by the control system, based on the sequence, rules, and logic stored in the metadata. The metadata is a database containing information about the data structure, the data meaning, the data usage, the data quality rules, and other information about the data.

The audit system logs the system operations and usage into the metadata database. The audit system is part of the ETL system that monitors the operational activities of the ETL processes and logs their operational statistics. It is used for understanding what happened during the ETL process.

Users use various front-end tools such as spreadsheets, pivot tables, reporting tools, and SQL query tools to retrieve and analyse the data in a DDS. Some applications operate on a multidimensional database format. For these applications, the data in the DDS is loaded into multidimensional databases (MDBs), which are also known as cubes. A multidimensional database is a form of database where the data is stored in cells and the position of each cell is defined by a number of variables called dimensions. Each cell represents a business event, and the values of the dimensions show when and where this event happened.

Tools such as analytics applications, data mining, scorecards, dashboards, multidimensional reporting tools, and other BI tools can retrieve data interactively from multidimensional databases. They retrieve the data to produce various features and results on the front-end screens that enable the users to get a deeper understanding about their businesses. An example of an analytic application is to analyse the sales by time, customer, and product. The users can analyse the revenue and cost for a certain month, region, and product type.

What is Data Mining? (from my book)

Data mining is the process of exploring data to find the patterns and relationships that describe the data and to predict the unknown or future values of the data. The key value of data mining is the ability to understand why some things happened in the past and the ability to predict what will happen in the future. To refer to predicting the future with regard to data mining, some people use the term forecasting, and some call it predictive analytics. On the other hand, when data mining is used to explain the current or past situation, it is called descriptive modeling, descriptive analytics, or knowledge discovery.

Implementing data mining in the business is growing by the day, both through descriptive and predictive analytics. Using data mining, we can find the correlation between purchase patterns and customer demographics. For example, in our Amadeus Entertainment case study, we can find whether there is a relation between the sales of a music product type and the customer interest or occupation. In the financial services industry, data mining is traditionally used for credit scoring, for example, to predict the ability and willingness to pay the loan based on (each country has different factors) payment behavior, past delinquency, current debt, credit history, number of inquiries, address, bankruptcies filed, and customer demography.

In business intelligence, popular applications of data mining are for fraud detection (credit card industry), forecasting and budgeting (finance), cellular/mobile package development by analyzing call patterns (telecommunication industry), market basket analysis (retail industry), customer risk profiling (insurance industry), usage monitoring (energy and utilities), and machine service times (manufacturing industry).

Next Page »

Blog at WordPress.com.