February | 2014 | Data Platform and Data Science

15 February 2014

Dedup

Filed under: Business Knowledge,Data Warehousing — Vincent Rainardi @ 5:23 pm

This little word often puzzled many people. What is it? Is it de-duplication? Removing the duplicates? What duplicates? In which table, dimension or fact? Is it only used in data warehousing? I came into this word / situation again yesterday and as this is the 4th time I came across it I thought I’d write it down. Probably useful for other people, while I still got the chance.

Dedup is the short for de-duplication. In data warehousing, in the source tables we sometimes found a table without a primary key. Or a unique index for that matter. This means that this table has a potential to have duplicate rows. Meaning that two or more rows with the same value across all columns. Or at least across the columns which we are interested in.

For example, in a hedge fund database you come across these 2 rows (this is a made-up data, not real):

Each of the 2 rows above contains data about a fund called Good Income B. They are identical.

So must we do to bring duplicate data into the warehouse? We de-duplicate them, by doing select distinct (or using group by). From 2 rows, we make them 1 row.

What is the de-duplication criteria? We need to have sufficient business knowledge to be able to determine the de-dup criteria. It is usually not all columns, but only a few of them. In the case of fund dimension like above, usually the business identifier is SEDOL. In the absence of Sedol or Fund ID, we use Fund Name and Sub Type (which is either Income or Accumulation).

To bring different share classes (clean vs not), we need to include AMC column. Because the clean share classes have lower Annual Management Charge e.g. 0.75% instead of 1.5%. But to differentiate between retail and institution share classes we need to include Min Initial Amount. Because usually the retail minimum amount is £500 or £1000, whereas Institution minimum amount is £50k or £100k.

Once we determine the de-dup columns, we can run group by over those columns to find the duplicates. Now we need to have another business decision, about which row we would like to bring to the Data Warehouse. In the above case, if the Fund Name + Sub Type + AMC + Min Initial Amount are the same between 2 rows, but their Lipper Preservations are different, which one are we going to bring? They lower, the higher, or any?

In this case the business knowledge (asset management industry) is required. In the above case the row with the latest Lipper scores needs to be brought into the data warehouse.

Building a Data Warehouse and BI on Low Budget

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 11:34 am

The core activities in building a data warehouse (DW) and BI are:

Design the dimensional DW (fact and dimension tables)
Identify the data source for each column in the DW
Build the ETL to populate the DW
Build a cube on top of the DW

A lot of companies want to know how to build a DWBI on small budget. A medium size DWBI perhaps cost about £400k to build: £100k on hardware, £100k on software, £200k on labour. Many companies don’t have £400k. They want to spend £25k for POC, then another £75k to make the real one. Some even only have £50k.

Building a DW + BI with £400k

The £400k DW is as follows. The £100k hardware is probably something like this:

4 DW servers 32 GB RAM (Dev, UAT, Staging/DR, Production), 6 cores. Blade or rack with cabinet. Dev & UAT could be 16 GB 4 cores, or VM.
5-10 TB disk storage (1-2 TB per environment), RAID 5, 146 GB 15k rpm, e.g. NAS or SAN including enclosure.
2 ETL servers 16 GB RAM, 4 cores (1 for DEV+UAT, 1 for Production + Staging/DR)
2 Reporting & OLAP servers 16 GB RAM

The £100k software is to purchase ETL and BI software, such as Informatica, QlikView or Cognos, and the database software, such as SQL Server or Oracle. Also OS e.g. Windows Server 2012, Spotlite and VMWare, team collaboration tools e.g. TFS, and other software such as JIRA, Beyond Compare, Visio, Office, MS Project, SnagIt and RedGate.

The £200k labour is for paying 5-6 FTE for 1 year: 1 DW architect & data modeller, 1 BA, 1 ETL Developer, 1 Report Developer, 1 tester, ½ PM, ½ Infra.

Building a DW and BI with £50k

To get this down to £50k we need to minimize all 3 fronts: we need to use only 2 servers. One for Dev + UAT, 1 for Prod. The server will be installed with everything: ETL, BI software and RDBMS. We won’t have DR. This could get the H/W and S/W cost down to 20k including storage and OS.

The remaining £30k is to pay 1 FTE for 1 year who does everything from requirement analysis, design, build, test, manage. From data models, create databases and tables on the database server, stored procedure, ETL packages, reports, and analytic cubes. It will be difficult to find this 1 person because the variety of skills required probably won’t exist in 1 person.

With 1 server functioning as both RDBMS and ETL and Reporting and Analytics, it means that we can only have a small DWBI system. Say 1 TB total net storage, which means that the DW size is probably about 100 GB max, including Staging tables.

We may able to spend £3k on software, but not more than that. SQL Server is about £3k for 2012 Standard edition for 10 users (link). And that includes RDMBS, Reporting and Analytics. Oracle BI Standard Edition is $460 (about £345 including VAT, link) per named user, so 10 users is about £3450. You would probably need to consider open source BI and databases such as MySQL, JasperSoft and Pentaho

I think it is possible to build a DW with £50k, but it has to be a small, simple DW, so that a) the small infra can host it, and b) a single developer can build it within 1 year. Simple, i.e. type 1 dimensions, no accumulative snapshot fact table (only transactional or periodic)

Building a DW and BI with £100k

With £100k we can definitely buy Microsoft BI (£3k for 10 users), or Oracle BI (£3.5k for 10 users), but not Cognos or Informatica. Cognos 10 BI for 10 users is about £30k (link). This is just Analysis Studio + Report Studio (assuming the same price) and 1 Admin pack, including VAT. BO Edge is about £27k (link). That’s for 5 users and 1 report developer. Informatica PowerCenter starts at $140k (link).

The allocation of £100k is probably something like: £30k hardware, £10k software, £60k labour. To me this sounds right for a small company. The £60k is for 2 developers for a year, with 1 slightly more senior (say £25k – £35k split). With 2 developers we can be built a lot more. The first version can be release after 6 months, and we have another 6 months for enhancing it to the users’ needs.

The £30k hardware will buy us 2 servers with ok-ish spec (something like 4 cores, 32 GB) with good amount of storage, probably about 5 TB. We won’t be able to effort SAN, but we can effort RAID 5 on DAS.

Building a DW and BI with less than £50k

If the budget is below £50k, or the time frame is less than 6 months, I would recommend not to build a proper data warehouse. We can still deliver BI without having a DW. One way is using data virtualisation such as Composite. Another way is using QlikView, Tableau, or Spotfire which reads the data sources directly. From cost, the latter is lower than the former.

You can attempt to build a DW (say MS BI) but not proper Kimball DW. I call this Pseudo Dimensional DW. It is fact and dim tables, but no surrogate key. The dimension tables have natural key as their PK. This is a lot simpler and faster to build than a proper Kimball DW. For a small scale (5 dims + 2 facts), instead of taking 3 months, a Pseudo DW could be done in 1 month.

Comments (4)