This little word often puzzled many people. What is it? Is it de-duplication? Removing the duplicates? What duplicates? In which table, dimension or fact? Is it only used in data warehousing? I came into this word / situation again yesterday and as this is the 4th time I came across it I thought I’d write it down. Probably useful for other people, while I still got the chance.
Dedup is the short for de-duplication. In data warehousing, in the source tables we sometimes found a table without a primary key. Or a unique index for that matter. This means that this table has a potential to have duplicate rows. Meaning that two or more rows with the same value across all columns. Or at least across the columns which we are interested in.
For example, in a hedge fund database you come across these 2 rows (this is a made-up data, not real):
Each of the 2 rows above contains data about a fund called Good Income B. They are identical.
So must we do to bring duplicate data into the warehouse? We de-duplicate them, by doing select distinct (or using group by). From 2 rows, we make them 1 row.
What is the de-duplication criteria? We need to have sufficient business knowledge to be able to determine the de-dup criteria. It is usually not all columns, but only a few of them. In the case of fund dimension like above, usually the business identifier is SEDOL. In the absence of Sedol or Fund ID, we use Fund Name and Sub Type (which is either Income or Accumulation).
To bring different share classes (clean vs not), we need to include AMC column. Because the clean share classes have lower Annual Management Charge e.g. 0.75% instead of 1.5%. But to differentiate between retail and institution share classes we need to include Min Initial Amount. Because usually the retail minimum amount is £500 or £1000, whereas Institution minimum amount is £50k or £100k.
Once we determine the de-dup columns, we can run group by over those columns to find the duplicates. Now we need to have another business decision, about which row we would like to bring to the Data Warehouse. In the above case, if the Fund Name + Sub Type + AMC + Min Initial Amount are the same between 2 rows, but their Lipper Preservations are different, which one are we going to bring? They lower, the higher, or any?
In this case the business knowledge (asset management industry) is required. In the above case the row with the latest Lipper scores needs to be brought into the data warehouse.