In most cases, when we started designing a data warehouse, as a data warehouse architect the PM needs us to tell them what the size of the DW will be, so that infrastructure can start the procurement process, capacity planning, etc. The problem is, we do not know the size of the warehouse until we finished building the warehouse and ETL.
Below I describe the technique that I used to estimate the size of dimension and fact tables. Once you get that, you can estimate the size of the staging database based on the warehouse size (say 30-50% of DW size), size of the cubes (say 5-10% of DW size), size of the extract files (say 10-20% of DW size).
This way you can estimate the disk requirement, which is one of the most difficult thing to do, if you don’t have the experience. I’ve done about 10 projects of data warehousing, i.e. designing and building data warehouses and BI, and what I wrote here is based on that experience. Of course, other DW architects may have different experience, and if so, please share it so we can all learn from you as well.
OK let’s start with dimension tables, then the fact tables.
Estimating the Size of Dimension Tables
Categorise dimension tables into small, medium, and large, based on how many rows in the dimension. Up to 1000 rows is small, 1000 to 100k is medium, and > 100k rows is large dimension. For example:
- Country dimension: small
- Customer dimension: large
- Currency dimension: small
- Date dimension: small
- Salesperson dimension: small
- Product dimension: large
- Industry Sector dimension: small
- Store dimension: medium
We can roughly estimate the size of a small/medium/large dimension based on the KB per row, like this:
- Small: 1 KB/row x 1000 rows = 1 MB
- Medium: 5 KB/row x 100k rows = 500 MB
- Large: 10 KB/row x 1m rows = 10 GB
Then we multiply with the number of dimensions for each of the Small/Medium/Large category, like this:
- Small: 11 tables x 1 MB = 11 MB
- Medium: 6 tables x 500 MB = 3 GB
- Large: 4 tables x 10 GB = 40 GB
Total: 44 GB
We can then cross check this 44 GB using the daily volume of dimension data flowing into the data warehouse, to verify that we are not too far off from the reality.
- Total data volume each day loaded by ETL into the data warehouse from all source systems: 15 GB, comprised of say 13 GB delta load (incremental extract / transactional) and 2 GB static load (whole-table extract). This 15 GB is the total volume of data extract from all source systems, which is usually about 5-10% of the size of the source system, if we can do incremental extraction (use 50% if you can’t do extract the data incrementally).
- Of all date going through the ETL process, about 5-10% is dimension/static data, and 90-95% is fact/transaction data. If you are not sure which number to take between 5% and 10%, take 5%. This is because most cases are 5%, only a few cases it is 10%. So 5% x 15 GB = 0.75 GB. This is the volume dimension data uploaded to the data warehouse every single day.
- About 0.1-3% of the rows in the dimension tables are update or inserted every day. 3% for a very active dimension (like customer and product dimensions), 0.1% for very static dimension (like country and currency dimensions). So if your customer dimension has 10 million rows, about 300k rows (3%) are updated or inserted every day. If your country dimension has 200 rows, about 0.2 rows (0.1%) are updated every day, which means that 1 row per week on average.
- So based on #2 and #3 we can estimate the size of the dimension. There 1.75 GB dimension data flowing into the DW every day, and that is 1.5% of the dimension size (0.1% for a very static dimension, 3% for a very active dimension, so the average is (3%+0.1%)/2 = 1.5%). So the size of all dimension tables combined is 100%/1.5% = 67 times of the daily volume of dimension data = 67 x 0.75 GB = 50.25 GB.
So one method giving 44 GB and another method giving us 50.25 GB so we are not too far off. We take the larger of the two methods, which is 50.25 GB.
I don’t like to to be vague or unclear: what’s the tolerance range for “too far off”?
Answer: Based on my experience, if the second method gives us 500 GB (>10x than the first method), we need to question it and look into it again. But if the second method is 2-3x bigger than the first method (132 GB) that’s fine. This 3x is the limit, beyond 3x we have to question it.
Estimated the Size of Fact Tables
We estimate the size of the fact tables based on the data volume in the source systems, like this:
- Say that we have 3 source systems, 100 GB, 10 GB and 400 TB, total of 510 TB. If we can do incremental extracting, the volume of the extracted data is about 1-10% of the size of the source system (this is daily). The bigger the source system, the smaller the percentage, i.e. if the source system is > 1 TB, it is more likely to be 2-3% than 8-10%. The newer the source system, the bigger the percentage, i.e. if the the source system is 1 year old it will be closer to 10% than to 1%. If you have no idea, don’t take any risk just take 10%. In this case, as it is less than 1 TB, let’s take 9%, assuming they are quite new. So the daily volume of extracted data is 9% x 510 GB = 46 GB per day.
- If you can’t do incremental extracting, use 50%. So if the total volume of the source system is 510 GB, the size of the data extract (daily) is be 255 GB. It is extremely unlikely that you can’t do incremental extraction at all. In all 10 DW projects I’ve done, we could do incremental extraction on every single one of them. If it was a good ETL (we could do incremental extraction for almost all tables) then it was something like 5%. But if it was a bad ETL (could not find ID columns in some tables required for incremental extraction) then it was something like 15%. So before you use 50%, verify that there isn’t really anything that you can use for incremental extraction.
For more info please read my article about incremental extraction here: link. I described 4 things that we can use for doing incremental extraction: timestamp columns, identity column, triggers, and transaction date. It would be an extremely rare case that there in nothing in any tables in the source system that you cannot use to do incremental data extraction. If that is the case, the source system would be a small system any way, or very old. Modern, large, complex transaction systems were built with incremental extraction in mind, or, at the very least, with a unique row identifier in every table. Even in Hadoop.
- Landing area and staging tables
Landing area is the file system (directories) which is used to store data dump / export files from the source system, before they are loaded into the data warehouse. Instead of using files, we can keep the data dump from the store system in staging tables. Staging tables are the tables in the staging database. They have no indexes, no primary keys. They are usually truncated before each load. In modern RDBMS, such as SQL Server 2014, staging tables resides in memory so that loading and reading them are a lot faster than disk-based tables.
To estimate the size of the landing area, we multiply the daily data extract volume with the number of days we want to keep the data. In this example, let’s assume that we want to store the extracted data, which is 46 GB/day (see #1 above), for a week (5 days). So the volume of the landing area is 5 x 46 GB = 230 GB. The same calculation principles apply for calculating the staging tables.
- Intermediate fact tables
Sometimes, the fact data requires an intermediate tables. For example, to perform grouping and decile calculation. Or to calculate loyalty points. So the fact data flows into a layer of intermediate tables, before they are loaded into the final fact tables. This is because “one-pass” SQL cannot be done to perform that operation. The data needs to be landed in a table frist, then a second-pass is performed. The reason of not doing this in a temp table is because
- The data volume is large, for example 500 million rows of weather measurement points per day. Using a temp table would affect the performance of other streams in the warehouse.
- A comparison of today’s data with previous days is required. If we use a temp table, we would lose previous days’ data. By having a permanent table we don’t have to rebuild previous days’ data again.
Without performing data modelling, how do we estimate the size of intermediate fact tables? By taking a percentage of a day’s worth of data, like this: one day’s data is 46 GB. Of this, 10% is dimension data and 90% is fact data. So the fact data is 41.4 GB. Of this 41.4 GB fact data, most of it does not require intermediate fact tables. Only 10-20% requires double-passes (or triple-passes). Taking the maximum to be safe (20%), 20% x 41.4 GB = 8.3 GB. That’s the size of the intermediate tables to process fact data.
- Final fact tables
The core question is, how do we estimate the size of the fact tables, without first designing them. So at this point of the project we do not know yet whether we will have FactAccountBalanceSnapshot, FactPremiumTransaction, etc. We don’t know how many fact tables, and what they are. Yet.
One approach is to look at the size of the fact data that is coming into the warehouse every day. We know that this is 46 GB (see point 4 above). When this data is loaded into a dimensional model, it become about 20-40%, because the textual columns are put into the dimensions, leaving only the numerical columns in the fact tables. Textual columns take up most of the space in a row.
Most of the fact tables are periodic snapsnot, only 1-2 are transactional. This is because the business usually need to analyse the changes to measures over a period of time. Accummulating snapshot fact table is rare too, usually only 1 or 2 within the whole warehouse. The ratio is about 80-90% by number of tables, and about 95% by data volume (GB). 95% of the fact data GB is in the snapshot fact table. Transactional fact tables are very small compared to periodic snapshot fact tables, because they do not repeat the data every day.
In the case above, assuming that we snapshot all of the incoming fact data, the volume of the periodic snapshot fact tables would be 46 GB x 30%* x 5 days x 52 weeks = 3.6 TB per year. But only 95% (3.42 TB) is in periodic snapshot fact tables, and 5% (0.18 GB) is in transactional and accumulating snapshot fact tables.
*30% is the mid between 20% and 40%, the ratio when the raw fact data is loaded into dimensional model.
It is a common practice that beyond 1 year, we don’t store daily snapshot. But either weekly snapshot or monthly snapshot. So it is something like: daily snapshot for the last 1 year, weekly snapshot for the last 2-3 years, then monthly snapshot for the last 3-10 years. So the number of snapshots are 5 x 52 = 260 for the last 1 year, 2 x 12 = 24 snapshots in the last 2-3 years, and 7 x 12 = 84 snapshots for the last 3-10 years. 260 + 24 + 84 = 368 snapshots.
So to store 10 years of data in the fact tables, the space required would be 46 GB x 30% x 95% x 368 = 48.2 TB.
- Infrastructure factors
Depending on the RDBMS chosen, we will then need to add or deduct space for indices, log files, fill factor and database compression. Exadata can compress data to 20% of the original volume. The reference says 10% but the reality is 33% (see Ofir’s article here: link).
In SQL Server and Oracle (and I believe other non MPP RDBMS such as DB2), the space for indices in a DW environment is typically 20-40% of the data. The log files are only a small percentage of the data, say 1-5%. DB maintenance will keep the log files small.
Temp DB is another one to consider, particularly if you are using SQL Server, which is a fixed amount. In some companies it’s 20 GB, but some companies it’s 200 GB. It also depends on memory, if the server has 130 GB RAM (say 120 usable by SQL), then 20 GB may be enough. But if the server only has 16 GB RAM (say 14 GB usable by SQL), then we may need 200 GB. And crucially it depends on what kind of SQL queries are flowing into the RDBMS: joining 20 tables (3 of them self joins) with grouping and rank functions, with each table being 50 GB in size, that could take 100 GB tempDB space. Depends on the execution plan too, i.e. hash join or nested loop. Using recompile or not. But that’s specific to SQL Server.
We also need to consider high availability requirements, i.e. mirroring will double the number of space, RAID 5 stripping adds 20%, RAID 1+0 doubles the space, etc. Hot or cold stand by, snapshot database, replication, etc also requires additional space.