There are several reasons why we want to create a data mart from a data warehouse. But before I list them down I would like to clarify the meaning of “data mart” and the meaning of “data warehouse”.
Data mart: in this article I am referring to the Kimball meaning of the word “mart”, which means that it is in dimensional model. Meaning that it consists of a few fact tables (usually 1 to 5 but not more than 10), and the necessary dimensions (usually about 5 to 10). Quoting my own definition in my book, a dimensional data mart is a group of related fact tables and their corresponding dimension tables containing the measurements of business events categorised by their dimensions.
The other possible meanings of “data mart” in data warehousing practice are (which I’m not talking about them in this article):
- An OLAP cube. Usually quite small in size, created from 1 or 2 fact tables (and the necessary dimension). Probably not at the lowest grain but at higher/summarised level.
- A relational database but not dimensional (not fact & dim). Usually it is not at 3rd normal form either. Just a collection of tables without (necessarily) having referential integrity / foreign key relationships between the tables. It’s quite a small database, there are not many tables, probably between 5 and 15.
Whereas for the term “data warehouse”, it’s as what I defined in my book: a data warehouse is a system that retrieves and consolidates data periodically from the source systems into a dimensional or normalised data store. It usually keeps years of history and is queried for business intelligence or other analytical activities. It is typically updated in batches, not every time the transaction happens in the source systems.
So, with those meanings of data mart and data warehouse in mind, the reasons for creating a data mart from a data warehouse are:
- Because we need to create a static copy of a few fact tables (with their corresponding dimensions) which does not change every day like the warehouse, for analysis purposes. Usually it’s only temporary. The users use the data in the mart for a project which does not last long, say a few days to a few weeks. And after that the data mart is deleted.
- Because they need a mart for data mining. Processing a data mining model (“training”, “predictive analysis”) creates heavy work load and we don’t want it to affect the performance of the central/core warehouse.
- To ease the query workload on the warehouse. Say 50% of the reports are querying one certain fact table (and its dimensions). To enlighten the burden of the warehouse we create a copy of that fact table (and its dimension) in a separated database located on different server, and point some of th reports that way. This mart is refreshed/updated every day.
- Because they want to change the data to simulate some business scenarios. They can’t change the data in the core warehouse (it will affect everybody!) so we provide a data mart for them to running their scenarios.
- To support the query performance of the reports (the mart is structured differently, indexed differently). Probably putting additional tables in it (summary tables/fact at higher grain, or list tables). The mart is usually read only, refreshed/updated every day.
- Because the data warehouse is in normalised format (kind of NDS, probably on PDW), so we need to build a dimensional mart for the analytics/OLAP to get the data from.