Data Platform and Data Science

19 January 2016

The ABC of Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 8:52 am

In this article I would like to define the terminologies used in data warehousing. It is a data warehousing glossary, similar to my investing article: The ABC of Investing (link). I won’t cover terms used in business intelligence; I will only cover terms in data warehousing. Terms in bold are defined in this glossary.

This glossary consists of 2 levels. Only the first level is alphabetical. The second level is not. So the best way to use this glossary is by searching the term (Control-F).

People make mistakes and I am sure there are mistakes in this article and I would be grateful if you could correct me, either using the comment below or via vrainardi@gmail.com.

Why I wrote this: many people working in data warehousing do not understand the standard terminology used. Even the simplest of the term such as “dimension” can be a foreign word to them. My intention is to provide a “quick lookup”, enabling them to understand that precise word in about 15 seconds or so.

Why don’t they use internet searches or Wikipedia? Why create another one? Because a) it takes longer to search and find the information, particularly if you are new, b) the pages on search results can be technically incorrect, c) sometimes I have my own view or prefer to emphasise things differently.

  1. Archiving: an approach to remove old data from the fact table and store it another table (usually in different database). It is quite common that the old data is simply deleted and not stored any where else. The correct term for the latter is purging.
  2. Corporate Information Factory (CIF): a top-down data warehousing approach/architecture created by WH Inmon in 1999. The “Data Acquisition” part loads the data from Operational Systems into a Data Warehouse (DW) and an Operational Data Store (ODS). The “Data Delivery” part loads the data from DW and ODS into Exploration Warehouse, Data Mart, and Data Mining Warehouse. The following are the definition of the terms used in CIF (I quote this from CIF website). Back in 1999, these concepts are revolutionary, hence my admiration and respect to the author, WH Inmon.
    • Operational Systems: the internal and external core systems that support the day-to-day business operations. They are accessed through application program interfaces (APIs) and are the source of data for the data warehouse and operational data store. (Encompasses all operational systems including ERP, relational and legacy.)
    • Data Acquisition: the set of processes that capture, integrate, trans-form, cleanse, reengineer and load source data into the data warehouse and operational data store. Data reengineering is the process of investigating, standardizing and providing clean consolidated data.
    • Data Warehouse: a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data.
    • Operational Data Store: a subject-oriented, integrated, current, volatile collection of data used to support the tactical decision-making process for the enterprise. It is the central point of data integration for business management, delivering a common view of enterprise data.
    • Data Delivery: the set of processes that enable end users and their supporting IS group to build and manage views of the data warehouse within their data marts. It involves a three-step process consisting of filtering, formatting and delivering data from the data warehouse to the data marts.
    • Exploration Warehouse: a DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse utilizes data compression to provide fast response times with the ability to access the entire database.
    • Data Mart: customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. The data mart may or may not be on the same server or location as the data warehouse.
    • Data Mining Warehouse: an environment created so analysts may test their hypotheses, assertions and assumptions developed in the exploration warehouse. Specialized data mining tools containing intelligent agents are used to perform these tasks.
    • Meta Data Management: the process for managing information needed to promote data legibility, use and administration. Contents are described in terms of data about data, activity and knowledge.
    • Primary Storage Management: the processes that manage data within and across the data warehouse and operational data store. It includes processes for backup and recovery, partitioning, summarization, aggregation, and archival and retrieval of data to and from alternative storage.
    • Alternative Storage: the set of devices used to cost-effectively store data warehouse and exploration warehouse data that is needed but not frequently accessed. These devices are less expensive than disks and still provide adequate performance when the data is needed.
  3. Current Flag: a column in a dimension table which indicates that the row contains the current values (in all attribute columns). Also known as Active Flag.
  4. Data Lake: a data store built in Hadoop file system, capable to store structured and unstructured It has analytic and query tools which enable users to join the underlying various data types into a single output dataset.
  5. Data Mining: the old meaning: an approach to find pattern in the data using certain algorithm (such as clustering and decision trees), and then use this pattern to predict the future values of data. See also: Machine Learning. The new meaning: an approach to extract certain information from a data warehouse which is valuable to the business. “To mine the warehouse” is term commonly used.
  6. Data Warehouse: a system that retrieves and consolidates structured data periodically from the source system into a dimensional or a normalised data store. A data warehouse is usually used for business intelligence and reporting.
  7. Data Warehouse vs Data Mart: a data warehouse contains everything for multiple purposes, whereas a data mart contains one data area for one purpose. For example: sales mart, customer mart. A data mart is always in dimensional model, whereas a data warehouse can be in dimensional model or normalised model.
  8. DW 2.0: a follow up from Corporate Information Factory (CIF) concept, which stores unstructured data as well as structured data. The DW 2.0 concept is created by WH. Inmon in 2008.
  9. Dimensional modelling: a data modelling approach using fact and dimension tables. A dimensional model is also known as dimensional schema. There are 2 approaches in dimensional modelling: star schema and snowflake.
    • Star schema: only one level of dimension.
    • Snowflake schema: more than one level of dimension.
  10. Dimensional Data Warehouse: a data warehouse which consists of fact and dimension tables. The primary key of a dimension table becomes a foreign key in the fact table. A dimensional data warehouse can be a star schema or a snowflake schema.
  • Fact Table: a table that contains business events or transaction data (time-sensitive data), such as orders and account balances. See also: Fact Table types.
  • Dimension Table: a table that contains “static data” (non time-sensitive), such as customer and product.
  • Bridge Table: a table that defines many-to-many relationship between two dimensions.
  • Measure: a column in a fact table which contains the numbers we want to analyse. For example: sales amount per city. Sales amount is a measure.
  • Attribute: a column in a dimension table which we can use to analyse the numbers. For example: sales amount per city. City is an attribute.
  • Surrogate Key: the primary key column of a dimension table. It becomes a foreign key column in the fact table.
  • Business Key: the primary key of the source table in the transaction system. This becomes the identifier in the dimension table.
  1. Enterprise Data Warehouse (EDW): a normalised database containing data from many departments (more than one divisions), and sourced from several source systems. An EDW stores historical transactional data as well as historical attribute EDW was pioneered by W. H. Inmon. See also: Normalised Data Warehouse.
  2. ETL (Extract Load Transform): the loading of data from the source system into the data warehouse.
  3. Fact Table types: there are 3 types of fact tables.
  • Transaction Fact Table: a fact table that stores the measure value of each business events when the business event happened. In a transaction fact table, each business event is stored only once as a row.
  • Periodic Snapshot Fact Table: a fact table that contains the values of each measure taken at regular interval. In a periodic snapshot fact table, each business event is stored multiple times. A Periodic Snapshot Fact Table is also known as a Snapshot Fact Table. See also: Snapshot Fact Table.
  • Accumulative Snapshot Fact Table: a fact table which the timing and status from different points in time are put as different columns on the same row. The row describes one particular customer (or other dimension).
  1. Machine Learning: an approach to find pattern in the data using certain algorithms then use this pattern to predict the future values. In my opinion, Machine Learning is the new term for the old meaning of Data Mining.
  2. MPP Server (Massively Parallel Processing): a type of database server where a query is broken down into multiple streams. Each stream is executed concurrently in different nodes. The output from multiple streams are then combined into one and passed back to the user. This architecture is also known as Shared Nothing architecture. Examples of MPP server are: Teradata, Netezza, Azure SQL Data Warehouse, and Greenplum.
  3. Normalised Data Warehouse: a data warehouse which consists of transaction tables, master tables, history tables, and auxiliary tables, in first, second or third normal form.
  • Transaction table: a table that contains business events or transaction data (time-sensitive data), such as orders and account balances.
  • Master table: a table that contains “static data” (non time-sensitive), such as customer and product. It contains today’s values.
  • History table: a table that contains “static data” (non time-sensitive), such as customer and product. It contains historical values.
  • Auxiliary table: a table which contains data that describes the codes in a master table. For example: customer type and currency code.
  • Bridge table: a table which implements many-to-many relationship between a transaction table and a master table. Or between two master tables.
  1. OLAP (Online Analytical Processing): an process of interrogating a multidimensional database to explore the data and find patterns. See: multidimensional database.
  2. OLAP Cubes: a compressed, in-memory, multidimensional database.
  3. Operational Data Store (ODS): a normalised database containing transaction data from several source systems (it is an integration point). It is similar to Enterprise Data Warehouse (EDW) but only contains the current values of the attributes. Of course it contains the historical values of the transactions.
  4. Multidimensional Database (quoted from my book, Building a Data Warehouse): a form of database where the data is stored in cells and the position of each cell is defined by a number of hierarchies called dimensions. The structure stores the aggregate values as well as the base values, typically in compressed multidimensional array format.
  5. Slowly Changing Dimension (SCD): an approach to preserve the historical/old values of attributes in the dimension tables. There are 5 types of SCD:
  • SCD Type 0: the attribute is static and the values never changes.
  • SCD Type 1: the old attribute values are not overwritten by the new attribute value. The old values are not preserved; only the latest value is stored.
  • SCD Type 2: the old attribute values are stored in a different row to the new attribute value. In the dimension table, each version of the attribute value is stored as a different row. The row containing the latest value is marked with a special sign.
  • SCD Type 3: the old attribute values are stored a column in the dimension table. Only the last two to three values are stored. Type 3 is usually used when most of the rows in the dimension table change their values together at the same time.
  • SCD Type 4: (as defined by SQL 2016, link) the old attribute values are stored in a different table called history table. Only the latest value is stored in the dimension table.
  1. Slowly Changing Dimension: a dimension which have a type 2 attributes.
  2. Snapshot Fact Table: a fact table that contains the values of each measure at a particular point in time. For example, the balance of every account at the end of the day. There are two snapshot fact tables: periodic and accumulative, but when people say “Snapshot Fact Table” they usually mean periodic. See also: Fact Table types.
  • Daily Snapshot: contains the value of each measure at the end of every day.
  • Weekly Snapshot: contains the value of each measure at the end of every week.
  • Monthly Snapshot: contains the value of each measure at the end of every month.
  1. Staging table: a table in a data warehouse which contains the raw data from the source system. This raw data will be processed further and loaded into either a fact table or a dimension table.

1 Comment »

  1. Reblogged this on Karl Beran's BI mumble.

    Comment by Karl Beran — 19 January 2016 @ 11:35 am | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

Blog at WordPress.com.