During my 10 years or so doing data warehousing projects, I have seen several initiatives doing data dictionary. A data dictionary is a list of definitions used in the system. Each definition is about usually about 10 to 50 words long. And there are about 50 to 200 definitions. What being defined is mostly technical terms, such as the meaning of each field in the database. This is the origin of data dictionary, grew from the need to explain the fields. For example, delivery date, product category, client reference number, tax amount, and so on. I saw this in 1998 all the way to today (2015).
This data dictionary was created by the business analyst in the project, who was trying to explain each field to clear any ambiguity about the terms used in the system. I found that the business people view it differently. They found that it is not too useful for them. The data dictionary is very useful to the new starter in the IT development project, but not too useful for the business. I’ll illustrate with several examples below.
In the database there are terms like: CDS, obligor, PD, yield. The BA defines these terms by copying a definition from internet searches, so they ended up with something like this:
- Credit Default Swap (CDS): a financial contract on which the seller will pay the buyer if the bond defaults
- Obligor: a person or entity who is legally obliged to provide some payment to another
- Probability of Default (PD): likelihood of a default over a particular time frame
- Yield: the interest received from a bond
The CDS definition is trying to explain a big concept in one sentence. Of course it fails miserably: what bond? What do you mean by “defaults”? After reading the whole sentence, the readers are none of the wiser. The readers will get much better understanding about CDS if they read the Wikipedia page about CDS.
The Obligor definition is too generic. It is too high level so that it is useless because it doesn’t provide the context to the project/system. In a credit portfolio project, obligor should be defined as the borrower of the loan, whereas in a fixed income portfolio project, obligor should be defined as the issuer of the bond.
The PD definition contains a common mistake in data dictionary: the definition contains the word being defined, because the author doesn’t have a good understanding about the topic. What is a default? It is not explained. Which PD is used in this field, Through The Cycle or Point In Time? Stressed or Unstressed? Is it calculated using linear regression or discriminant analysis? That is what the business wants to know.
The Yield definition does not explain the data source. What the business users need is whether it is Yield To Maturity, Current Yield, Simple Yield, Yield to Call or Yield to Worse. There are different definitions of yield depending on the context: fixed income, equity, property or fund. The author has firmly frame it for fixed income, which is good, but within fixed income there are 5 different definitions so the business want to know which one.
Delivery mechanism: Linking and Web Pages
To make the life of the business users easier, sometimes the system (the BI system, reporting system, trade system, finance system, etc.) is equipped with hyperlinks to the data dictionary, directly from the screen. So from the screen we can see these words: CDS, obligor, PD, yield are all hyperlinks and clickable. If we click them, the data dictionary page opens, highlighting the definition. Or it can also be delivered as a “balloon text”.
But mostly, data dictionaries are delivered as static web pages. It is part of the system documentation such as architecture diagram, content of the warehouse, support pages, and troubleshooting guide.
It goes without saying that as web pages, it should have links between definitions.
I would say that the mechanism of delivery only doesn’t contribute much to the value. It is the content which adds a lot of value.
A data dictionary should be accessible by the business, as well as by IT. Therefore it is better if it is delivered as web pages rather than links direct from the system’s screens. This is because web pages can be accessed by everybody in the company, and the access can be easily controlled using AD groups.
The most important thing in data dictionary is in-context, concise definition of the field. The second most important thing is the data source, i.e. where is this field coming from. System A’s definition of Duration could be different from system B’s. They may be coming from different sources. In System A it might be sourced from Modified Duration, whereas in System B it is sourced from Effective Duration.
Because of this a data dictionary is per system. We should not attempt building a company-wide data dictionary (like this: link) before we completed the system-specific data dictionaries.
How deep? If the field is sourced from system A, and system A is sourced from system B, which one should we put as the source, system A or system B? We should put both. This is a common issue. The data dictionary says that the source of this field is the data warehouse. Well of course! A lot of data in the company ended up being in the data warehouse! But where does the data warehouse get it from? That’s what the business what to know.
Not Just the Fields
A good data dictionary should also define the values in the field. For example, if we have a field called Region which has 3 possible values: EMEA, APAC, Americas, we should explain what APAC means, what EMEA means and what Americas means (does it include the Caribbean countries?)
This doesn’t mean that if we have a field called currency we then have to define USD, EUR, GBP and 100 other currencies. If the value of a field is self-explanatory, we leave it. But if it is ambiguous, we explain it.
If the field is a classification field, we should explain why the values are classified that way. For example: the value of Asset Class field could be: equity, fixed income, money market instruments (MMI), CDS, IRS. Many people would argue that CDS and MMI are included in fixed income, so why having separate category for CDS and MMI? Perhaps because MMI has short durations and the business would like to see its numbers separately. Perhaps because the business views CDS as hedging mechanism rather than investment vehicle so they would like to see its numbers separately.
So in summary, a data dictionary should:
- Contain in-context, concise definition of every field
- Contain where the field is sourced from
- Contain the definition of the values in the field
- It should be system specific
- It should be delivered as web pages
- It should be accessible by both the business and IT
A good data dictionary is part of every data warehouse. I would say that a data warehouse project is not finished until we produce a good data dictionary.