Data Platform and Data Science

21 February 2022

NoSQL in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 3:42 am

Where does NoSQL sit in data warehousing landscape? Does it just sit on the bench, having no role? Do we use it in staging area? Or does it function as one of the main data stores?

NoSQL has many faces. Document database, key value pair, graph database, they are all NoSQL. So obviously its usage depends on what type of data we have to store. If our data is about individual people (or individual companies) then it lends itself to be stored as a graph database. Because people have connections (ditto companies). Or you can store them (people and companies) as documents.

That is if you focus on one of the at the time. But if you’d like to process them en-masse, aggregate them, etc. then tabular storage has its advantages. KVP has a distinct advantage when it comes to schema-on-read. Data from any sources can be stored as key value pairs in the data lake, and thus deliberately avoiding modelling the data.

The idea of schema-on-read is simple, but a bit difficult to be agreed with. But it is key to understanding NoSQL’s role. The basic premise of a data lake is that we don’t model the data when storing it in the lake. We model it when we use it. KVP is ideal for this, for data lake store. All the attributes from all tables are stored in just one column. Obviously this requires a good data dictionary and data lineage. Not just at entity level (what they contain, how they are used, etc.) but down to each attribute.

Document DB’s role in data warehousing is very different (or any data lakescape for that matter). Document DB has models. Every entity has attributes. And that becomes the basis of the data model. But unlike tabular model, in document DB individual entity can have different attributes. Let’s look at security as an example (as in financial instrument). The attribute of an option is very different to a CDS. An option has a “strike price” attribute for example, and a CDS doesn’t. A CDS has a “reference entity” attribute whereas an option doesn’t.

In document DB entities have relationship with each other. If it is one to many then the parent IDs are put on the child entities. If it many to many then we use a “joining document” or we can do embedding and referencing (see: ref #6 below).

There are other types of NoSQL too: wide column store, tuple store, object store, triple store, multi model DB. We really are limiting ourselves if we only do relational databases. There are many, many different types of data store outside relational DB. And as a data architect, it is our job to know them well. Even if eventually we use SQL for a project, we know know very well why we are not implementing it on the other data stores.

A “wide column store” uses rows and columns like in a relational database, but the column names and data types can vary from row to row. The most famous wide column store is probably Cassandra, but both Amazon and Azure have them.

So where does that leave us. Where do we use NoSQL in data warehousing or data lake? KVP is useful to be used as a data lake store. Graph database would be useful as the main data store if our data is about individual object (such as person, security, or company) which is not processed/queried en-masse. Document DB is useful for modelling entites which has variable attributes, such as security. So NoSQL can be used as a data lake store, or as one of the main data stores.

Reference:

  1. Azure Cosmos DB – Documentation: link
  2. Azure Cosmos DB – Key Value Pair: link
  3. Azure Cosmos DB – Document DB: link (MongoDB)
  4. Azure Cosmos DB – Graph DB: link (Gremlin)
  5. Azure Cosmos DB – Designing Data Structure: link
  6. Azure Cosmos DB – Data Modelling: link
  7. Azure Cosmos DB – Wide Column Store: link (Cassandra)
  8. Amazon NoSQL – Documentation: link
  9. Amazon NoSQL – Key Value Pair: link (DynamoDB)
  10. Amazon NoSQL – Document DB: link
  11. Amazon NoSQL – Graph DB: link (Neptune)
  12. Technical Review of Two wide column store NoSQL Databases: Amazon DynamoDB and Apache Cassandra, by Abigail Udo: link
  13. Amazon Athena: link
  14. Oracle NoSQL Database Documentation: link
  15. Wikipedia on NoSQL: link
  16. The curse of Schema-on-read in Data Lakes, by Sandeep Uttamchandani: link

2 Comments »

  1. Great post. Clear, concise and just perfect.

    Comment by thoughtsallsorts — 21 February 2022 @ 8:23 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a reply to Vincent Rainardi Cancel reply

Blog at WordPress.com.