Data Warehousing and Data Science

29 November 2017

What’s in Azure?

Filed under: Business Intelligence,Data Science,Data Warehousing — Vincent Rainardi @ 5:31 am

As of 28th Nov 2017 we can create the following in Microsoft Azure:

  • Data: SQL Database, SQL Server, SQL Data Warehouse, SQL Elastic Database Pool, MySQL, PostgreSQL, Data Lake, CosmosDB, MongoDB, Redis, Teradata Database, Oracle Database, IBM DB2, Aerospike, XtremeData, HazelCast, ArangoDB.
  • Data Integration: Informatica, Data Factory, Data Catalog, File Sync, Profisee Maestro Integration Server, Information Builder iWay Big Data Integrator, BizTalk Service, IBM WebSphere Application Server, IBM WebSphere MQ, IBM MQ, Datameer, QuerySurge.
  • Big Data: HDInsight, Hadoop, HDFS, HBase, Hive (Interactive Query), Hive Streaming, Spark, Kafka, Storm, Hortonworks, Cloudera, Cassandra, GridGain, MapR, F5 BIG-IP, Syncfusion, Informatica Big Data Mgt, Kyligence, AtScale.
  • Data Science: ML Experimentation, ML Model Management, R Server, Data Science VM, Bot Service, Computer Vision API, Face API, Language Understanding Intelligent Service, Translation Speech API, Text Analytics API, ML Studio Web Service, Sparkling Water H2O.
  • Analytics: Databricks, Stream Analytics, Analysis Services, Data Lake Analytics, Time Series Insights, Tableau Server, Qlik Sense Server, Pyramid Analytics, Nutanix Analytics, Real World Analytics, Exasol Analytic, HP Vertica Analytics, Teradata Aster Analytics, Bing Search, Dundas BI, Power BI, Panorama Necto, Spago BI, Targit, KNIME, SAP Hana, Kepion, Jedox, Diagramics.
  • Internet of Things: IoT Hub, Event Hub, Notification Hub, Pilot Things IoT VPN, TrendMicro IoT Security.
  • Developer Tools: Visual Studio, VS Anywhere, DevOps Project, Team Project, DevTest Labs, Application Insights, API Management, Operational Insights, Jenkins,
  • Infrastructure: Virtual Network, Load Balancer, Network Security Group, Security & Audit, Security & Compliance, Event Tracker Security Centre, Log Analytics, Automation, Active Directory, Scheduler, CloudAMQP, Cradeon.
  • Web: Website + SQL, WebApp, WordPress, Drupal, Joomla, Django, API App, Logic App, CDN, Media Services, VoIP, SiouxApp, App Dynamics, CakePHP, BlogEngine.NET, MVC Forum, Better CMS, Node JS
  • Compute: Windows Server, Red Hat Enterprise Linux, Oracle Linux, Ubuntu Server, VM Instances, Service Fabric Cluster, Web App for Container, Function App, Batch Service, Cloud Service.
  • Blockchain: Ethereum, Hyperledger, Corda, Quorum, STRATO, Chain Core, Stratis.

13 November 2017

Numerical Attributes

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 7:24 pm

Should we put numerical attributes in the dimension table or in the fact table? Numerical attributes are static attributes of an entity but they are numeric, not string. For example: interest rate of a swap and weight of a product.

I usually follow the textbook approach and put them in the dimension. But for measures and attributes of OTC derivative instruments, we should be putting them in the fact table.

The fact table contains daily (and intraday) valuations, such as price, market value and the “Greeks”. An interest rate swap usually has a fixed leg and a floating leg. For example, receive 0.886% pay EUR 3m + 0.567%. So we will get a fixed interest rate of 0.886% and pay our counterparty the rate of EURIBOR 3m plus 0.567% (currently EURIBOR 3m is -0.329%).

These two attributes (the 0.886% and 0.567%) are part of the interest rate swap contract and are fixed for the duration of the contract (until maturity). When it reaches the maturity date, the contract ends and if we want to extend it, we will normally make a new contract, with a new SEDOL. Hence these 2 attributes are essentially static attributes (they are called receive interest rate and pay interest rate).

Because they are static they should be put into the dimension, which in this case is the security dimension or instrument dimension, rather than the fact table, which in this case is the instrument valuation fact table.

But I recommend putting them into the fact table for 4 reasons:

  1. Incorrect business judgement (a simple “sweep all” rule, and if we are wrong then it’s not too bad)
  2. Bytes
  3. Floating
  4. Calculation

1. Incorrect business judgement

There are many types of OTC derivative instruments, e.g. Interest Rate Swaps, Credit Default Swap, FX Options, Equity Index Swap, CDS Options, Asset Swap, Total Return Swap, etc. There are about 15 of them, each with different numerical attributes. So in there are about 50 or so numerical attributes.

We could sit down with the business and categorise these 50 numerical attributes into two:

  • A measure or fact: meaning they are potentially changing every day
  • A dimension attribute: meaning that they are static and never change

But the business judgement could be wrong. Something which they think is static could be changing. Here are some of those 50 attributes and imagine you sit down with the business categorising them into the two categories above:

Strike price interval, put option premium, strike price, intrinsic value, market value, fair value, time value, put call ratio, put call parity, recovery rate, expiration value, underlying asset value, fixed interest rate, floating interest rate margin, nominal amount, notional value, pay duration, pay/receive spread, pay/receive margin, credit spread.

In addition to the input from the business people, it can be useful to check the source system for historical changes. Does the value of that column/field ever changes? If yes, how often?

If we miscategorise them, there are two things which can happen:
a) A measure is miscategorised as an attribute
b) An attribute is miscategorised as a measure

Generally speaking, b) is fine but a) has quite serious consequences, because a) means that 1 instrument in the instrument dimension can have 100 rows, even 1000 rows. And if we have 300,000 instruments (covering both active positions, and non-active positions), the size of the instrument dimension will be very big, 300 million rows, making it slow.

The reason an instrument can have 1000 rows is that the attribute is actually a measure. We think it is static, but it actually changes every day. Or even several times a day.

On the other hand if an attribute is miscategorised as a measure, that attribute will be put in the fact table. And because it is numeric, it does not take a lot of bytes, even if it is repeated every day.

So to be safe, if you are not sure, put it as a measure in the fact table. The setup of an instrument can take a few weeks, particularly if it is a swap. During this period there is a continuous update on the instrument rows. If we have 20 numerical attributes for a swap, during this few weeks period we could have 20 rows because on day 1 only 5 attributes are setup, on day 2 another 5 is setup, on day 3 another 5, and on day 4 they change the attribute setup on day 1, and so on. So the whole thing can take 3 weeks, producing 20 rows. If it is in the fact table, a) we will only get 1 version a day, and b) we are preventing the instrument dimension from ballooning into many millions of rows.

2. Bytes

Numerical attributes take up much less bytes / disk space compared to textual attributes. Therefore, it does not slow things down if we put numerical attributes in the fact table, compared to the risk of miscategorising them into the instrument dimension, which will cause the dimension to “explode”.

3. Floating

If the attribute value is 0.00352281 and it is a float in the source system, it can be stored as 0.0035228100000000001. And it is possible that the value in the following day is 0.0035228100000000000, just because of the rounding in the floating point processor. Perhaps it is derived from a calculation of A/B. And if it changing from 00001 to 00000 then to 00001 again the next day, we have a phenomenon called “Flip Flop”. Flip Flop is where a dimension attribute keep changing values (usually twice a day) because it is updated from 2 different systems. This causes the number of rows in the dimension to “explode”.

4. Calculation

The last argument is that numerical attributes might be required for calculation, for example column1 divided by column2, or column1 + column2 + column3. If all columns involved in the calculation are stored in the fact table, we will be able to perform the calculation without joining the any dimension. We just need to filter on the partitioning date of the fact table, which is very fast.

Conclusion

I am not advocating to store all numerical attributes in the fact table for all cases. Only in the case of OTC instrument, I would recommend putting the numerical attributes in the fact table, because of the 4 reasons above.

But for most cases, usually the best approach is to put the numerical attributes in the dimension. For example: the dimension of a product (the width, length and height) and the weight of a product. Yes they do change from time to time, but not daily, hence we want to record these changes as SCD type 2 in the product dimension.

12 November 2017

Relational vs Non Relational Databases

Filed under: Data Architecture — Vincent Rainardi @ 6:35 am

Most of us database professionals are familiar with relational databases like Oracle and SQL Server. This is understandable because most companies are using relational databases. But we also have graph database, object databases, document databases, key-value databases, columnar databases. All of these are known as non relational databases. They are also called NoSQL databases.

So let’s find out what a relational database is, and what a non relational database is. A data architect should understand non relational databases, even though they are not as popular as relational databases.

A. Relational Databases

A relation means a table, which consists of columns (aka attributes) and rows (aka tuples). More formally, a relation is a header with a body. A header is a set of attribute names with the constraints, and the body is a set of tuples from a domain.

A table is called a relation because it is a relation between the header and the body, between the attribute and the tuples. For example: Employee relation consists of a header with 2 attributes: Name and Date of Birth. The Name attribute can contain any value up to 10 characters, as long as it is alphanumeric. The Date of Birth attribute can contain any valid date. So “Andy”, “Bob” and “Carol” relate to the Name attribute. “25 Dec 2000” and “1 Jan 1995” relate to the Date of Birth attribute. That is why a table is called a relation, and the whole thing is called Relational Model, by Edgar F Codd in 1969.

A relational database is a database where the data is stored as values in tables. Data inside a relational database is retrieved using a query called SQL, Structured Query Language. SQL is also used to add and update the data.

A1. Relvar, attribute value, database schema

Actually, a relation does not mean a table. A relation means a table definition (does not contain data). A relation variable means the table instance which contains data. A relation variable is called a relvar.

An attribute means a column. An attribute value means the value of an attribute for a particular tuple (row).

A database schema is a set of relations (tables), the header (columns) associated with those relations, and their predicates (constraints).

A2. Constraints

A constraint can be of one several forms:

  • Domain constraint is a restriction on the type of values that an attribute can have, e.g. numbers, dates, or alphanumeric. Today a domain constraint is known as a data type.
  • Data constraint is a restriction on the possible values that an attribute can have, e.g. only 1 to 10. Today a data constraint is known as a check constraint.
  • Key constraint is a way to identify a tuple using the values of certain attributes.

There several types of key constraint:

  • A superkey is a set of attributes which identifies a tuple (row), which is chosen to be implemented. There are no two tuples which have the same superkey. A superkey is now known as a primary key.
  • A foreign key is an attribute which identifies a tuple in another relation.
  • A candidate key is an attribute (or a set of attributes) which identifies a tuple. It can be chosen as a superkey. If not chosen it becomes an alternate key.
  • An alternate key is a candidate key which is not chosen as a superkey.

A3. Query

A query means getting data from a set of relations in the database. The output of a query is called an answer, or a result set.

A query can do several operations:

  • Restrict means limiting the rows returned on the result set.
  • Project means limiting the columns returned on the result set.
  • Product means combining all tuples in a relation to all tuples in another relation (Cartesian explosion)
  • Join means doing a product, and then filter the result to only the tuples we want.
  • Union means combining a table with similar structures.
  • Difference means listing the tuples in a relation which are not found in another relation.
  • Intersect means listing the tuples found in in both relations.

We also have other operations: outer join, outer union, aggregation (e.g. sum), group and ungroup.

A4. What A Relational Database Is

So a Relational Database is a database in the form of tables (relation), which consists of columns (attribute) and rows (tuples). The database is created, populated and queried using SQL language. Each column (attribute) has a domain (data type). Each table has an identifier (superkey or primary key) and can have foreign keys. In SQL we have join, product, union, project, restrict, difference and intersect.

Relation means table. The word “Relational” here means the relation between the Table Header (the data structure) and the Table Body (the data content). Not between one table and another table. It is based on Relational Model by Edgar F Codd in 1969.

SQL Server, Oracle, MySQL, DB2, Sybase, Teradata, Informix are examples of Relational Databases.

In a relational database, the data domain (data type) must be known first, before we can store the data. And secondly, the data structure must be designed first, before we can store the data in the database.

B. Non Relational Databases (aka NoSQL Databases)

Non Relational Databases are databases which are not in the form of tables, where the data attributes are put as table header, and the data is put in the table body. The attributes can be stored in the table body. Or they can be stored as a node or as a file. Or the data might not have attributes. The data might not be numbers, text and dates, but images, videos and documents. The data could be DNA sequences, social network, or chemical molecules. Storing them as text and numbers might require a lot of space, or might not be suitable to how the data is used/queried.

Non Relational Database is known as NoSQL, stands for Not Only SQL (No doesn’t mean no here, but it means not only).

There are various different types of Non Relational Database. The 5 most famous ones are #1 to #5 below: (#6 no longer exists and #7 is a combination of #1 to #5)

  1. Key-Value Pair Database
  2. Document Database
  3. Graph Database
  4. Object Database
  5. Columnar Database
  6. Hierarchical Database
  7. Multi Model Database

 

B1. Key-Value Database

The most widely used Non Relational Database is Key-Value database, aka Key-Value store. It consists of files consisting of 2 fields: Key field and Value field. It is very simple. Because there is no data type, we can store anything in there. The benefit of using Key-Value database is we do not need to define the data structure (the schema). The other benefit is that it can be distributed across many nodes (file servers), so the performance is highly scalable. But the winning benefit is the money: it can store a large amount of data at much lower cost than relational database.

A Key-Value database is ideal for storing large amount of unstructured data. In some implementation of the Key-Value Database we can store not only strings but also lists of string. In other implementation, the Key-Value Database is in-memory, boosting the query performance.

The main disadvantage of Key-Value Databases is that the data consistency is not guaranteed. After 1 second it is consistent, but before 1 second it is not.

Examples of Key-Value Databases are Redis, Aerospike, Riak KV, Berkeley DB, LevelDB and Apache Ignite. Amazon DynamoDB, Azure CosmosDB and Oracle NoSQL are not only a Key-Value Database, they are also a Document Database, which is why they are classified as a Multi Model Database (see point B7 below).

Hadoop is essentially Key-Value store, it stores Key-Value data. MapReduce operates on Key-Value data stored in Hadoop. But Hadoop, unlike Redis or Aerospike, is not a Key-Value database. Hadoop is not a database system, it is merely a file system (HDFS).

B2. Document Database

Document Database stores JSON and XML documents. It doesn’t stores binary files such as videos, PDFs and images. The word “document” in Document Database means JSON and XML, not a file we normally work in the office. Microsoft Office files such as xlsx, docx, pptx (2007 and above) are XML files, but they are zipped so they becomes binary. So Excel and Word files are not what we store in Document Database. This is why some people prefer to call it “Document-oriented Database”.

Why is it important to store JSON and XML files? Because they contain data, such as loan, claim or patient data. It’s just that the format is JSON or XML. Data is widely stored as text files, i.e. CSV or tab delimited. But the limitation is that it is not validated. If a row in a CSV file is missing a field, it causes problems when we try to import it using an ETL tool. So we prefer XML files. Using a Schema Definition (XSD), the maker of the file can automatically validate the file, e.g. the data type for each field is, what values are permitted, etc. So the consumer of the file is guaranteed that the file is “good”.

That was in 2012, when XML is the de-facto standard. In 2017 JSON practically replaces XML, and is now the de-facto standard. People prefer putting data into JSON files than into XML files because JSON files are smaller and the processing is faster and easier. And JSON files are easier to read than XML (link, link). You can’t win against these 3 arguments.

The power of putting data in JSON documents is the hierarchical navigation. If we put a product hierarchy into JSON, we can query all nodes under product A up to 3 levels deep, or all the siblings of product A under the same grandparent. Hierarchical query like this is the main advantage of using a Document Database, compared to a Relational Database.

In a Document Database, the data structure is contained within the document. Whether we use XML files or JSON files, the data structure is inside the data file. So we don’t need to define the data structure in advance, like in a Relational Database.

A Document Database enables us to update a JSON document (or XML), adding data into it or deleting data from it. If we want to add a new field, we just add it into the JSON document. There is schema or data structure that we need to modify like in a Relational Database.

Examples of Document Database are: MongoDB, RavenDB, Apache CouchDB, RethinkDB, RaptorDB, JasDB, IBM Cloudant.

B3. Graph Database

In Graph Database data is stored as nodes connected to other nodes using lines. For example, a Graph Database containing social network data, a node can be: a person, a company, a city or a place (e.g. a restaurant, a shop, a school). The line shows the relationship between 2 nodes. A line has a time property, which shows when the relationship occurred.

A line between 2 persons can be: “is a friend of” or “is a colleague of”. A line between a person and a company can be “is an employee of” or “is a customer of”. A line between a place and a person can be: “had dinner in” or “bought something in”. The time property of a line can be “on 3rd March 2016”, “from 4th April 2016 to 5th May 2017” or “every Monday”.

A property of a “person” node can be name, date of birth, address. A property of a “place” node can be name, type of place and rating.

In the case of a hospital, the node can be patient, healthcare professional, medication, ward and illness. In the case of an electricity company, the node can be customer, product, tariff, engineer and bill.

In the case of social media, examples of the queries on a graph database are:

  1. Who likes “Amari Noodle Bar” in the last 3 months?
  2. The friends of the people who likes “Amari Noodle Bar” (up to 3 hops) – where do they live (just street / post code, not the house number)
  3. What are the education background of people who have ever worked for both IBM and Microsoft?

As you can see from these queries, a Graph Database is a very powerful tool to have, if it is populated with good data. It is very different to a relational database.

Examples of Graph Database are Neo4J, GraphBase, Gephi, and InfiniteGraph.

B4. Object Database

Object Database stores data as objects. By objects I mean as in classes in Java or C#.NET. For example: a customer is an object, with name, date of birth and address as the attributes. The Java code defines this customer object, populates for 10 customers then stores these 10 customers in the Object Database (each customer is called an “instance”, and the customer itself is a “class”). The Java code can then update the address of customer #5, delete customer #3 and insert customer #11, all in 1 database call. It can also copy customer #6 into a new customer with all attributes having the same values except the name.

To query an Object Database, we use JPA (Java Persistence API). JPA is the object oriented version of SQL. Like SQL, JPA uses SELECT FROM WHERE GROUP BY.

For example: SELECT C.Name, C.Address.City FROM Customer as C WHERE C.DateOfBirth.Month = “December”

About the “C.Address.City” above: because it is an Object Database, Address is an object which has attributes, such as Street, City and Post Code. JPA also have JOIN, ORDER BY and SUB QUERY. JPA has UPDATE and DELETE (no INSERT) but this is not the recommended way to update objects.

Object Databases are not popular because object oriented languages such as Java and .NET stores their object data in Relational Databases. The OO language has an interface to save objects as tables in Relational Databases. This interface is called Object-Relational Mapping (link). For example, .NET has TableAdapter (link), which enables us to do this:

CustomerTableAdapter.Insert(Customer.Name, Customer.DateOfBirth, Customer.Address)

And to update or delete, just replace the word Insert above with Update and Delete. To query we use the standard SQL language, because the data is stored in the Customer table in Relational Database. This makes the query a lot easier, especially when aggregating data.

Examples of Object Databases are: ObjectDB, GemStone, Ignite ObjectStore, Ekky ObjectDatabase++.

B5. Columnar Database

Columnar Databases store the data in tables. But instead of storing it as rows, the data is stored as columns. For example, if we have a table 1 million orders, with 3 fields: date, product and quantity. The order table in the Columnar Database looks like this: (red numbers are the ID of the order)

Date: 1/11/2017: 1,2,3,4,5. 2/11/2017: 6,7,8,9. 3/11/2017: …
Product: Product1: 1,3,4,6,7,15. Product2: 2,5,8,9,10. Product3: …
Quantity: 1: 1,2,3,5,6,9,10. 2: 4,7,8,11,12,13. 3: …

Because orders with the same date are stored together, orders with the same products are stored together, and orders with the same quantity are stored together, then when we query how many orders are for Product2 and date = 2/11/2017 and quantity = 1, it will be very fast to return the result.

Examples of Columnar Databases are: Greenplum, MariaDB AX, MonetDB, InfiniDB, Apache Kudu.

B6. Hierarchical Database

In a Hierarchical Data Model the data is organised like a tree. Today there is no Hierarchical Database, not as a stand-alone commercial product. Today Hierarchical Data Models are now mainly implemented in Document Databases, such as MongoDB (see B2 above). In a Document Database we can do hierarchical navigation on the data, i.e. query the data hierarchically. For example: in a product hierarchy we can query all nodes under product A up to 3 levels deep, or all the siblings of product A under the same grandparent.

B7. Multi Model Database

Multi Model Databases can store data as Key-Value, Document, Graph, Object or Columnar. It does not have to be all five formats; two formats are enough to be called as Multi Model. Most Multi Model NoSQL Databases can store data as Key-Value and Document. AWS is arguably the most complete, it can store data as Key-Value, Document, Object and Columnar.

Oracle NoSQL, Amazon DynamoDB, Azure Cosmos DB, MUMPS, OrientDB

Reference:

  1. Paraha Sarathi, A Deep Dive into NoSQL: A complete list of NoSQL databases, link
  2. Josh Wyse, Why JSON is better than XML, link
  3. W3Schools, JSON vs XML, link
  4. Couchbase, Comparing document-oriented and relational data, link
  5. org, Definition of Object DBMS: link
  6. Wikipedia, NoSQL databases, link
  7. Wikipedia, Column-oriented DMBS, link

 

Blog at WordPress.com.