A Data Lake is a storage and analytic system which stores structured and unstructured data from all source systems in the company in its raw form. The data is queried, combined and analysed to get useful patterns and insights. It is built on a cost effetive Hadoop storage infrastructure, and can be dynamically scaled.
A Data Warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalised data store. It usually keeps years of history and queried for business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.
Data Lake Advantages
Because data from all source systems are there in the data lake, the main advantage of data lake is it enables the analyst to get analytics and insight from multiple systems. Yes the data is still in their raw form, so would requires some processing, but it is a lot quicker (1-2 week) than building a data warehouse (6-12 months). And that is the second advantage: time to build.
The third advantage of data lake is its ability to store unstructured data such as documents and social media, which we can query and combine with structured data from databases.
The fourth and probably the most powerful business case for a data lake is the cost efficiency. In investment banking world for example, we can store market data (e.g. prices, yields, spreads, ratings) not only for the securities that we hold but all securities in the market, cheaper than if we store it in Oracle or SQL Server.
The fifth advantage of a data lake is the flexibility. Unlike Oracle or SQL Server, data lakes dynamically scales up. If this morning it is 2 TB, in the afternoon it can be 3 TB. As new data arrives, and we need new capacity, we can add storage easily. As we requires more computing power, we can get more computing power there and then, instantly. There is no need to wait a few days for adding a new node in a SQL always-on availability groups, or adding storage to the SAN, or extending the computing power of an Oracle Grid. This means that we do not need to spend a lot of money upfront paying for 3 TB data and 16 processors. We can just start with 300 GB and 4 processors, and expand when required.
Data Lake Disadvantages
The first issue is that the data lake technology is immature. The language which can query across database and unstructure files only has very limited features (each vendor has different language, for Microsoft Azure it is U-SQL). It is probably only 10% of what PL/SQL or T-SQL can do. We can solve this by putting QlikView or Tableau on top. We use U-SQL only to query individual table and we join the data in QlikView/Tableau and do further processing there.
The second issue is cost (but I don’t find this argument holds water). The issue is: it actually costs a lot of money to store data from all source systems in the company, let alone storing external market data which requires a lot of licence cost.
Let’s take Microsoft Azure Data Lake for pricing example. Being in the cloud, the price is only US$ 0.04 per GB and US$0.07/1 million transactions (ref: link). Let’s say that “data from ALL systems in the company” is 500 GB. And every day we store this data in the data lake. So 500 GB per day x 365 days = 182,500 GB per year x $0.04/GB = $7300 per year. Let’s say we have 10 million transactions per day, which is 3650 million transactions per year x $0.07/million transaction = $256 per year. So it is only cost us about $7500 per year. This is a very reasonable price to pay, to have “all the data in the company in one place”. Even 5 years later, when the volume grows to 5 times, it is only about $35,000 per year. Still very reasonable price to pay.
The third issue is performance. Because data lakes stores the data in its raw format, the query which joins the different data could be running like a dog. Luckily Data Lake is running on HDFS, a distributed file system, which is very fast. So yes it is slower than a data warehouse, but it is not too bad. We are not talking about 30 minutes to run a query, but something like 20-30 seconds (compared to DW query which is say 1-3 seconds).
Data Lake being “exploration tool”, I can tolerate a little bit of slowness. After we proof that (for example) the risk data we query is making money, then it can pay for creating a Risk data mart specifically for that risk analysis purpose.
The forth issue is skill. Data Lake requires a superb business knowledge. Because the analyst needs to join tables from different source system. What is the difference between Real Modified Duration and Effective Real Duration and Modified Duration? But this issue is the same whether we are building ad Data Warehouse or a Data Lake. Both requires good business knowledge.
I don’t find U-SQL is difficult to understand for people with SQL knowledge, which is a common skill. But how about Machine Learning? That is difficult to master, right? Yes, that is true, but it is worth paying an expert data scientist to discover the insight in the data, because this insight can be used to save cost or boost our revenue. The potential benefit is 10-100 times the salary of the data scientist.
So considering all the pros and cons above, I am in favour for creating a Data Lake. In addition to having a Data Warehouse of course.