Data Warehousing and Data Science

20 June 2016

Data Lake vs Data Warehouse

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 6:22 pm

A Data Lake is a storage and analytic system which stores structured and unstructured data from all source systems in the company in its raw form. The data is queried, combined and analysed to get useful patterns and insights. It is built on a cost effetive Hadoop storage infrastructure, and can be dynamically scaled.

A Data Warehouse is a system that retrieves and consolidates data periodically from source systems into a dimensional or normalised data store. It usually keeps years of history and queried for reporting, business intelligence or other analytical activities. It is typically updated in batches, not every time a transaction happens in the source system.

Data Lake Advantages

Because data from all source systems are there in the data lake, the main advantage of data lake is it enables the analyst to get analytics and insight from multiple systems. Yes the data is still in their raw form, so would requires some processing, but it is a lot quicker (1-2 week) than building a data warehouse (6-12 months). And that is the second advantage: time to build.

The third advantage of data lake is its ability to store unstructured data such as documents and social media, which we can query and combine with structured data from databases.

The fourth and probably the most powerful business case for a data lake is the cost efficiency. In investment banking world for example, we can store market data (e.g. prices, yields, spreads, ratings) not only for the securities that we hold but all securities in the market, cheaper than if we store it in Oracle or SQL Server.

The fifth advantage of a data lake is the flexibility. Unlike Oracle or SQL Server, data lakes dynamically scales up. If this morning it is 2 TB, in the afternoon it can be 3 TB. As new data arrives, and we need new capacity, we can add storage easily. As we require more computing power, we can get more computing power there and then, instantly. There is no need to wait a few days/weeks (or months) for adding a new node in a SQL always-on availability groups, or adding storage to the SAN, or extending the computing power of an Oracle Grid. This means that we do not need to spend a lot of money upfront paying for 3 TB data and 16 processors. We can just start with 300 GB and 4 processors, and expand when required.

Data Lake Disadvantages

The first issue is that the data lake technology is immature. The language which can query across database and unstructure files only has very limited features (each vendor has different language, for Microsoft Azure it is U-SQL). It is probably only 10% of what PL/SQL or T-SQL can do. We can solve this by putting QlikView or Tableau on top. We use U-SQL only to query individual table and we join the data in QlikView/Tableau and do further processing there.

The second issue is cost (but I don’t find this argument holds water). The issue is: it actually costs a lot of money to store data from all source systems in the company, let alone storing external market data which requires a lot of licence cost.

Let’s take Microsoft Azure Data Lake for pricing example. Being in the cloud, the price is only US$ 0.04 per GB and US$0.07/1 million transactions (ref: link). Let’s say that “data from ALL systems in the company” is 500 GB. And every day we store this data in the data lake. So 500 GB per day x 365 days = 182,500 GB per year x $0.04/GB = $7300 per year. Let’s say we have 10 million transactions per day, which is 3650 million transactions per year x $0.07/million transaction = $256 per year. So it is only cost us about $7500 per year. This is a very reasonable price to pay, to have “all the data in the company in one place”. Even 5 years later, when the volume grows to 5 times, it is only about $35,000 per year. Still very reasonable price to pay.

The third issue is performance. Because data lakes stores the data in its raw format, the query which joins the different data could be running like a dog. Luckily Data Lake is running on HDFS, a distributed file system, which is very fast. So yes it is slower than a data warehouse, but it is not too bad. We are not talking about 30 minutes to run a query, but something like 20-30 seconds (compared to DW query which is say 1-3 seconds).

Data Lake being “exploration tool”, I can tolerate a little bit of slowness. After we proof that (for example) the risk data we query is making money, then it can pay for creating a Risk data mart specifically for that risk analysis purpose.

The forth issue is skill. Data Lake requires a superb business knowledge. Because the analyst needs to join tables from different source system. What is the difference between Real Modified Duration and Effective Duration and Modified Duration? But this issue is the same whether we are building ad Data Warehouse or a Data Lake. Both requires good business knowledge.

I don’t find U-SQL is difficult to understand for people with SQL knowledge, which is a common skill. But how about Machine Learning? That is difficult to master, right? Yes, that is true, but it is worth paying an expert data scientist to discover the insight in the data, because this insight can be used to save cost or boost our revenue. The potential benefit is 10-100 times the salary of the data scientist.

Conclusion

So considering all the pros and cons above, I am in favour for creating a Data Lake. In addition to having a Data Warehouse of course.

 

18 June 2016

About NOLOCK

Filed under: Analysis Services — Vincent Rainardi @ 10:13 pm

When we issue a SELECT statement in SQL Server, it locks the table. Many people wonder why does SQL Server needs to lock the table?

When the SELECT is being processed, when SQL Server is reading the rows in the table, imagine if someone drops the table. What will happen? The drop statement will have to wait until SQL Server finish reading all rows for this select. (1)

What if someone truncates the table when the SELECT is reading the rows? Or delete the rows that the SELECT was about to read? Same thing, the TRUNCATE or DELETE will have to wait until the SELECT finishes. (2)

What if someone update the rows that SELECT was about to read? Same thing, the UPDATE has to wait. (3)

What if someone inserts some rows that fall within the SELECT criteria? Same thing, the INSERT has to wait. (4)

(1), (2), (3) and (4) are the purposes of locking the table when reading it, i.e. to get consistent result from the beginning of the read until the end of the read.

This is the default behaviour of SQL Server. I call it “Clean Read”. Formally it is called READ COMMITTED Isolation Level.

The opposite of Clean Read is Dirty Read, which happens when the data is changed before the end of the reading, so that at one point we see the row, but at a split moment later when we update the row we can find the row. Dirty Read is formally know in the SQL Server world as “Non-Repeatable Read”.

SELECT With NOLOCK

What if we do the SELECT with NOLOCK? Does SQL Server prevent someone from dropping the table? Yes, nobody can drop the table when it is being read, even if the reading is using NOLOCK. The DROP will be executed, after the SELECT is finished.

What about Truncate? Can someone truncate the table, while a SELECT with NOLOCK is reading the table? No, the TRUNCATE will wait until the SELECT finishes.

What about Delete? Can someone delete the rows from the table while those rows are being read by a SELECT with NOLOCK? Yes, that is possible.

If you have a table with 100,000 rows and User1 issued SELECT * FROM TABLE1 WITH NOLOCK, and at the same time User2 issued DELETE FROM TABLE1, then User1 will only get something like 15,695 rows*, because the other 85k rows were deleted.

*This is the result of my test. Obviously you can get some other numbers depending on various factors.

These 15,695 rows are not sequential, but jumpy, i.e. User1 will get
row 90351-90354, then jump to:
row 90820-90824, then jump to:
row 94100-94103,
and didn’t get row 94104 to 100,000.

Because the missing rows have been deleted by User2.

Error 601

But sometimes, instead of getting the above rows, User1 will get this error message:
Error 601: Could not continue scan with NOLOCK due to data movement

This is when “the page at the current position of the scan” is deleted. See link.

 

 

Domain Knowledge

Filed under: Analysis Services — Vincent Rainardi @ 9:59 pm

In my opinion one can’t be a data architect or data warehouse architect without having good domain knowledge. For example, asset management, the main dimension or reference data is Security (aka Instrument). A security has many attributes, such as

  • Security IDs: ISIN, Sedol, Cusip, Ticker Code, Bloomberg ID, Markit RED, in house ID, etc.
  • Industry/Sector: Barclays, Markit, JP Morgan, Merrill Lynch, GICS, UK SIC, International SIC, IMF Financial Sub Sector, Moody 35, Moody 11, etc.
  • Credit Ratings: S&P, Moody’s, Fitch, house rating, and numerous combination between them.
  • Stock Exchange (where the security is listed)
  • Issuer, Parent Issuer (for derivative use underlying?)
  • Dates (maturity date, issue date, callable dates, putable dates
  • Country: country of incorporation, country of domicile, country of risk
  • Currency: denomination currency
  • Asset class: inflation-linked, treasury, sovereign, IG corporate, CDS, IRS, options, futures, FX swap, etc.

And we have many measures such as valuations (market value, exposure, P&L), analytics (yield to maturity, current yield, modified duration, real duration, spread duration, convexity, etc) and risk (value at risk, tracking error, PV01, IE01, DV01, etc).

If we don’t understand what they means, and how they relate to each other, then how could we design the data structure to store them properly? We can’t. As a data architect we need to understand the definition of each attribute and measures above.

We also need to understand the meaning of each value. For example, one of possible value for asset class attribute is Equity Index Option. What does Equity Index Option mean? What is the difference with Equity Index Swap and Equity Option? We need to understand that. Because by understanding that we will be able to differentiate the difference between Asset Class and Asset Type. And how these 2 fields relates to each other. And whether we need to create Asset Sub Class attribute or not, and the hierarchy between Asset Class and Asset Sub Class.

Of course a business analyst will need to understand the domain knowledge. But a data architect also need to understand it. A data architect who doesn’t understand the domain knowledge will not be able to do their job properly. A data architect who works for a pharmaceutical company for 5 years will not be able to design a database for a Lloyd’s insurance company, without learning the domain knowledge first. And it could be 6 month before they reach the necessary level of understanding of insurance. Like a business analyst, a data architect job is industry specific.

I agree, not all industries are as complex as Lloyd’s underwriting/claim or investment banking. Retail, mining, manufacturing, distribution, and publishing for example, are pretty easy to understand, perhaps takes only 1-2 months to understand them. But healthcare, insurance, banking, finance and investment are quite difficult to understand, perhaps requiring about 6 months t understand them, or even a year.

So how do get that industry knowledge? The best way is by getting a formal training in that industry sector, which can either be a self study or a class. Fixed-income securities for example, can be learned from a book. Ditto trade lifecycle. I don’t belieave there is nothing that one cannot learn from a book. Yes, some people prefer to go to class and have a teacher explain it to them. But some people prefer to read it themselves (myself included).

I agree that a developer doesn’t necessarily have industry knowledge. It is a nice to have, but not mandatory. Be it an application developer, a database developer or a report/BI developer. A project manager also doesn’t need to have industry knowledge. It is a nice to have, but not mandatory. They will be able to do their job without industry knowledge. But a data architect, like a business analyst, need to have it. A business analyst needs the business knowledge to understand the business requirements. A data architect needs the industry knowledge for designing the data model and analyse where to get the data from.

 

Blog at WordPress.com.