Data Warehousing and Data Science

24 November 2022

Data Lake for Asset Management Companies

Filed under: Data Warehousing — Vincent Rainardi @ 6:36 pm

For many years I have been asking myself the value of data lake specifically for asset management companies. If an asset management company does not have a data warehouse (or any other data management solution), would I put a data lake as a solution? The business needs are for reporting, and for investment management. Let me spell it out:

  1. To provide portfolio managers and financial analysts with the latest, valid version of all holdings (IBOR and ABOR positions) and benchmark constituents, and tearsheets for their meeting with various clients and companies, as well as evaluating the companies they are looking to buy.
  2. To provide the client reporting team with performance reports, attribution reports, valuation reports and ESG reports to be sent to clients and to be published in the factsheets and KII documents.
  3. To report to regulators including FCA and SEC on shareholding, OTC derivatives, trades, ESG, MiFID and SFDR, and to comply with requirements to be sustainable funds.
  4. To provide the finance department with the asset under management numbers (AUM), NAV and transactions.
  5. To assist the marketing and sales teams with various data requests for proposal, RFPs and presentations to clients and prospective clients.
  6. To provide the risk and compliance team with risk analytics (spread duration, DTS, etc), exposures and VaR breakdowns, stress test results and compliance reports.

The reason I ask this question is because the data in asset management companies are mostly structured data, if not 100%. All the above 6 points are structured data. They are tabular. They consists of columns and rows. The data types are string, numeric and dates. There are no audio, images or video. There are no twitter data, Facebook data or documents.

Data Lake on the other hand, handles multimedia files well, and social media too. That is the key differentiator between a data lake and a data warehouse. On the other hand, there are many, many superior features that a data warehouse has but a data lake doesn’t. The top one is data integration. A data warehouse integrates data. Company information from various data sources (Bloomberg, Factset, Markit, MSCI, Trucost, etc) are integrated in the warehouse. Industry sector for example, and their primary equity ID such as ISIN. Also company names, country of domicile, and ratings.

Allocating proper IDs to each company is not a simple process. Any one who has done this will testify how complex the process is. There are many ISINs, SEDOLs and CUSIP that the company issued, which one should we choose? Yes they are security ID not company ID, but that is what is used by other data providers such as MSCI and Trucost, so we need to use ISIN. Should we use Factset entity ID or Bloomberg issuer ID for the main issuer ID? (like the primary key) There are many company names. The company name from Bloomberg is not the same as Factset, and they are different to Refinitive, MSCI, Trucost, S&P, FTSE and Rimes. One is “Company, Inc.” and the other is without a comma or a period. One uses “Corp” and the other uses “Corporation”. And many other abbreviations. Which one should you use? Do you use Factset proper name? Should you take Bloomberg issuer hierarchy or Factset’s?

The same deal with security integration. How do you do your waterfall matching? SEDOL first or CUSIP first? CUSIP 8 or 9? Should you use Bloomberg security ID first? And Ticker is a mess. With exchange, without exchange, with country, without country, Bloomberg ticker, Factset ticker, etc. The same thing with valuation data. Do you use the start of day, end of day, mid month or month end? Do you use the latest available, or the signed off version? In a data warehousing project, the first thing we do is to establish a credible issuer record, security data, ESG data, valuation data, performance attribtion data. A credible exchange rates, prices and risk data. By integrating them from various sources. Yes it is a large project, but that is what is required for reporting.

A data warehouse is integrated. A data lake is not. A data warehouse is golden source (meaning it’s credible). A data lake is not. A data warehouse has DQ process (the data has been corrected and signed off), a data lake is not. Can you trust holding valuations coming out of data lake? No. It does not have a robust sign off process. Can you trust a company ISIN coming from the data lake? No. It is only from one source which hasn’t been integrated or cross checked with other sources.

A data lake is technically superior (in terms of data format that it can handle) but functionally poor. So if your goal is reporting, the right solution is a data warehouse. Not data lake. If your purpose is data science (machine learning) then data lake is the right solution. This is because machine learning processes raw data. Well that’s not right, because machine learning processes both raw data and calculated data. Say you’d like to predict the carbon emission scope 1 of a company using time series, say on CNN or RNN algorithm. That’s using raw data. That emission scope 1 number is from the ESG data provider (raw). But if you’d like to include the company’s annual revenue or the company industry sector, ratings, country of domicile or market cap, then it is not “raw” data. It is “processed” data. The company information coming out from the data warehouse is much more higher quality than the ones in the data lake. Why? Because annual revenue is a calculation. For example, in the US we get the quarterly sales number from 10Q, and we sum it up. But which period are we using? Does it tally with the period of the carbon emission? If the carbon emission is as at 31st Dec 2020, and the company financial year is from 1st April to 31st March, which quarterly sales do you take? If there are several sources for sector, rating, country, then there is a waterfall process + data quality check to determine which one to use. All this happens in a data warehouse. None of it happens in the lake, as by definition the lake store only raw data.

So there goes the argument that “for machine learning you need a data lake, not a data warehouse, because you need raw data”. The fact is machine learning predictions require both raw data and processed/calculated data.

For an ecommerce companies which takes Twitter feed, or share prediction algorithm that takes public news as input, data lake is ideal. But if your goal is reporting to regulators, investment desks, sales and marketing, risk and compliance, etc. then you need to use a data warehouse, not a data lake.

I have created a data lake for an asset management company (part of a big team). And as I said, the data is structured. It is tabular. Rows and columns. No images, audio files or videos. No Twitter feed, or Facebook export. It is about holdings, valuations, portfolios, issuers, assets, attributions, the normal day-to-day asset management business data.

Monolith is dead

That said, I completely agree that monoliths are dead. A data warehouse as a monolith is not suitable for today. It takes a long time to build (1-2 years), and once it’s build it’s difficult to change as the regression test takes ages. We need to build a small data mart. Specifically for 1 purpose, like calculating the total AUM (with the breakdowns) for finance at month ends and dailies. Like SFDR reporting. Or client reporting. One mart for one purpose. And you’ll have many marts. Each with different purpose. That way if you change the portfolio dimension in 1 mart, the other marts won’t be affected. There is no single version of truth. There are many versions of which the issuer hierarchy can be implemented. Fixed income is different to equity.

And if you require a single version of the truth, like holdings, valuation or attribution for example, then you create a special mart what calculates that. And then publish it to all the other mart. More on this publishing mechanism later.

That said, the mart doesn’t have to be dimensional. No. It can be relational, it can be vertical. The data model in a mart needs to suit the business purpose of that mart. It might be serving a website, it might be for reporting, or for analytics (BI). Or for machine learning. Or calculating ESG numbers. Or calculating AUM. The data model in each mart needs to suit its purpose. It would be silly if we insist of using dimensional model all the time (fact and dimension). Of course not. We have to tailor the data model to suit what the data is used for.

Because of this you might not want to call it a data mart. Because data mart traditionally means a dimensional model. A small data warehouse, in dimensional format. A data mart traditionally consists of fact and dimension tables. In Kimball format (as in the group, not just one man). As a fan of the Kimball method, I found that dimensional format has many advantages. But when it is used for feeding a website that’s not the right format. When it is used solely for calculating ESG numbers from a Python notebook using Pandas and Numpy, dimensional is not the right format. Some applications require flatten tables, some applications require relational tables. Some require data vault format. No, definitely not “always dimensional”. So you might want not to call it a data mart. You might want to call it a client application. A business application database. Along those lines, i.e. the client or business purpose. Call it an ESG data store, AUM data store, client reporting database, SFDR database, etc. Remember: only for 1 specific purpose.

The advantage of this “only for 1 specific purpose” from the management point of view is the clarity of funding. It is easier to calculate the cost-benefit analysis. And to monitor the payback. Because the whole project is about that 1 purpose. Compare this with funding a traditional “enterprise scale” data warehouse. How do you budget it? You will have to take it from the core IT budget. The central budget. Or combine many projects. And the payback could be 3 years. Or 5 years. It is not financially attractive. On the other hand, a business application that serves only 1 specific purpose is hugely attractive. You don’t mess around doing other thing. You just do 1 thing, like calculating AUM. When you need another one for SFDR, you build another one. Another application, with another database. The budget allocation could not be clearer. The payback too.

Data Lake

That said, I would argue that there is a point in building a data lake for an asset management company. First, for a central data storage, where everything is kept. Second, for landing area. Third, for data publishing.

A data lake by definition is a centralised storage for raw data. Everything is stored there, from many data sources. External and internal. Even calculated data from various systems is stored there too. The advantage? Everything is in one place. And if you use Databricks for your data lake, then you can join the files like tables! You can use normal SQL to query the data. You can feed your BI tools. And your reporting tools. And many, many business applications too. From your company website, to your externally-facing client reporting portal. You can feed your internal data portal too, where everybody can get any data that they need.

Second, landing area. I’d argue that, rather than doing data staging or landing in the SQL database where the data mart is located, it is better to put them in a data lake. The reason is because we can do incremental loading into the data marts / business applications, keeping them lean and performant.

Third, for data publishing. If you have an application which calculates something which is used by many business application, like AUM for example, then it is a good idea for this application to read the lake and publish the calculation results back into the data lake (but in another area, not in the raw data area). For example, you could label the raw data as Bronze area and the AUM as Silver area. So the idea is that we use the data lake for publishing calculated numbers, to be consumed by other business application (because all business applications read data from the lake).

Conclusions

  1. Asset managers: if your need is reporting, use a data warehouse not a data lake.
  2. Monolith is dead: build data marts (business apps) rather than a data warehouse.
  3. There is a value for an asset manager to have a data lake (3 advantages).

Thank you to Robert Adler whose article inspired me to write this article:
https://www.portfoliobi.com/post/data-lakes-vs-data-warehouses-considerations-for-storing-data

1 November 2022

Power BI Datamart

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 8:39 am
Tags: , ,

Like SQL Data Warehouse, Microsoft once again named a product by what it does: Power BI Datamart. It is an Azure SQL database, with a visual query designer so users don’t need to write SQL to get the data out to BI tools, Excel or ML algorithms. It also generates a Power BI dataset automatically, which we can use to create Power BI reports.

Loading data into Datamart

Power BI Datamart uses a built-in Power BI Dataflow for loading data into it. After we create a Datamart, we click on Get Data and use PowerQuery to connect to various data sources such as Azure SQL database, Azure Data Lake Storage, Databricks (Spark), Azure Synapse, HDInsight, Sharepoint, Excel files, text files, JSON files, Parquet files, API, Salesforce, Snowflake, Redshift, BigQuery, Oracle DB, IBM DB2, Teradata, PostgreSQL, MySQL, Access, SAP, Analysis Services and ODBC (link):

The above list of data sources is so comprehensive, I could not believe it at first. This is a game changer. Gone are the days when we struggling in SSIS, Informatica or ADF to load data from those varied sources. PowerQuery has all the connectors and interfaces to every single one of those databases, files, data lakes and big data.

We then select the tables, specify transformations and PowerQuery would build the ETL rules in that built-in Dataflow, create the relationships and load the data from those tables in the data source into the Datamart. We can then see all those tables and relationships in the Table Tools tab:

Afterwards, we can create relationships (again) and measures. We can also hide the tables and columns that we don’t need.

In the workspace we automatically get a dataset:

We can schedule when we want to reload data from those data sources into the Datamart (called Scheduled Refresh). And we can also setup incremental load, i.e. only changed rows are loaded into the Datamart, based on the column that we want (known as “watermark column”, link):

Security and access control

Power BI Datamart uses row level security and roles to restrict user access to the Datamart. In this example below (link), the users are restricted to only be able to access rows with category ID of 1 and 2. Once that role is defined, we can assign users to that role, as we can see below:

Querying data

As I said at the start of this article, we can use a visual query designer to get the data out without writing SQL (link). As we can see below, the Visual Query Designer is like Markit EDM, SSIS and Informatica.

We can select rows, filter rows, select columns, split columns, remove duplicates, join tables, change data types, transpose table and pivot columns. We can also do error handling, such as removing errors or replacing errors (link). And we can also create custom functions (link).

Of course we can also write SQL:

Accessing Power BI Datamart from SSMS

And we can also query the data in the Power BI Datamart using SSMS (SQL Server Management Studio). And using Azure Data Studio too (link):

Not only we can use SSMS for querying data in the Datamart, but also adding users and configuring security. And we can also use SSMS for looking at the query statistics. This is how it looks from SSMS (link):

We can see above that the Power BI Datamart tables are listed as views, under “model” schema. The relationships in the Power BI Datamart is stored in two views called metadata.relationships and relationshipColumns, as shown above (link).

Conclusion

I think Power BI Datamart is a game changer.

  • It enables us to gather data from databases, data lakes, files and APIs into one data mart.
  • We can reload data every day/week and we can do incremental loading.
  • We can query the data using SSMS (can also do admin stuff).
  • We can build Power BI reports easily (the dataset is automatically created).
  • We query the data using visual query designer or by writing SQL.
  • We can easily get the data out to BI or ML tools, and to Excel.

That is a very powerful tool. Anyone in data warehousing and BI should consider this tool. At least take a look.

And I think the name suits it well. The Power BI Datamart is indeed a good data mart. With all its bells and whistles.

Blog at WordPress.com.