Data Warehousing and Data Science

21 January 2023

Elements of Data Architecture

Filed under: Data Warehousing — Vincent Rainardi @ 1:21 am

There are many elements of data architecture:

  1. Data models
  2. Data storage
  3. Data infrastructure
  4. Data ingestion
  5. Data calculation
  6. Data analytics
  7. Data access
  8. Data security
  9. Data sharing
  10. Data availability
  11. Data ownership
  12. Data standards
  13. Data dictionary
  14. Data quality

Note:

  • #3 includes cloud infrastructure, data platform, scalability
  • #10-14 plus #7 are included in data governance

In this article I’ll go through all those elements one by one, in order to paint a high level picture of data architecture, before finally answering “What is data architecture?” That way it would make our understanding of data architecture more down to earth, rather than highly conceptual.

1. Data Models

Before we do anything, in the beginning we model all the business concepts and processes. The output of this exercise is something called a data model. If in the business there is something called accounts, then in the data model we also have something called accounts. If in the business we have clients, in the data model we also have clients.

A data model defines what the data container look like. To use an analogy, data is like water. It is stored in different kind of water containers, such as bottles, cups, buckets, tanks. A data model defines how these water bottles, cups, buckets and water tanks would look like. It defines their shapes and sizes.

In the actual world, data is stored in database tables, or in files, or in documents. A data model defines the structure of these tables, files or documents. A data model also defines the relationships between those tables/files/documents.

So to summarise, a data model implements the business concepts and it defines the structure of the tables/files/documents where the data is stored. As a data architect, our role is to define those data models.

2. Data Storage

Data storage is the physical place where the data is located. To use the water analogy, data storage is the actual buckets, cups and bottles. These bottles and cups can be located on shelves, rooms, etc. The data model defines the shape and sizes of the bottles and cups, and the data storage is the actual bottles and cups.

In the actual words, the data storage is databases or data lakes. In databases, data is stored in tables. In data lakes, data is stored in files. In document DB, data is stored in documents. These databases and data lakes are physically located on storage devices such as Storage Area Network (SAN) and Solid State Device (SSD).

One property of the data storage is the capacity. Another property is the scalability, and availability. Capacity means how much data can be stored. Scalability means how easy it is to stretch the capacity. If the current capacity is 10 Terabyte, how easy is it to increase the capacity to 100 Terabyte? Will that impact the performance? In other words, will it be slower?

Availability means how much downtime could possibly occur, if something goes wrong. If the primary data storage goes down, will the system automatically switch to a secondary storage? How out of date is the secondary storage? How long does it take to make it up-to-date? If the primary network path from the user to the storage is not available, is there a secondary path? The downtime could be seconds or days depending on whether we have a secondary storage or not, and whether it is always kept up-to-date.

To summarise, data storage is the physical containers where the data is stored, like databases or data lakes, located in SAN or SSD. As a data architect, we design the data storage, and set the capacity, scalability and availability according to the requirements and infrastructure costs.

3. Data Infrastructure

The data storage is physically located in an IT infrastructure such as a data centre. In this day and age, the infrastructure is all cloud based. Meaning that the data centre is owned by Microsoft, Amazon or Google, and we are renting it from them, paying by the hour. In the old days, we need to pay millions of dollars upfront, to buy all the components of a data centre such as servers, storage and routers. Today we don’t pay a dime upfront. We only pay what we used last month.

The cloud data centre also contains network and security resources such as VNETs, IP Addresses, APIs, active directory, roles, Front Door, key vaults (I’m using Azure terminology here but the concept applies to AWS and GCP too) as well as data resources such as storage accounts, files, folders, ADLS, SQL databases, Databricks, Python Notebooks and ADF pipelines.

All of that is what I call the “data infrastructure”. To summarise, data infrastructure is all the data resources, network resources and security resources related to data storage and processing. As a data architect, our job is to work with the network team to design and build the data infrastructure.

4. Data Ingestion

Data Ingestion is the process of loading data into the data storage. This process uses data pipelines, which are controlled by an Enterprise Scheduler such as Autosys, Active Batch or Control-M. It includes the audit logs, error handling, data quality monitoring, data ingestion monitoring and failure alerts.

As a data architect, our job is to design the data ingestion processes, according to the business requirements and according to the data models. We then handover the design to the development team to be implemented. Once the development and testing are completed, the development team then deploys the data ingestion pipelines to the production environment using release pipelines.

Note on Release Pipelines

A release pipeline is a mechanism to deploy code from the code repository (Repo) to dev, test and prod environments. So we don’t copy code from Dev to Test and from Test to Prod. But we deploy code from the Repo to Dev and test it there. If required, fix the code in the Repo. Then deploy the code from the Repo to Test, and test it there. If required, fix the code in the Repo. Then deploy the code from the Repo to Prod.

The concept here is that we package the whole change as a “Release”, so that no changes get left behind. In the old days before DevOps, we didn’t package the changes into a Release, so sometimes “it works in Dev but not in Prod”. That’s because some changes got left behind. Using “Release pipelines” we test the functionality as well as the deployment.

5. Data Calculation

The other development that we need to do is Data Calculation. Data Calculation is all the processing and calculations that need to be performed on the data, to satisfy the business requirements.

Our role as a data architect is to guide the BA with regards to technical requirements and architectural design. For instance, perhaps the output of the calculations need to be persisted, and if so what is the best way to persist it. Perhaps the calculation is common across two applications, and if so should it be built as a single code that receives parameters?

6. Data Analytics

Data analytics means analysing and examining raw data in order to:

  1. Present the data
  2. Find patterns and trends
  3. Draw conclusions and insights
  4. Make a prediction

There are 5 types of data analytics:

  1. Descriptive data analytics: the TRENDS & ANALYSIS in the past data
  2. Diagnostic data analytics: to find out WHY some issues happened
  3. Predictive data analytics: to forecast what WILL happen in the future
  4. Prescriptive data analytics: what should be done to make something happen in the future
  5. Cognitive data analytics: human-like intelligence including semantics and AI

The 5 above known as the 2Ds, 2Ps and 1C. It doesn’t mean that we should be doing all 5. No. We could just do 1.

Doing data analytics is very much a BA job and so it should be. As a data architect, our job is to provide a framework of how it should be built. For example, if the BAs are analysing the trends in the data, how should they do it? Meaning: which data should be used for this analysis? What tool should be used? Where should the result be stored? Should it be automated? The same questions when a Quant does forecasting using ML: which data, which tool, storing results and automation.

7. Data Access

Data Access means the mechanism by which the users and applications are allowed to read from or write to the data storage. Users usually access the data using user IDs, whereas application is using a service accounts. Both user IDs and service accounts are created in Active Directory, and then they are assigned to a role.

The role is then given certain permissions, for example: a BA role can write into database 1,2,3 and can read from fileshare 4,5,6, where as a Quant role can write into database 11,12,13 and read from fileshare 14,15,16. When there are new users they are simply assigned to those roles.

Resources are then grouped together. For example, table 1,2,3 are grouped into Account Tables resource and fileshare 4,5,6 are grouped into Product Fileshare resource. The role permissions are then changed to use the Resources rather than the actual databases or fileshares. Using the above example a BA role can write into Account Tables resource and a Quant role can read from Product Fileshare resource.

In Azure, for database access, we should also decide whether we will be using SQL Authentication or Windows Authentication or both. More on this in section 8 below (Data Security).

Who should approve data access requests should also be defined (usually the line manager + InfoSec).
Who should audit data access should also be defined (usually once every 3 or 6 months, by Production Support and Development Head). It is very important to audit data access every quarter, and removed those which are no longer required.

What is our role here as a Data Architect? Our role is to work with InfoSec to define the mechanism. Using Roles and Resources as I described above is only one way. There is another way such as Azure Resource Groups and PIM (Privileged Identity Management) for production access. The best practice is usually achieved not using a single approach but a combination of several approaches.

The data can also be accessed using API. As a data architect our job is to define the mechanism, again working with InfoSec. For example: key vault, access token, roles validation, firewall, endpoints, etc.

8. Data Security

Like any security, Data Security is about Authentication, Authorisation and Accounting (known as AAA). Authentication is about identity. If someone claims to be User ID1 or Application Account2, how do we know that they are not lying? Authorisation is about permission. Is this user allowed to access that data? Accounting is how long the user accessed that resource and how much MB data was retrieved or written. This accounting data is usually used for billing, trend analysis and capacity planning.

Data security is also about preventing data theft and cyberattack. This is about preventing unauthorised access from both external and internal, for example by using a firewall or FrontDoor. It is also about encrypting the data so that in the event of the data being stolen, the thief can’t read the data.

Data security is not only about protecting the data when it is being stored, but also when it is being moved. For example whether the data is encrypted when it is being moved.

Data security is also about classifying which data is considered sensitive data. And controlling the access to that sensitive data.

Data security is also about preventing data loss. For example because the medium where the data is stored is damaged. So we need to provide a regular backup, or secondary server (not just using RAID disks).

In the context of data security, “data” means the raw data i.e. the output of #4, but also the processed data i.e. the output of #5 and #6.

What is our role as a data architect in data security? We know the in and out of the data, we know the structure and the flow of the data, we know how to the data is stored and who access it. Our job as a data architect is work with the InfoSec to protect the data when it is being stored and when it is being moved, to prevent data theft, unauthorised access and data loss.

9. Data Sharing

Data Sharing is a complement of Data Analytics and Data Calculations: rather than analysing or processing the raw data, we simply share the raw data with the users and applications. You might be underestimating Data Sharing as it looks simple but it can provide tremendeous values.

In it’s simpliest form, data sharing is data access. We just allow the users to access the data. Just like in point #7 above. But data sharing can also means building a set of APIs, for example Rest API or GraphQL, so that the users can access the data programatically, for example from their ML applications or from their BI reports.

Data Sharing is also about sharing internal data with external users such as clients and regulators. For example, via FTP, an API (as above) or via a website (see below). I’d like to point out here that, whether the user access it via a website, FTP or API, the underlying mechanism should be the same (single code base), i.e. using API.

Data Sharing is also about building a website for internal or external users to enable them to view, filter and export the data. They can export it into a text file (CSV or pipe delimited), PDF, JSON or Excel. Or they can save the filter settings (giving it a name), and schedule the export to run daily or weekly so that they get the latest data every time.

In Data Sharing, Data Access and Data Security is paramount (see #7 and #8 above). Particularly if it is for external users. Here’s an example for Data Access: In the data sharing website, if a user choose Client X, how do we know that this user is allowed to see Client X data? Here’s an example for Data Security: how protect the data from being stolen? Or being destroyed?

Data Sharing is not about building reports and give access for the users to the reports. No, that’s Data Analytics. Data Sharing is about allowing users and apps to access our data, such as using API, a website or FTP. As a data architect, our job is to design the mechanism, and hand it over to Dev team for development. When designing the mechanism, we work with InfoSec to ensure that the mechanism is secure i.e. the data is encripted during transport, and the access is tightly restricted and auditable (who accessed what and when is logged).

10. Data Availability

Data Availability is about keeping the downtime to a minimum. The data is always available to the users, to the reports and to the APIs 24×7. Data Availability is a bit of an art, because the database server might be up and running, but because the network path from the users to the database is having a problem, the user can’t access the data. Ditto from the reports or APIs.

The last thing we should do in cases like that is to blame each other (it’s not a database issue gov, it’s the network). Business users couldn’t care less whether it is the left hand of IT who’s making a mistake or the right hand, and they are right. From their point of view we are all IT. The right thing to do is have a Teams session between DBA and network team to work it out and resolve the issue. Data architect, InfoSec or Dev team might join in as well, depending on the issue.

Data Availability also means this: the data is server X or application Y and the user don’t have access to it. In other words, the data is in the building, but it’s not available to the users. In this case it is not about down time at all. It is not even about operation. It is about development prioritisation. If we allow everyone to access every data in the company, it would be spagetti. There’s a reason why we adopt data warehouse or data lake as the data architecture, one of which is so that users can access that data. Users, reports, and APIs can access all the data in one place.

What’s our role as a data architect in Data Availability? Our role is to define ingestion mechanism for new data. And for production issue our job is to help resolving the issue, often by describing the security architecture, data architecture and the data flows (data ingestion or publication).

11. Data Ownership

Data Ownership means that every data must have an owner. Not the technical owner, but business owner. For example, billing data is owned by the finance department, and inventory data is owned by the production department.

We data architect can easily create an entity, such as “policy” and “client” but who owns that data entity? It’s not always a single owner. Master entity such as policy and client can have 2 owners or more. In insurance companies the policy data is usually owned by the product team, not underwriting. Whereas clients are joinedly owned by the Sales & Marketing team and the Customer Service team.

The reason every data must have an owner is because if there is a dispute with regards to a definition of a particular data, the owner has the final word. Ditto when there is a dispute with regards to the sourcing of a particular data. Or the data quality check. See #12, #13 and #14 below for more details.

But Data Ownership is a matter for the Data Governance. What’s it to do with us data architect? Data Architect designs the entities in the data model, of course we should be working with the data owners when doing data modelling. And also when we are designing data ingestion, data access, data security, data analytics, data calculations and data sharing. In almost every single thing that we do, we need to work with the data owners.

We know when a particular data has no owner and we can bring that up to the Data Governance forum/meeting.

12. Data Standards

Data Standards is a convention about how the data models should be designed. For example, we have standards on data types. We have standards on the naming of data entities (and other data objects). We have standards on data ingestion process e.g. naming of the pipelines and resources, the way the connections are established, the way the ingestion pipelines should be built, etc.

We have standards on referential integrity between entities. Naming of the primary and foreign keys. We have standards on the collations, partitioning strategy, index rebuilt and statistics update.

We have standards for the Data APIs, both APIs for Data Sharing and API for Data Ingestion. Naming of endpoints, data types for parameters, format of the return values (e.g. JSON), expected response time, etc.

As a data architect it is our job to define those standards and monitoring the implementation. It is sometimes annoying when a “pull request” satisfies everything except 1 count of naming standards. When everything else is working except 1 entity name not following the standards. But we need to be disciplined with regards to following the data standards. If we allow one exception the whole thing will look funny.

Note: a pull request means merging the code from a “feature branch” into the “release branch”. Feature branch is where we do our development and unit testing. Release branch is where we package the code for releasing it into Test environment or Production environment.

13. Data Dictionary

A data dictionary is a collection of business terms used in the whole data platform. Yes this is very much in the Data Governance domain, but when designing the data model we Data Architect also have to define every single column in every entity.

When designing the data calculations we also have to define every single calculation. Yes, defining calculations is very much a BA task, but a Data Architect also needs to read those definitions. Quite often, the BA will need to ask the Data Architect about where a particular piece of data should be sourced from, especially when there are several possibilities. So in a way, Data Architects are also involved in the definition of those calculations.

A data dictionary also contains the definition of all the source columns. This is obtained from the data providers. If the data is sourced internally, then we need to ask the SME of that source system, about what every column means.

A data dictionary also contains the definitions of all the fields in every report. Again this is very much a BA task, but a Data Architect is often involved in the sourcing of the elements that make up that particular field. For example if field X in the report is Nominal x Unit Price x FX Rate, the BA would ask us Data Architects about where the Nominal, Unit Price and FX Rate should be sourced from (which column in which table, and potentially which row).

Creating a data dictionary is a mammoth task. We are talking probably 6-12 months of 1 FTE. Maintaining a data dictionary is not an easy task either, because new data keeps croping up and new components keeps being built. They all need to be added into the data dictionary.

With regards to Data Dictionary, as a data architect we are taking a back seat role. Our job is to make ourselves available to the Data Governance team and the BAs for them to consult and ask us question. Quite often those questions are about data definition and data sourcing (what does this mean and where can I find X).

14. Data Quality

We finally arrive at the last element: the Data Quality. Data Quality is checking that the data is correct and good for use. When the data is ingested into the data platform, before any calculation is made, we first check the quality of the data. And by that I mean these 5 factors: data existence, data format, data timeliness, data completeness and data correctness.

  1. Data Existence means that the data file for that day exists. Say every day we expect to receive 12 files from this data provider, and today we only receive 10 files. Meaning that 2 files are missing. That’s data existence.
  2. Data Format means that the physical format of the file is correct. Say we expect file X to be in comma separated format (CSV), which it usually is, but today it’s pipe delimited. That means it fails data quality check. Another example is if we expect that the file has 10 header lines but today it contains 11 header lines. Or we expect that the file should contains 40 columns (based on the agreed file spec with the data provider) but today it contains 41 columns. Or column 1 should contains a number, but today it contains a letter, e.g. 1.2E-5. Or column 2 should contains a date in YYYY-MM-DD format, but today it contains a date in DD/MM/YYYY format. Or blank should be expressed as a blank string, but today it is expressed as “NULL”.
  3. Data Timeliness means that certain file should be in by certain time. For example, if it is monthly file we should receive it by the first Monday at 9am. Or if it is a daily file we should receive it by 7am. Our Data Quality check will check the existance of the file at 7am (or the first Monday at 9am) and if the file is not there then it failed the timeliness check.
  4. Data Completeness is about the content of the file. If the file usually contains around 10,000 rows and today it contains 1000 rows then it’s not complete. Usually we can tolerate 10% to 20% depending on what data.
  5. Data Correctness is about the accuracy of the data. If in the file the average of column 1 usually around 55 and today the average is 85 then it failed the correctness check. Usually we can tolerate 10% to 20% depending on what data. Usually we compare it with the last 3 days’ average, or with the previous month. We can also use median or quartile. For text values we count. For example, if usually out of 10,000 rows the values are about 5000 “High”, 3000 “Medium”, 2000 “Low”, then today they should be around those counts too. We can replace High with 3, Medium with 2 and Low with 1, so we get 5000×3 + 3000×2 + 2000×1 = 23,000. Ditto with dates, we count the years or the months.

That’s the 5 categories of Data Quality checks/rules. But how about the Data Quality mechanism? How is it done? It needs to be flexible, not hard coded. So when we need to change it we don’t need to change the code, but just the parameters which are stored in a config file. In the old world, the data files were loaded into staging tables first, then we performed the DQ check on those staging tables, using SQL. But in the modern data platform, the DQ checks are performed on the data files. Not using SQL but using Python notebooks.

Where are the results of this DQ checks stored? In a table in the data warehouse or a file in the data lake. This table or file is then displayed on a DQ report.

Data which failed DQ checked are marked. There should be a column in each table for this, e.g. “Pass DQ Check (Y/N)”, and/or “DQ Rule ID” column.

What about data correction? It’s best to do data correction manually rather than automatically, but first we need to try correcting it at source i.e. ask the data provider to send a corrected file.

So Data Quality is checking and reporting the data existance, format, accuracy, timeliness and correctness, before the data is used by applications and in the APIs. As a Data Architect, our job is to define the mechanism for doing Data Quality check, logging and reporting the results. Also the mechanism for marking the data which failed the DQ checks and for data correction. We must also work with the BA and Data Governance to define the Data Quality rules.

CONCLUSION

So those are the elements of Data Architecture. It is clear now that Data Architecture is not just about Data Models, but about 13 other things too, including Data Security.

Of course different people has different opinion. I welcome comments and suggestions, either below on WordPress, or on LinkedIn (if you accessed this article through LinkedIn), or you can email me at vrainardi@gmail.com. I hope this article has been useful for you.

30 December 2022

Data Lake in Azure: Databricks or Synapse?

Filed under: Data Warehousing — Vincent Rainardi @ 4:17 pm

Which one is better for data lake in Azure: Databricks or Synapse Analytics? Databricks is easy to query using SQL from Power BI, Qlik or Tableau, as it has SQL endpoint and Partner Connect. On the other hand, Synapse Analytics can run on a SQL database, on a Spark cluster or on a storage account, all of which use SQL so they are easy to query too.

In terms of loading data in, both of them use Azure Data Factory (ADF). ADF has Databricks Notebook activity which enables us to transform data and ingest files from Azure storage account (either containers or file shares) into Databricks tables on ADLS2. This ETL process can be automated using Delta Live tables and Auto Loader (link).

Both ADF pipeline and Synapse pipeline can be used to ingest data into Synapse Analytics from many data sources using Copy activity. Synapse Analytics has Data Explorer which can pull data in batches or streaming from various sources, validate the data and doing data conversion too.

Synapse Analytics has changed a lot in the last 2 years. Previously it is synonymous to SQL Data Warehouse (previously known as Parallel Data Warehouse, PDW). It runs PolyBase SQL query, which is a distributed query engine. The main drawback was that this SQL DW or PDW was very expensive (this was 2010: link). Today the cost perception has completely changed, with $50/month we can start development on Synapse. I created a Synapse workspace, with SQL Pool and Spark Pool with 3 nodes, disable after 15 mins of inactivity, no Data Explorer, end up with about $10/month. So it is possible for small & medium companies to use it, unlike SQL DW 10 years ago.

The key to the answer I think, is the “data lake”. How does Databricks compare with Synapse, when they are used as a data lake? To be used as a data lake Databricks is better, because we can easily create a metadata driven data ingestion. Meaning that when we have a new data (from a file or a table), we can automatically create the ingestion Databricks notebooks to ingest the new data into data lake.

On the other hand, if it is for data warehousing platform, Synapse is better, because it has 3 modes including serverless SQL pool endpoints that we can use to query data files in Azure Data Lake, using SQL (link). We can then create a logical data warehouse on top of the raw data without moving the data (see my article here). We can transform the data in the data lake using Transact SQL and load it into a dimensional model in a Synapse SQL database.

I mean the difference between them (for a data lake) is not that big. In Synapse, data scientists can also query the data lake using OpenRowSet. And with a bit of an effort we can programmatically create metadata-based ingestion in Synapse pipeline (not using Copy Into but using Python Notebooks and SQL Scripts).

Conclusions

  1. Synapse today is very different to two years ago, so take a look and try it. It is still changing rapidly so two year from now it would have more new features and may look different.
  2. Which one is better for data lake in Azure: Databricks or Synapse Analytics? In my opinion Databricks is slightly better for data lake, but as a complete package for Data Lake + Data Warehouse, Synapse Analytics is better (compared to Databricks Lakehouse).
  3. Both platforms are changing rapidly, for example a year ago we didn’t have Lakehouse, Photon, SQL Warehouse or Partner Connect on Azure Databricks. And  a year ago there was no Data Explorer in Synapse Analytics, and Synapse has completely changed. Two years ago Databricks was way better than Synapse for data lake, but today the gap has narrowed a lot. A year from now the status can change or reverse.

PS. Note that what I’m talking about here is Azure. Databricks in Azure. Not in AWS.

Would welcome your opinion and comments.

28 December 2022

Share it, don’t move it

Filed under: Data Warehousing — Vincent Rainardi @ 9:32 am

I came across JP Morgan Chase’s implementation of Data Mesh in AWS (YouTube, AWS, Dave Vellante), and was amazed by the concept of “don’t move the data, but share it”. Many years ago when I was doing a data platform project in a large bank in London, the bank insisted that we should not move the data. They already had a data store, and didn’t want to create another data store. So the concept was not new, but today with data mesh it has become much more popular. Fannie Mae did it (link), Go Daddy did it (link), and many others are implementing this concept of “don’t move the data”.

In my book (2007) I called it Enterprise Information Integration (A below), where the data is shared. Compare this with a Federated Data Warehouse (B below) where the data is moved:

OpenFin

OpenFin technology (link) enables companies to consume data from financial companies as such Factset and Refinitiv, as well as their own internal data. We can use OpenFin desktop to integrate the data from various data sources, without moving the data. That is a new concept. For 20 years I have been integrating data into a database and data lake. But with OpenFin, we integrate data directly on the user desktop, without integrating it first in a database or a data lake (link).

So the core idea is, rather than integrating data in a data warehouse (or a data lake), companies should be integrating data directly on the user desktop, as shown below. Data from Refinitiv, Salesforce, Factset, S&P Global, etc. can integrate directly on the user desktop (B), without being integrated first in a data warehouse (A). We can also create an application which publishes the data from the data warehouse into user desktop (C).

OpenFin is a desktop/workspace. It is like a browser where you can open many tabs or boxes, each tab or box containing an application. The key feature here is that the tabs and boxes are connected and synchronised. If we change the data in one box, the data in all other boxes change automatically. For example, if in Box 1 you change from product A to product B, in Box 2 would change from displaying say the sales and costs for product A to product B. That sales and costs data can come from within your company, or it can come from other company, shared via a protocol called FDC3 (link).

So there are 2 new concepts:

  1. Don’t move the data but share it.
  2. Integrate data directly on user desktop, not in a database or data lake.

Data Mesh

That concept of “don’t move the data but share it” is probably best achieved using Data Mesh. What is Data Mesh and why doesn’t it play a critical role in implementing “data sharing” concept?

The Data Mesh concept was introduced by Zhamak Dehghani in 2019. First, please read her article on Data Mesh: link. There is no point for me to rewrite it here as she explains it very clearly there.

Data Mesh is based on 4 principles. One of them is self-service data platform. This data platform should be cloud based (SAAS if possible), where data is treated as products that live in separate business domains.

It should be self-service, meaning data users should be able to get the data themselves. For that it is critical that the platform provide two things:

  1. A searchable data catalogue to enable the users to find the data they need.
  2. An API which enables the users to download the data, or their applications to consume the data.

Data Processing

That concept of “don’t move the data but share it” is very difficult to do in practice. This is because of the processing required, before the data can be consumed. Take for example the performance of the funds (3 months, 1 year, etc.) which need to be attributed to various industry sectors, asset class, currencies, down to the instrument level (or even trade level). Once this processing is done, we will need to store the results. So very quickly, we realise that we need to store the data.

But the output of the attribution calculation is not the raw data. It is a new data. If your application produces new data, then you should be publishing it back to the data platform (e.g. data lake or data warehouse), so that other data consumers can use that new data.

In fact, the discipline of publishing the results of data processing back into the data platform is the corner stone principle of the whole “share it don’t move it” concept. Otherwise the entire thing won’t work.

Data Lake

One last thing, because of this “publishing calculation results back to the data platform”, a Data Lake is more suitable compared to a Data Warehouse as the data platform. A Data Lake is more versatile for that job, and more cost effective.

27 December 2022

Difference between Extract and Ingest

Filed under: Data Warehousing — Vincent Rainardi @ 9:57 am

What is the difference between Extract (as in ETL = Extract-Transform-Load) and Ingest (as in data ingestion)?

Difference #1: The direction of the flow

  • Extract means pulling data out of the operational system.
  • Ingest means putting data into an application.

So the difference is clear. Extract is pulling data out, whilst Ingest is pushing data in.

Difference #2: The reason for doing it

The reason we extract data out of the operational system is to populate a data warehouse. Or a data mart. So we extract in order to store it in another place.

The reason we ingest data into an application is to use it for analysis. Or for prediction (ML). Or for calculating something.

So it is clear the the reason we pull data out from an operational system is to store it (not to use it directly), whereas the reason we push data into an application is to use it directly.

Difference #3. The transformation

The third difference is transformation:

  • We extract data in order to transform it. To change the format. And then to store it, in different format. For example, in dimensional model format.
  • We ingest data not in order to change format. But to use it. To consume it.

Data Lake

What about data lake? Is it ingest or extract + load?

It is to ingest. Because we don’t transform the data. We don’t change the format.

Difference #4. End to end picture

It would help our understanding if we look at the end-to-end picture. There are 3 use cases below: A, B and C.

In A, the data is extracted into a data warehouse/mart to be used for ML.

In B, the data is ingested into a data lake to be used for ML. Note the use of “ingest”.

In C, the data in ingested into a data lake to be extracted and loaded into a data warehouse, before it is used by a Business Intelligence application.

18 December 2022

Building Prod and Dev Environments

Filed under: Data Warehousing — Vincent Rainardi @ 5:54 pm

About 20 years ago in a leafy little town in Surrey I had a IT training for a week. Every year, the company paid for each staff to have a development training of their choosing. That year I chose “Iterative Development” after choosing .NET the previous year. The methodology was called Rational Unified Process, which is an iterative development process. Rational is now part of IBM.

In that training class we were taught that it is better to deliver things incrementally. So if the software we build had 10 different parts, for the first release we should only do 1 part. On the second release we do another part, and so on.

What was remarkable however, was that the first release was the most difficult, because we would need to setup the development and production environment. Setup the architecture if you like. So if we call the architecture build as Release 0, and then we have 10 subsequent releases delivering the various functionalities, the effort for Release 0 would take a very large chunk of the who project (say 25%).

That was why the training instructor said to me that on the first release you should only deliver minimal functionality. Because you have all this architecture and environment to deal with. Procuring database servers, web/application servers, setting them up, installing storage and configuring them, dealing with network configuration, and so on. Setting up development environment, code repository, test harnesses, release pipelines, and so on.

I understood that infrastructure and architecture took a lot of effort because prior to that training in Surrey, I had several experiences configuring IT infrastructure. I had to configure a HP Unix server + storage in one project, and a cabinet of SQL servers + storage in another project. I worked as environment manager in Mainframe system in one project, and setting up LAN for Ethernet and token ring in another project.

That principle was ingrained in my mind. Since then I’ve been working for no less than 10 companies, delivering various data platforms (mostly data warehouses, big data and data lake platforms). And I always try to apply that principle. Try to deliver the architecture first, both the production environment and the development environment. That is an enormous undertaking.

I call it “delivering 2 numbers”. In the first release (what I called Release 0), we would just deliver 2 numbers. Not even to the screen. Just to a database or data lake. It is not even regularly, but only a one off. But it is in production. The development environment may not be setup properly. The test environment either. The release pipeline may not be working properly. The user access might not be setup. We may not be using a service account or a key vault. But the data load works in production. And it is only delivering 2 numbers (well not literally 2 numbers, but a minimum amount of data, like just 1 table).

If we have to deliver 100 tables to the data warehouse/lake and build 10 different reports out of it, the effort for delivering the project is probably something like this: (all items include design, build and test)

  1. Build development environment (including release pipeline): 15%
  2. Build production environment (including scheduling): 20%
  3. Build test environment (including automatic testing): 15%
  4. Setup security: 10%
  5. Build data load for 100 tables: 20%
  6. Build 10 reports:10%
  7. Setup operational process: 10%

So the environment (including security) is 60% of the effort, whereas the business functionality (data load + reports) is 30%. So environment is about twice of functionality.

And that is what the business users often don’t realise. “How come you guys been working for 3 months and not a single report been built yet? I mean there are 5 of you! What have you been doing?”

Building the environment is like digging the foundation of a building. It takes a lot of time, and no body appreciate it. When the foundation is built, the structure of the building can be built on it. Then the kitchen, flooring, etc. can be done very quickly. But it is the foundation that takes up a lot of time. For a long time we don’t see the results.

I hope this was helpful. As usual if you have any comments or questions just click the link below. Or you can email me at vrainardi@gmail.com.

24 November 2022

Data Lake for Asset Management Companies

Filed under: Data Warehousing — Vincent Rainardi @ 6:36 pm

For many years I have been asking myself the value of data lake specifically for asset management companies. If an asset management company does not have a data warehouse (or any other data management solution), would I put a data lake as a solution? The business needs are for reporting, and for investment management. Let me spell it out:

  1. To provide portfolio managers and financial analysts with the latest, valid version of all holdings (IBOR and ABOR positions) and benchmark constituents, and tearsheets for their meeting with various clients and companies, as well as evaluating the companies they are looking to buy.
  2. To provide the client reporting team with performance reports, attribution reports, valuation reports and ESG reports to be sent to clients and to be published in the factsheets and KII documents.
  3. To report to regulators including FCA and SEC on shareholding, OTC derivatives, trades, ESG, MiFID and SFDR, and to comply with requirements to be sustainable funds.
  4. To provide the finance department with the asset under management numbers (AUM), NAV and transactions.
  5. To assist the marketing and sales teams with various data requests for proposal, RFPs and presentations to clients and prospective clients.
  6. To provide the risk and compliance team with risk analytics (spread duration, DTS, etc), exposures and VaR breakdowns, stress test results and compliance reports.

The reason I ask this question is because the data in asset management companies are mostly structured data, if not 100%. All the above 6 points are structured data. They are tabular. They consists of columns and rows. The data types are string, numeric and dates. There are no audio, images or video. There are no twitter data, Facebook data or documents.

Data Lake on the other hand, handles multimedia files well, and social media too. That is the key differentiator between a data lake and a data warehouse. On the other hand, there are many, many superior features that a data warehouse has but a data lake doesn’t. The top one is data integration. A data warehouse integrates data. Company information from various data sources (Bloomberg, Factset, Markit, MSCI, Trucost, etc) are integrated in the warehouse. Industry sector for example, and their primary equity ID such as ISIN. Also company names, country of domicile, and ratings.

Allocating proper IDs to each company is not a simple process. Any one who has done this will testify how complex the process is. There are many ISINs, SEDOLs and CUSIP that the company issued, which one should we choose? Yes they are security ID not company ID, but that is what is used by other data providers such as MSCI and Trucost, so we need to use ISIN. Should we use Factset entity ID or Bloomberg issuer ID for the main issuer ID? (like the primary key) There are many company names. The company name from Bloomberg is not the same as Factset, and they are different to Refinitive, MSCI, Trucost, S&P, FTSE and Rimes. One is “Company, Inc.” and the other is without a comma or a period. One uses “Corp” and the other uses “Corporation”. And many other abbreviations. Which one should you use? Do you use Factset proper name? Should you take Bloomberg issuer hierarchy or Factset’s?

The same deal with security integration. How do you do your waterfall matching? SEDOL first or CUSIP first? CUSIP 8 or 9? Should you use Bloomberg security ID first? And Ticker is a mess. With exchange, without exchange, with country, without country, Bloomberg ticker, Factset ticker, etc. The same thing with valuation data. Do you use the start of day, end of day, mid month or month end? Do you use the latest available, or the signed off version? In a data warehousing project, the first thing we do is to establish a credible issuer record, security data, ESG data, valuation data, performance attribtion data. A credible exchange rates, prices and risk data. By integrating them from various sources. Yes it is a large project, but that is what is required for reporting.

A data warehouse is integrated. A data lake is not. A data warehouse is golden source (meaning it’s credible). A data lake is not. A data warehouse has DQ process (the data has been corrected and signed off), a data lake is not. Can you trust holding valuations coming out of data lake? No. It does not have a robust sign off process. Can you trust a company ISIN coming from the data lake? No. It is only from one source which hasn’t been integrated or cross checked with other sources.

A data lake is technically superior (in terms of data format that it can handle) but functionally poor. So if your goal is reporting, the right solution is a data warehouse. Not data lake. If your purpose is data science (machine learning) then data lake is the right solution. This is because machine learning processes raw data. Well that’s not right, because machine learning processes both raw data and calculated data. Say you’d like to predict the carbon emission scope 1 of a company using time series, say on CNN or RNN algorithm. That’s using raw data. That emission scope 1 number is from the ESG data provider (raw). But if you’d like to include the company’s annual revenue or the company industry sector, ratings, country of domicile or market cap, then it is not “raw” data. It is “processed” data. The company information coming out from the data warehouse is much more higher quality than the ones in the data lake. Why? Because annual revenue is a calculation. For example, in the US we get the quarterly sales number from 10Q, and we sum it up. But which period are we using? Does it tally with the period of the carbon emission? If the carbon emission is as at 31st Dec 2020, and the company financial year is from 1st April to 31st March, which quarterly sales do you take? If there are several sources for sector, rating, country, then there is a waterfall process + data quality check to determine which one to use. All this happens in a data warehouse. None of it happens in the lake, as by definition the lake store only raw data.

So there goes the argument that “for machine learning you need a data lake, not a data warehouse, because you need raw data”. The fact is machine learning predictions require both raw data and processed/calculated data.

For an ecommerce companies which takes Twitter feed, or share prediction algorithm that takes public news as input, data lake is ideal. But if your goal is reporting to regulators, investment desks, sales and marketing, risk and compliance, etc. then you need to use a data warehouse, not a data lake.

I have created a data lake for an asset management company (part of a big team). And as I said, the data is structured. It is tabular. Rows and columns. No images, audio files or videos. No Twitter feed, or Facebook export. It is about holdings, valuations, portfolios, issuers, assets, attributions, the normal day-to-day asset management business data.

Monolith is dead

That said, I completely agree that monoliths are dead. A data warehouse as a monolith is not suitable for today. It takes a long time to build (1-2 years), and once it’s build it’s difficult to change as the regression test takes ages. We need to build a small data mart. Specifically for 1 purpose, like calculating the total AUM (with the breakdowns) for finance at month ends and dailies. Like SFDR reporting. Or client reporting. One mart for one purpose. And you’ll have many marts. Each with different purpose. That way if you change the portfolio dimension in 1 mart, the other marts won’t be affected. There is no single version of truth. There are many versions of which the issuer hierarchy can be implemented. Fixed income is different to equity.

And if you require a single version of the truth, like holdings, valuation or attribution for example, then you create a special mart what calculates that. And then publish it to all the other mart. More on this publishing mechanism later.

That said, the mart doesn’t have to be dimensional. No. It can be relational, it can be vertical. The data model in a mart needs to suit the business purpose of that mart. It might be serving a website, it might be for reporting, or for analytics (BI). Or for machine learning. Or calculating ESG numbers. Or calculating AUM. The data model in each mart needs to suit its purpose. It would be silly if we insist of using dimensional model all the time (fact and dimension). Of course not. We have to tailor the data model to suit what the data is used for.

Because of this you might not want to call it a data mart. Because data mart traditionally means a dimensional model. A small data warehouse, in dimensional format. A data mart traditionally consists of fact and dimension tables. In Kimball format (as in the group, not just one man). As a fan of the Kimball method, I found that dimensional format has many advantages. But when it is used for feeding a website that’s not the right format. When it is used solely for calculating ESG numbers from a Python notebook using Pandas and Numpy, dimensional is not the right format. Some applications require flatten tables, some applications require relational tables. Some require data vault format. No, definitely not “always dimensional”. So you might want not to call it a data mart. You might want to call it a client application. A business application database. Along those lines, i.e. the client or business purpose. Call it an ESG data store, AUM data store, client reporting database, SFDR database, etc. Remember: only for 1 specific purpose.

The advantage of this “only for 1 specific purpose” from the management point of view is the clarity of funding. It is easier to calculate the cost-benefit analysis. And to monitor the payback. Because the whole project is about that 1 purpose. Compare this with funding a traditional “enterprise scale” data warehouse. How do you budget it? You will have to take it from the core IT budget. The central budget. Or combine many projects. And the payback could be 3 years. Or 5 years. It is not financially attractive. On the other hand, a business application that serves only 1 specific purpose is hugely attractive. You don’t mess around doing other thing. You just do 1 thing, like calculating AUM. When you need another one for SFDR, you build another one. Another application, with another database. The budget allocation could not be clearer. The payback too.

Data Lake

That said, I would argue that there is a point in building a data lake for an asset management company. First, for a central data storage, where everything is kept. Second, for landing area. Third, for data publishing.

A data lake by definition is a centralised storage for raw data. Everything is stored there, from many data sources. External and internal. Even calculated data from various systems is stored there too. The advantage? Everything is in one place. And if you use Databricks for your data lake, then you can join the files like tables! You can use normal SQL to query the data. You can feed your BI tools. And your reporting tools. And many, many business applications too. From your company website, to your externally-facing client reporting portal. You can feed your internal data portal too, where everybody can get any data that they need.

Second, landing area. I’d argue that, rather than doing data staging or landing in the SQL database where the data mart is located, it is better to put them in a data lake. The reason is because we can do incremental loading into the data marts / business applications, keeping them lean and performant.

Third, for data publishing. If you have an application which calculates something which is used by many business application, like AUM for example, then it is a good idea for this application to read the lake and publish the calculation results back into the data lake (but in another area, not in the raw data area). For example, you could label the raw data as Bronze area and the AUM as Silver area. So the idea is that we use the data lake for publishing calculated numbers, to be consumed by other business application (because all business applications read data from the lake).

Conclusions

  1. Asset managers: if your need is reporting, use a data warehouse not a data lake.
  2. Monolith is dead: build data marts (business apps) rather than a data warehouse.
  3. There is a value for an asset manager to have a data lake (3 advantages).

Thank you to Robert Adler whose article inspired me to write this article:
https://www.portfoliobi.com/post/data-lakes-vs-data-warehouses-considerations-for-storing-data

1 November 2022

Power BI Datamart

Filed under: Business Intelligence,Data Warehousing — Vincent Rainardi @ 8:39 am
Tags: , ,

Like SQL Data Warehouse, Microsoft once again named a product by what it does: Power BI Datamart. It is an Azure SQL database, with a visual query designer so users don’t need to write SQL to get the data out to BI tools, Excel or ML algorithms. It also generates a Power BI dataset automatically, which we can use to create Power BI reports.

Loading data into Datamart

Power BI Datamart uses a built-in Power BI Dataflow for loading data into it. After we create a Datamart, we click on Get Data and use PowerQuery to connect to various data sources such as Azure SQL database, Azure Data Lake Storage, Databricks (Spark), Azure Synapse, HDInsight, Sharepoint, Excel files, text files, JSON files, Parquet files, API, Salesforce, Snowflake, Redshift, BigQuery, Oracle DB, IBM DB2, Teradata, PostgreSQL, MySQL, Access, SAP, Analysis Services and ODBC (link):

The above list of data sources is so comprehensive, I could not believe it at first. This is a game changer. Gone are the days when we struggling in SSIS, Informatica or ADF to load data from those varied sources. PowerQuery has all the connectors and interfaces to every single one of those databases, files, data lakes and big data.

We then select the tables, specify transformations and PowerQuery would build the ETL rules in that built-in Dataflow, create the relationships and load the data from those tables in the data source into the Datamart. We can then see all those tables and relationships in the Table Tools tab:

Afterwards, we can create relationships (again) and measures. We can also hide the tables and columns that we don’t need.

In the workspace we automatically get a dataset:

We can schedule when we want to reload data from those data sources into the Datamart (called Scheduled Refresh). And we can also setup incremental load, i.e. only changed rows are loaded into the Datamart, based on the column that we want (known as “watermark column”, link):

Security and access control

Power BI Datamart uses row level security and roles to restrict user access to the Datamart. In this example below (link), the users are restricted to only be able to access rows with category ID of 1 and 2. Once that role is defined, we can assign users to that role, as we can see below:

Querying data

As I said at the start of this article, we can use a visual query designer to get the data out without writing SQL (link). As we can see below, the Visual Query Designer is like Markit EDM, SSIS and Informatica.

We can select rows, filter rows, select columns, split columns, remove duplicates, join tables, change data types, transpose table and pivot columns. We can also do error handling, such as removing errors or replacing errors (link). And we can also create custom functions (link).

Of course we can also write SQL:

Accessing Power BI Datamart from SSMS

And we can also query the data in the Power BI Datamart using SSMS (SQL Server Management Studio). And using Azure Data Studio too (link):

Not only we can use SSMS for querying data in the Datamart, but also adding users and configuring security. And we can also use SSMS for looking at the query statistics. This is how it looks from SSMS (link):

We can see above that the Power BI Datamart tables are listed as views, under “model” schema. The relationships in the Power BI Datamart is stored in two views called metadata.relationships and relationshipColumns, as shown above (link).

Conclusion

I think Power BI Datamart is a game changer.

  • It enables us to gather data from databases, data lakes, files and APIs into one data mart.
  • We can reload data every day/week and we can do incremental loading.
  • We can query the data using SSMS (can also do admin stuff).
  • We can build Power BI reports easily (the dataset is automatically created).
  • We query the data using visual query designer or by writing SQL.
  • We can easily get the data out to BI or ML tools, and to Excel.

That is a very powerful tool. Anyone in data warehousing and BI should consider this tool. At least take a look.

And I think the name suits it well. The Power BI Datamart is indeed a good data mart. With all its bells and whistles.

20 October 2022

Forecasting stock prices using LSTM

Filed under: Data Warehousing — Vincent Rainardi @ 7:53 am

LSTM

Long Short Term Memory (LSTM) is a machine learning algorithm that can recognise a series of data in a sequence. For example, series of words in an email, a series of spoken words (voice), a series of pictures (a video), a series of daily rain fall data, daily solar radiation, or monthly sales. LSTM has the ability to recognise the pattern in the data and make a prediction about the future values. This has been implemented widely in many areas, including in the financial markets.

RNN

LSTM belongs to a family of algorithms called Recurrent Neural Network (RNN). RNN is an machine learning algorithm that has the ability to learn the pattern in a sequential data. Apart from LSTM, there are other algorithms in RNN family, for example: Gated Recurrent Unit (GRU). RNN, as the name implies, consists of a neural network. A neural network is an architecture from a series of layers, and each layer consists of nodes. Each node in LSTM consists of 3 gates, i.e. the forget gate, the input gate and the output gate.

Stock Markets

There are many different forms of financial markets. Stock markets, bond markets, commodity markets (they trade the future price of oil, gold, aluminium, wheat, etc.), currency markets (trading the exchange rate of fiat currency pairs and crypto currencies), money markets (trading very liquid securities such as US treasury) and derivative markets (trading fixed income and equity derivatives such as options, futures and swaps). These markets have different characteristics. Some are volatile like stock markets and currency markets, some are quite stable like bond markets and money markets. For individual investors, stock markets is arguably the most popular one, much more popular than bond, commodity, derivative or money markets.

Forecasting vs prediction

Before we get into stock price forecasting, let’s understand the difference between prediction and forecasting. In machine learning the word “prediction” means trying to guess the value of a variable. If you have the number of windows in a house, and you try to guess the price of the house, that’s prediction. Or if you have the height of the people, and you try to guess the weight, that’s prediction. Of course you can use many variables for prediction, for example: using the house location, the floor area, the number of bedrooms and the age of the house to predict the price of the house.

In machine learning forecasting deals specifically with time. It means predicting the future values. For example how many washing machine will be sold in our store next month, or whether it is going to rain or not tomorrow. We can based our forecast based on the past values of same metrics. For example, we can forecast the next month inflation based on the inflation in last 2 years. Or we can based our forecast on something else entirely. For example, we can forecast the traffic jam based on the scheduled events in that city. And of course, forecasting is predicting. Because predicting means trying to guess the value of a variable, including the value of a variable in the future.

Stock Price Forecast

In the stock market there a lot of variables that we can forecast (= guessing the future). For example, try to predict which companies will be leaving S&P 500 next year. We can predict the number of transactions tomorrow. Or the volume. Or the direction of the price tomorrow, i.e. up or down. Or the market sentiment. Or we can try to correlate one variable with another variable. Out of all variables, the stock price is the most popular one. That is not surprising, because if you know what the stock price will be in the future, you can make a huge fortune.

There are many stock price forecasting articles. Most of them use only last few weeks data. And they forecast only the next day price. That is not what happens in the real world. Many uses the closing price, or the opening price. Again that’s not what’s happening in the real world. When an investor sell, the price they get is not the opening or closing price. Not the highest price or the lowest price either. They get somewhere between the highest and the lowest price. It is not called the mid price, because mid price is the mid point between the asking price and the offer/bid price, for a particular trade. Whereas high, low, open, close are for a day worth of trades, for a particular stock or bond. Or options, or futures, or any other derivatives, commodity or currency for that matter.

Principles of Forecasting

Many forecasting methods is based on the assumption that the future will behave like the past. That is the fundamental reason to believe that we can predict the future. Because the future will behave like the past. This is the fundamental of a traditional time series forecasting such as ARIMA or Exponential Smoothing.

The technical term for this is autoregression, meaning that the future values of a time series is related to the linear combination of past values of that time series (or autocorrelation if it is not linear). The other technical term for this is moving average model, meaning that the future values of a time series is related to the past forecast errors. Both autoregression and moving average model are the fundamental principles used in ARIMA.

Whereas Exponential Smoothing technique believes that the future values of a time series are related to the weighted average of the past values of that time series. The recent events get larger weights than the distant past events.

The Art of Stock Forecasting

Like any prediction, the art of stock forecasting is choosing the right variables to use. The stock time series itself might not enough for forecasting its future. In that case we need to include other variables, such as volume, other stocks, and the industry average. That is the most important ingredient: the input variables. The first step that we do, before choosing the algorithm or methods, we need to establish a set of variables which are highly correlated to the future price of the stock.

The second most important ingredient in stock forecasting is the choosing the right algorithm. It does not have to be a machine learning. Time series forecasting is a hundred years old since, so ignore the non ML methods at your peril. Combine the traditional methods and the ML methods. Learn about Facebook’s Prophet, GARCH, ES, ARIMA, Theta and other non-ML methods. Combine them with popular ML methods commonly used for forecasting time series such as LSTM and CNN. But don’t stop there, also use unconventional ML algorithms such as XGBoost and Random Forest. There are tons of literatures and papers for each methods, which provides us with a good starting point.

That is the art of stock forecasting, to find the right combination of variables and algorithms that can give us the best results. And then the third one: tuning those combinations. Trying out different regularisations, using different hyper-parameters, different optimiser, different evaluation criteria. All in an attempt to get the best results. That is the art of stock forecasting.

Different types of LSTM

When it comes to stock price forecasting using LSTM, there are 5 different architectures that we can use: vanilla LSTM, stacked LSTM, convolutional LSTM, bidirectional LSTM and CNN LSTM.

  • Vanilla LSTM is a single LSTM layer, possibly with regularisation measure such as a dropout layer. Optionally with one or two dense layers for converting the output into classes.
  • Stacked LSTM means multiple LSTM layers stacked one after another.
  • Bidirectional LSTM learns the time series both forward and backward.
  • Convolutional LSTM reads the time series using convolution method (see my article here for what convolution means)
  • CNN LSTM also reads the time series using convolution method.

The difference between CNN LSTM an Convolutional LSTM is: CNN LSTM uses a convolution layer to read the input, then pass the output to an LSTM layer. Whereas Convolutional LSTM does not use a separate convolution layer separately, but uses a Keras layer called ConvLSTM which already have convolution reading built in.

Implementation

The implementation examples of these 5 architectures can be seen on Jason Brownlee’s website: https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/

Here is an example for Vanilla LSTM: (my own implementation)

In the above example, the Y Gap is 66, i.e. the number of days between [the last trading day in the stock time series] and [the number of days in the forecasting period]. For example: the time series is from 1st Jan 2016 to 31st Dec 2021. The Y Gap is the period from Jan, Feb, and March 2022. And the forecasting period is the first week in April 2022.

That is why, in the example above the LSTM Output is set to 71 days. Because it is 66 days (the number of trading days in 3 months period of the gap, i.e. Jan, Feb, March 2022) plus 5 days (the number of trading days in the first week of April).

The number of trading days corresponds to the days when the stock market is open. This is usually Monday to Friday, minus the public holidays.

Running the model is quite simple:
model.fit(X_train, Y_train, epochs = 50, batch_size = 32)

I do hope this article inspired you to take a stab in stock price forecasting. If you have any questions or would like to discuss any aspect of stock forecasting, I’m available at vrainardi@gmail.com.

14 October 2022

Data Warehouse on Data Lake

Filed under: Data Warehousing — Vincent Rainardi @ 7:04 am

Some people argue that the best way to build a data warehouse on data lake is to create a SQL Server database (or Oracle), populated from a Databricks data lake, like the diagram below. It is certainly a good way of doing it, but yesterday I saw a better way: build the Data Warehouse on the Databricks SQL Warehouse feature (used to be called SQL endpoint).

The SQL Warehouse in Databricks makes the tables in the Databricks queryable using SQL command as if they were tables in a SQL database. More importantly it enables BI and reporting tools to connect to it as if it was a SQL database, and we can permission which users are allowed to access which schemas, tables and views. Yes views! That is something that we could not do previously on the Databricks cluster. But now we can, on the SQL Warehouse. See the diagram below.

Similarly, for the data scientists and ML engineer, instead of creating a SQL database populated from the data lake, it is better to create the data science area on Databricks itself, as a SQL Warehouse.

Here’s how to do it. We build the data lake as normal on Databricks. One Databricks database per data source. As usual, in each database we have bronze area, gold area, etc. We bring the data in from the data sources (say from Charles River, or from SAP) into the bronze area using Azure Data Factory (or Glue + Transform if you use AWS). We create Python notebooks to process data from bronze to silver to gold, validating data, cleaning it and processing it along the way.

At this point the usual approach is to create an Azure SQL database (or Redshift if you use AWS, or Snowflake), and create a Kimball dimensional model on it (star schema, fact and dimension tables). To populate it we can use Azure Data Factory to read the lake. To get the data out of the lake we can either a) create APIs on top of Databricks (so the ADF pipelines consume these APIs) or b) directly reading the Databricks cluster using Spark SQL, Scala or Python.

Using SQL Datawarehouse we don’t create this Azure SQL database. Instead, we create a database in SQL Warehouse. A database in SQL Warehouse is formally called a schema. Then we create the Kimball dimensional model (fact and dimension tables) on this schema. Put the location path of those table in the data lake ADLS (or S3 if you use AWS). Preferably on Delta format (you can choose JSON, CSV, Parquet, Avro). For large fact tables we can partition the tables.

Then we use COPY INTO to loads data from a file location in the data lake, into staging tables on SQL Warehouse. So we don’t need to create the data lake tables as external tables. These staging tables are created as Delta tables. Then we populate those fact and dimension tables using INSERT INTO statement: INSERT INTO FactTable1 TABLE StagingTable1. So we select from the staging tables and insert into the fact & dimension tables.

And that’s how we create a data warehouse on Databricks SQL Warehouse. And populate it.

As a bonus, using Databricks SQL Warehouse we can capture the data lineage out of the box. See the Reference section below.

Reference:

5 October 2022

MCC formula for multiclass classification

Filed under: Data Warehousing — Vincent Rainardi @ 7:51 am

Problem: you produced multiclass classification like below, and need to evaluate it using MCC:

Note: MCC (Matthew Correlation Coefficient), also known as Phi Coefficient (link) is a measure how closely related 2 variables are. For multiclass, MCC is a better evaluation measure than accuracy, precision, recall or F1 score. MCC and AUC measures different things: MCC measures the statistical accuracy, whereas AUC measures the robustness (link).

Solution:

from sklearn.metrics import matthews_corrcoef
matthews_corrcoef(Y_Actual, Y_Predicted)
Output: 0.035662657

Where Y_Actual and Y_Predicted are lists containing the class ID like: [3, 8, 2, 5, …, 9]. In the above example both lists contain 1000 items.

How does it work?

But how does it actually work? What is the formula to calculate that 0.035662657?

Well, Scikit Learn specifies the formula here:
https://scikit-learn.org/stable/modules/model_evaluation.html#matthews-corrcoef

Where TP, TN, FP, FN are the number of true positives, true negatives, false positives and false negatives, respectively.

But that’s for binary classification. How about multiclass classification?

Well, Scikit Learn specifies the formula too, on the same page as above:

Where:

k = class from 1 to K (in the above example: 10 classes)
s = number of samples (in the above example: 1000 samples)
c = number of samples correctly predicted (in the above example: 157)
tk = number of times class k truly occurred (in the above example: 67, 42, 67, etc.)
pk = number of times class k was predicted (in the above example : 17, 8, 20, etc.)

When I looked at this formula, I got confused. So let’s pick it apart and do it in Excel.

Nominator

The nominator is: c x s minus t x p, see below:

So it is 1000 x 1000 minus (column t) x (column p), i.e. 67 x 17 + 42 x 8 + … and so on.

In Excel we can use sumproduct like this:
(refer to the column and row numbers in Excel screenshot above)

= B12*D12-(SUMPRODUCT(B2:B11,C2:C11))
= 28918

Denominator

The denominator is:

In Excel it is:
= SQRT((B12^2-SUMPRODUCT(C2:C11,C2:C11))*(B12^2-SUMPRODUCT(B2:B11,B2:B11)))
= 810876.198255

Dividing the nominator by the denominator gives us:
28918 / 810876.198255 = 0.035662657

There we go. That’s how you calculate MCC in Excel. I hope this is useful.

Next Page »

Blog at WordPress.com.