Data Platform and Data Science

28 January 2022

Feature Importance

Filed under: Data Warehousing — Vincent Rainardi @ 9:15 am

Quite often in machine learning we would like to know which variables have the most impact on predicting the target variable. Knowing which product features affect the customer’s decision to buy the product is critical for a consumer goods manufacturer. Knowing which financial ratios affect the stock price if very important not only to fund managers, but also to the board of directors. Imagine for a second, that you will be able to know that, out of 10 different factors there are 2 factors which has lots of influence and 1 factor having minor influence, like this:

That would be very useful right? Some people would do anything to get this knowledge.

Well, the good news is: it is possible. In fact, we have been doing it for many decades, since before machine learning. But with machine learning it got better.

Like many things in machine learning, the question is usually: do we have the data? When I first studying ML, I always thought “Do we have the algorithm?” Or “which algorithm should I use?” (link) But after studying many algorithm, I’m now convinced that there would be one suitable. Given that we have deep learning which can find pattern in any data, it’s quite unlikely that we don’t have an algorithm. Even if there isn’t (just for argument sake), we can make one. Either by enhancing an existing algorithm or by combining several.

No, the question in machine learning is usually: do we have the data? And in this case, yes we do. I work a lot with ESG data (environment, social, corporate governance), such as carbon emission, air pollution, customer satisfaction, employee well being, board composition and executive remuneration. There are hundreds of ESG factors for every company and we have their share price (they are listed companies). Naturally, the question every one wants to know the answer to is: which ESG factors affect the share price most? And not only that, for every company we have many financial ratios such as: return on equity, net profit margin, debt to total asset, earning per share. Which ratios affect the share price the most?

Different industry sector have different variables. Imagine the variables in retail sector, health care sector, oil, banking, etc. They all have different variables, and they all have different target variables. And yes, we can find out which variables affect the target variable the most. So yes, we do have the data!

So it’s totally worth it to spend an hour reading and learning about this. The benefit for us is enormous. And once again, it is called Feature Importance (in machine learning features means input variables).

The best starting point I found, is to read what Jason Brownlee wrote on this topic: link. In that article he explains how to get the features importance using various different algorithms such as Random Forest, Decision Trees, Extreme Gradient Boost, Linear Regression, and Permutation. Both for regression (predicting a value) and classification (grouping data).

Yes, there are 2 different things in Feature Importance:

  1. Which variables affect the target variable the most (as I explained above), and
  2. Which variables affect the classification the most i.e. grouping the data points (putting similar data points into the same group)

In certain industry sectors #1 is more important, but in other industry sectors #2 is more important. That’s why we need to understand both.

If you read Jason’s article above, you’ll find that for each algorithm he provides a Python code to calculate the feature importance (mostly using Scikit Learn), as well as the result (both the numbers and the bar charts). In order to be comparable, we can scale those numbers to between 0 and 100 like this:

So if the output of a model is between 0 and 1, we multiply by 100. If it is between 0 and 1000, we divide by 10. This way all model outputs will become between 0 and 100.

In the above tables, the legend for the columns are as follows:

  • XGB = Extreme Gradient Boosting
  • RF = Random Forest
  • DT = Decision Trees (more specifically CART, Classification and Regression Trees)
  • LR = Linear Regression or Logistic Regression (depending whether it is for regression or for classification), using the coefficient for each feature
  • Perm = Permutation (using K Neighbours Regression then calculating the permutation feature importance)

But the above is still not easy to compare between models. To make it easier to compare between models, we can express them as percentage of total. So we first calculate the column totals like this:

And then we express each cell as a percentage of total like this:

Now it is easier to see. We can go through line by line, and spot the anomalies like this:

We can see above that for regression, the Linear Regression considers Factor 2 and Factor 7 as much more important compared to the other models.

On the classification side, we can see that Decision Trees consider Factor 7 as more important compare to other models, but Factor 2 and 6 as less important. And the Logistic Regression model consider Factor 2 as more important than other models, but Factor 4 as less important.

It is usually easier to spot the anomalies if we put them into 3D bar chart like this:

Here on the Regression chart (see the red arrows) we can see the 2 yellow bars for Linear Regression model on factor 2 and 7 are higher than other models.

And on the Classification chart (see the green arrows) we can easily see the bars which are higher than other models. But (see the black arrows) it is more difficult to spot bars which are lower than other models.

The question here is which model is the best one to use? Well, each model have different ways in considering which variables are important, as Jason explained in his article. Random Forest for example, use the decrease in impurity to determine which variables are more important. Specifically, for every variable, the sum of the impurity decreases across every tree in the forest, and is accumulated every time that variable is used to split a node. The sum is then divided by the number of trees in the forest to give an average. By impurity here I mean Gini. Similarly, the Decision Trees (CART) and the XGB models also use the impurity decrease, but in slightly different flavour because the ensemble technique used is different in those models.

So it is not about which model is the best to use. But we have to choose a few models which covers different opinions. The wider the opinion, the better. So we need to choose models which are contrasting each other. For example, for regression we can choose Random Forest and Linear Regression (coefficients). Because Linear Regression points out that factor 2 and 7 are important, whereas other models don’t.

For classification we can use XGB, Decision Trees and Logistic Regression. Because DT and LR have different opinion to XGB with regards to which factors are important. We know from XGB factor 4 is the most important, followed by factors 3, 5 and 6. But DT is saying that factor 7 is important too. And LR is saying that factor 2 is important too. This way we can examine factor 2 and 7 to see if they are significant or not. So the most important thing in choosing the models is to get different opinion about what factors are important.

25 January 2022

Data Lake on Oracle

Filed under: Data Warehousing — Vincent Rainardi @ 8:33 am

A modern data warehouse consists of a data lake and a dimensional data warehouse (link). The data lake is just a collection of files, but there is a SQL layer on top, enabling us to query the data lake using SQL which is called Data Lakehouse (link). I know that Databricks and Snowflake have this feature (link) but it turned out that Oracle has this feature too (link). It is based on Oracle external table which creates a SQL layer on top of data lake files (link). It works not on Oracle Cloud files, but also on S3, Azure and GCP files.

Oracle Autonomous Database is a cloud database that uses machine learning to automate database tuning, security, backups, updates and other routine database management tasks which usually performed by a DBA (link). Data Catalog is a metadata management service that provides an inventory of assets, a business glossary and a metadata store for data lakes. The Autonomous Database can automatically create external tables for thousand of files automatically harvested by the Data Catalog (link). This way, users can immediately query all the data in the data lake using SQL. For each file in the data lake the Data Catalog creates an entity automatically, without us having to manually define the columns and data types. And this entity metadata is automatically synchronised to the files in the lake (link), so the Autonomous Database always reflects what’s in the data lake.

This way, the data lake objects/files are harvested into the Data Catalog, and the Data Catalog is automatically sync into the Autonomous Database, enabling users to query the data lake using SQL, as Martin Gubar describes in the Oracle blog (link):

This makes Oracle a good platform for creating a modern data warehouse, i.e. a data lake + a dimensional data warehouse. Oracle’s relational database engine is legendary, making it a very good platform for hosting the dimensional data warehouse (star schema). And the Autonomous Database with Data Catalog describe above can provide a SQL queryable layer on top of data lakes in AWS S3, Azure, GCP, or Oracle Cloud Infrastructure (OCI, link).

For data movement, on OCI we can use Data Integration to ingest data into the data lake, and into the warehouse. It is a cloud based, managed service (link) with rich transformation, data explorer and data profiler features. We can monitor the running pipelines using various metrics, notifications or alarms. We can use Data Flow to create and run Spark application (link). It’s a fully managed service so we don’t need to create Spark clusters or worry about the underlying infrastructure. We can use Data Transfer to migrate large volume of data into the data lake (link).

Like Azure, AWS and GCP, OCI also provides machine learning platform. It is called Data Science. We can use it to build, train and deploy ML models (link). It is fully managed and serverless so we don’t need to worry about the underlying infrastructure such as updating Python libraries dependencies, Conda or JupyterLab. Anomaly Detection enables us to identify undesireable events in real time (link), such as IOT. It automatically analyses the dataset to build ML models for multivariate signals by considering their correlations.

For BI & reporting, we can use Analytics Cloud to explore, load, model and analyse data by creating reports and dashboards (link). We can do development operations such as release pipelines, build, deployment approval, code repository and continuous integration using a tool called DevOps (link).

So all in all, OCI provides an end-to-end platform to create a modern data warehouse, i.e. a data lake and a data warehouse. I’m glad to learn this today, because all this time I was wondering what Oracle has been doing in this era of cloud data warehousing. Azure, Amazon and Google all are offering cloud data platform for data lakes and data warehousing – what about Oracle, who is the leader in relational databases? It turns out that Oracle also has good cloud data platform offering.

20 January 2022

Data Lake Architecture

Filed under: Data Warehousing — Vincent Rainardi @ 8:47 am

When we google “data lake architecture” we usually get a diagram showing the functions of a data lake (what it is used for), but not showing what’s inside, like this:

But what we really want to know is what is inside that Data Lake box, not the inputs or the outputs. When we try to find out what’s inside the lake, we’ll find something like this:

They show the layers and functions of a data lake, but not the internal mechanism. Not what is really inside a data lake. Note that instead of raw, clean and curated layers, some people call the layers differently, such as:

  • Bronze, silver, gold
  • Temporal, raw, trusted, clean
  • Raw, core, data mart
  • Transient, raw, trusted, refined
  • Landing, raw, standardised, curated
  • Raw, staged, analytics

After reading a few books on data lakes we’ll find that a data lake is just a collection of files. That’s it. Seriously? Are you telling me that a data lake is just a collection of files? Yup. Wow, how disappointing! That’s what I initially thought.

Every body agrees that data lakes should be in the cloud. In other words we should use Amazon S3 (AWS) or Google Cloud Storage (GCS) or Azure Storage (ADLS) to create a data lake. But it is just a collection of files.

Then I saw this diagram in Holt Calder’s article (link):

So we can query a data lake in S3 / GCS / ADLS by creating Snowflake external tables on top of it. This way we can use SQL to query the data lake. You can read how to do it on Snowflake documentation: link.

Creating external table is an old technology. I remember creating an external table in Hive like 10 years ago. Yes, in 2012! At that time there was no data lakes. I was working for an investment bank and all we had was Hadoop (HDFS). On top of this distributed file system we use Apache Hive so we can query the data using SQL. See Hive documentation on how to do this: link.

Then recently when working with Databricks I read Armbrust’s seminal paper on Delta tables: link. Well not just Michael Armbrust but 10 other people wrote it too. In that paper they explained how they store the transaction log, check points and change metadata directly within the object store, like below:

It is similar to Apache Hive ACID but in Hive ACID the metadata is stored in the RDBMS, not in HDFS or object stores. Armburst calls this approach Delta Lake. It is similar to Apache Hudi and Apache Iceberg.

In this paper (link) they explained the mechanism in great details, including the storage format, the log checkpoints, the isolation levels, adding log records atomically, writing transactions, reading from tables

Until today, whenever I query any table in Databricks, I’m amazed with the wonderful mechanism underneath it. Earlier I just took it for granted, didn’t really think how it works, treating it as just another RDBMS. Well because we can do SQL select, update and delete statements, etc. (link, link) just like in Oracle or SQL Server. And yes we can do JOIN too! (link). But underneath it’s all stored as files in the data lake. What a wonderful mechanism!

So Data Lake these days is not just in the cloud, but also represented as tables. We can query it using SQL statements. That is the internal mechanism of cloud data lakes. Either using Snowflake external table, or using Delta table in Databricks.

18 January 2022

How to do AI without Machine Learning?

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:40 am

I’m doing a master’s degree titled ML and AI, and all this time I’ve been wondering what the difference between AI and ML is. I know AI is a superset of ML, but what is in AI but not in ML? Is it possible to do AI without ML? If so, how?

The Old Days of AI: rule-based

In 1990s there was no machine learning. To be clear, here machine learning includes classical algorithms like Decision Trees, Naive Bayes and SVM, as well as Deep Learning (neural network). In the 1990s there was no machine learning, but there was a lot of news about Artificial Intelligence. Deep Blue was the culmination of that.

So we know there was AI when there was no ML. There was AI without ML. But what was it? What was that AI without ML? Well rule-based of course. The technology that Deep Blue used is called the “Expert System”, which is based on rules defined and tuned by chess masters. You can read about the software behind Deep Blue here: link.

A rule-based system is essentially IF-THEN. There are many different types of rules so I need to clarify which one. It is the IF-THEN rule that makes up an Expert System. There are 2 main components of an Expert System (ES): the Inference Engine and the Knowledge Base. You can read the software architecture of an Expert System here: link.

Search and Fuzzy Logic

Besides ML and ES, another way to do AI is using Search. There are various ways to do search, such as Heuristic Search (Informed Search), Iterative Search and Adversarial Search. You can read the details in an excellent book by Crina Grosan and Ajith Abraham: link, page 13 to 129.

In the Expert System world, the IF-THEN rule-based is not the only way to do Expert System. There is another way: using fuzzy logic. In an IF-THEN rule-based expert system, the truth value is either 0 or 1. In a fuzzy logic system, the truth value is any real number between 0 and 1 (link). There are several fuzzy logic systems, such as Mandani and TSK (you can read the details here: link)

Evolutionary Algorithm and Swarm Intelligence

Another way for doing AI is using Evolutionary Algorithm (EA). EA uses concepts in evolution/biology such as reproduction, natural selection and mutation, in order to develop a solution/AI: link.

And finally, another way for doing AI is Swarm Intelligence: link. Swarm Intelligence (SI) is inspired by the behaviour of a group of animals, such as birds and ants. SI-based AI system consists of a group of agents interacting with each another, and with the environment (similar to Reinforcement Learning but using many agents).

Ending

So there you have it, there are a few other ways for doing AI:

  • Machine Learning
  • Expert System (Rule-Based)
  • Fuzzy Logic
  • Search
  • Evolutionary Algorithm
  • Swarm Intelligence

So just because we study ML we should not think that we are the only one, the only way to do AI. There are other ways, which might be better. Which may produce a better AI. Who knows, you haven’t studied them right? Well I know for sure now, that AI is not just ML. I hope this article is useful for you.

References:

  1. Expert System, Wikipedia: link
  2. History of AI, Wikipedia: link
  3. AI without ML, Teradata: link
  4. AI without ML, Claudia Pohlink: link
  5. Rule-based AI vs ML, We Are Brain: link
  6. Intelligence Systems, Crina Grosan & Ajith Abraham: link

17 January 2022

Machine Learning or Data Science?

Filed under: Data Science,Machine Learning — Vincent Rainardi @ 8:07 am

I’ve just got my post grad diploma in machine learning and all this time I was wondering what data science was. I have written an article about what data science is: link, but now that I understand a bit more about machine learning, I understand there is a lot of overlap between the two (ML and DS).

Last night when I read a Data Science book by Andrew Vermeulen (link) I was wondering which of the things I’ve learned in ML is actually DS. I list the items and label them ML or DS:

Yes, machine learning is definitely part of data science. Strictly speaking, the data cleansing, data analysis, statistics and visualisation are data science but not machine learning. We can see this in this proceedings: link.

So Data Science consists of the followings:

  • Data Cleansing
  • Data Analysis
  • Statistics (including probability, central limit theorem, hypothesis testing)
  • Data Visualisation
  • Machine Learning (including all ML models)

But in my opinion one cannot learn ML without studying statistics, visualisation, data loading, data cleansing and data analysis. In order to understand ML models properly, one must understand all the above fields.

Berkeley School of Information argues that the followings are also included in data science: link

  • Data Warehousing
  • Data Acquisition
  • Data Processing
  • Data Architecture
  • Business Intelligence
  • Data Reporting

I disagree with this opinion. From what I see many companies, Data Warehousing, acquisition/ processing and Data Architecture are part of a role called Data Engineer. A Data Engineer prepare and stores the data, including designing the data models and data ingestion process.

Because Data Visualisation is part of data science, it is tempted to think that Business Intelligence and Data Reporting are part of Data Science. But this is not true. The data visualisation in the data science is more on the data behaviour, such as clustering and statistical analysis, whereas BI is more on the business side, such as portfolio performance or risk reporting. This is only my opinion though, I’m sure other people have different opinions.

So there are 2 fields/roles in the data industry these days:

  • Data Science: data cleansing, data analysis, statistics, machine learning, data visualisation.
  • Data Engineering: data acquisition, data loading/processing, data quality, data architecture.

Whereas in the old days the roles are: business/data analyst, data architect, BI developer, ETL developer.

15 January 2022

Do we still need a data warehouse?

Filed under: Data Warehousing — Vincent Rainardi @ 9:07 am

Yesterday someone asked me what I would include in an “introduction to data warehousing” (including data mart). I get asked that question a lot over the years. This is because I wrote a book on data warehousing, and a blog too. People contacted me through my blog and asked “I want to learn data warehousing, what should I study?” In 2008 to 2010 the answer was not difficult: “Just read my book”. But today, in 2022, there has been so much change in data warehousing that what I wrote in 2007 is no longer relevant. It still provides the basics, but there are so many new things going on.

For a start, the architecture is now different. See the architecture for a modern data warehouse here: link. Then we have a data lake. See the architecture of data lake + data warehouse here: link. And now we have a Data Lakehouse, a data lake which has data warehouse features: link.

So what would I write or say to someone who is a completely new in data warehousing and want to learn about it? Do I say “Don’t learn it”? I thought about this answer for a long time actually. Is data warehousing no longer relevant in 2022? Has it been completely replaced by data lakes? Can it still be useful in the era of AI / machine learning?

Still Relevant

After considering many factors and weighing up everything I am now sure that the answer is “It is still relevant”. You still need to learn data warehousing. Why is that? For a start, a data warehouse integrates data from many different sources. Does a data lake do that? No. A data lake only put them into one place but does not integrate them. Dimensional model is the right way to integrate customers, to integrate products, to integrate sales from multiple sources (those who have experienced it will understand this). An MDM integrates the master data, but not the transaction data. You need a DW/mart to integrate both.

Secondly, a data warehouse is easy to use. It uses a star schema which provides the granularity at the base level, and yet it is easily rolled up to higher levels. And you can do this roll up on any attribute. It has developed over 20 years, it’s mature and has been implemented on all use cases, thanks to Bill Inmon, Ralph Kimball, Margy Ross, Bob Becker, Joy Mundy, Warren Thornthwaite, Chris Adamson, Nicholas Galemmo, Claudia Imhoff, and thousands of other dimensional practitioners, see my Who’s Who in Data Warehousing here, and all data warehousing authors here.

In hundreds of thousands of companies throughout the world, big and small, data warehouse is the back end of millions of reports and analytics. It supports the business intelligence and now machine learning too. You need to have a working knowledge of data warehousing to work in the reporting team in these companies (almost every company has one). Not just modelling techniques, but also loading techniques. And reporting techniques. And infrastructure knowledge too i.e. database, security, DevOps.

Replaced By Data Lakehouse?

Is it true that data warehousing can be replaced by Data Lakehouse? I said that here. But I was wrong. Yes a data lakehouse has ACID properties. But that does not make it a star schema. It only makes it a database with lots of disconnected tables. Can you join the tables? Of course you can, but it is not as slick as a star schema. And no where near as integrated (e.g. customers, products, securities). Those who have tried different approaches in data integration will appreciate the power of dimensional approach.

Can we build a dimensional model in a data lakehouse? Of course we can, but it is not the right platform. It is far better to build the dimensional model on a relational database. Because physically a data lakehouse is just a collection of files, the reason to put a SQL layer on the top is to enable us to access the files easily. But a data lakehouse was not designed to do join, indexing, filtering, etc. that a dimensional model requires. Dimensional model was designed and born in a relational database platform, which supports it very well.

A Data Lakehouse can’t replace a dimensional data warehouse. Because it doesn’t have integration capability and it’s not ease to use/query. If what you need is to put data from different systems in one place, then yes use a Data Lakehouse (which is better than a data lake, see here). But the data will be disintegrated. You won’t have a single customer table. Or a single product table. There won’t be conversion between different measures (say sales from different systems) to integrate them into one table.

Reporting is all about integrating data from various data sources. If you only have one data source, then you can just report from that single source directly. It is because your data is in different places that you need a data warehouse. A data lake or a data lakehouse does not “stitch” the data. Inside a data lake/lakehouse the data is still disintegrated.

Please read James Serra article on this point here: link. I completely agree with James that for small amount of data (say 10 tables), or if the users are all data scientists, or to build a POC or to get a quick win with a report/dashboard, then you can use a Data Lakehouse.

Don’t worry if you don’t know what a Lakehouse is, many people in data warehousing and BI don’t. For that I would recommend you to read the CIDR 2021 paper by Ambrust et al: link. This paper explains the Lakehouse architecture and how it addresses some of the data warehousing challenges.

What to study in data warehousing?

So, these days, what does one need to study in data warehousing? (I include data mart)

First of all, they need to understand the problem, preferably illustrated using several cases. The problem of difficulty to report or analyse data. Because the data needs to be transformed, because the data is not conformant to one another, because the data is located in different places. They need to understand the purpose of the whole exercise, i.e. data warehouse/mart makes reporting seamless and easy because the data is integrated and conformed. That the purpose is to enable users to do analysis and reporting.

They can call it BI, AI, or whatever, but they must realise that it is only possible because all the data they need is integrated in one place. Yes a BI system can connect to US, Japan, France databases to get the sales figures, but that is not integration. They have to realise the FX rates, one data is quarterly while the other is daily or monthly, different data uses different product codes, and different hierarchies. How do you integrate them? They must appreciate what a data warehouse (and all the analysts, developers, testers behind it) going through to integrate the sales data worldwide, the AUM data worldwide, the customer data worldwide, the financial data worldwide. Not “all the data in one place”. But: “all the data integrated in one place”. The difference between the two is not one little word. The difference my friends, is a million dollar and 2 years of hard work. It’s equally important to understand, that if the client is small enough and lucky enough to have just one currency, one office and one system, then they don’t need a data warehouse. Report directly from that system. Because of their size, there is no way to justify a million dollar spending. Or having 5-10 people in IT working on this for 2 years. If the student understand this perspective, they are in the right track.

Second, they need to understand why we need to use dimensional modelling. Why not use relational models, or transactional reporting. Why not use a data vault or data mesh. Why not use data lake or data lakehouse. They need to understand which architecture is suitable for which cases. If you just want to analyse Twitter big data, then there is no need to build a DW or mart. Forget dimensional modelling. You need to use a data lake.

Until they are able to answer “which cases are suitable for data warehousing and which are not” the learner should not move to the next step. It is better to understand the why and not understanding the what, then the opposite. As a learner, ask yourself “why don’t I use a lake or relational?” If you can’t answer this, keep learning this second stage. Keep reading and thinking until you can answer that question.

Third, study the architecture. Start with the architecture of a modern data warehouse (i.e. a lake + a warehouse): link. Carefully look at where ML, BI, IOT (sensors), No SQL, MDM, DW and marts are positioned in this architecture. Then study dimensional modelling concepts: link. This is a vast topic and you will need at least 3 months to understand all the details. So please don’t get trapped into the details. Instead, keep it to the basic, which you can learn in a week. And as beginner, limit yourself to only what Kimball wrote (the Kimball Group, not just Ralph Kimball). I can’t say this enough: avoid the advanced stuff and keep it to the basics, i.e. fact tables, dimension tables, SCD, hierarchy and that’s it.

After studying the basics of dimensional modelling, ask yourself: why use star schema? What is the advantage compared to other models? (i.e. snowflake, relational, data vault, ODS) Until you can answer this question you shouldn’t move on to the next step. There is no point understanding the what, if you don’t understand the why. Keep reading until you can answer that question: why do you have to arrange the data in dim and facts, why not relational model, snowflake model, data vault model, or ODS model?

So those are the first 3 things to learn in data warehousing: the purpose, the architecture, and dimensional modelling.

How DW is used and populated

After those first 3 things, the next stage is to understand how a DW/mart is used: BI, reporting & analytics. Not how to get the data out of the DW/mart, but how the reports will be used. Many people are against this. They feel as people in IT don’t need to know the business. On the contrary, we absolutely need to know the business: what the data will be used for. The whole of the design is based on how it will be used. So if you don’t know the business your DW will be poorly designed, and your reports/dashboard too.

15 years ago BI & reporting are almost the whole purpose of data warehousing. How about today, are they still relevant? Does one still need to learn BI & reporting tools such as Power BI, Tableau and Looker? Yes, today BI is still a major usage of a data warehouse. And yes one still need to learn BI/reporting tools. But, I would argue that, considering there are so many things to learn in modern DW, BI/reporting tools can be considered a separate skills to master. Ditto ML (AI), more and more DW is used for ML, does it make ML as one of the subject that one needs to learn in DW? No, it is a separate subject. So when one learn data warehousing, I would recommend not to learn BI tools and ML. Finish learning DW first (see below), then do BI or ML (not both, but pick only one because each is a very large subject to learn).

Then you need to understand about various different techniques of data loading: ETL, ELT, ingestion, ETLTL, streaming, (small) batching, and so on. Knowing the model is one thing, but loading data into that model is a completely different thing. There are various different tricks and methods that we need to know. Staging area, data quality, alert, monitoring, error handling – all these need to be built.

A modern data warehouse

In a Modern Data Warehouse, we have a lake and a warehouse (link). So not only you need to know how to load data into the dimensional model, but also how to ingest data into the data lake (preferably data lakehouse, along with loading the metadata too), and how to get the data out of the lake to the warehouse. So it is critical that you know the in-and-out of data lakes: ADLS2, ADF pipelines, Synapse, Databricks, Python notebooks, Spark, Linked Service, Key Vault, API, storage account, resource groups (if you use Azure, otherwise the corresponding items in Redshift or Google Query). This is where an old school DW practitioner usually got stuck: a modern DW demands the cloud technology, and they are stuck in the old SQL world. They are not flexible enough (or humble enough) to adapt and learn the cloud and the data lake technologies.

A modern data warehouse can also have machine learning, streaming data and master data management. Machine Learning is so popular these days, every company is using it. It’s no longer about BI, but it’s about AI. So you need to know how those data scientists are using the data for making predictions, which is completely different to BI and reporting. Streaming data like social media feeds, IOT feeds (Internet of Things/sensors), and audio visual streams are quite common these days (yes, even in banking). For this we use Kafka or Azure Stream Analytics (or Kinesis/Flink in AWS). Data mastering is usually sometimes outside a modern DW, but sometimes is included in the ecosystem.

Data Governance & DevOps

A DW ecosystem must always be accompanied with a data governance program. Yes, the data quality process is built-in (into the modern DW ecosystem), but Data Governance is completely different. Data Governance includes data definition, data ownership, data mastering, data lineage, and data quality. Whether you are doing a lake, a DW, an ODS, a Data Vault or even a transaction system database, you need to understand Data Governance. Otherwise what you are doing doesn’t make sense. There is no point creating a data warehouse if we don’t understand the principles of Data Governance.

A data warehouse without Data Governance may be able to deliver in the short term, but not in the long term. So we need to know it. Data Governance it a vast world with a myriad of tools, so as a beginner you will need to limit yourself to just the essentials. Just try to understand the concepts, and forget about the tools and how to use each of them.

Any IT development requires DevOps, and data warehousing is no exception. DevOps (development operations) means development processes and production processes. Development processes include coding, automated testing, pull request, branching, release pipelines, as well as the infrastructure such as databases, servers, service accounts, resource groups, storage accounts and networking. Production processes include change control, monitoring data ingestion, setting up alert, error logging, and managing user access. Knowing DevOps would help a lot in data warehousing.

Recap

To recap:

  • I want to learn data warehousing, what should I learn?
    The purpose, the architecture and dimensional modelling.
    Then the business knowledge, BI/reporting and data loading.
    Then data governance and DevOps.
  • Is learning data warehousing still relevant today?
    Yes, learning data warehousing is still relevant today because of its capability to integrate data, it is easy to use and because thousands of companies throughout the world have data warehouses.
  • Will data warehousing be replaced by data lake/lakehouse?
    No, because a data lake/lakehouse doesn’t have integration capability, it is not easy to use/query, and it is not a suitable platform for dimensional model.
    There are business cases which are suitable for data lake (ones which don’t require data integration), and there are business cases which are suitable for dimensional data warehouse. They have different purposes and capabilities. A modern DW includes a data lake.

In the meantime, if you are new in data warehousing, or looking for an inspiration, read the Who’s Who in Data Warehousing: link. Even for an experienced veteran, it is still a good read. You would recognise all the names, which makes it interesting.

If you are a learner, I wish you all the best in your data warehousing journey. Enjoy your learning experience, that’s the most important thing. Not the results. Or achievements. Any questions, just ping me at vrainardi@gmail.com, I’ll be happy to answer them. I now spend half of my time in machine learning, but still work in data warehousing. And still learning data warehousing.

Blog at WordPress.com.