Using Production Data in Dev and Test Environments

13 January 2024

Using Production Data in Dev and Test Environments

Filed under: Data Warehousing — Vincent Rainardi @ 4:40 pm

There are many data projects. It could be a business application like payment systems, it could be a data warehousing project, it could be a data migration project. What I’m writing about here is specifically for a data warehousing project. So for other types of data project it might not be appropriate or even applicable.

Goal of a data warehousing project

The purpose of creating a data warehouse is to enable reporting and analytics which can’t be done directly on the main business application. Either because the buiness application is too transaction oriented, or because the reporting requires historical balances whereas the business application only stores today’s balances. The other reason is that because reporting/analytics would slow down the operational performance of the business application.

The second purpose of creating a data warehouse is to integrate data from many applications or many data sources. This is something that data lakes fundamentally can’t do. If the assets of your company is located in ten different systems, then you need to integrate them before you can report the totals. By assets I mean factories, offices, trucks, machinery, IT equipments. If your customer data is located in ten systems, if your revenues are stored in ten different applications, then you will need to integrated them in one place before you can report the totals. And that place is a data warehouse.

Testing a data warehouse

As I said above, a data warehouse has a reporting/analytic front end, such as Tableau, Looker or Power BI. Or a custom application, developed in house for a specifical business purpose. Or an AI application, such as forecasting and classification. Or Generative AI.

So the first and most important test of a data warehouse is that those front end reporting/analytic tools (or AI) are producing the correct results. Correct means matching expectations. So you need to calculate in advance, what the results should be. And based on that you can judge whether the numbers on the report/dashboard/apps are correct or not.

So a data warehouse is never a stand alone system. The data inside the warehouse is used for something, and it is the output of that something that is tested. For example, if you use your data warehouse to integrate customer data from five different applications (CRM, sales, billing, finance, services), your integrated reporting should be outputing the correct number of customers, and each customer needs to have the correct purchase amounts, invoice amounts and payment amounts. For each customer the account manager, name, applicances, service agreements must be correct.

If the output of those reports are incorrect, then the testers would look at the data in the warehouse. Does that customer have the correct account manager, address, service agreement, etc. So it’s a two level testing: firstly test the report, and secondly look at the warehouse. For this testing, we use real production data. Because in data warehousing we want to test if the values displayed in the reports are correct or not.

Data In Development

Before the developers hand it over to the testers, they need to test it first. For example, if the bug they are fixing is: the January invoice for Mr Harris should be $2400 not $3700, then in Dev they need to see this $2400 amount so they can find out why it is $2400. In the source application (billing) it is $3700, so somewhere in the ETL the amount changed to $2400.

If the issue they are trying to fix is: the sales amount for January in LA should be $86k not $83k, and the issue is because the customer address is not correct, they need to see the sales invoices data (real data) and customer addresses (real data). The issue could be because a customer which is located in LA but the city is incorrectly set to Denver.

So in data warehousing, in Dev we can’t use dummy data, like fake amounts and fake names. Perhaps when developing a business application you can use dummy data, but in data warehousing you can’t. Data warehousing is about correctly aggregating/integrating data from several systems, or taking weekly snapshots from one system, for the purpose of reporting/analytics (or AI).

Therefore it is of paramount importance that the data in the warehouse (in Dev and Test) is the same as in Production. Because the testing is about check the numbers, and bug fixing is about fixing incorrect numbers (incorrect invoice amount, incorrect city, incorrect logic, etc.) If you put dummy data in Dev and Test, you can’t function as a tester or developer.

Data warehousing vs payment system

When you are developing a new payment system, you can use dummy data in Dev and Test. Made up data. Like Mr John Doe paying $100 on 31^st January. But that is only when you are developing. Once it goes live, then for fixing issues and testing the fix you need to have live production data in Dev and Test. See the top half below.

But as we can see on the bottom half above, in a data warehousing project, the real data already exists in the source system. And the purpose is to report on that data. So from the moment the project starts, development uses that real data from the source system. The development is about flowing that real data from the source system down to the data warehouse, and onto the reports. So it does not make sense to develop a data warehouse using dummy/made up data.

Data warehouse vs data migration

A data migration project moves data from one system to another. During the project they develop a system which transfers the data, transforms the data and validates the data quality. Like data warehousing, a data migration project uses real live data in Dev and Test. This is because the live data already exists in the system to be migrated, and the main purpose of the project is to transfer the data to the target system. See the bottom right below, in blue.

That’s why the migration tooling which was developed in a data migration project (to transfer, transform and validate data) uses the live data from the source system. There is no point using dummy/fake/made up data, because the whole purpose of the project is to transfer the data over!

Data Protection

Of course now we have GDPR (General Data Protection Regulation, applicable in EU countries, and ex-EU), meaning if that data is a customer data, then it needs to be processed lawfully, fairly and transparantly. Meaning that the customer has to know what we do with their data. In data warehousing, the purpose is for reporting. In data migration, the purpose is for transfering to the new system. Those purposes have to be specified to the customer, when we collect that personal data.

This is because customer data contains PII (Personally Identifiable Information) such as name, date of birth, phone number, email address, passport/NI number and home address. If you work in a bank or a wealth manager, then that customer data also contains financial information, like their bank account number, how much money the customer has in that account, and when they made big transfers, etc. If you work in a hospital, then that customer data also contains medical information, like what illness they had in the past, family medical history, lab test results, and the medication they are currently on. This financial and medical information is called SPII (Sensitive PII).

We can only process data which are relevant to the purpose that we have communicated to the customer. We need to keep customer data accurate and up-to-date. We cannot keep customer data longer than necessary. When we no longer use it, we need to delete it. And above all, we need to keep customer data secure. Meaning that we store the customer data in a secure place, i.e. only people with legitimate rights can access it. This is called access control, and it is of paramount importance. When we transfer the data, it has to be encrypted, especially when it passes public network.

Which is why in places like hospitals and banks, they prefer the IT developers and testers to not have access to SPII. Developers and testers should not see how much money Mr Harris has in his account, or Mrs Harris medical history. Not only internal IT staff (direct employee) but also IT contractors, IT consultants and outsourced IT companies. It’s ok if it’s retail business, like fashion retail and supermarkets, or manufacturing business – because the customer data is PII. But financial and medical data (banks, wealth management, hospital, NHS) they are SPII.

But as I mentioned from the beginning, it is unavoidable that in data warehousing and data migration projects IT developers and testers see those SPII data. They are the ones who moves the data from the source system into the reports or to the target system! So for that GDPR assessment needs to be conducted, forms filled, all the risks are evaluated and considered, all data processing and storage are documentated, data security are reviewed to the minute details, penetration testing conducted, and all that are presented to GDPR officer in the company.

If there is a breach, where the customer data is out in the public domain accidentally, the process needs to be defined in advance, and executed step by step as planned. Everything needs to be prepared in advance. Who should be informed (regulators, customers, etc.), how to contain the issue, where to investigage, etc. Particularly for SPII data. So it is a lot of work in terms of protocol and GDPR, but it has to be done. Because, in data warehousing and data migration projects, developers and testers will need access to live, real data.

Summary

In data warehousing and data migration projects, live data needs to be in development and test environments from the beginning. In a business application project like building a new payment system, we can use fake/made up data when building the new system, but after the system goes live dev and test needs to be populated with live data from production, in order for developers and testers to be able to fix any data issues and bugs.

Data Protection regulations like GDPR dictates that we must process customer data lawfully, fairly and transparantly. The customer needs to be informed, when we collect their data, for what purpose their data will be used for. We need to keep customer data accurate and up-to-date. We cannot keep customer data longer than necessary. We need to keep customer data secure. We need to store it in a secure place, and only people with legitimate rights can access it (access control). When we transfer the data, it has to be encrypted, especially when it passes public network. Every aspect of data processing involving PII and SPII needs to be documented and risk-assessed, including the systems, the controls, and the security.

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Data Platform and Data Science

13 January 2024