Data Warehousing and Machine Learning

26 November 2021

Automating Machine Learning using Azure ML

Filed under: Data Warehousing — Vincent Rainardi @ 8:06 am

I have been using Google Colab (see my article here) and Jupyter to build and train various machine learning models including LR, KNN, PCA, SVM, XGBoost, RF, NLP, ANN, CNN, RNN, RL. I have been using Azure (Databricks, Data Lake, Data Factory, SQL, etc.) so I’m intrique to try Azure Machine Learning to see if it is as good as Colab.

The first thing I notice in Azure ML is Automated ML, which enables us to train an ML model without doing any coding. We specify the data, and Azure ML will try various algorithms, build various models, and evaluate them according to the criteria that we give.

This sounds too good to be true, but entirely possible. One of my ML projects is about credit card transactions. In that project I used 6 algorithms (LR, KNN, SVM, DT, RF, XGB) and each model has many hyperparameter. Each of these hyperparameters have many values to try. So to find the best model I had to do are a lot of hyperparameter tuning using GridSearch cross validation on training data. Once I found the best parameter for a model, I had to evaluate the performance on the test data using Area Under ROC Curve, or AUC. Then I had to select the best model based on that evaluation. And on top of that I need to find out the top features. Can all this be automated in Azure ML? Sounds too good to be true, but entirely possible.

First, I loaded the data using a procedure similar to this demo: link. Set the evaluation metric to AUC, set the train-test split using K-fold cross validation with K=3, set the ML algorithm to auto, set explain the best model = True, and set maximum concurrent session to 5. For the compute node use DS12 V2 with 4 CPUs, 28 GB memory and 200 GB SSD space (16×500 IOPS)

The top 10 models came out like this:

I expected XG Boost classifier (XGB) to be the top model and it is (I didn’t enable Neural Network in the AutoML). The top XGB model is using SparseNormalizer, which is expected because the data is skewed on many features. 2m 24s training time on 30k observations/examples on 4 CPUs/28 GB is not quick.

The eta (learning rate) is the step size shrinkage used in update to prevents overfitting. In this case it is 0.4 (the default is 0.3, range is from 0 to 1, see link, link). Gamma is the minimum loss reduction required to make a further partition on a leaf node of the tree, ranging from 0 to infinity (default is 0). It is a regularisation measure and in this case it is conversative (the larger the gamma the more conserative the model is). The maximum depth is 10. For comparison when I tuned by XGB model for credit card fraud data, the eta was 0.2, the gamma was 0 and the max depth was 6.

We can see it in more details by clicking the Algorithm Name, then click View Hyperparameters:

We can see the top influencing features like this:

F is the feature number, such as account age, location, or customer behaviour.

We also get a chart of the top feature against the probability of the predicted variable, like this: (I would prefer charting the top and second top features on the x and y axis but as this is out of the box it looks good and useful)

And we get the Precision-Recall chart out of the box too (you can choose which version of AUC to use, i.e. weighted, macro or micro:

The ROC is True Positive Rate (TPR) on the Y axis against False Positive Rate (FPR) on the X axis, so the above is not an ROC curve. But it gives us a good sense on how we can maximise recall or precision.

We want to recall to be as large as possible, and precision to be as large as possible but the AUC line limit them so it will always be a trade off between them. For example if you take the Weighted Average AUC line, the maximum of (recall – precision) might be point A. But in the case of credit card fraud you would want high recall, so we would choose point B instead of point C which is for high precision.

And AutoML in Azure ML also gives us the data transformation, such below:

We can see above that during the preprocessing of the data, for numerical features AutoML uses MeanInputer to mitigate missing values, whereas for categorical features CharGram count vectoriser and ModeCatInputter label encoder. Then it uses maximum absolute scaler before feeding the preprocessed data to the XGBoost model.

Overall I found that AutoML is useful. It tried various algorithms including Random Forest, Logistic Regression and XG Boost. Over 60 models it tried, in under 2 hours! The test AUC is 94.8% which is a good result for this data. And it gives us features importance as well. It tried various values of hyperparameters for each model, and chose the best values for us, automatically. Very, very easy to use. Welldone Microsoft! Of course, once we get the top models AutoML, then we can tune it further ourselves to get higher AUC. It is finding the top models which is very time consuming (it took me a week, but with AutoML it only took 2 hours).

25 November 2021

How to download files from Azure storage account using Control-M

Filed under: Data Warehousing — Vincent Rainardi @ 8:46 am

Control-M has a dedicated Azure Blob Storage job: (link)

On the Action dropdown we can select Upload, Download or List. Upload means uploading a file from an on-prem folder to an Azure container, whilst Download means the reverse. List means getting the name of the files in an Azure container.

We can also copy a file from a container to another container, delete a file, create and delete a container. Just need to specify the appropriate action on the Action dropdown list.

Note that we must install Azure CLI 2.0 and include the Azure authentication parameters in the connection profile. Or authenticate the CLI.

Upload files to FTP server

Once the files are downloaded to a local folder, we can push it to an FTP server using Manage File Transfer job (MFT, link, link). Note that FiIle Transfer Server must be configured first.

24 November 2021

Azure Big Data Analytics

Filed under: Data Warehousing — Vincent Rainardi @ 8:57 am

I’m not going to dwell on “what is big data”. You can read the definition here: link. It is basically data that doesn’t suit well in a database. So we put them as files in Hadoop or Data Lake.

In this article I would like to write specifically about Big Data Analytics, i.e. how they are processed / analysed. Specifically, what tools in Azure that we can use to analyse big data.

Databricks and HDInsight

The most popular one today is to store big data in Azure Data Lake (ADLS2), create a Spark cluster on top of it, and do the analysis using Azure Databricks Notebooks. You can use SQL Python, R or Scala to query and analyse the data. Here is the architecture (link):

The older method is to use Azure HDInsight. So we create a HDInsight Spark cluster on ADLS2 storage, and put the data in it. We then use Jupyter Notebooks to query and analyse the data, using either PySpark, SQL or Scala. We can also use HBase to process NoSQL data (schemaless) and use LLAP to query Hive tables interactively. Here is the architecture (link):

Stream Analytics

One of the things in big data is data stream such as stock market data, social media feeds, web logs, traffic data, weather data and IoT data (sensors, RFIDs).

The most popular method is to use Azure Stream Analytics (ASA) to analyse real time data stream. ASA can use data from Azure Event Hubs, Azure IoT Hub or from Azure Blob Storage. ASA query is based on T-SQL language (link), which we can use to filter or aggregate the data stream over time. Here is the architecture (link):

If we use HDInsight, the older method is to use Kafka to build real time streaming data pipelines and application. We can also use Storm to do real time event processing. Here is the architecture (link):

Azure Synapse

The alternative to Databricks and HDInsight is to use Azure Synapse Analytics (link). We put the data in ADLS2 and use Azure Synapse Analytics to analyse the data using SQL or Spark. We can also use Data Explorer for time series data. In Synapse Studio we can create pipelines (link) to process data (similar to ADF but different: link). Here is the architecture: (link)

Machine Learning

These days, the main analytics is not business intelligence of reporting, but machine learning. Machine learning may not be the most widely used analytics, but it is certainly the most powerful analytics, i.e. in terms of prediction capability, understanding the most influential factors, etc.

Whether your data is in Databricks, HDInsight or Synapse you can use Azure Machine Learning (AML, link). We can use AML to create ML models for prediction or regression whether using a Spark MLib notebook (link), using a Python notebook (link), or without coding (link). Here is the architecture: (link)

As somebody who has developed many ML models, I can tell you that there is a big gap in terms of DevOps. Azure Machine Learning has a complete DevOps, from development to deployment and operation support. This takes off a big headache if you are a development manager trying to do machine learning in your company.

Data Visualisation

Of course, no discussion about analytics is complete without mentioning data visualisation, i.e. BI and reporting. There is only one tool in the Microsoft toolbox: Power BI. Whether your data is in data lake, data warehouse, data mart, CSV files, Excel or API (or any other form), the Microsoft way is Power BI. No they don’t promote Python visualisation such as Seaborn. In fact, you can do Python visuals within Power BI: link. Here is the architecture: (link)

8 November 2021

What is a Data Lake?

Filed under: Data Warehousing — Vincent Rainardi @ 9:05 am

When asked what a data lake is, most people describe the function, rather than describing what it is physically. That is like answering “What is a car” question with “A vehicle that we can use to move from place A to place B”. This is because not many people know what a data lake is, physically. So in this article I would like to answer that seemly simple question.

Collection of Files

A data lake, physically, is a collection of files. Files can be structured (tabular, hierarchical, graph, etc.) or unstructured (images, audio, videos, documents, etc.) Typically, structured data files are stored as columnar text files such as CSV or pipe delimited. Other commonly found format for data files are JSON, XML and Excel files.

Unstructured data files such as multimedia files are stored in their native formats, such as:

  • Images: JPEG, PNG, GIF, BMP (link)
  • Audio: WAV, MP3, M4A (link)
  • Video: MPEG, AVI, WMV (link)
  • Document: PDF, HTML, DOCX (link)

The other type of files stored in the data lake are:

  • Database backup files (typically .BAK but can be other extensions)
  • Log files (typically .LOG but can be other extensions)
  • Email files (typically MSG, PST, EDB, OST)
  • Social media data such as Facebook and Twitter

BLOB (Binary Large Object)

Let me clarify a terminology which is used often but not usually clear what it is physically: BLOB. A blob, or binary large object, is a file containing binary data, such as multimedia files or executable files. So originally blob does not include human readable files such as text files. But in the data lake world, blob generally means all files, including human readable files.

But if we want to be precise and get physical, blobs are not files. Blobs are a collection of files, which can be stored in different ways.

In Azure, there are 3 types of blobs (or blob storage): block blob, append blob, page blob. When people say “file”, it generally means “page blob”, which is random access file storage. Block blobs are optimised for uploading large amount of data/files, whereas append blob is optimised for appending data at the end of the blob (not efficient for updating or deleting existing blocks)

In AWS, files are called objects. What we call “blob storage” in Azure is called object store in AWS. An object store in AWS uses unique key-values to store objects. An object in AWS consists of a file and a metadata describing the object.

Containers

The files in the data lake are organised in folders. These folders are called “containers” or “buckets”. In Azure they call it a container and in AWS and Google Cloud they call it a bucket. The container can be multi levels (folders within folders).

Each container belong to an account. These accounts are called “storage account”. Users are permissioned to access these storage accounts. But users can also be permissioned to access a container or a file.

HTTP Access

Files in the data lake is accessable via HTTP. Data lake is a RESTful architecture so each file has a URI (means URL).

For example:

Database-like Access

Structured data files in the data lake can be accessed as if they are tables in a database. We can do this in both Azure, AWS and Google Cloud. For that we use technology like Hive and Databricks. In Databricks we create a Hive table which is linked to a data file. In Databricks, a collection of tables is called a “database”.  

Afterwards, we can query that Hive table using an SQL SELECT statement (Spark SQL). To run those Spark SQL queries we need to create a Spark cluster.

Database-like Access is a very important feature of a data lake. It means that the data lake is queryable as if the data is stored in SQL tables. Many BI tools (such as Power BI) can access Databricks tables, be it in Azure, AWS or Google Cloud. So we don’t need to put the data into a SQL Server or Oracle database. The BI tools can directly query the data lake as if they are database tables.

18 October 2021

Why Data Lake?

Filed under: Data Warehousing — Vincent Rainardi @ 6:55 am

I’ve been building data warehouses for 15 years and my typical architecture is like this:

My main focus has been to build a data warehouse in a database in dimensional model format (Kimball star schema). The database is usually SQL Server but also Oracle and Teradata. This works well because a) the warehouse is queryable by SQL, b) the data is integrated, and c) the star schema is easy to use and performant. The front end varied from company to company e.g. Reporting Services, Excel, Tableau, Power BI, Qlik, Cognos, Business Objects, and users also query the warehouse directly. Occasionally I built OLAP cubes (mostly Analysis Services but also Hyperion) and users access it from Excel. For 15 years the name of the game was “Business Intelligence” (BI), that’s what data was used for.

In the last few years I’ve been doing Machine Learning and it mostly requires data files (rather than database), including image files (for CNN) and sequential/streamed data (for RNN). This makes the Data Warehouse not fit for purpose, and we use Data Lake instead. The ML tools I use are varied, from Jupyter notebooks, Tensorflow in Google Colab, Azure ML Studio, to Rasa NLP engine and Watson, to NVIDIA Jetson Nano. Data is no longer used for BI, but for modelling AI. It is the output of the ML models that is used for reporting and BI. The model files (h5), the augmentation (images), the model parameters and model output (pickle files) are stored in Data Lake. The Data Lake started to be used for staging area as well, as users require access to the raw data files. So the architecture become like this:

It is a completely different world to 15 years ago. We change and adapt. The Data Lake is a collection of files, yes, but it is also queryable via SQL for those old school folks (if we use Databricks or Hive). And Python friendly too, for those ML folks. Data Lake is available in various formats, one of the best ones is Databricks Delta (available on Azure and AWS), which stores data in optimised Parquet files (compressed vertical format). It provides transaction logs, consistent view, concurent read/write operations and good performance (z-order partitioning).

So as we can see above, the data mart/warehouse is still there in the architecture, with the data lake used as its staging area. The warehouse also takes in the the output of the ML models e.g. predictions, etc. And the BI Tools and reports are still there too like before, publishing the content of the warehouse to the users. Where is the integration point? The warehouse of course. Or if you use a multi-marts architecture (instead of the single warehouse), then the integration point is the data marts.

Note: what I labelled as “Data Lake” in my diagram above (e.g. Databricks Delta Lake, see below diagram) some people call it a “Lakehouse”. A Lakehouse is a data lake and a data warehouse in one thing, see Ref #6 below. My architecture diagram above (and below) is not a Lakehouse, it is the architecture of a modern data warehouse, which uses both a data warehouse and a data lake for the reasons I outlined above.

The Cloud Architecture

The main difference to 15 years ago is the cloud architecture (which is also the main driver). It hard to argue against cloud, whether you choose Azure, AWS or GCP. First, we don’t pay a million dollar in advance, but pay by the hour (tens of thousands per month). Second, we enjoy robust infrastructure, high availability, hardware redundancy, 24×7 support, tight security. We don’t need to think about backup and failover, it is always available. Third, there is no patching. This is like heaven for the infrastructure team, and even more so for the head of IT, the Chief Financial Officer and the Chief Operating Officer.

With that, data warehouses are now in the form of PaaS databases (Platform as a Service). Whether you are a SQL Server shop or Oracle shop, you would be pushed to be in the cloud. We won’t be able to argue why we need to maintain on-premise databases.

The pressure for a data lake platform to be in the cloud is even stronger then a data warehouse. 10 years ago, people still talk about building their own Hadoop cluster, on premise. The “zoo” as we used to call it (because it’s not just the elephant but also the pig, the bee hive, the oozie, the hbase whale). But now it is unheard of. Everyone just use cloud based data lake like Azure and AWS (between them they probably have 90% of the market). It’s a pay-as-you-go (pay by the hour), it’s has high availability, hardware redundancy, 24×7 support and tight security. We get a world class platform for a low cost.

Conclusion

So why data lake? Because a) the ML systems need files rather than database, e.g. image files, streamed data, b) to store unstructured data, e.g. tweets, text, images, videos, c) to be used as a staging area for the data warehouse.

Data lake is not the integration point, it is just a storage. The integration point is still the data warehouse (for the structured data). So data warehouse is still being used, and the BI tools are still being used too.

This modern data warehouse architecture (which is a data lake plus a data warehouse) must be in the cloud, the most popular ones are Azure and AWS.

References

  1. BlueGranite’s diagram of a modern data warehouse: link, link
  2. Scott Muniz’s diagram of a Databricks data lake in Azure: link
  3. Kevin Clugage and Denny Lee’s diagram of a modern data warehouse: link
  4. Alexandre Gattiker’s diagram of Azure Databricks streaming architecture with a data warehouse: link
  5. Clinton Ford’s diagram of a modern data architecture with Delta Lake and Azure Databricks: link
  6. What is a LakeHouse by Databricks team: link

21 February 2021

One or Two Fact Tables

Filed under: Data Warehousing — Vincent Rainardi @ 7:34 pm

One of the data I receive from my data sources is like this:

In this case, should we make it into one fact table like this: (one row per company per year, with M1, M2, M3 repeated) – let’s call this option 1

Or should we make it into two fact tables like this: (the first one is 1 row per company and the second one is 1 row per company per year) – let’s call it option 2

Kimball theory says we should do option 2, because the grains are different. A fact table is determined by its grain. If the data is at different grain, then it does not belong in this fact table, it belongs to another fact table. In option 1, M1 to M3 do not belong there, because their grains are different. They belong to another fact table, hence option 2 is the correct way.

So that was what I did when I first came into this this situation. Stick to the book, I said to myself, and you’ll be safe.

But then the same situation came up again, on another data source. I should model it the same way I thought. But this one is a bit “rugged”, notice M4 and M5 below which are only applicable to some years:

And M1A is closely related to M1 (they are used together), M2A is closely related to M2, M3A is closely related to M3, so the two fact table almost always have to be joined together.

Also notice that there are 15 measures which are annual, and only 3 which are not.

So I thought rather than the users having to join those two fact tables every time because the first 3 measures are used together, I think the right thing to do is to put them in one fact table like this:

So I did. And there are two more data sources like that afterwards. I still feel it is the right thing to do despite it breaks the normalisation rules (who’s doing normalisation in data warehousing anyway?) and breaks a Kimball principle on fact tables.

It’s not because there are only 3 measures which are repeated (even one is too many) but because otherwise to use it users will have to join the two fact tables.

What’s wrong with joining two fact tables? Nothing. Joining fact tables was done since the dawn of data warehousing, with shipments and orders as the classic example in 1990s.

Oh well, both options have the good side and bad side, the positives and negatives, and I think we could present good arguments on both options. What do you think?

And that is the title of this article: One or Two Fact Tables.

Vincent Rainardi, 21st Feb 2021 (gosh it’s been a looong time since I last wrote about data warehousing. That’s because I’m currently doing a master degree on machine learning, which is a lot of fun! Can’t have enough of it! Doing Python and math every day 🙂 At first I thought I would be converting to ML. But now that I’ve done quite a lot of it, I’m sure I don’t want to let DWBI and Investment Management go. I will bring ML into my DWBI work in financial sector. And keep BA as my main work, not ML engineer 🙂 It’s a very good feeling to be able to know what we want to do in life!

8 November 2020

Day Measures

Filed under: Data Warehousing — Vincent Rainardi @ 3:04 am

Sometimes the measures we want to store is the number of days. For example:

  • the number of days from when we received an order until we shipped it
  • the number of days from when a booking was made until the engineer visited
  • the number of days from today until a bond matures
  • the number of days since the policy was incepted (began) until today

I call these measures Day Measures.
When a day measure is “from today” or “until today”, then the measure changes everyday.
For example: the maturity of a bond is the number of years from today until a bond matures. If the bond matures in 31st December 2030, and today is 9th Nov 2020, then the maturity is 10.1451 years (10 + 52/365.25).
Tomorrow it is 10.1396. The maturity of a bond is important because the bigger the maturity, the higher the risk that the bond value can be impacted by the interest rate.

Days to shipment (the number of days from when the order is received until shipment) does not change if it was a last year order. But if the order was placed yesterday and it usually takes about 3 weeks for us to fulfill an order, then in the next 3 weeks the “days to shipment” changes every day.

If we have a daily periodic snapshot fact table, and we have a day measure in this fact table, then we need to calculate this measures every day.

The question is: is that right?
Couldn’t we just store the maturity date of the bond, rather than calculating the Maturity daily?
Couldn’t we just store the shipment date and the order date, rather than calculating “days to shipment” daily?

We can and we should. But “in addition to” not “instead of”.
It would be silly to store just the date because everytime we want to use it we need to calculate it. The point of having a data warehouse is not to save storage, but to make it easy to use. Easy to query.
When we calculate the Maturity or Days To Shipment, we don’t need to calculate anything. It is there ready for us to use.

So yes it is worth calculating Days Measures every day, and store them in the daily periodic snapshot fact table.

11 September 2020

Copy Paste Anchor Links (Page Jumps) from Microsoft Word to WordPress

Filed under: Data Warehousing — Vincent Rainardi @ 6:57 am

Problem: when copy-pasting a Word document to WordPress, hyperlinks are preserved but bookmarks are gone.

Solution:

  1. In Microsoft Word create them as hyperlinks
  2. Paste it to WordPress
  3. Change the hyperlinks to anchor links using Notepad
  4. Paste back to WordPress

Step by step:

Step 1. In Microsoft Word, create them as hyperlinks

  • In a blank Microsoft Word document, create this:

Both of the “Section 1” are normal hypelinks, with text = “Section1” (no space), address = “#section1” (no space with # prefix) like this:

  • Highlight it, control-K (Insert Hyperlink), type Address = #Section 1 (notice the hash in the front).
  • Create Section 2 and Section 3 the same way.

Step 2. Copy to WordPress

  • Control-A (select all)
  • Copy paste to WordPress visual editor.

Step 3. Replace the hyperlinks to anchor links in Notepad

  • Control-Shift-Alt-M (switch to Code Editor)
  • Control-A (select all)
  • Paste to Notepad (Notepad++ is better)
  • Control-H (replace), replace href with name, and remove the # as follows:
    find what: <a href=”#, replace with: <a name=”

Step 4. Paste back to WordPress

  • Control-A and paste back to WordPress code editor.
  • Click Update on the top right corner
  • View the page and test the hyperlink

6 September 2020

Dynamic Difference in Power BI

Filed under: Data Warehousing — Vincent Rainardi @ 3:30 pm

If you are looking for how to do dynamic difference (dynamic delta) in Power BI, you’ve come to the right place. This article is about showing how to calculate the difference (say between actual and budget) dynamically or on-the-fly as you switch from product to product. This is done using the SelectedValue function.

Business Scenario

Here is the business scenario: you work in product development in a manufacturing company. You have many products developed in your product development pipeline, each have different costs for each component like this:

Report Layout

The Power BI report should look like this:

You need to be able to select a product (Product A in the above example) and the actual spend vs budget are displayed in a table, along with the difference.

So you are comparing the cost breakdown of a product to the budget. Another business scenario which is similar to this is comparing the performance attribution of a mutual fund to the benchmark. In this case instead of components, we have industry sectors.

Dynamic Difference

The point here is that the differences are calculated on-the-fly, depending on the user selection. That is the key Power BI functionality that I’d like to show you.

There is a function in Power BI called SELECTEDVALUE, which we need to use for this. We use it to find out which product is being selected.

How To Do It

So let’s create the Power BI report.

Step 1. First, in Excel, type this table, and load it in Power BI as Cost table:

Step 2. Create a calculated table to select distinct values from the Product column:

   Product = distinct(‘Cost'[Product])

Step 3. Use it for a slicer so we can select Product A, B and C (filter the Budget out):

Step 4. Create 3 measures: Cost_Budget, Cost_Actual and Cost_Differene as follows:

1. Cost for the budget:
Cost_Budget =
CALCULATE
( SUM(‘Cost'[Cost]),
  FILTER(‘Cost’, ‘Cost'[Product] = “Budget”)
)

2. Cost for the selected product:
Cost_Actual =
CALCULATE
( SUM(‘Cost'[Cost]),
  FILTER(‘Cost’, ‘Cost'[Product] = SELECTEDVALUE(Product[Product]))
)

3. The difference between the budget and actual:
Cost_Difference = [Cost_Budget] – [Cost_Actual]

Step 4. On the report create a table with Component, Budget, Product and Difference in the Values.

Now when we select Product A it will display the cost of Product A and calculate the difference on-the-fly.
And when we select Product B it will display the cost of Product B and calculate the difference on-the-fly.

26 September 2019

Development Operations (DevOps)

Filed under: Data Warehousing — Vincent Rainardi @ 5:45 pm

Whether you are building a data warehouse, a client reporting system, a MDM system or a trade booking system you need to run a development team. Development Operations are things that you need to run a software development team. What do you need to run a development team? They are:
1. Release pipeline
2. Change control
3. Sprint planning
4. Automated testing
5. Code repository
6. Server build

In writing this article I will try to combine all my experiences from about 10 companies in the last 12 years. Some of them do best practices, some do bad practices. I will only write about the good ones here.

1. Release Pipeline

The first thing that you need as a development team is a release pipeline. A release pipeline is a process to automatically build, deploy and test the code to a series of environments, i.e. development, test, UAT, production and support. A release is a collection of code such as stored procedures, SSIS packages, .Net code, views, tables, etc. bundled into 1 package to be deployed together as 1 unit.

Can’t you deploy it manually by copying the code? Of course you can. But it will take more effort. And it is prone to error. What if you want to rollback to previous version? Well you have to do it manually right, by copying the previous version of each component. Who is keeping track of the version for each component? What if you missed 1 component? You could be deploying 50 components in test but only 49 in production. It’s a headache right? With a release pipeline, you can deploy all components into many environments consistently and effortlessly. You deploy into Test environment, and test it here. You deploy into UAT environment, and test it here. And you deploy the same thing into Production. And then you deploy the same thing into Support environment. The environments are guaranteed to be consistent.

If you have 100 stored procedures, 100 tables, 100 views, 100 SSIS packages, a view SSAS cubes, 50 SSRS reports and hundreds of .Net codes, then you will appreciate the benefit of having an automated release pipeline. It is a huge time saver. And it improves your quality.

Deployment into an environment requires approvals. A deployment into the test environment for example, may require the test manager approval. Whereas to deploy a release into production requires an approval from the production support manager.

2. Change Control

The second thing that you need to run a development team is change control. Change control is a process of approving every single change going into production environment. It can be a code change (like a release). It can be a data change (like an UPDATE or DELETE statement). It can be a configuration change (like moving the data folder from C drive to E drive). It can be an infrastructure change (like setting up a new SQL server).

Change control is very important because:
– It shows that the change has been tested properly in Test environment, before it is deployed to production.
– It shows that the development manager and the production support manager are aware about the change.
– When the company is audited you have evidence that every single change in production is approved and tested.
– You can choose which change will be deployed at the same time in production, and which ones are not at the same time to minimise the risk
– It shows the sign off from the business, agreeing to the change.
– When the production deployment failed, it is recorded properly including the reason and the solution.
– It shows that every change in production is controlled, so the production systems are stable and reliable.

A change request is an electronic form which specifies:
– The reason for the change (usually a business reason but can be a technical reason)
– The list of components and data to be changed
– The evidence that the change has been deployed to Test/UAT environment the same way it will be deployed into production
– Link to the design specification or requirement which forms the basis of the change
– The evidence that the change satisfies the design specification or requirement (or if it is a bug fix, the evidence that the change fixes the bug)
– The impact of this change to the business users
– The approval from the business, from the development manager, and from the production support
– The updated documentation for the production support
– How the change will be deployed to production and to the DR system
– How the change will be tested that it works in production
– How the change will be rolled back if it doesn’t work in production
– The date the change is intended to be deployed to production
– The developer who did the change
– Who from the development team will be on support on the day the change is deployed to production
– Which teams are required to deploy the change
– The applications which are impacted, including downstream applications
– The severity of impact of the change, i.e. high, medium, low (the risk)
– The scale of the change, i.e. major change, minor change or medium (how many components)
– Whether it is an emergency change or not

The change request form is not just for application changes, but also for infrastructure changes such as servers and networks. It is for any changes impacting any production system, including codes, files, databases, servers, configurations, firewall, connections, permissions, scheduling.

Major change requests are discussed in a weekly meeting between the developers and the production support team to understand:
– What changes are going in this week
– Whether there are any conflicts between the changes which are going in this week
– Where those changes are in terms of their positions in the daily batch
– If things do go wrong, how long it will take to recover (or to fix)
– Who is required to standby to fix if anything goes wrong (DBA, infrastructure, developer)
– When those changes should be done from production point of view (to minimise the impact).
– To scrutinise the test evidence, not whether the change satisfies the requirements, but whether the change has been deployed in the Test system the same way as it would be deployed in production. In particular, whether the condition of the Test environment is the same as production. Also, whether the impacted application has been tested in Test environment.
– To check that the rollback script really covers the whole change, not just part of it. And whether it has been tested.
– Whether you need to communicate the change to the users or not, and whether the help desk should be prepared to field calls from the users on the deployment day.

The changes are then either approved or rejected. If it is approved, it is given a date and time when it will be deployed. Naturally, the business and the development team want the change to be deployed as soon as possible, but it is the responsibility of the production support team to decide the timing, to minimise the risk. Changes with major impact (or high risk) are usually deployed on Saturday when the business users are not working so they are not impacted and you have Sunday to recover or fix any issues. Changes with medium impact (and medium risk) can be deployed on Friday, so you have Saturday for fixing. Changes with low impact and low risk can be deployed earlier in the week (Wed or Thu but not Mon or Tue).

If there are major changes going in this week end, it is wise to defer deploying other major changes to next week end. Similarly if the required support resources is not available this week end (if it goes wrong) then the change should be postpone to next week.

Some changes requires coordination with an external vendor. So the timing of production deployment must be mutually agreed. Some changes are time-bound, i.e. they have to be done before certain date otherwise the company will get a penalty from the regulator.

Some changes need to be done immediately, for example because of security risk, or risk of losing a great deal of money. This is classified as an emergency change and it is usually a bug which needs to be fixed immediately. The procedure for emergency change request is different to a normal change request. It still requires approval, it still needs to be tested, but a lot more streamlined. In most cases the principle of applying emergency changes is “fix after”, meaning if things go wrong you do not roll the change back, but you fix it. This requires more people to standby compared to a normal change. This “fix after” principle allows the testing to be light weight.

3. Sprint Planning

The third thing that you need to run a development team is sprint planning. A sprint is a unit of two weeks. In theory it is possible to define a sprint as 1 or 3 weeks, but I’ve been working at 4 companies which do sprint, and every single one does 2 weeks.

Yes Sprint is a way of running Agile. No one does Waterfall any more these days. Funny how things moved, 5 years ago people still debating Agile vs Waterfall. But not now. No more debate on that. The debate is more around CI/CD atau automated testing.

Azure DevOps is probably the best tool in the market, followed by Jenkins, Travis. Many are not a complete DevOps tool. GitHub, Grade, JIRA, Bamboo, AWS CodePipeline, Trello for example. They just doing build and source code control but without CI/CD, or vice versa. Or doing both but not sprint planning (Trello do sprint planning but no release pipeline, whereas AWS CodePipeline is the opposite). Even Azure DevOps is not complete (it doesn’t do change control), but it does sprint planning, source code control, release pipeline CI/CD and automated testing.

Every two weeks you plan the works that you need to do in the next sprint. These pieces of work are called user stories. First you define the team capacity, i.e the number of working days minus the holidays. You then allocate how many days is required for each story, who will be doing it, and the acceptance criteria (the definition of done), i.e. just doing the development and test, or until production. Then you check if anyone is under or over allocated. If someone is over capacity, distribute or reduce the work.

After planning, during the 2 weeks sprint run you track the progress of each story. For each story you create a few tasks with an estimate and the person doing it. As each task gets done, you mark the actual hours against the estimate. The status of each task changes from new to in-progress to completed. This will create a burndown chart, which tracks the whole team’s actual hours against estimate.

It’s all very well and good doing planning and tracking as above, but how do you determine what needs to be done in the first place? This is where the product owner comes in. The list of work in each sprint should be driven largely from the prioritised list of projects + product increment. If the team is a dedicated project team, then the list of work comes from just 1 project + the BAU of that project. Some teams serve multiple projects (size of 3 months to 1 year) plus hundreds enhancement requests (size of 2 to 10 days) throughout the year.

4. Automated Testing

The fourth thing that you need to run a development team is automated testing. Many people leave this to the back of the queue, but the reality is that this thing saves a lot of time (and money), and increases the quality of the code enormously. Automated testing means two things:
a) Ensuring that deployment to production does not break anything
b) The code satisfies all the intended functionalities, and none of the unintended ones

Point a) is achieved by copying the production system into the test environment. The lot, i.e. thousands of components, code, settings, databases and files from production is restored into the test environment overnight. Then you apply the latest code and data changes from the release pipeline into this test environment. Then you run the whole system, and check the expected results. The key thing is: all of these are conducted programmatically, every day, without any human intervention.

Point b) is achieved by having a test environment which is completely empty, i.e. no data. You have the databases, but there are no rows in any tables. You have the data folders but there are no data files in them. Then you test 1 piece of code, e.g. a .Net web service or a stored procedure. You do this by setting up test data, e.g. you insert a few rows into the input tables. Each row reflect one specific test scenario. Then you run the code and check the result. Then you repeat it for all pieces of code in the system, thousands of them. For 1 piece of code you need to run some positive tests and some negative tests. The key thing is: all of these are conducted programmatically, every day, without any human intervention.

If you have to do all that manually, it would requires thousands of hours. Actually it would be impossible to do because for every release you will have to repeat both tests. If you don’t have automated testing, you can’t deploy to production 3x a week. The best you can do is once a quarter, because of the amount of testing involved. It does not make sense to do 10 hours of development and then spend 2000 hours of testing right? But once your hours of testing is down to zero, then you can do a 10 hours development, test it and deploy to production. And then you can do another 10 hours of development, and so on.

Because for every single change you run “the whole lot of testing” twice (the a and the b above), you can release very often, and you increase the quality of the code. You do positive testing and negative testing on every component, and also integration testing. The a) above is called “system testing”, and the b) above is called “unit testing”. Both are automated.

Does it mean you do not need to do manual testing? No, you still need to. When a major feature is developed, you need to do manually test the functionalities across the whole piece. Suppose you are adding a new feature in an order processing system which enable the system to combine orders from different customers and process them together. There are 9 places in the system which need code changes. Each of these 9 pieces has been unit tested individually using the b) above, which proves that each piece works individually (using minimal amount of data, i.e. a few rows). The whole system has been system tested using the a) above, which proves that there is no negative impact to the system as a whole.

But how do you know that the ability to combine orders really works end-to-end, using a normal production data volume? The automated unit test only uses a few rows, and they are made up data. It does not prove that end to end the new feature is working. To prove that the new feature is working we need to run thousands of real orders from last month across the system, combining hundreds of them, process them and check the output. Are all order processed correctly? Are there any issues? If it passes this test we can say hand on heart that the system now has capability of combining orders.

5. Code Repository

This is a must in any development team. You can use GitHub, TFS, SVN, PVCS, BitBucket, or others, but you need to have the capability of storing code, versioning, master and feature branches, committing changes, packaging and releasing code, merging, collaborating, restoring to previous version, doing code reviews and pull requests.

Whether it is Java code, Python, databases, ETL tool, cubes or reports, they are code. And they ALL need to be in ONE code repository. And the code repository MUST be integrated with the release pipeline. You need to be able to push a release candidate to the release pipeline, test it and deploy it automatically into various environment including test, support and production.

6. Server Build

Like automated testing, this one is frequently put at the back of the queue, but it really is a time saver and money saver. So we need to do it. Server Build is the ability to create a server programmatically. A release pipeline consists of a series of environments such as Dev, Test, UAT, Prod and Support. Each environment consist of several servers, like SQL Server, application server, SSIS server, SSAS server, Python server, etc.

These servers need to be built. If you build them manually, it would take a lot of hours. Not just once, but they will need to be configured, patched and maintained. And when they don’t work, you need to troubleshoot it and fix it.

A better way to create servers in Azure is using PowerShell DSC. For each environment you need to create a DSC configuration which tells PowerShell DSC the server name, resource group, folder paths, Windows features and any other characteristics which are specific to each environment and each server. The PowerShell DSC then reads this DSC configuration and create each server in each environment according to the settings written in this file.

The best practice is to recreate these servers on a weekly basis (except the production environment), to avoid having to upgrade and patch them.

7. BAU/support/bug fixes
8. Business requirement & design
9. Architecture

The traditional understanding of “DevOps” as a system normally includes point 1 to 6, but not 7 to 9. But 7 to 9 are certainly required to run a development team for any system (I do mean software, not civil engineering).

Next Page »

Blog at WordPress.com.