Data Platform and Data Science

30 March 2022

Infrastructure in Data Warehousing

Filed under: Data Warehousing — Vincent Rainardi @ 7:43 am

We often forget that underpinning the data mart/lake/warehouse is the storage, network and security. Without them users can’t access the data. We can’t even store the data! We are so dependent on them and yet we spent so little time on them. Today let’s think about them. About the infrastructure.

When we talk about the infrastructure in Data Warehousing, there are 3 things that we need to do:
1. Storage
2. Network
3. Security

1. Storage

Storage means the place where we store the data, e.g. a file storage, a blob storage, a data lake, a database, etc. It all depends on your architecture. So in Azure or AWS we need to create a storage account, or a data lake store, or a database.

This is what we call them in Azure:

  • Blob storage
  • File share
  • Data lake store
  • SQL database

1.1. Storage – SQL Database

If it’s a SQL database, then we need to think about the sizing, i.e. how much computing power, I/O and memory is allocated to it. In Azure we call it DTU. The default is S0 which is 10 DTU (database transaction unit). A DTU is unit of measure for [CPU, memory, read, write] together. S0 has 10 DTU and 250 GB storage. An S0 has very limited power, it’s very slow. For “real” applications you’ll need an S4 or S6. For light weight application it’s an S3 (link).

  • S0 is 10 DTU with 250 GB storage, $18/month
  • S3 is 100 DTU with 250 GB storage, $184/month
  • S4 is 200 DTU with 250 GB storage, $368/month
  • S6 is 400 DTU with 250 GB storage, $736/month

It’s probably easier to visualise DTU using a “bounding box” like this (source: link):

In my experience, a process consists of the combination of stored procedures and data loading which ran for 2-3 hours on S0, after I changed it to S6 it ran 1-2 minutes.

1.2. Storage – Files

For storing files we first create a “storage account” (link). We can choose either GPv2 storage account which is hard drive based or FileStorage storage account which is SSD based (a lot faster). We also need to choose between local redundancy (3 copies in one physical location), zone redundancy (3 copies in 3 different zones in 1 region) or geo redundancy (6 copies in 2 different regions), see link.

Then we create a container, which is like a folder (link). Then we upload the files into this container. To enable users to access the files from Windows explorer (like \\location\folder\) we need to create a file share (link).

You can also create an Oracle database in Azure (link), but you need to create it on a VM (virtual machine). If you use AWS, the storage is called “S3” (Simple Storage Service). What we call a “container” in Azure, in S3 we call it a “bucket”. Basically it means a group of files (link).

1.3. Storage – Data Lake

In Azure, the storage for data lake is called ADLS2 (Azure Data Lake Storage Generation 2, link). It can be used for Hadoop (HDInsight), Synapse or Databricks. It’s very scalable (exabytes) and cost effective ($0.019 per GB per month, for standard storage, or “hot storage”).

There are 4 different storage in ADLS: hot, cool, archive and premium. Hot is the standard. Archive is automatically deleted after 180 days. Cool is automatically deleted after 30 days. Premium has low latency (meaning fast).

2. Network

Network means access to that storage, i.e. how users can connect to that storage or database. How the reports can access that database or data lake.

2.1 Network for a storage account

After we have a storage account, we configure the “networking” of that storage account. By that I mean we click on the Network pane and configure which the IP ranges are allowed to access this storage account. This is done by configuring the “firewall rules” to allow only certain IP ranges to go in, or to go out.

Storage accounts also have private endpoints which is accessible through the internet. It means that we assign a private IP address from the VNet to the storage account. A VNet (virtual network) is a group of computers and devices connected through the internet. We can put a VM (virtual machine), a storage account, a SQL database into one VNet so they can access each other.

The concept of private endpoint can also be applied to an application, like this (source: link):
(click the image to enlarge)

We can see above that the Europe West and Europe North VNets are both using private IP addresses (meaning 10.something or 192.something), communicating via private endpoints (the green arrows) to a web app front end (link).

2.2. Network for a SQL database

In Azure, a SQL database also has a Network pane on which we can configure which IP ranges are allowed to access this database. This is done by configuring the virtual network rules (VNet rules) or IP rules to allow certain IP ranges to go in, or to go out.

VNet rules are applied to the whole SQL server, not just to one database. Whereas IP rules can be applied to either the SQL server level or the database level.

Networking wise, in front of a SQL database there is a “database gateway” (the green boxes below, source: link). A database gateway is an Azure service which has a public IP address, and listens to port 1433 (the SQL port).

The gateway then redirect the traffic to the right cluster (cluster 54). Inside the cluster, the traffic is forwarded to the right database. The alternative is called “proxy mode”, which means that traffic must hit the gateway and the gateway will get the traffic to the right cluster.

Of course you don’t have to have a cluster. In most cases the SQL server is “stand alone” (meaning not in a cluster). A database cluster means a group of SQL servers with high availability. Each of the SQL server is called a “node”. One node (called the primary node) is copied to other nodes (called secondary nodes), like this: (source: link)

3. Security

Security is about allowing users to access the data, and protecting the data.

3.1. Data Access

Data Access is about 2 things: authentication and authorisation.

  • Authentication mean that we lookup the incoming user in our Active Directory, and verify the password.
  • Authorisation means that we lookup what the user is allowed to access.

Bear in mind that a user can be a service account, not just human users. And bear in mind that instead of user ID and password, for service account we can also have a “key” and a “secret”, which is normally stored in a “key vault”.

If you have a database, and a user wants to read a table in this database, then the system will first find out who this user is. This is done by looking up in the Active Directory (list of all the computers, servers, users in our organisation). It checks whether the password supplied is correct (or the “secret” for the “key”).

We can also create a “guest user” in our Active Directory (link). A guest users “sign in” in their company, then come to our network as a temporary user. So we verify their user ID (or email address) but not their password. But remember, only users from that company are allowed in (“company” is defined as a range of IP addresses).

3.2. Data Protection

Data protection is about 2 things: protecting the data storage, and protecting the data transport.

Data storage means the place where our data is stored, i.e. the storage accounts, the data lake, or the databases. We do this by allowing certain users to be able to access certain tables (or views, or stored procedures). We do that using Active Directory.

Data transport means that when we transfer data through the internet, no one can know its content. And no one can eavesdrop either. We do this by encrypting the data, and securing the “channel” we use to transport the data. “Securing” means that public can’t access this channel. Only authorised users are allowed to access it.

Data Architect

What really strikes me is that as a data architect you would need to know those things. The networking, the security and the storage. Earlier in my career, when I was just started working I was a network guy. Back in uni I installed Novell Ethernet network in my lab and configure it. And in my first job I was connecting IBM token ring terminal for AS/400 network. That was my job when I started. Well not for long, because straight after then I became a COBOL developer in AS/400, as well as SQL developer and C# developer (for security access). Well when you just started working you’d do anything that your boss asked you to, right? 🙂

So when later on in my career I became a data architect I didn’t have much trouble understanding the terminology. Well, it’s true that it’s been 20 years so I need to read up a bit. But the basic understanding was already there. Networking and security doesn’t change much. They have a “new face”, new names, but the principles are still the same.

But yes, my point is that as a data architect you need to do those things. You need to design the access to your data, how to secure the storage and the transport, and how you are going to store the data. Basically those 3 things above: the storage, the network and the security. It’s part of the job as a data architect. We are not data modellers. Or data analysts. We are a data architect, meaning that we design the data models, and we architect the data storage, data loading and data access, and everything around it. Not just the data, but also the processes and the architecture.

Blog at WordPress.com.