There are many elements of data architecture:
- Data models
- Data storage
- Data infrastructure
- Data ingestion
- Data calculation
- Data analytics
- Data access
- Data security
- Data sharing
- Data availability
- Data ownership
- Data standards
- Data dictionary
- Data quality
Note:
- #3 includes cloud infrastructure, data platform, scalability
- #10-14 plus #7 are included in data governance
In this article I’ll go through all those elements one by one, in order to paint a high level picture of data architecture, before finally answering “What is data architecture?” That way it would make our understanding of data architecture more down to earth, rather than highly conceptual.
1. Data Models
Before we do anything, in the beginning we model all the business concepts and processes. The output of this exercise is something called a data model. If in the business there is something called accounts, then in the data model we also have something called accounts. If in the business we have clients, in the data model we also have clients.
A data model defines what the data container look like. To use an analogy, data is like water. It is stored in different kind of water containers, such as bottles, cups, buckets, tanks. A data model defines how these water bottles, cups, buckets and water tanks would look like. It defines their shapes and sizes.
In the actual world, data is stored in database tables, or in files, or in documents. A data model defines the structure of these tables, files or documents. A data model also defines the relationships between those tables/files/documents.
So to summarise, a data model implements the business concepts and it defines the structure of the tables/files/documents where the data is stored. As a data architect, our role is to define those data models.
2. Data Storage
Data storage is the physical place where the data is located. To use the water analogy, data storage is the actual buckets, cups and bottles. These bottles and cups can be located on shelves, rooms, etc. The data model defines the shape and sizes of the bottles and cups, and the data storage is the actual bottles and cups.
In the actual words, the data storage is databases or data lakes. In databases, data is stored in tables. In data lakes, data is stored in files. In document DB, data is stored in documents. These databases and data lakes are physically located on storage devices such as Storage Area Network (SAN) and Solid State Device (SSD).
One property of the data storage is the capacity. Another property is the scalability, and availability. Capacity means how much data can be stored. Scalability means how easy it is to stretch the capacity. If the current capacity is 10 Terabyte, how easy is it to increase the capacity to 100 Terabyte? Will that impact the performance? In other words, will it be slower?
Availability means how much downtime could possibly occur, if something goes wrong. If the primary data storage goes down, will the system automatically switch to a secondary storage? How out of date is the secondary storage? How long does it take to make it up-to-date? If the primary network path from the user to the storage is not available, is there a secondary path? The downtime could be seconds or days depending on whether we have a secondary storage or not, and whether it is always kept up-to-date.
To summarise, data storage is the physical containers where the data is stored, like databases or data lakes, located in SAN or SSD. As a data architect, we design the data storage, and set the capacity, scalability and availability according to the requirements and infrastructure costs.
3. Data Infrastructure
The data storage is physically located in an IT infrastructure such as a data centre. In this day and age, the infrastructure is all cloud based. Meaning that the data centre is owned by Microsoft, Amazon or Google, and we are renting it from them, paying by the hour. In the old days, we need to pay millions of dollars upfront, to buy all the components of a data centre such as servers, storage and routers. Today we don’t pay a dime upfront. We only pay what we used last month.
The cloud data centre also contains network and security resources such as VNETs, IP Addresses, APIs, active directory, roles, Front Door, key vaults (I’m using Azure terminology here but the concept applies to AWS and GCP too) as well as data resources such as storage accounts, files, folders, ADLS, SQL databases, Databricks, Python Notebooks and ADF pipelines.
All of that is what I call the “data infrastructure”. To summarise, data infrastructure is all the data resources, network resources and security resources related to data storage and processing. As a data architect, our job is to work with the network team to design and build the data infrastructure.
4. Data Ingestion
Data Ingestion is the process of loading data into the data storage. This process uses data pipelines, which are controlled by an Enterprise Scheduler such as Autosys, Active Batch or Control-M. It includes the audit logs, error handling, data quality monitoring, data ingestion monitoring and failure alerts.
As a data architect, our job is to design the data ingestion processes, according to the business requirements and according to the data models. We then handover the design to the development team to be implemented. Once the development and testing are completed, the development team then deploys the data ingestion pipelines to the production environment using release pipelines.
Note on Release Pipelines
A release pipeline is a mechanism to deploy code from the code repository (Repo) to dev, test and prod environments. So we don’t copy code from Dev to Test and from Test to Prod. But we deploy code from the Repo to Dev and test it there. If required, fix the code in the Repo. Then deploy the code from the Repo to Test, and test it there. If required, fix the code in the Repo. Then deploy the code from the Repo to Prod.
The concept here is that we package the whole change as a “Release”, so that no changes get left behind. In the old days before DevOps, we didn’t package the changes into a Release, so sometimes “it works in Dev but not in Prod”. That’s because some changes got left behind. Using “Release pipelines” we test the functionality as well as the deployment.
5. Data Calculation
The other development that we need to do is Data Calculation. Data Calculation is all the processing and calculations that need to be performed on the data, to satisfy the business requirements.
Our role as a data architect is to guide the BA with regards to technical requirements and architectural design. For instance, perhaps the output of the calculations need to be persisted, and if so what is the best way to persist it. Perhaps the calculation is common across two applications, and if so should it be built as a single code that receives parameters?
6. Data Analytics
Data analytics means analysing and examining raw data in order to:
- Present the data
- Find patterns and trends
- Draw conclusions and insights
- Make a prediction
There are 5 types of data analytics:
- Descriptive data analytics: the TRENDS & ANALYSIS in the past data
- Diagnostic data analytics: to find out WHY some issues happened
- Predictive data analytics: to forecast what WILL happen in the future
- Prescriptive data analytics: what should be done to make something happen in the future
- Cognitive data analytics: human-like intelligence including semantics and AI
The 5 above known as the 2Ds, 2Ps and 1C. It doesn’t mean that we should be doing all 5. No. We could just do 1.
Doing data analytics is very much a BA job and so it should be. As a data architect, our job is to provide a framework of how it should be built. For example, if the BAs are analysing the trends in the data, how should they do it? Meaning: which data should be used for this analysis? What tool should be used? Where should the result be stored? Should it be automated? The same questions when a Quant does forecasting using ML: which data, which tool, storing results and automation.
7. Data Access
Data Access means the mechanism by which the users and applications are allowed to read from or write to the data storage. Users usually access the data using user IDs, whereas application is using a service accounts. Both user IDs and service accounts are created in Active Directory, and then they are assigned to a role.
The role is then given certain permissions, for example: a BA role can write into database 1,2,3 and can read from fileshare 4,5,6, where as a Quant role can write into database 11,12,13 and read from fileshare 14,15,16. When there are new users they are simply assigned to those roles.
Resources are then grouped together. For example, table 1,2,3 are grouped into Account Tables resource and fileshare 4,5,6 are grouped into Product Fileshare resource. The role permissions are then changed to use the Resources rather than the actual databases or fileshares. Using the above example a BA role can write into Account Tables resource and a Quant role can read from Product Fileshare resource.
In Azure, for database access, we should also decide whether we will be using SQL Authentication or Windows Authentication or both. More on this in section 8 below (Data Security).
Who should approve data access requests should also be defined (usually the line manager + InfoSec).
Who should audit data access should also be defined (usually once every 3 or 6 months, by Production Support and Development Head). It is very important to audit data access every quarter, and removed those which are no longer required.
What is our role here as a Data Architect? Our role is to work with InfoSec to define the mechanism. Using Roles and Resources as I described above is only one way. There is another way such as Azure Resource Groups and PIM (Privileged Identity Management) for production access. The best practice is usually achieved not using a single approach but a combination of several approaches.
The data can also be accessed using API. As a data architect our job is to define the mechanism, again working with InfoSec. For example: key vault, access token, roles validation, firewall, endpoints, etc.
8. Data Security
Like any security, Data Security is about Authentication, Authorisation and Accounting (known as AAA). Authentication is about identity. If someone claims to be User ID1 or Application Account2, how do we know that they are not lying? Authorisation is about permission. Is this user allowed to access that data? Accounting is how long the user accessed that resource and how much MB data was retrieved or written. This accounting data is usually used for billing, trend analysis and capacity planning.
Data security is also about preventing data theft and cyberattack. This is about preventing unauthorised access from both external and internal, for example by using a firewall or FrontDoor. It is also about encrypting the data so that in the event of the data being stolen, the thief can’t read the data.
Data security is not only about protecting the data when it is being stored, but also when it is being moved. For example whether the data is encrypted when it is being moved.
Data security is also about classifying which data is considered sensitive data. And controlling the access to that sensitive data.
Data security is also about preventing data loss. For example because the medium where the data is stored is damaged. So we need to provide a regular backup, or secondary server (not just using RAID disks).
In the context of data security, “data” means the raw data i.e. the output of #4, but also the processed data i.e. the output of #5 and #6.
What is our role as a data architect in data security? We know the in and out of the data, we know the structure and the flow of the data, we know how to the data is stored and who access it. Our job as a data architect is work with the InfoSec to protect the data when it is being stored and when it is being moved, to prevent data theft, unauthorised access and data loss.
9. Data Sharing
Data Sharing is a complement of Data Analytics and Data Calculations: rather than analysing or processing the raw data, we simply share the raw data with the users and applications. You might be underestimating Data Sharing as it looks simple but it can provide tremendeous values.
In it’s simpliest form, data sharing is data access. We just allow the users to access the data. Just like in point #7 above. But data sharing can also means building a set of APIs, for example Rest API or GraphQL, so that the users can access the data programatically, for example from their ML applications or from their BI reports.
Data Sharing is also about sharing internal data with external users such as clients and regulators. For example, via FTP, an API (as above) or via a website (see below). I’d like to point out here that, whether the user access it via a website, FTP or API, the underlying mechanism should be the same (single code base), i.e. using API.
Data Sharing is also about building a website for internal or external users to enable them to view, filter and export the data. They can export it into a text file (CSV or pipe delimited), PDF, JSON or Excel. Or they can save the filter settings (giving it a name), and schedule the export to run daily or weekly so that they get the latest data every time.
In Data Sharing, Data Access and Data Security is paramount (see #7 and #8 above). Particularly if it is for external users. Here’s an example for Data Access: In the data sharing website, if a user choose Client X, how do we know that this user is allowed to see Client X data? Here’s an example for Data Security: how protect the data from being stolen? Or being destroyed?
Data Sharing is not about building reports and give access for the users to the reports. No, that’s Data Analytics. Data Sharing is about allowing users and apps to access our data, such as using API, a website or FTP. As a data architect, our job is to design the mechanism, and hand it over to Dev team for development. When designing the mechanism, we work with InfoSec to ensure that the mechanism is secure i.e. the data is encripted during transport, and the access is tightly restricted and auditable (who accessed what and when is logged).
10. Data Availability
Data Availability is about keeping the downtime to a minimum. The data is always available to the users, to the reports and to the APIs 24×7. Data Availability is a bit of an art, because the database server might be up and running, but because the network path from the users to the database is having a problem, the user can’t access the data. Ditto from the reports or APIs.
The last thing we should do in cases like that is to blame each other (it’s not a database issue gov, it’s the network). Business users couldn’t care less whether it is the left hand of IT who’s making a mistake or the right hand, and they are right. From their point of view we are all IT. The right thing to do is have a Teams session between DBA and network team to work it out and resolve the issue. Data architect, InfoSec or Dev team might join in as well, depending on the issue.
Data Availability also means this: the data is server X or application Y and the user don’t have access to it. In other words, the data is in the building, but it’s not available to the users. In this case it is not about down time at all. It is not even about operation. It is about development prioritisation. If we allow everyone to access every data in the company, it would be spagetti. There’s a reason why we adopt data warehouse or data lake as the data architecture, one of which is so that users can access that data. Users, reports, and APIs can access all the data in one place.
What’s our role as a data architect in Data Availability? Our role is to define ingestion mechanism for new data. And for production issue our job is to help resolving the issue, often by describing the security architecture, data architecture and the data flows (data ingestion or publication).
11. Data Ownership
Data Ownership means that every data must have an owner. Not the technical owner, but business owner. For example, billing data is owned by the finance department, and inventory data is owned by the production department.
We data architect can easily create an entity, such as “policy” and “client” but who owns that data entity? It’s not always a single owner. Master entity such as policy and client can have 2 owners or more. In insurance companies the policy data is usually owned by the product team, not underwriting. Whereas clients are joinedly owned by the Sales & Marketing team and the Customer Service team.
The reason every data must have an owner is because if there is a dispute with regards to a definition of a particular data, the owner has the final word. Ditto when there is a dispute with regards to the sourcing of a particular data. Or the data quality check. See #12, #13 and #14 below for more details.
But Data Ownership is a matter for the Data Governance. What’s it to do with us data architect? Data Architect designs the entities in the data model, of course we should be working with the data owners when doing data modelling. And also when we are designing data ingestion, data access, data security, data analytics, data calculations and data sharing. In almost every single thing that we do, we need to work with the data owners.
We know when a particular data has no owner and we can bring that up to the Data Governance forum/meeting.
12. Data Standards
Data Standards is a convention about how the data models should be designed. For example, we have standards on data types. We have standards on the naming of data entities (and other data objects). We have standards on data ingestion process e.g. naming of the pipelines and resources, the way the connections are established, the way the ingestion pipelines should be built, etc.
We have standards on referential integrity between entities. Naming of the primary and foreign keys. We have standards on the collations, partitioning strategy, index rebuilt and statistics update.
We have standards for the Data APIs, both APIs for Data Sharing and API for Data Ingestion. Naming of endpoints, data types for parameters, format of the return values (e.g. JSON), expected response time, etc.
As a data architect it is our job to define those standards and monitoring the implementation. It is sometimes annoying when a “pull request” satisfies everything except 1 count of naming standards. When everything else is working except 1 entity name not following the standards. But we need to be disciplined with regards to following the data standards. If we allow one exception the whole thing will look funny.
Note: a pull request means merging the code from a “feature branch” into the “release branch”. Feature branch is where we do our development and unit testing. Release branch is where we package the code for releasing it into Test environment or Production environment.
13. Data Dictionary
A data dictionary is a collection of business terms used in the whole data platform. Yes this is very much in the Data Governance domain, but when designing the data model we Data Architect also have to define every single column in every entity.
When designing the data calculations we also have to define every single calculation. Yes, defining calculations is very much a BA task, but a Data Architect also needs to read those definitions. Quite often, the BA will need to ask the Data Architect about where a particular piece of data should be sourced from, especially when there are several possibilities. So in a way, Data Architects are also involved in the definition of those calculations.
A data dictionary also contains the definition of all the source columns. This is obtained from the data providers. If the data is sourced internally, then we need to ask the SME of that source system, about what every column means.
A data dictionary also contains the definitions of all the fields in every report. Again this is very much a BA task, but a Data Architect is often involved in the sourcing of the elements that make up that particular field. For example if field X in the report is Nominal x Unit Price x FX Rate, the BA would ask us Data Architects about where the Nominal, Unit Price and FX Rate should be sourced from (which column in which table, and potentially which row).
Creating a data dictionary is a mammoth task. We are talking probably 6-12 months of 1 FTE. Maintaining a data dictionary is not an easy task either, because new data keeps croping up and new components keeps being built. They all need to be added into the data dictionary.
With regards to Data Dictionary, as a data architect we are taking a back seat role. Our job is to make ourselves available to the Data Governance team and the BAs for them to consult and ask us question. Quite often those questions are about data definition and data sourcing (what does this mean and where can I find X).
14. Data Quality
We finally arrive at the last element: the Data Quality. Data Quality is checking that the data is correct and good for use. When the data is ingested into the data platform, before any calculation is made, we first check the quality of the data. And by that I mean these 5 factors: data existence, data format, data timeliness, data completeness and data correctness.
- Data Existence means that the data file for that day exists. Say every day we expect to receive 12 files from this data provider, and today we only receive 10 files. Meaning that 2 files are missing. That’s data existence.
- Data Format means that the physical format of the file is correct. Say we expect file X to be in comma separated format (CSV), which it usually is, but today it’s pipe delimited. That means it fails data quality check. Another example is if we expect that the file has 10 header lines but today it contains 11 header lines. Or we expect that the file should contains 40 columns (based on the agreed file spec with the data provider) but today it contains 41 columns. Or column 1 should contains a number, but today it contains a letter, e.g. 1.2E-5. Or column 2 should contains a date in YYYY-MM-DD format, but today it contains a date in DD/MM/YYYY format. Or blank should be expressed as a blank string, but today it is expressed as “NULL”.
- Data Timeliness means that certain file should be in by certain time. For example, if it is monthly file we should receive it by the first Monday at 9am. Or if it is a daily file we should receive it by 7am. Our Data Quality check will check the existance of the file at 7am (or the first Monday at 9am) and if the file is not there then it failed the timeliness check.
- Data Completeness is about the content of the file. If the file usually contains around 10,000 rows and today it contains 1000 rows then it’s not complete. Usually we can tolerate 10% to 20% depending on what data.
- Data Correctness is about the accuracy of the data. If in the file the average of column 1 usually around 55 and today the average is 85 then it failed the correctness check. Usually we can tolerate 10% to 20% depending on what data. Usually we compare it with the last 3 days’ average, or with the previous month. We can also use median or quartile. For text values we count. For example, if usually out of 10,000 rows the values are about 5000 “High”, 3000 “Medium”, 2000 “Low”, then today they should be around those counts too. We can replace High with 3, Medium with 2 and Low with 1, so we get 5000×3 + 3000×2 + 2000×1 = 23,000. Ditto with dates, we count the years or the months.
That’s the 5 categories of Data Quality checks/rules. But how about the Data Quality mechanism? How is it done? It needs to be flexible, not hard coded. So when we need to change it we don’t need to change the code, but just the parameters which are stored in a config file. In the old world, the data files were loaded into staging tables first, then we performed the DQ check on those staging tables, using SQL. But in the modern data platform, the DQ checks are performed on the data files. Not using SQL but using Python notebooks.
Where are the results of this DQ checks stored? In a table in the data warehouse or a file in the data lake. This table or file is then displayed on a DQ report.
Data which failed DQ checked are marked. There should be a column in each table for this, e.g. “Pass DQ Check (Y/N)”, and/or “DQ Rule ID” column.
What about data correction? It’s best to do data correction manually rather than automatically, but first we need to try correcting it at source i.e. ask the data provider to send a corrected file.
So Data Quality is checking and reporting the data existance, format, accuracy, timeliness and correctness, before the data is used by applications and in the APIs. As a Data Architect, our job is to define the mechanism for doing Data Quality check, logging and reporting the results. Also the mechanism for marking the data which failed the DQ checks and for data correction. We must also work with the BA and Data Governance to define the Data Quality rules.
CONCLUSION
So those are the elements of Data Architecture. It is clear now that Data Architecture is not just about Data Models, but about 13 other things too, including Data Security.
Of course different people has different opinion. I welcome comments and suggestions, either below on WordPress, or on LinkedIn (if you accessed this article through LinkedIn), or you can email me at vrainardi@gmail.com. I hope this article has been useful for you.