What does a data architect do? Designing tables? That was what I thought at first. But it turned out that it’s not just that. The main responsibilities of a data architect are:
- Data modelling and analysis (45%)
- Data loading and delivery (25%)
- Data availability, performance, security (15%)
- Data quality & governance (10%)
- Other (5%)
These percentages are applicable if you are a data architect in an “end user” organisation i.e. a bank, health care company, insurance company, retail companies, telecommunication company, airline, etc. The percentages are not applicable if you work as a consultant, meaning that you are a data architect in an IT consultancy firms such as TCS, Accenture, Cognizant, IBM. See under the “Other (5%)” section below for details.
1. Data Modelling and Analysis
- Understand the data
It is the most important duty of a data architect to understand the data in the organisation. If he works in an asset management, he needs to understand what is yield, spread, duration, etc. Without understanding the data well, one cannot hope to function well as a data architect.
- Data modelling
It is not just about designing tables. But also XML files and schemas (XSD), OLAP Cubes, format of interface files, etc. Not just relational, but also dimensional. Not just tables, but also indexes, logs, partitions, database and data servers. Triggers, CDC, audit too. Physical data modelling / PDM as well as logical / LDM. Let’s be honest, we rarely dwell into the physical aspect. We often ignore the physical make up of a database. We usually leave it to the DBA.
- Data Analysis
We often leave data analysis to the Business Analyst. My job is to design the data model, but the content of it is not my business. That’s the usual attitude of a data architect. This is wrong. We as a DA needs to understand where everything is. When we join a company as a starter, we need to spend a great deal of time understanding where the supplier SIC is, what the branch code is made up of, how many joins are required to query the historical data, how the profitability is stored, how the customer ID is generated, etc.
2. Data Loading and Delivery
The data doesn’t get there by itself. It is delivered into the tables by some processes. It is our job to understand the exact details of the data loading mechanism. Understanding the batch architecture, understanding what each step of the loading process does. Whether they are using Active Batch or Autosys, we need to understand how it works. How to create an conditional alert, a loop, monitoring loading time, etc. It’s true that the operational aspect of it is the job of production support, but the architectural design of the data loading process is our responsibility as a DA.
A data architect which only does data modelling and doesn’t understand about the data loading process is like a blind man. When and how the data is loaded into the data store affect the data significantly. For example, if a large, static table is constantly be truncated and reloaded every day, would you be quiet about it? If the staging table is thousand times bigger than the main target table, reaching 4 billion rows, would you not investigate it? If the loading SP is using merge causing nested loop of 1 billion rows, would you just sit and watch? We need to be fully aware of all these things.
We as a data architect need to control the purging process, the loading process, index maintenance, log maintenance. If you work as a DA in SQL Server and somebody mention Olla Hallengren and you raised your eye browse asking “Who is that?”, you can’t claim yourself to be a good data architect. We have to understand how the databases are maintained, and how the data is maintained. That’s the minimum. After understanding it, we need to improve it. Offline backup, restore process, common staging area, reuseable mappings and lookups, capacity planning, we need to improve all that.
Yes we work with the DBAs on these, but we can’t leave it to the DBAs. Does a DBA care about reusable Lookups? Do they even understand it? Yes we work with Production Support team on these, but we should not leave it to them.
Data loading is about piping the data into the data stores. Data delivery is about piping the data out from the data stores, into the business applications. O yes, we as a data architect need to understand and improve this process as well. If the data delivery is not efficiently done, we need to improve it. Six different applications may be requiring the same sales or costing data and they all do 10-join query involving huge tables, hundreds of times a day. Wouldn’t it be better if we create a different structure (additional structure) to store the data? A DBA would only look at the indexes, but they don’t have the knowledge about the data structure to be able to suggest the additional tables. We do!
Hence it is our job to make the data delivery as efficient as possible. How many reports are accessing this particular area of the database? Should we create a flatten table to support those main reports? Why are those reports using views, not stored procedures? (SPs can have temp tables, loop, passed-in-parameters; views can’t)
After we are in control of the data loading and data delivery and data maintenance processes, it is only then we can attempt to improve the architecture. You see, architecture is not just about the structure of the database, but also about designing the processes and mechanism that load and extract the data, into and from the databases.
Architecting the data flow means seeing it from ten thousand feet. At this height we can see things like:
- Why there are 2 jobs pushing promotional sales data into the same set of tables?
- At 7 am new data is being loaded into those main tables, whilst at the same time the cube processing are reading from those table causing locking. No wonder we have time outs.
- Many reports and processes read the relational warehouse directly, shouldn’t they be reading the dimensional marts, the flatten structures?
- Should we shield the structure by implementing views? That would make the query simpler, but also good for security (maintaining permissions)
- If many applications need FX rates, should we load that first and signal the consuming systems when it’s ready?
We design the data loading and data delivery process as efficient as possible. That’s architecting!
3. Data Availability, Performance, Security
Data availability is not just about high availability e.g. the databases should be available and accessible 99.99% of the time. It is also about the timing: it is no good if the position-level performance data or the sales data is available at T+2 (two days after the event). We should make it available on T+1 and if possible on T using intraday load.
The sooner the data is available to the end users the better it is. We have to always look at the data availability from the end user point of view. We have to think about the data delivery all the way to the end users who consumes it, not just half way into the Data Warehouse. For example, it is no good if we get the purchase data (or the risk data, or any other data relevant to your industry) into the Data Warehouse on the same day, but the report delivers the risk data to the end user on T+1 because it was waiting for rating data from a vendor.
Data availability is also about retrieving old data. Consider this conversation: User: can I view the sales data in that region for that product group in 2003? IT: I’m sorry Sir, it’s not available in the database, we have to load it first as the database only has data from 2005. User: ok, how long is it going to take to load? IT: about 4 hours. User: So I can view it tomorrow? IT: No Sir, because we have to bring the tape from Hounslow into here (Chester) and it will take a week. User: ???
And the following week: IT: the 2003 data is ready Sir, you can view it on the screen. User: Why there is only 2003 data? Where is 2004 to 2015 data? IT: Oh, the archive files are arranged by year. If we load 2003 data, we lose all the other year. User: ???
Or, something like this: User: why the 2003 data doesn’t have London region? It only has South East. In 2007 we divided South East into 4 regions. IT: That regional classification didn’t exist in 2003 Sir, and this is 2003 data. User: I know that, but can’t you load the 2015 regional classification? IT: No we can’t. User: Then this data is useless! We need to be able to analyse this 2003 sales data by current product grouping, and by current regional divisions, so we can compare it with the this year’s sales performance!
Query and Load Performance
This is about how performant is the query which retrieves the data from the data store, how fast the reports run, how long does the data export take, how long does the cube processing take, how long does the data load/upload/download take, how long does the data publishing take.
Structures like code-decode tables (link), rows with begin and end dates, etc could hamper the performance of the query. They may be suitable for certain type of query, but absolutely useless for another type of query and causes huge performance problem. As a data architect we need to take care performance issues like these.
It’s not just about the good days. More importantly, it is about the bad days. If during the morning data load, the database server has a problem, when will the data load finishes? 9 am? 1 pm? That’s what the business user wants to know. Do you have a secondary server automatically taking over the data loading?
The business users expect the Cognos Reports or Hyperion Cubes to work within a 1-2 seconds, day-in, day-out. If it takes 1-2 minutes to run a report or a cube query, that’s not good. Business users concern about the performance of what affect them. Not how many milliseconds the query runs! They don’t want to know if it is because of out-of-date statistics, missing indexes, incorrect partitioning, cartesian joins, badly written code, nested loop, locking, latching or dead lock. They do not care about the technicalities of it! If we have to put a secondary server, RAID 1+0, Fiber Channel SAN, Amazon Redshift or Hadoop – so be it. It is our job (data architect) to ensure that the Reports and Cubes respond time is always 1-2 seconds.
And if we don’t know what Redshift is, if we don’t know who Claudera are, if we don’t know how Twinfin FPGA works (or Oracle RAC), then we are not in the business, we are not good Data Architects. We as Data Architects need to know a great deal about stuff that influence the performance.
This is not “oh tell that to Info Sec” debate. We Data Architects work with Information Security people. We work with Compliance guys. We work with Infrastructure guys, with DBAs. We work with Production Support team and Helpdesk team. We work with external penetration test consultant. We work with anybody required to deliver data security, i.e. to ensure that only the relevant users can access relevant data. To ensure that nobody in the organisation (and outside!) can see something that they are not supposed to see.
Reviewing database accounts, reviewing the way the security grouping are arranged, how the database roles are structured (group within group, group hierarchy, etc), Kerberos authentication, view layers, cube roles, database permissions, segregation of network between Dev and Prod, etc. all of these are our reponsibilities as a Data Architect.
It is not just the databases, but also data on the web portals, data we receive/send on files, maintenance of the security grouping (usually via a web interface), system accounts used in data loads, system accounts used by reports, segregations of permissions in the applications, and the security of data in the archive and backups.
We have to understand the concept of Chinese Walls, extranet security, penetration testings, data masking, legallity of data retention periods and scope, public key cryptography (especially in retail banking sector) and security authentication process.
In retail banking sector, the customer data obtained and retained by Singapore, Switzerland, Luxembourg branches may not be viewable by the bank employees in the US, UK or France. In the health care sector, the personal health data obtained in UK, Germany, Australia, and US should not be used or disclosed “in a form which can identify the patient”, without their consent. But the business may want to analyse the data to find the trend at regional and global level, without identifying individual bank account or patient.
We as a Data Architect needs to find a solution to make this happen, for example using data masking, encryption, or nulling out. In retail industry for example, we could be storing customers’ credit card numbers but replace the first 12 digits with XXX so that the Call Centre users can’t view the complete numbers. In banking for example, the name of certain accounts may be blanked out. But because of the sector, country and size of the account, users are still able to identify the account, for instance there might be hundreds of large petrochemical companies in China but there’s only one large cocoa plantation in Peru.
4. Data Quality and Governance
This is not just about implementing data quality software like Trillium and IDQ to improve the quality of customer data. It is the whole framework of data quality. It is about defining the rules to certify the quality of the data, data profiling and clensing, creating workflows for DQ monitoring and exception handling, establishing Data Stewards who identify the business users who need to fix incorrect data, implementing tools for data reconciliation and data lineage, “entity” resolution for Single Customer View, the list goes on.
In Asset Management sector, if the country, currency, or classification of a position is not set, the breakdown of a portfolio will be reported incorrectly. In corporate banking, identifying an “entity” is crucial in managing derivative transactions and could be done in waterfall manner using Markit RED (Reference Entity Database), MIN (Moody’s Issuer Number), ISIN/CUSIP/SEDOL of issues, Reuter/Bloomberg Ticker of the security, and Entity Legal Name + Country of Incorporation.
The reporting of data quality is also Data Architect’s responsibility, i.e. how many breaks this month, how good/bad it is compared to last month, which DQ rules was violeted, how much over the tolerance limit were the breaks, does it affect the top level numbers, was it caused by the raw data coming from the source or caused by a bug in our systems, how many days did it take to be fixed, has it been previously identified and reported, etc.
Common DQ issues in banking are:
- Duplicate records, i.e. two securities with the same ISIN
- Inclusion/exclusion of outstanding trades
- Incorrect instrument attributes
- Ambiguity in identifying where yield or duration is sourced from, and how it was calculated
- Timing of NAV which changes due to single swing price adjustments
- Incorrect risk calculation due to incorrect instrument attributes
Data governance is difficult to get funding if it is too “blue sky”. It should focus on things that quickly bring business benefits, such as: completeness of customer data, business rules to check data accuracy, ensure that key fields are maintained.
It is part of a Data Architect role to deal with the non-technical aspects, ie. Management, dealing with “politics” i.e. resolving conflict of interest (with regards to data) between different groups in the organisation. In a good organisation this is only 5% of the time, but in reality, in many organisation, this “management and politics” work could easily become 30% of our time, making us unable to deliver the 4 responsibilities above with good results.
Data Architects in IT Consultancy Firms and Vendors
I mentioned in the beginning that the percentages are not applicable if we were a data architect working for an IT consultancy firm such as TCS, Accenture, and Cognizant. Or a if we were a data architect working for a database vendor, a BI/DW vendor or a data integration vendor such as IBM, Teradata, SAS, and Informatica. This is because we also have an additional responsibility to help sell data-related projects to the clients. This includes pre-sales activities such as preparing proposals, drafting potential solutions, conducting presentation, preparing demonstrations, proof of concepts, preparing business case including setting up the data, conducting research, answering RFPs particularly the technical pages, share customer feedback with the product development team and marketing team, assisting costing calculation.
Data architects working for an IT consultancy firm or vendor do not normally do sales or account management, but they could be asked to help with support work for certain accounts. Usually as a third line support (the first line being the helpdesk/technical support team, and the second line being the developers/product development team, or training team in certain cases). This support work is usually because they were working on that project before.
And then after they are deployed in the projects, their role become similar to DAs in the “end user” organisations, but with extra time doing the “internal agendas” (work not for the client but for the consultancy firm or vendor itself), for example: working on other proposal or presentation to other client.
The percentages could be something like this:
|“In the office” time: (aka “on the bench” time)
||“In the project” time: (depends on what project, e.g. DQ project is different to BI project)