Data Warehousing and Business Intelligence

31 May 2015

Responsibilities of Data Architects

Filed under: Data Architecture — Vincent Rainardi @ 7:49 am

What does a data architect do? Designing tables? That was what I thought at first. But it turned out that it’s not just that. The main responsibilities of a data architect are:

  1. Data modelling and analysis (45%)
  2. Data loading and delivery (25%)
  3. Data availability, performance, security (15%)
  4. Data quality & governance (10%)
  5. Other (5%)

These percentages are applicable if you are a data architect in an “end user” organisation i.e. a bank, health care company, insurance company, retail companies, telecommunication company, airline, etc. The percentages are not applicable if you work as a consultant, meaning that you are a data architect in an IT consultancy firms such as TCS, Accenture, Cognizant, IBM. See under the “Other (5%)” section below for details.

1. Data Modelling and Analysis

  • Understand the data
    It is the most important duty of a data architect to understand the data in the organisation. If he works in an asset management, he needs to understand what is yield, spread, duration, etc. Without understanding the data well, one cannot hope to function well as a data architect.
  • Data modelling
    It is not just about designing tables. But also XML files and schemas (XSD), OLAP Cubes, format of interface files, etc. Not just relational, but also dimensional. Not just tables, but also indexes, logs, partitions, database and data servers. Triggers, CDC, audit too. Physical data modelling / PDM as well as logical / LDM. Let’s be honest, we rarely dwell into the physical aspect. We often ignore the physical make up of a database. We usually leave it to the DBA.
  • Data Analysis
    We often leave data analysis to the Business Analyst. My job is to design the data model, but the content of it is not my business. That’s the usual attitude of a data architect. This is wrong. We as a DA needs to understand where everything is. When we join a company as a starter, we need to spend a great deal of time understanding where the supplier SIC is, what the branch code is made up of, how many joins are required to query the historical data, how the profitability is stored, how the customer ID is generated, etc.

2. Data Loading and Delivery

Data Loading

The data doesn’t get there by itself. It is delivered into the tables by some processes. It is our job to understand the exact details of the data loading mechanism. Understanding the batch architecture, understanding what each step of the loading process does. Whether they are using Active Batch or Autosys, we need to understand how it works. How to create an conditional alert, a loop, monitoring loading time, etc. It’s true that the operational aspect of it is the job of production support, but the architectural design of the data loading process is our responsibility as a DA.

A data architect which only does data modelling and doesn’t understand about the data loading process is like a blind man. When and how the data is loaded into the data store affect the data significantly. For example, if a large, static table is constantly be truncated and reloaded every day, would you be quiet about it? If the staging table is thousand times bigger than the main target table, reaching 4 billion rows, would you not investigate it? If the loading SP is using merge causing nested loop of 1 billion rows, would you just sit and watch? We need to be fully aware of all these things.

We as a data architect need to control the purging process, the loading process, index maintenance, log maintenance. If you work as a DA in SQL Server and somebody mention Olla Hallengren and you raised your eye browse asking “Who is that?”, you can’t claim yourself to be a good data architect. We have to understand how the databases are maintained, and how the data is maintained. That’s the minimum. After understanding it, we need to improve it. Offline backup, restore process, common staging area, reuseable mappings and lookups, capacity planning, we need to improve all that.

Yes we work with the DBAs on these, but we can’t leave it to the DBAs. Does a DBA care about reusable Lookups? Do they even understand it? Yes we work with Production Support team on these, but we should not leave it to them.

Data Delivery

Data loading is about piping the data into the data stores. Data delivery is about piping the data out from the data stores, into the business applications. O yes, we as a data architect need to understand and improve this process as well. If the data delivery is not efficiently done, we need to improve it. Six different applications may be requiring the same sales or costing data and they all do 10-join query involving huge tables, hundreds of times a day. Wouldn’t it be better if we create a different structure (additional structure) to store the data? A DBA would only look at the indexes, but they don’t have the knowledge about the data structure to be able to suggest the additional tables. We do!

Hence it is our job to make the data delivery as efficient as possible. How many reports are accessing this particular area of the database? Should we create a flatten table to support those main reports? Why are those reports using views, not stored procedures? (SPs can have temp tables, loop, passed-in-parameters; views can’t)

Architecting

After we are in control of the data loading and data delivery and data maintenance processes, it is only then we can attempt to improve the architecture. You see, architecture is not just about the structure of the database, but also about designing the processes and mechanism that load and extract the data, into and from the databases.

Architecting the data flow means seeing it from ten thousand feet. At this height we can see things like:

  • Why there are 2 jobs pushing promotional sales data into the same set of tables?
  • At 7 am new data is being loaded into those main tables, whilst at the same time the cube processing are reading from those table causing locking. No wonder we have time outs.
  • Many reports and processes read the relational warehouse directly, shouldn’t they be reading the dimensional marts, the flatten structures?
  • Should we shield the structure by implementing views? That would make the query simpler, but also good for security (maintaining permissions)
  • If many applications need FX rates, should we load that first and signal the consuming systems when it’s ready?

We design the data loading and data delivery process as efficient as possible. That’s architecting!

3. Data Availability, Performance, Security

Data availability

Data availability is not just about high availability e.g. the databases should be available and accessible 99.99% of the time. It is also about the timing: it is no good if the position-level performance data or the sales data is available at T+2 (two days after the event). We should make it available on T+1 and if possible on T using intraday load.

The sooner the data is available to the end users the better it is. We have to always look at the data availability from the end user point of view. We have to think about the data delivery all the way to the end users who consumes it, not just half way into the Data Warehouse. For example, it is no good if we get the purchase data (or the risk data, or any other data relevant to your industry) into the Data Warehouse on the same day, but the report delivers the risk data to the end user on T+1 because it was waiting for rating data from a vendor.

Data availability is also about retrieving old data. Consider this conversation: User: can I view the sales data in that region for that product group in 2003? IT: I’m sorry Sir, it’s not available in the database, we have to load it first as the database only has data from 2005. User: ok, how long is it going to take to load? IT: about 4 hours. User: So I can view it tomorrow? IT: No Sir, because we have to bring the tape from Hounslow into here (Chester) and it will take a week. User: ???
And the following week: IT: the 2003 data is ready Sir, you can view it on the screen. User: Why there is only 2003 data? Where is 2004 to 2015 data? IT: Oh, the archive files are arranged by year. If we load 2003 data, we lose all the other year. User: ???
Or, something like this: User: why the 2003 data doesn’t have London region? It only has South East. In 2007 we divided South East into 4 regions. IT: That regional classification didn’t exist in 2003 Sir, and this is 2003 data. User: I know that, but can’t you load the 2015 regional classification? IT: No we can’t. User: Then this data is useless! We need to be able to analyse this 2003 sales data by current product grouping, and by current regional divisions, so we can compare it with the this year’s sales performance!

Query and Load Performance

This is about how performant is the query which retrieves the data from the data store, how fast the reports run, how long does the data export take, how long does the cube processing take, how long does the data load/upload/download take, how long does the data publishing take.

Structures like code-decode tables (link), rows with begin and end dates, etc could hamper the performance of the query. They may be suitable for certain type of query, but absolutely useless for another type of query and causes huge performance problem. As a data architect we need to take care performance issues like these.

It’s not just about the good days. More importantly, it is about the bad days. If during the morning data load, the database server has a problem, when will the data load finishes? 9 am? 1 pm? That’s what the business user wants to know. Do you have a secondary server automatically taking over the data loading?

The business users expect the Cognos Reports or Hyperion Cubes to work within a 1-2 seconds, day-in, day-out. If it takes 1-2 minutes to run a report or a cube query, that’s not good. Business users concern about the performance of what affect them. Not how many milliseconds the query runs! They don’t want to know if it is because of out-of-date statistics, missing indexes, incorrect partitioning, cartesian joins, badly written code, nested loop, locking, latching or dead lock. They do not care about the technicalities of it! If we have to put a secondary server, RAID 1+0, Fiber Channel SAN, Amazon Redshift or Hadoop – so be it. It is our job (data architect) to ensure that the Reports and Cubes respond time is always 1-2 seconds.

And if we don’t know what Redshift is, if we don’t know who Claudera are, if we don’t know how Twinfin FPGA works (or Oracle RAC), then we are not in the business, we are not good Data Architects. We as Data Architects need to know a great deal about stuff that influence the performance.

Data Security

This is not “oh tell that to Info Sec” debate. We Data Architects work with Information Security people. We work with Compliance guys. We work with Infrastructure guys, with DBAs. We work with Production Support team and Helpdesk team. We work with external penetration test consultant. We work with anybody required to deliver data security, i.e. to ensure that only the relevant users can access relevant data. To ensure that nobody in the organisation (and outside!) can see something that they are not supposed to see.

Reviewing database accounts, reviewing the way the security grouping are arranged, how the database roles are structured (group within group, group hierarchy, etc), Kerberos authentication, view layers, cube roles, database permissions, segregation of network between Dev and Prod, etc. all of these are our reponsibilities as a Data Architect.

It is not just the databases, but also data on the web portals, data we receive/send on files, maintenance of the security grouping (usually via a web interface), system accounts used in data loads, system accounts used by reports, segregations of permissions in the applications, and the security of data in the archive and backups.

We have to understand the concept of Chinese Walls, extranet security, penetration testings, data masking, legallity of data retention periods and scope, public key cryptography (especially in retail banking sector) and security authentication process.

In retail banking sector, the customer data obtained and retained by Singapore, Switzerland, Luxembourg branches may not be viewable by the bank employees in the US, UK or France. In the health care sector, the personal health data obtained in UK, Germany, Australia, and US should not be used or disclosed “in a form which can identify the patient”, without their consent. But the business may want to analyse the data to find the trend at regional and global level, without identifying individual bank account or patient.

We as a Data Architect needs to find a solution to make this happen, for example using data masking, encryption, or nulling out. In retail industry for example, we could be storing customers’ credit card numbers but replace the first 12 digits with XXX so that the Call Centre users can’t view the complete numbers. In banking for example, the name of certain accounts may be blanked out. But because of the sector, country and size of the account, users are still able to identify the account, for instance there might be hundreds of large petrochemical companies in China but there’s only one large cocoa plantation in Peru.

4. Data Quality and Governance

This is not just about implementing data quality software like Trillium and IDQ to improve the quality of customer data. It is the whole framework of data quality. It is about defining the rules to certify the quality of the data, data profiling and clensing, creating workflows for DQ monitoring and exception handling, establishing Data Stewards who identify the business users who need to fix incorrect data, implementing tools for data reconciliation and data lineage, “entity” resolution for Single Customer View, the list goes on.

In Asset Management sector, if the country, currency, or classification of a position is not set, the breakdown of a portfolio will be reported incorrectly. In corporate banking, identifying an “entity” is crucial in managing derivative transactions and could be done in waterfall manner using Markit RED (Reference Entity Database), MIN (Moody’s Issuer Number), ISIN/CUSIP/SEDOL of issues, Reuter/Bloomberg Ticker of the security, and Entity Legal Name + Country of Incorporation.

The reporting of data quality is also Data Architect’s responsibility, i.e. how many breaks this month, how good/bad it is compared to last month, which DQ rules was violeted, how much over the tolerance limit were the breaks, does it affect the top level numbers, was it caused by the raw data coming from the source or caused by a bug in our systems, how many days did it take to be fixed, has it been previously identified and reported, etc.

Common DQ issues in banking are:

  • Duplicate records, i.e. two securities with the same ISIN
  • Inclusion/exclusion of outstanding trades
  • Incorrect instrument attributes
  • Ambiguity in identifying where yield or duration is sourced from, and how it was calculated
  • Timing of NAV which changes due to single swing price adjustments
  • Incorrect risk calculation due to incorrect instrument attributes

Data governance is difficult to get funding if it is too “blue sky”. It should focus on things that quickly bring business benefits, such as: completeness of customer data, business rules to check data accuracy, ensure that key fields are maintained.

5. Other

It is part of a Data Architect role to deal with the non-technical aspects, ie. Management, dealing with “politics” i.e. resolving conflict of interest (with regards to data) between different groups in the organisation. In a good organisation this is only 5% of the time, but in reality, in many organisation, this “management and politics” work could easily become 30% of our time, making us unable to deliver the 4 responsibilities above with good results.

Data Architects in IT Consultancy Firms and Vendors

I mentioned in the beginning that the percentages are not applicable if we were a data architect working for an IT consultancy firm such as TCS, Accenture, and Cognizant. Or a if we were a data architect working for a database vendor, a BI/DW vendor or a data integration vendor such as IBM, Teradata, SAS, and Informatica. This is because we also have an additional responsibility to help sell data-related projects to the clients. This includes pre-sales activities such as preparing proposals, drafting potential solutions, conducting presentation, preparing demonstrations, proof of concepts, preparing business case including setting up the data, conducting research, answering RFPs particularly the technical pages, share customer feedback with the product development team and marketing team, assisting costing calculation.

Data architects working for an IT consultancy firm or vendor do not normally do sales or account management, but they could be asked to help with support work for certain accounts. Usually as a third line support (the first line being the helpdesk/technical support team, and the second line being the developers/product development team, or training team in certain cases). This support work is usually because they were working on that project before.

And then after they are deployed in the projects, their role become similar to DAs in the “end user” organisations, but with extra time doing the “internal agendas” (work not for the client but for the consultancy firm or vendor itself), for example: working on other proposal or presentation to other client.

The percentages could be something like this:

“In the office” time: (aka “on the bench” time)

  • Proposal 60%
  • Presentation, demo, PoC 20%
  • Research & documentation 10%
  • Support 5%
  • Management 5%
“In the project” time: (depends on what project, e.g. DQ project is different to BI project)

  • Data Modelling & Analysis 40%
  • ETL Design / Data Loading 30%
  • Data Quality Framework 20%
  • Management 5%
  • Internal agendas 5%
Advertisements

5 Comments »

  1. Hi Vincent,
    I have a question about data quality.
    I am using a data quality tool (talend), I can define tests (validation) that I run periodically in some crititical tables of my model.
    This tests let me to control data quality level in some tables.
    I also have some data quality validation in my ETL that I created prior to use that DQ tool.
    I don’t know what are the best practices in DQ, I am thinking in insert the tests that I have created with talend DQ y my ETL (Pentaho), because I have more tests in Talend DQ that in Pentaho.
    I think that my validations tests should be in all the data flows to my DW, shouldn’t be?

    Can you give some advices, best practices or roadmap ti go further in DQ on my DW?

    Thanks in advance,

    Comment by Juan V. — 16 April 2016 @ 8:21 am | Reply

    • Hi Juan, I think the most important thing about data quality is to fix the issues captured by the DQ/ETL tools. There is no point spending time and money configuring data quality validations if we don’t fix the data issues. But ironically, many companies do that: the DQ system outputting a lot of numbers which we do nothing about. The second most important thing is the coverage. We should not spend time and money capturing issues which were not important nor valuable to the business. For example, we need to be able to handle incoming data files being half empty (and potentially causing the DW batch to fail), or containing incorrect data (and potentially causing incorrect decisions being made). We capture technical data issues (such as orphan rows or a business key doesn’t exist) but don’t capture important business data issues (such as incorrect country of risk or duration being way out of the normal range). So yes by all means you should be validating all fields in all tables, but as time and budget are limited you should prioritising the ones having most value to the business. On every validation we setup, we should ask ourselves: what value does this bring to the business?
      About validation being in 2 places (Talend and Pentaho), the best practice is one place. Every DQ rule should only be created in one place. The ideal place is after the DW load is completely finished. So don’t validate the data when the data is being loaded (in-row validation), but after the data is fully loaded. After all the data has been loaded we can perform a lot more meaningful validations than doing it in-row (when the row is being loaded). For example, we can verify the row counts, the total sum (per product, per portfolio, per store, etc.) We can compare today’s max, min, and avg values to the the last 1 week data. We can detect anomalies in textual values by comparing the counts to historical data values.
      I hope this helps, Vincent

      Comment by Vincent Rainardi — 16 April 2016 @ 3:05 pm | Reply

  2. Hi Vincent,

    Thanks for your help is really usefull.
    Reading your answer I noticed some things that perhaps are not bests practices in my ETL.
    I made my validations (duplicates, nullls, integrity…) ans fixes in my ETL before loading to my target table.
    In my Pentaho ETL I don’t have a high data volumen (for example when I worked in SAS DI the volume was higher), I made my ETL in a row by row process in a pipeline dataflow , then I don’t treat the datasets as a whole in many steps ony when is required (sort or grouping..).
    Some validation test are made row by row and this validation are made before loading.

    You are right that some validation are better after load, like row count or sums, but perhaps if I consider performance is better to do validation after load, then yoy can run the test in the whole dataset, not row by row….

    Other question, I remenber to heard abut EQTL (extract-quality-transform-load) modelo, in this model the validation test are made before loading..

    What is your opinion?

    Thanks for your help

    Comment by Juan V. — 16 April 2016 @ 7:25 pm | Reply

    • Hi Juan, it is very common to fix null and integrity when the row is being loaded into the dimension or fact table. The null is replaced by zero or blank string, and the non-existent business keys and invalid dates are given a surrogate key of zero or -1. And the duplicate rows are removed when the data is in staging tables. This is all good and is recommended. But these EQTL are technical solutions with relatively low value. A more valuable solution is to detect that for example:
      1) today’s prices are the 80% same as yesterday (which indicates a problem at the source and will cause us to trade incorrectly today)
      2) today’s sales amount is 20% lower than the last one week average (which again indicates a problem at the source and will cause us a big problem)
      3) 500 bonds have Effective Modified Duration outside their tolerable values (again, caused by a problem at the source but will cause us a problem)
      These “business validations” are highly valuable because it stops us making silly trades and incorrect decisions based wrong data. You see, usually when the data issue is caused by a problem at the data source, we have a tendency to say “It is not our fault in the DW team. We have flagged it to the source system team and there is nothing we can do about it. They should fix it.” While this true, our DW reports are churning out false information and our BI users are consuming toxic data, potentially costing us $1 million today. If it is a regulatory report we could be fined $1 million. It is our duty to stop our BI tools/reports giving out toxic information. And these “business validations” can only be done from “outside the row”, not when the row is being loaded into the DW, but when all the complete datasets have been loaded into the DW, but before it is open to the public for access. Some companies decided to replace today’s bad data with the last 3 days average (only the bad rows, not all rows). That brings good value to the business. Risk data and market data for example. If the data we receive from an external data provider is bad today, and they can’t correct it today; it will take them a few days to investigate and sort out. We can’t wait until tomorrow because trades and reports need to be done today. Some companies can wait for a week, well they are lucky to have that luxury. In some companies their risk is only $100 not $1 million, well they are lucky. But in some companies particularly investment banks, they are not that lucky. If we can’t correct it with last few days average, at least we will have to warn the report users and BI users, that some numbers are red (bad) today whereas the majority are green (good). We need to think about what matters to the business users. Often we spend a lot of time thinking about various technical DQ things that the business users have no concern about. If they are worry about wrong prices, we should be focusing our attention about wrong prices, not about nulls. I hope this helps.

      Comment by Vincent Rainardi — 17 April 2016 @ 7:17 am | Reply

  3. Thanks for your help Vincent,

    Juan

    Comment by Juan V. — 17 April 2016 @ 2:25 pm | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: