Data Warehousing, BI and Data Science

18 December 2015

Measure or Attribute

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 10:00 pm

Last week two people asked me about this subject so I thought it would be beneficial for others if I write it. It is about whether a number should be a measure (a fact) or an attribute (of a dimension). Apologies in advance I can’t reveal the real case as it is a client’s case. But I will present several similar cases to illustrate this issue. As always, it is better to learn by example (cases) than by theory.

First case: worker rate

The first case is the timesheet system in the construction industry. The daily rate of the worker is the number in question. The hourly rate is recorded as an attribute of the worker. The fact table recorded the hours each worker worked in the project, every day. The data mart is used to analyse the cost the project (by work area), and to plan for the next 6-12 months. The cost for each worker for each day is simply the rate multiplied by the hours worked.

There are 2 issues here. One, when the rate entered last month is incorrect, how do we correct it. Second, how to plan for next year using the current rate + 5%, and current work hours -5%.

In this case, it is better to treat the rate as a measure, located in the fact table for actual costs as well as in the fact table for planning. This way we can correct the past rates and recalculate last month’s costs. We can also change the future rates (in the planning fact table) and recalculate the future costs. Yes we will occupy more disk space because we are repeating the rate on every row, but that is a very good price to pay for the advantage we get: more flexibility, ability to support the required functionality and higher performance.

The need to put the rate in the worker dimension arises when we need to know the historical rates for each worker. But this need can also be satisfied if we store the rate as a measure. Granted it would be less performant as we have to plough through lots of fact rows to distil the data to get distinct historical rate per worker. So if this is a mandatory requirement, it is a good idea if we put rate an type 2 attribute, in addition to putting the rate as a measure.

Second case: product price

This is a classic example from the retail industry. But it is also applicable to banking and insurance, pharma, energy and telecom sectors. In a shop, each product has a price. The same product can have different price in different stores. The price can change at any time. The price reviews can happen weekly or monthly, or ad-hoc. It can be done at the store level, city level or regional level. The actual price paid depends on discount or promotion applied for each transaction, at the time the customer purchase the product.

The issue here is, whether we should store the price of the product in a sales fact table, or as an attribute in the product dimension. I maintain an opinion that the price must be store in the sales fact table, for each transaction. In addition, the price should also be stored in the price list fact table. The price on the price list fact table is the official price for each store, for that day. Whereas the price on the sales fact table is the actual price paid, which could be different from price list, because of the offer or promotion applied to that transaction. For example, buy one get one free, or three for the price of two. Or, because the product is in a bundle such as cheese and wine, pyjamas and sleeper, a dictionary and thesaurus, etc. Or, because the product is simply discounted on that day, perhaps due to expiry date, or because the manager want to get rid of the stock quickly.

Having a price in the product dimension, in my opinion is not right. Product attributes such as size and colour, model and optional extras, are true product properties and therefore should be in the product dimension as attributes. These attributes are the same across stores, they are truly the property of a product. But prices can be different across stores, so they are not sole property of a product.

The price may be attached not to an individual product code, but to several product codes. For example, regardless of colour, the prices of the products are the same. Prices can be dynamic, e.g. prices of tickets (entertainment industry or transport industry) of the same show or flight can be different from one seat to the next, from one booking to the next, from one time slot to the next. Prices can be in bands, for example in utility industry (electricity and gas) and in telecommunication industry (mobile calls) the first 100 kWh can have different price to the second 100 kWh, Friday price can be different to Sunday calls.

The concept of price list fact table is more accommodative to satisfy the above demanding requirements, compared to putting price as a type 2 attribute in the product dimension.

11 November 2015

Indexing Fact Tables

A bit on Primary Key

Yesterday a friend asked me why there was no primary key on a fact table. I explained that we did have a fact table surrogate key (FSK) on that fact table, but I made it as a unique non-clustered index because we needed the physical ordering (the clustered index) to be on the snapshot date as it was a daily periodic snapshot fact table, queried mostly by using the snapshot date.

The purpose of having a primary key (PK) is to enforce uniqueness in one column of the table. We can achieve the same thing, by creating a unique key, hence we do not need a PK in that fact table.

We need to pause and think, if in the fact table we need a unique identifier of the rows. If we need to refer a fact row from another fact row (self referencing), then we do need a PK, which is usually a single column bigint FSK. But this unique identifier single column bigint FSK does not have to be an FK, it can be a non-clustered unique index.

The second purpose of having a PK is to enforce not null. This is really not the function of the PK, but more of a requirement of a PK. A PK requires that the column must be not-nullable. So not-nullable is a property of the column itself, not a property of the PK. And we implement this non-nullability when declaring the column on the table DDL.

We need to bear in mind that a PK has nothing to do with clustered or non-clustered indexes. SQL Server will automatically implement a PK as either a unique clustered index (UCI) or a unique non-clustered index (UNCI), depending on whether a clustered index already exists.

The Clustered Index

A clustered index does two things:

  1. Sorting the table physically
  2. Act as the row locator in non-clustered indexes

Point a) is for the performance of the queries. If we don’t partition a periodic snapshot fact table on the snapshot date, the next best thing is to cluster the table on the snapshot date.

But point a) is also for the performance of the update and delete. It is rare, but in some cases we need to update a periodic snapshot fact table (PSFT). So far I only found 1 case where I need to update a PSFT, out of about 150 PFSTs that I have created over the last 10 years. When updating fact table, it is absolutely crucial that the partitioning key, or the clustered fact table if you don’t have it partitioned, to be on the business date, plus the columns used as the joining criteria between the fact staging table and the PSFT. The clustered index should not be on the query criteria columns. It is the job of the non-clustered index to support the query.

Point b) means that the narrower the clustered key, the smaller the non-clustered indexes. Some people think that the narrow clustered key means that the non-clustered index will also have better performance but in my opinion and observation this performance increase is negligible.

So that’s the clustered index. For an insert-only PSFT we put the clustered index on the snapshot date plus the query criteria column to support the query performance. For an insert-and-update PSFT we put the clustered index on the joining criteria of the update statement.

For example, if the joining criteria of the update is snapshot date key + order ID (a degenerate dimension, the PK in the source table), whereas the query criteria is snapshot date key + account key, then for insert-only PSFT the clustered index should be snapshot date key + account key whereas for update PSFT the clustered index should be on snapshot date key + order ID.

The join SQL engine takes when updating the fact table depends on not only the clustered index of the PSFT, but also on the clustered index of the fact staging table (the source of the merge). If we do use the Merge command, we should convert it to update & insert commands. See my article here (link) about the Merge command’s inefficiency.

Partitioning

We can replace the physical ordering functionality above with partitioning. It is common and it make sense to partition a PSFT, I agree. But I would recommend not to partition the fact table when we create it, but later on. This is because of these reasons:

  1. We should not spend the time unless it is required, and when we create the fact table we don’t yet know if the partitioning is required.
  2. When the table is populated and queried we can learn about its condition and behaviour. Much, much better than imagining. For example, is the partitioning required to support query performance, or loading performance?
  3. We may have purging on the fact table, limiting the volume so that we don’t need to partition it because the performance is OK.
  4. We may need to create a new column for the partitioning key.
  5. When we create the fact table, we don’t yet know how the data will be distributed and queried. Purging can change the data distribution. We don’t know the query pattern, for example, 90% of the queries might be on today’s data so we should put it into a separate partition.
  6. Point e above dictates the partition maintenance strategy, i.e. whether we have partitioning functions splitting the table into 10,000 partitions or to 100 partitions with a “sliding window” maintenance. At the creation time, we have very limited knowledge of this.

Non Clustered Indexes

Each surrogate key is ideally indexed. Not combined as 1 index, but as separate indexes. All as non-clustered indexes (NCI). Say we have order date key, customer key, product key, store key, ship date key. Then we create 1 NCI on order date key, 1 NCI on customer key, 1 NCI on product key, 1 NCI on store key and 1 NCI on ship date key.

We should not combine these NCIs into 1 NCI because the second, third, and forth column of the combined NCI will not be used. Because their ordering is not sequential.

See also two of my articles which are related:

  • Indexing Fact Tables in SQL Server (link)
  • Primary Key and Clustered Index in the Fact Table (link)

12 September 2015

EAV Fact Tables

Filed under: Data Architecture,Data Warehousing — Vincent Rainardi @ 4:20 am
Tags: ,

A few weeks ago I came across EAV fact tables. EAV is Entity Attribute Values data model (read this for a background on EAV structure). It is a data model which enables us to add column into a table without actually changing the structure. At first I thought this EAV approach has no place in Kimball dimensional model. But after thinking and rethinking I saw that it had some advantages too, not just disadvantages. And in some cases it is appropriate to be used.

So below I’ll explain what it it is, the advantages, the disadvantages and the verdict about it.

What does it look like?

An EAV Fact Table looks like this:

What does an EAV Fact Table look like

The above example is taken from retail industry (a fashion retailer), where they analysed the profitability of each product line every day. What is a product line? They have 9 product groups: women, men, kids, shoes, handbags, accessories, watches, jewelry, beauty. The women product group consists of 23 product lines: dresses, tops, jumpers, cardigans, shirts, T-shirts, blazers, capes, jackets, coats, skirts, shorts, trousers, jeans, leggings, tights, socks, nightware, lingerie, leisureware, swimware, suits, new.

The above EAV fact table is product line performance fact table. Every day, based on the sales figures, profit margins, direct costs and overheads, they calculated various performance measure for each product line: sales growth, profit margin, product distribution, margin stability, cost effectiveness, price variation, colour variation, size variation, style variation, etc. TypeKey 217 means 1 week sales growth, 218 means 1 week margin stability, 219 is product distribution, and so on. Some measures are time-based, so they have periods such as 1 day, 1 week or 1 month. Some measures have 2 versions: net and gross. Some measures have location variations i.e. local and global. Similar measures are grouped.

Performance measurements are different for each product line. Measurements applicable for a product line may not be appliable to other product line. Using EAV structure fits the structure of performance measurement data, and makes it flexible. Because there are so many product lines with so many different performance measures, almost each week they have a new performance measure. This is because in the business analytic software they can create a new measure at any time. They define the formula for that measure, and on which product lines the measure are applicable, the different range of time periods applicable to that new measure, whether it is gross or net measurement, and whether it is global or local based.

What’s bad about it?

The main down side of EAV fact tables is: when we query the fact table we may need to pivot it. If we need to retrieve the weekly margin stability for every product lines, we could just filter on Performance Type Key = 128 and we get what we need. But if we need to retrieve all time variances for sales growth (1 week, 2 weeks, 1 month, etc) for a certain product lines, then we will need to pivot the data.

This pivoting is annoying because we have to hardcode the performance type names to make them as columns. At times it could be so annoying that we wished we had a normal Kimball style fact table so we didn’t have to pivot the data.

What’s good about it?

In Kimball star schema, in the fact tables the measures are created as fixed columns. If we have a new measure, we will need to create a new column. If the retail analytics software is churning out new measures almost every week, our development cycle will not be able to cope with the pace. In a mature warehouse, it will probably take about a month to complete the analysis, design, development, testing and release process, just to add 1 column. Equal to probably 10 man days.

If the employee cost rate is $550 per day (sum of salary, bonus, medical, dental, vision, relocation, life insurance, accident insurance, training, pension, cost of HR staff, childcare, cost of facilities, and 401k for a year, say $120k, divided by 250 working days per year minus 30 days vacation and sick time) that 1 column would cost $5500. It is very not cost effective.

Having a flexible structure like the EAV fact table means that we don’t have to worry about new performance measure churned by the retail analytic software almost every week, ever. That saves $5500 per week, or $275,000 per year, which is a lot of money.

Final verdict

If new measures are created quite often (more than once per quarter) and the measures are different for every product lines (or every customer, or other dimensionality such as instrument or asset type), then the EAV Fact Table approach is justified.

If the measures are quite stable (changes are less than once per quarter), and the measures are the same for every dimensionality, then an EAV Fact Table is not justified. We should build it as a normal, Kimball-style fact table.

But that is just my opinon. You could have different experiences, and therefore different opinions. Hence I would like to hear your comments in this topic.

31 May 2015

Responsibilities of Data Architects

Filed under: Data Architecture — Vincent Rainardi @ 7:49 am

What does a data architect do? Designing tables? That was what I thought at first. But it turned out that it’s not just that. The main responsibilities of a data architect are:

  1. Data modelling and analysis (45%)
  2. Data loading and delivery (25%)
  3. Data availability, performance, security (15%)
  4. Data quality & governance (10%)
  5. Other (5%)

These percentages are applicable if you are a data architect in an “end user” organisation i.e. a bank, health care company, insurance company, retail companies, telecommunication company, airline, etc. The percentages are not applicable if you work as a consultant, meaning that you are a data architect in an IT consultancy firms such as TCS, Accenture, Cognizant, IBM. See under the “Other (5%)” section below for details.

1. Data Modelling and Analysis

  • Understand the data
    It is the most important duty of a data architect to understand the data in the organisation. If he works in an asset management, he needs to understand what is yield, spread, duration, etc. Without understanding the data well, one cannot hope to function well as a data architect.
  • Data modelling
    It is not just about designing tables. But also XML files and schemas (XSD), OLAP Cubes, format of interface files, etc. Not just relational, but also dimensional. Not just tables, but also indexes, logs, partitions, database and data servers. Triggers, CDC, audit too. Physical data modelling / PDM as well as logical / LDM. Let’s be honest, we rarely dwell into the physical aspect. We often ignore the physical make up of a database. We usually leave it to the DBA.
  • Data Analysis
    We often leave data analysis to the Business Analyst. My job is to design the data model, but the content of it is not my business. That’s the usual attitude of a data architect. This is wrong. We as a DA needs to understand where everything is. When we join a company as a starter, we need to spend a great deal of time understanding where the supplier SIC is, what the branch code is made up of, how many joins are required to query the historical data, how the profitability is stored, how the customer ID is generated, etc.

2. Data Loading and Delivery

Data Loading

The data doesn’t get there by itself. It is delivered into the tables by some processes. It is our job to understand the exact details of the data loading mechanism. Understanding the batch architecture, understanding what each step of the loading process does. Whether they are using Active Batch or Autosys, we need to understand how it works. How to create an conditional alert, a loop, monitoring loading time, etc. It’s true that the operational aspect of it is the job of production support, but the architectural design of the data loading process is our responsibility as a DA.

A data architect which only does data modelling and doesn’t understand about the data loading process is like a blind man. When and how the data is loaded into the data store affect the data significantly. For example, if a large, static table is constantly be truncated and reloaded every day, would you be quiet about it? If the staging table is thousand times bigger than the main target table, reaching 4 billion rows, would you not investigate it? If the loading SP is using merge causing nested loop of 1 billion rows, would you just sit and watch? We need to be fully aware of all these things.

We as a data architect need to control the purging process, the loading process, index maintenance, log maintenance. If you work as a DA in SQL Server and somebody mention Olla Hallengren and you raised your eye browse asking “Who is that?”, you can’t claim yourself to be a good data architect. We have to understand how the databases are maintained, and how the data is maintained. That’s the minimum. After understanding it, we need to improve it. Offline backup, restore process, common staging area, reuseable mappings and lookups, capacity planning, we need to improve all that.

Yes we work with the DBAs on these, but we can’t leave it to the DBAs. Does a DBA care about reusable Lookups? Do they even understand it? Yes we work with Production Support team on these, but we should not leave it to them.

Data Delivery

Data loading is about piping the data into the data stores. Data delivery is about piping the data out from the data stores, into the business applications. O yes, we as a data architect need to understand and improve this process as well. If the data delivery is not efficiently done, we need to improve it. Six different applications may be requiring the same sales or costing data and they all do 10-join query involving huge tables, hundreds of times a day. Wouldn’t it be better if we create a different structure (additional structure) to store the data? A DBA would only look at the indexes, but they don’t have the knowledge about the data structure to be able to suggest the additional tables. We do!

Hence it is our job to make the data delivery as efficient as possible. How many reports are accessing this particular area of the database? Should we create a flatten table to support those main reports? Why are those reports using views, not stored procedures? (SPs can have temp tables, loop, passed-in-parameters; views can’t)

Architecting

After we are in control of the data loading and data delivery and data maintenance processes, it is only then we can attempt to improve the architecture. You see, architecture is not just about the structure of the database, but also about designing the processes and mechanism that load and extract the data, into and from the databases.

Architecting the data flow means seeing it from ten thousand feet. At this height we can see things like:

  • Why there are 2 jobs pushing promotional sales data into the same set of tables?
  • At 7 am new data is being loaded into those main tables, whilst at the same time the cube processing are reading from those table causing locking. No wonder we have time outs.
  • Many reports and processes read the relational warehouse directly, shouldn’t they be reading the dimensional marts, the flatten structures?
  • Should we shield the structure by implementing views? That would make the query simpler, but also good for security (maintaining permissions)
  • If many applications need FX rates, should we load that first and signal the consuming systems when it’s ready?

We design the data loading and data delivery process as efficient as possible. That’s architecting!

3. Data Availability, Performance, Security

Data availability

Data availability is not just about high availability e.g. the databases should be available and accessible 99.99% of the time. It is also about the timing: it is no good if the position-level performance data or the sales data is available at T+2 (two days after the event). We should make it available on T+1 and if possible on T using intraday load.

The sooner the data is available to the end users the better it is. We have to always look at the data availability from the end user point of view. We have to think about the data delivery all the way to the end users who consumes it, not just half way into the Data Warehouse. For example, it is no good if we get the purchase data (or the risk data, or any other data relevant to your industry) into the Data Warehouse on the same day, but the report delivers the risk data to the end user on T+1 because it was waiting for rating data from a vendor.

Data availability is also about retrieving old data. Consider this conversation: User: can I view the sales data in that region for that product group in 2003? IT: I’m sorry Sir, it’s not available in the database, we have to load it first as the database only has data from 2005. User: ok, how long is it going to take to load? IT: about 4 hours. User: So I can view it tomorrow? IT: No Sir, because we have to bring the tape from Hounslow into here (Chester) and it will take a week. User: ???
And the following week: IT: the 2003 data is ready Sir, you can view it on the screen. User: Why there is only 2003 data? Where is 2004 to 2015 data? IT: Oh, the archive files are arranged by year. If we load 2003 data, we lose all the other year. User: ???
Or, something like this: User: why the 2003 data doesn’t have London region? It only has South East. In 2007 we divided South East into 4 regions. IT: That regional classification didn’t exist in 2003 Sir, and this is 2003 data. User: I know that, but can’t you load the 2015 regional classification? IT: No we can’t. User: Then this data is useless! We need to be able to analyse this 2003 sales data by current product grouping, and by current regional divisions, so we can compare it with the this year’s sales performance!

Query and Load Performance

This is about how performant is the query which retrieves the data from the data store, how fast the reports run, how long does the data export take, how long does the cube processing take, how long does the data load/upload/download take, how long does the data publishing take.

Structures like code-decode tables (link), rows with begin and end dates, etc could hamper the performance of the query. They may be suitable for certain type of query, but absolutely useless for another type of query and causes huge performance problem. As a data architect we need to take care performance issues like these.

It’s not just about the good days. More importantly, it is about the bad days. If during the morning data load, the database server has a problem, when will the data load finishes? 9 am? 1 pm? That’s what the business user wants to know. Do you have a secondary server automatically taking over the data loading?

The business users expect the Cognos Reports or Hyperion Cubes to work within a 1-2 seconds, day-in, day-out. If it takes 1-2 minutes to run a report or a cube query, that’s not good. Business users concern about the performance of what affect them. Not how many milliseconds the query runs! They don’t want to know if it is because of out-of-date statistics, missing indexes, incorrect partitioning, cartesian joins, badly written code, nested loop, locking, latching or dead lock. They do not care about the technicalities of it! If we have to put a secondary server, RAID 1+0, Fiber Channel SAN, Amazon Redshift or Hadoop – so be it. It is our job (data architect) to ensure that the Reports and Cubes respond time is always 1-2 seconds.

And if we don’t know what Redshift is, if we don’t know who Claudera are, if we don’t know how Twinfin FPGA works (or Oracle RAC), then we are not in the business, we are not good Data Architects. We as Data Architects need to know a great deal about stuff that influence the performance.

Data Security

This is not “oh tell that to Info Sec” debate. We Data Architects work with Information Security people. We work with Compliance guys. We work with Infrastructure guys, with DBAs. We work with Production Support team and Helpdesk team. We work with external penetration test consultant. We work with anybody required to deliver data security, i.e. to ensure that only the relevant users can access relevant data. To ensure that nobody in the organisation (and outside!) can see something that they are not supposed to see.

Reviewing database accounts, reviewing the way the security grouping are arranged, how the database roles are structured (group within group, group hierarchy, etc), Kerberos authentication, view layers, cube roles, database permissions, segregation of network between Dev and Prod, etc. all of these are our reponsibilities as a Data Architect.

It is not just the databases, but also data on the web portals, data we receive/send on files, maintenance of the security grouping (usually via a web interface), system accounts used in data loads, system accounts used by reports, segregations of permissions in the applications, and the security of data in the archive and backups.

We have to understand the concept of Chinese Walls, extranet security, penetration testings, data masking, legallity of data retention periods and scope, public key cryptography (especially in retail banking sector) and security authentication process.

In retail banking sector, the customer data obtained and retained by Singapore, Switzerland, Luxembourg branches may not be viewable by the bank employees in the US, UK or France. In the health care sector, the personal health data obtained in UK, Germany, Australia, and US should not be used or disclosed “in a form which can identify the patient”, without their consent. But the business may want to analyse the data to find the trend at regional and global level, without identifying individual bank account or patient.

We as a Data Architect needs to find a solution to make this happen, for example using data masking, encryption, or nulling out. In retail industry for example, we could be storing customers’ credit card numbers but replace the first 12 digits with XXX so that the Call Centre users can’t view the complete numbers. In banking for example, the name of certain accounts may be blanked out. But because of the sector, country and size of the account, users are still able to identify the account, for instance there might be hundreds of large petrochemical companies in China but there’s only one large cocoa plantation in Peru.

4. Data Quality and Governance

This is not just about implementing data quality software like Trillium and IDQ to improve the quality of customer data. It is the whole framework of data quality. It is about defining the rules to certify the quality of the data, data profiling and clensing, creating workflows for DQ monitoring and exception handling, establishing Data Stewards who identify the business users who need to fix incorrect data, implementing tools for data reconciliation and data lineage, “entity” resolution for Single Customer View, the list goes on.

In Asset Management sector, if the country, currency, or classification of a position is not set, the breakdown of a portfolio will be reported incorrectly. In corporate banking, identifying an “entity” is crucial in managing derivative transactions and could be done in waterfall manner using Markit RED (Reference Entity Database), MIN (Moody’s Issuer Number), ISIN/CUSIP/SEDOL of issues, Reuter/Bloomberg Ticker of the security, and Entity Legal Name + Country of Incorporation.

The reporting of data quality is also Data Architect’s responsibility, i.e. how many breaks this month, how good/bad it is compared to last month, which DQ rules was violeted, how much over the tolerance limit were the breaks, does it affect the top level numbers, was it caused by the raw data coming from the source or caused by a bug in our systems, how many days did it take to be fixed, has it been previously identified and reported, etc.

Common DQ issues in banking are:

  • Duplicate records, i.e. two securities with the same ISIN
  • Inclusion/exclusion of outstanding trades
  • Incorrect instrument attributes
  • Ambiguity in identifying where yield or duration is sourced from, and how it was calculated
  • Timing of NAV which changes due to single swing price adjustments
  • Incorrect risk calculation due to incorrect instrument attributes

Data governance is difficult to get funding if it is too “blue sky”. It should focus on things that quickly bring business benefits, such as: completeness of customer data, business rules to check data accuracy, ensure that key fields are maintained.

5. Other

It is part of a Data Architect role to deal with the non-technical aspects, ie. Management, dealing with “politics” i.e. resolving conflict of interest (with regards to data) between different groups in the organisation. In a good organisation this is only 5% of the time, but in reality, in many organisation, this “management and politics” work could easily become 30% of our time, making us unable to deliver the 4 responsibilities above with good results.

Data Architects in IT Consultancy Firms and Vendors

I mentioned in the beginning that the percentages are not applicable if we were a data architect working for an IT consultancy firm such as TCS, Accenture, and Cognizant. Or a if we were a data architect working for a database vendor, a BI/DW vendor or a data integration vendor such as IBM, Teradata, SAS, and Informatica. This is because we also have an additional responsibility to help sell data-related projects to the clients. This includes pre-sales activities such as preparing proposals, drafting potential solutions, conducting presentation, preparing demonstrations, proof of concepts, preparing business case including setting up the data, conducting research, answering RFPs particularly the technical pages, share customer feedback with the product development team and marketing team, assisting costing calculation.

Data architects working for an IT consultancy firm or vendor do not normally do sales or account management, but they could be asked to help with support work for certain accounts. Usually as a third line support (the first line being the helpdesk/technical support team, and the second line being the developers/product development team, or training team in certain cases). This support work is usually because they were working on that project before.

And then after they are deployed in the projects, their role become similar to DAs in the “end user” organisations, but with extra time doing the “internal agendas” (work not for the client but for the consultancy firm or vendor itself), for example: working on other proposal or presentation to other client.

The percentages could be something like this:

“In the office” time: (aka “on the bench” time)

  • Proposal 60%
  • Presentation, demo, PoC 20%
  • Research & documentation 10%
  • Support 5%
  • Management 5%
“In the project” time: (depends on what project, e.g. DQ project is different to BI project)

  • Data Modelling & Analysis 40%
  • ETL Design / Data Loading 30%
  • Data Quality Framework 20%
  • Management 5%
  • Internal agendas 5%
« Previous Page

Blog at WordPress.com.