Data Warehousing, BI and Data Science

16 June 2010

Data Architect, Data Warehouse Architect and BI Solution Architect

Filed under: Other — Vincent Rainardi @ 7:18 am
Tags:

What is the difference between a data warehouse architect and a data architect? How about BI solution architect?

Both data architect and data warehouse architect do data modelling, as in ERWin stuff. Or Embarcadero ER Studio. But a data warehouse architect is more specialised on designing the data model for a data warehouse, whereas a data architect is more specialised on designing the data model for databases used by transactional systems.

A data warehouse architect does a lot more than just data modelling. They also does the ETL and the infrastructure. These are areas that a data architect doesn’t do normally.

For a data architect to be able to call themselves a data warehouse architect, they don’t only need to learn how to create a data model for a data warehouse (as in dimensional modelling). But they need to also understand the ETL architecture. And they need to understand the servers. For example, they need to be able to specify the specification for the production data warehouse servers, i.e. CPU, memory, disks. And other server stuff like clustering, mirroring and DR. And they need to understand physical database stuff too, like table partitioning, file groups and materialised views.

In my book I specify there are 2 sides of data warehouse architecture. The first one is the logical architecture and the second one is physical architecture. A warehouse architect needs to understand both.

A data warehouse architect in my opinion demands 4 separate skills: data architecture, ETL, database platform and physical infrastructure. By “database platform” I mean SQL Server knowledge, Oracle knowledge, Teradata knowledge, Netezza knowledge, etc. For example, “SQL Server 2008 Data Warehousing Features” is a “must know” for a DW architect in SQL Server. Whereas UPI & USI is a “must know” for DW architect in Teradata. If we design a DW on Oracle 11g R2, we need to know Oracle specific DW features, such as Initialisation Parameter Settings, Partitionwise Joins, Compression, Parallel Execution, etc.

A BI architect is more on the application side, as in SharePoint architecture, Hyperion architecture, Reporting Services architecture, and Analysis Services architecture. For example: a BI system where we have reporting services running on SharePoint, plus excel services and PPS services on SharePoint, and SSAS cubes too reading from a warehouse or mart. And on top of that some custom .NET coding for authentication or customised security. Plus they allow some self-service BI using Qlikview or PowerPivot.

Back to the data warehouse architect, the ETL aspect of the job is sometimes quite demanding. There is something called ETL architecture in warehousing, which is basically

a) the architecture of the overnight batch, i.e. the structure of the tasks and workflows, the execution order of the workflows, the backup, the reconciliation, the checking and alert, and the data quality. The overnight batch is not only about data loading / ETL. It also has: data serving elements, i.e. processing reports (stored as PDFs, ready to be served to achieve split second response time), refreshing OLAP cubes.

b) the architecture of the continuous feed throughout the day for real time warehousing

c) the physical infrastructure of the ETL, i.e. the servers, the databases, the data connections

d) the methods to extract and load the data i.e. sliding window, swap partition, flipping twin tables, identifying incremental extraction, changed data capture, change tracking mechanism, (filtered) replication between warehouse and mart (or mirroring), how to extract data from cubes.

If a data warehouse architect only understands dimensional modelling, the company will have problems in the ETL and infrastructure. The servers might not be ‘suit for purpose’, for example disk configuration is not optimised for warehousing.

How about “BI solution architect”? A data warehouse is the back end. Business Intelligence is the front end. Data warehousing is about the data model, the ETL and the databases. BI is about the reports, the OLAP cubes, the analytical applications, the data mining, the KPIs, the dashboards, the score cards, and the performance management. It is a common perception that a BI solution architect is a “front end” person. An application person. This perception is incorrect.

A “solution architect” is responsible for the whole solution. Not only the front end, but also the back end. It is impossible for him to be responsible for the whole solution without looking after the back end. In the case of a BI solution architect, he or she needs to look after both the BI front end (reports, cubes, performance management, etc.), and the back end (data warehouse, ETL). It is impossible for a BI solution architect to be responsible for the BI without looking after the data warehouse. In some companies, instead of calling the role “BI solution architect”, it is called “data warehouse solution architect”. It doesn’t mean that the role is only responsible for the back end data warehouse, but he is also responsible for the BI front end. In some companies, they have not only one but several BI solution architects, each responsible for a certain area.

There are several different types of architects in IT, for example: system architect, data architect, solution architect, information architect and enterprise architect. This segregation of duties only happens in very large group of companies (enterprises). For example, a banking group with 3000 IT staff. If the IT staff is only 100 usually the architect does multiple functions. A system architect is responsible for infrastructure, including networks and servers. I’ve mentioned about data architect and solution architect. An information architect is responsible for the flow of information throughout the enterprise, including databases and data quality. An enterprise architect is responsible for all the applications in the enterprise, making sure they run in sync and adhere to the standards.

Scheduled Deploy in SSAS

Filed under: Analysis Services — Vincent Rainardi @ 7:13 am
Tags:

Just a quick note to share one point re deploying changes in SSAS cubes in a “scheduled” way. I mean as opposed to “deploy from BIDS” way. So we want to deploy some cube changes but we can’t do that online from BIDS during the day because (for example) it takes a long time to fully process the cube and the users can’t tolerate the cube to be unavailable that long (say 2 hours).

Say your cube* is processed “full” daily at night, after the warehouse is loaded. And you want to deploy the cube changes tonight, just before that nightly “cube refresh” *.

*when I say “cube” here I mean “SSAS database”. And when I say “refresh” here I mean “process”.

To do this, we can use SSAS Deployment Wizard. It’s not in BIDS. It’s in SQL Server branch in the Start menu, Program Files. So click Start, Program Files, SQL Server 2008, Analysis Services. There you will find it (the wizard).

In this wizard, first you need to select the database files (the extension is .asdatabase). This is your local copy of the solution files, so for your “sales” cube go to (for example) c:\data\SSAS projects\Sales\Sales\bin\ folder and select Sales.asdatabase.

After selecting the database file, navigating through this wizard, you’ll be offered

a) change the data source connection (from Dev SQL Server to Prod for example)

b) whether you want to deploy the roles and members or not, and

c) whether you want to deploy the partitions or not.

Then after that it asks if you want to script it or not. Choose yes and you will get an XMLA file containing the cube definition. The default location of this file is the same folder as the .asdatabase file, and you can change it to any folder you want. Ah, 1 more thing, the wizard also asks how do you want to process the cube (i.e. full, default, or not processed).

Copy and paste this XMLA into a step in SQL Agent job and you can schedule it to run at certain time tonight or after midnight (it’s called morning if it’s after midnight right 🙂 ). Or you can modify an existing job that process the cube, inserting one new step containing this XMLA. In this case in the wizard you should choose “don’t process the cube”.

If your existing “cube refresh” job does “process default” on all the dims, then does process full on certain partitions*, you probably want to disable those “refresh cube” steps and specify “process full” on the Deployment Wizard.

*the ETL stores information about which partitions changed / need to be processed

That’s it, that’s how we can do “scheduled deployment” for a SSAS cube, i.e. put the XMLA output into a SQL Agent job.

Who Uses What Cubes and When (The Data Mart Approach)

Filed under: Analysis Services — Vincent Rainardi @ 7:07 am
Tags:

Background

A few months ago I wrote about “who are using the cube right now” [link]. Perhaps we are trying to find out who is making the server so slow, and what query they are executing.

Today the question is slightly different. We want to know who uses what cubes and when. We want to know not only at present, but over a period of time. There are three different things here: the who, the what, and the when.

The who: how many users do we have? Who are they? From which department, town, and country are they from? Which product family or services are they responsible for? What are their roles (job titles)? Do we have their email addresses?

The what: which cube do they use? Which dimensions? Which attributes? Which measure groups?

The when: on which day do they use the cubes? At what time, morning or afternoon? How many times a month? How long do they use it for? 1 hour a week, or 10 hours a month? Heavy usage or light usage?

The Data Mart Approach

We are going to answer all the questions above using a data mart. We create fact and dimension tables. We populate them with data from the query log table. We present the data in this mart to a cube. We can then browse this cube, to find out who uses what and when.

Well that’s the idea. It’s quite a simple isn’t? I like simple ideas (complex ideas don’t usually work). But putting ideas into a reality is never simple. There are always issues. But issues are there to be solved.

This idea of using a data mart to store system usage is not specific to OLAP system. It’s a generic idea, and it has been used in many technology areas for a long time. I first noticed it when I was working for a large ecommerce company. The web log stores who accesses what and when. A web analytics system then harvests this web log and stores the information in a data mart. It then presents this information nicely in the form of charts and web site statistics. Actually I came across it before then. It was when I was working for another company, in ecommerce development. We used WebTrend, which works in a similar way.

So, let’s start. We’ll create the data mart. We’ll populate them with data from the query log. Then we’ll create the cube.

Creating the Data Mart

Dimensions:

There are five dimensions in the mart: date, time of day, user, measure group and attribute.

  1. Date dimension: contains all dates since the measurement began until 10 years from today. Grain: date. Columns: date key, date, month, quarter, year, day of the week, week number, and week end / week day indicator.
  2. Time of day dimension:  contains the time in 1 minute interval within 24 hour period. Grain: minute. Columns: time key, time (hh:mm, 24 hour format), 15 minutes, half hour, hour, day part. The time of day dimension is important for global cubes as it enables us to understand the peaks and throughs of the usage. Useful for server maintenance, cube deployment schedule and down time planning. Day part is in 6-hour interval, i.e. 1 is mid night to 6 am, 2 is 6 am to 12 pm, 3 is 12 pm to 6 pm. You could also have 2 or 3 hour interval if it suits you better.
  3. User dimension: contains every single user in the organisation, regardless of country. Grain: windows user account. Columns: user name, first name, last name, email address, team, department, office, country, job title, product line or services they are responsible for.
  4. Cube dimension: contains every measure group in every cube. Grain: measure group. Columns: measure group, cube, database. If we have 3 databases, each with 2 cubes, and each cube has 4 measure group, then in the Cube dimension we will have 3x2x4 = 24 rows.
  5. Attribute dimension: contains every visible and enabled attributes in every dimension of every cube. If we have 3 databases, each with 1 cube, and each cube has 5 dimensions, and each dimension contain 4 attributes, then in the Attribute dimension we will have 3x5x4 = 60 rows. Grain: attribute. Columns: attribute, dimension, cube, database. Attribute dimension only does not contain hierarchies; it only contains attributes.

Fact table: only 1 fact table: fact query log. Contains the count and duration. Grain: 1 row for each row in the query log. Column: fact key, 5 dim keys (1 for each of the above 5 dims), count, duration.

Measures:

  1. Hit: number of times the cube serves data to the user
  2. Duration: time taken to run the query

Populate the Mart

The main source of the mart population is SSAS Query log. We could combine it with SSAS Dynamic Management View (DMV) to get the resources data (CPU, memory, bytes read, connection duration) but that’s for another post. We could also combine it with the output from Profiler, but that’s is also for another post. Apologies, but I’ve got to limit the scope if I want to get this article published in the blog soon. Otherwise it will become a never ending story.

The query log provides information about who runs what and when. The query log table can become large very quickly. So after we load it into the mart, we truncate the log table. I wouldn’t advise to truncate it every day, as it is useful for operation purposes, i.e. knowing who’s accessing the cubes right now and what they are doing (see my post here [link]). A weekly / bi-weekly rolling window would be good, i.e. keep the data for 1 or 2 week, rolling.

Data sources for each dimension:

  • As per standard practice, the date and time of day dimensions are populated from a script.
  • The main source for user dimension is active directory. Using LDAP programming we can enumerate all user account in a domain, along with their first name, last name, email address and which group they belong to. The job title, department, office and country could be obtained from the intranet (company directory), or an extract from the HR system. The matching key to join these 2 sources is usually first name + last name. The data from AD sometimes needs cleansing. Some user accounts are not real user, e.g. they are either a system, machine, or dummy account.
  • The cube and attribute dimensions could be populated using AMO. AMO can be used to enumerate the cubes, the dimensions and the attributes within an SSAS database. The AMO’s Server class contains a collection of all databases within that server.

The measure group is populated from the OLAP Query Log table. You need to set the Query Log settings (SSAS server settings) so that each and every query is logged (as opposed to 1 every 5 queries, etc) so the data mart would contain all the queries hitting the AS server.

Creating the Cube

When creating the cube I found that the following hierarchies are useful in increasing the cube’s usability:

  • Date dimension: date – month – year hierarchy
  • Time of day dimension: hour – 15minutes hierarchy
  • User dimension: department – user, product line – user, country  – office – user hierarchies
  • Cube dimension: MG – cube – database hierarchy
  • Attribute dimension: attribute – dimension hierarchy

Measures: both hit and duration are plain simple measure, using sum aggregate function. I don’t find the need to create distinct count. Both measures are in 1 MG, with regular relationship to the 5 dimensions.

That “101010” column

There is a column on the OLAP Query Log table that contains number 1 and 0, e.g. “101010”. I think the column name is “dataset”. Each digit on this column corresponds to an attribute on the cube dimension. Cube dimension, not database dimension. The comma separate one dimension from the next. For example: “100,1000,01000”. This means that the cube has 3 dimensions. The first dimension has 3 attributes, the second dimension has 4 attributes and the third dimension has 5 attributes. The order of the dimensions is as shown in BIDS (without sorting) or XMLA export.

The position of the number 1 shows which attribute are used in the query. For example, “100,1000,01000” means that the query uses the first attribute from the first dimension, the first attribute from the second dimension and the second attribute fom the third dimension. The order of the attribute in the dimension is as shown in BIDS (unsorted) or XMLA export.

The above method enables us to understand which attributes are used by each query, i.e. using the position of digit “1” on the dataset column for each dimension. We have 4 alternatives to present this information in dimensional model:

  1. The key column on the fact table for the attribute dimension is a single surrogate key. The attribute dimension is like a “junk dimension”, i.e. it contains all possible combinations of attributes. For N attributes, we will need 2N rows in the attribute dimension. For example, for 30 attributes, we need 230 = 1 billion rows. This approach is simple but it’s not workable for complex cube, with say 100 attributes.
  2. For each digit we have a key column, so the attribute dimension is in “role-play”, i.e. if we have 100 attributes in the cube we will have 100 key columns on the fact table. Note that the attribute dimension only needs to have 2 rows: exist and doesn’t exist. The attribute dimension doesn’t need to have a row for each attribute in the cube. This approach is quite rigid, i.e. if we have a new dimension we will need to change the structure. And it can’t assist us in answering the question: to which dimension do those attribute belong to?
  3. The third option is to “normalise” the attribute dimension, preferably in another fact table. If it’s in the same fact table, then the measures need to be proportionately allocated. What I mean with “normalise” is: we have only one key column, but (using the above “100,1000,01000” example) we have 12 rows in the fact table.
  4. The last option is to have a fixed number of key columns, i.e. column1 is the first attribute used, column2 is the 2nd attribute used, etc. The benefit of this approach is that it is easy to answer the question: which attributes were used in the query? But the disadvantage is: some attributes could be missing (not stored), i.e. if we only provide 10 key columns but we have 12 attributes in the query, then we will lose the last 2. In this approach the attribute dimension has attribute, dimension, cube, database columns so it’s easy to answer the question: to which dimension do those attributes belong to?

All the above options are to answer the question: which attributes are used in the query? And secondly, to which dimensions do those attributes belong to? As usual, the best approach depends on the situation. In one project, option 4 seems to be the most suitable as it can answer both questions. And, although the cubes are quite complex (60-100 attributes), the majority of the queries are simple, i.e. 4-6 attributes, occasionally 8 attributes and 10 is very rarely. 12 attributes covers more than 99.9% of the queries. And for those 0.1% of the queries, yes we lost a few attributes, but we still capture 12 attributes. And we worked out a scheme about which attributes get low priority and we don’t mind them not captured. This way we ensure that we capture the most important attributes.

Further Development

There are 2 things that we can do to further develop this concept:

  1. Combine with DMV
  2. Combine with SQL Profiler

DMV gives us additional information such as:

  • how much memory and CPU is used by each command and each user
  • which aggregates are used/hit and which are missed
  • which objects are locked and by whom
  • how much memory and CPU is used by each dimension and each measure
  • how much memory is used by each data cache and objects of SSAS engine
  • which cubes, dimensions, partitions and measures are used most and which are rarely used (how many times they are read, how many rows returned, how many KB read)

Whereas SQL Profiler gives us the MDX (“text data” column on the query event), as well as information about the query duration, memory, locks, and CPU time.

All this could add value to the information that we extract from the query log.

« Previous Page

Blog at WordPress.com.