A few months ago I wrote about “who are using the cube right now” [link]. Perhaps we are trying to find out who is making the server so slow, and what query they are executing.
Today the question is slightly different. We want to know who uses what cubes and when. We want to know not only at present, but over a period of time. There are three different things here: the who, the what, and the when.
The who: how many users do we have? Who are they? From which department, town, and country are they from? Which product family or services are they responsible for? What are their roles (job titles)? Do we have their email addresses?
The what: which cube do they use? Which dimensions? Which attributes? Which measure groups?
The when: on which day do they use the cubes? At what time, morning or afternoon? How many times a month? How long do they use it for? 1 hour a week, or 10 hours a month? Heavy usage or light usage?
The Data Mart Approach
We are going to answer all the questions above using a data mart. We create fact and dimension tables. We populate them with data from the query log table. We present the data in this mart to a cube. We can then browse this cube, to find out who uses what and when.
Well that’s the idea. It’s quite a simple isn’t? I like simple ideas (complex ideas don’t usually work). But putting ideas into a reality is never simple. There are always issues. But issues are there to be solved.
This idea of using a data mart to store system usage is not specific to OLAP system. It’s a generic idea, and it has been used in many technology areas for a long time. I first noticed it when I was working for a large ecommerce company. The web log stores who accesses what and when. A web analytics system then harvests this web log and stores the information in a data mart. It then presents this information nicely in the form of charts and web site statistics. Actually I came across it before then. It was when I was working for another company, in ecommerce development. We used WebTrend, which works in a similar way.
So, let’s start. We’ll create the data mart. We’ll populate them with data from the query log. Then we’ll create the cube.
Creating the Data Mart
There are five dimensions in the mart: date, time of day, user, measure group and attribute.
- Date dimension: contains all dates since the measurement began until 10 years from today. Grain: date. Columns: date key, date, month, quarter, year, day of the week, week number, and week end / week day indicator.
- Time of day dimension: contains the time in 1 minute interval within 24 hour period. Grain: minute. Columns: time key, time (hh:mm, 24 hour format), 15 minutes, half hour, hour, day part. The time of day dimension is important for global cubes as it enables us to understand the peaks and throughs of the usage. Useful for server maintenance, cube deployment schedule and down time planning. Day part is in 6-hour interval, i.e. 1 is mid night to 6 am, 2 is 6 am to 12 pm, 3 is 12 pm to 6 pm. You could also have 2 or 3 hour interval if it suits you better.
- User dimension: contains every single user in the organisation, regardless of country. Grain: windows user account. Columns: user name, first name, last name, email address, team, department, office, country, job title, product line or services they are responsible for.
- Cube dimension: contains every measure group in every cube. Grain: measure group. Columns: measure group, cube, database. If we have 3 databases, each with 2 cubes, and each cube has 4 measure group, then in the Cube dimension we will have 3x2x4 = 24 rows.
- Attribute dimension: contains every visible and enabled attributes in every dimension of every cube. If we have 3 databases, each with 1 cube, and each cube has 5 dimensions, and each dimension contain 4 attributes, then in the Attribute dimension we will have 3x5x4 = 60 rows. Grain: attribute. Columns: attribute, dimension, cube, database. Attribute dimension only does not contain hierarchies; it only contains attributes.
Fact table: only 1 fact table: fact query log. Contains the count and duration. Grain: 1 row for each row in the query log. Column: fact key, 5 dim keys (1 for each of the above 5 dims), count, duration.
- Hit: number of times the cube serves data to the user
- Duration: time taken to run the query
Populate the Mart
The main source of the mart population is SSAS Query log. We could combine it with SSAS Dynamic Management View (DMV) to get the resources data (CPU, memory, bytes read, connection duration) but that’s for another post. We could also combine it with the output from Profiler, but that’s is also for another post. Apologies, but I’ve got to limit the scope if I want to get this article published in the blog soon. Otherwise it will become a never ending story.
The query log provides information about who runs what and when. The query log table can become large very quickly. So after we load it into the mart, we truncate the log table. I wouldn’t advise to truncate it every day, as it is useful for operation purposes, i.e. knowing who’s accessing the cubes right now and what they are doing (see my post here [link]). A weekly / bi-weekly rolling window would be good, i.e. keep the data for 1 or 2 week, rolling.
Data sources for each dimension:
- As per standard practice, the date and time of day dimensions are populated from a script.
- The main source for user dimension is active directory. Using LDAP programming we can enumerate all user account in a domain, along with their first name, last name, email address and which group they belong to. The job title, department, office and country could be obtained from the intranet (company directory), or an extract from the HR system. The matching key to join these 2 sources is usually first name + last name. The data from AD sometimes needs cleansing. Some user accounts are not real user, e.g. they are either a system, machine, or dummy account.
- The cube and attribute dimensions could be populated using AMO. AMO can be used to enumerate the cubes, the dimensions and the attributes within an SSAS database. The AMO’s Server class contains a collection of all databases within that server.
The measure group is populated from the OLAP Query Log table. You need to set the Query Log settings (SSAS server settings) so that each and every query is logged (as opposed to 1 every 5 queries, etc) so the data mart would contain all the queries hitting the AS server.
Creating the Cube
When creating the cube I found that the following hierarchies are useful in increasing the cube’s usability:
- Date dimension: date – month – year hierarchy
- Time of day dimension: hour – 15minutes hierarchy
- User dimension: department – user, product line – user, country – office – user hierarchies
- Cube dimension: MG – cube – database hierarchy
- Attribute dimension: attribute – dimension hierarchy
Measures: both hit and duration are plain simple measure, using sum aggregate function. I don’t find the need to create distinct count. Both measures are in 1 MG, with regular relationship to the 5 dimensions.
That “101010” column
There is a column on the OLAP Query Log table that contains number 1 and 0, e.g. “101010”. I think the column name is “dataset”. Each digit on this column corresponds to an attribute on the cube dimension. Cube dimension, not database dimension. The comma separate one dimension from the next. For example: “100,1000,01000”. This means that the cube has 3 dimensions. The first dimension has 3 attributes, the second dimension has 4 attributes and the third dimension has 5 attributes. The order of the dimensions is as shown in BIDS (without sorting) or XMLA export.
The position of the number 1 shows which attribute are used in the query. For example, “100,1000,01000” means that the query uses the first attribute from the first dimension, the first attribute from the second dimension and the second attribute fom the third dimension. The order of the attribute in the dimension is as shown in BIDS (unsorted) or XMLA export.
The above method enables us to understand which attributes are used by each query, i.e. using the position of digit “1” on the dataset column for each dimension. We have 4 alternatives to present this information in dimensional model:
- The key column on the fact table for the attribute dimension is a single surrogate key. The attribute dimension is like a “junk dimension”, i.e. it contains all possible combinations of attributes. For N attributes, we will need 2N rows in the attribute dimension. For example, for 30 attributes, we need 230 = 1 billion rows. This approach is simple but it’s not workable for complex cube, with say 100 attributes.
- For each digit we have a key column, so the attribute dimension is in “role-play”, i.e. if we have 100 attributes in the cube we will have 100 key columns on the fact table. Note that the attribute dimension only needs to have 2 rows: exist and doesn’t exist. The attribute dimension doesn’t need to have a row for each attribute in the cube. This approach is quite rigid, i.e. if we have a new dimension we will need to change the structure. And it can’t assist us in answering the question: to which dimension do those attribute belong to?
- The third option is to “normalise” the attribute dimension, preferably in another fact table. If it’s in the same fact table, then the measures need to be proportionately allocated. What I mean with “normalise” is: we have only one key column, but (using the above “100,1000,01000” example) we have 12 rows in the fact table.
- The last option is to have a fixed number of key columns, i.e. column1 is the first attribute used, column2 is the 2nd attribute used, etc. The benefit of this approach is that it is easy to answer the question: which attributes were used in the query? But the disadvantage is: some attributes could be missing (not stored), i.e. if we only provide 10 key columns but we have 12 attributes in the query, then we will lose the last 2. In this approach the attribute dimension has attribute, dimension, cube, database columns so it’s easy to answer the question: to which dimension do those attributes belong to?
All the above options are to answer the question: which attributes are used in the query? And secondly, to which dimensions do those attributes belong to? As usual, the best approach depends on the situation. In one project, option 4 seems to be the most suitable as it can answer both questions. And, although the cubes are quite complex (60-100 attributes), the majority of the queries are simple, i.e. 4-6 attributes, occasionally 8 attributes and 10 is very rarely. 12 attributes covers more than 99.9% of the queries. And for those 0.1% of the queries, yes we lost a few attributes, but we still capture 12 attributes. And we worked out a scheme about which attributes get low priority and we don’t mind them not captured. This way we ensure that we capture the most important attributes.
There are 2 things that we can do to further develop this concept:
- Combine with DMV
- Combine with SQL Profiler
DMV gives us additional information such as:
- how much memory and CPU is used by each command and each user
- which aggregates are used/hit and which are missed
- which objects are locked and by whom
- how much memory and CPU is used by each dimension and each measure
- how much memory is used by each data cache and objects of SSAS engine
- which cubes, dimensions, partitions and measures are used most and which are rarely used (how many times they are read, how many rows returned, how many KB read)
Whereas SQL Profiler gives us the MDX (“text data” column on the query event), as well as information about the query duration, memory, locks, and CPU time.
All this could add value to the information that we extract from the query log.