Data Warehousing and Data Science

29 June 2010

The most useful SSAS book

Filed under: Analysis Services — Vincent Rainardi @ 1:10 pm

I wrote that Chris, Alberto and Marco’s SSAS book is the most useful SSAS book to me. This week I read it again and found a lot of very useful things inside. The reason why this book is useful, in my opinion, is because it encapsulates their experience in the field. It’s not the “standard” kind of book where the authors just re-write the BOL content, i.e. how to do this, etc. It contains, decisively, their opinions and views of how things should be done, based on their experience. And believe me, if people like Chris, Alberto and Marco allow you to tap their experience, it’s worth a lot of money. These people are the best in the field! They have tons of experience.

Yesterday and today I learn from these experts:

1. When creating a cube, instead of writing the SQL queries and joins in the DSV, it is better to do it in SQL Server views. Why is using views better?

The book provides 10 reasons. I have come across this many times and can provide 3-4 reasons immediately. Still, reading the other reasons provided in the book illuminated me, of which I’m grateful to the writers.

2. SSAS cubes are populated from a data mart*. Why, ideally, we should build the mart as “star”, not a “snowflake”.

*”But we can populate it straight from the operational system!”  you may say. I know, I know, I wrote about it!

3. General rules for indexing the data mart, to be used by SSAS.

This has been my particular interest for some time. Although the book only provides a short coverage, it gave me some ideas which I would treasure for the future.

4. SQL queries that SSAS generates.

I already knew about this before I read the book. I knew that dimension query was a “select distinct”. From my own experience, I knew that reference dimensions, if materialized, could provide incorrect data. I knew that fact dimension did a “select distinct” on the fact table. But still, reading this subject on this book provided me with lots of insight that I didn’t know. For example: the “order by” on distinct count queries – this knowledge is so valuable to me, very useful when optimising cube processing time.

5. Calculation dimension.

A technique that enables us to apply a calculation to many measures. We define the calculation only once but it’s applicable to many measures. I already knew about this technique, but reading the explanation from the experts themselves gives me different ideas which I can use to improve the performance of a cube. Not to be used as “date tools” but for something else. That idea is such a treasure to me so, once again, I’m grateful to the writers.

A lot people consider me as an expert in data warehousing, and in the last 2 years, in SSAS (I never consider myself an expert), but even I learned a lot from this book. My point is very simple: if you are a cube developer, cube designer, OLAP architect, or whatever you want to label yourself, buy their book and tap their experience. The knowledge in the book worth thousands times the price of the book.

If you have read it, or you think you have read it, my advice is: read it again. In my opinion, anybody who is involved in cube building needs to read this book.

25 June 2010

MDX: EXISTING

Filed under: Analysis Services,MDX — Vincent Rainardi @ 2:31 pm
Tags: ,

EXISTING evaluates a set in the current context. It’s like (hierarchy, measure), except that we don’t have to specify the hierarchy.

In this MDX, measure X is the Sales Amount for the current country:

with member [Measures].[x] as
( [Sales Territory].[Sales Territory Country].CurrentMember,
[Measures].[Internet Sales Amount]
)
select [Product].[Category].[All].Children on 1,
[Measures].[x] on 0
from [Adventure Works]
where [Sales Territory].[Sales Territory Country].[United Kingdom];

Output:

Using EXISTING, it’s like this:

with member [Measures].[x] as
sum(existing [Measures].[Internet Sales Amount])
select [Product].[Category].[All].Children on 1,
[Measures].[x] on 0
from [Adventure Works]
where [Sales Territory].[Sales Territory Country].[United Kingdom];

Output:

Measure X is the Sales Amount in the current context i.e. current country.

Simple, isn’t it?

23 June 2010

SQLBits7

Filed under: Event — Vincent Rainardi @ 8:24 pm
Tags:

SQLBits 7, the largest SQL Server conference in Europe, will take place in York, 30th Sep to 2nd Oct. Different from SQLBits 6 which was only 1 day, SQLBits 7 will take the same format as SQLBits 5. It will be 3 days:

Thursday: in-depth, full day seminar (pay)
Friday: deep-dive, advanced session conference (pay)
Saturday: community day (free)

Submit a session for SQLBits

SQL Server 2008 Data Warehousing Features

Covers data warehousing features in SQL Server 2008 such as merge, change data capture, star join query/bitmap filter, change tracking, minimally logged insert, resource governor, backup compression, data compression and indexed views.

For each point I will explain what they are, why should we use them/what can we use them for in data warehousing, and a demo so you can see it for yourselves.
This is an introductory session. I won’t go into deep technical details about the mechanism of how they work. My goal is to introduce them to you, and if you are interested, you can study them further. Many SQL Server data warehousing “shops” have not used/taken advantage of these features, which is ashame because they are very useful.

21 June 2010

MDX: Filter

Filed under: Analysis Services,MDX — Vincent Rainardi @ 10:03 pm
Tags: ,

We can filter an axis based on the members, or based on the measure value.

Based on the members: begins with F or after      Based on the measure value: Sales > $40,000

To filter based on members, we can use string comparison:

select non empty
filter
( {[Customer].[City].[All].Children},
[Customer].[City].CurrentMember.Name > “F”
) on 1,
[Measures].[Internet Sales Amount] on 0
from [Adventure Works];

Or use string functions:

select non empty
filter
( {[Customer].[City].[All].Children},
left([Customer].[City].CurrentMember.Name,1) = “F”
) on 1,
[Measures].[Internet Sales Amount] on 0
from [Adventure Works];

To filter based on measure value, we use numeric comparison:

select non empty
filter
( {[Customer].[City].[All].Children},
[Measures].[Internet Sales Amount] > 40000
) on 1,
[Measures].[Internet Sales Amount] on 0
from [Adventure Works];

Other examples of string functions:

Second letter is A:
mid([Customer].[City].CurrentMember.Name,2,1) = “A”

Ends with A:
right([Customer].[City].CurrentMember.Name,1) = “A”

Begins with O and ends with A:
left([Customer].[City].CurrentMember.Name,1) = “O” and
right([Customer].[City].CurrentMember.Name,1) = “A”

MDX: Sorting

Filed under: Analysis Services,MDX — Vincent Rainardi @ 9:56 pm
Tags: ,

We can sort an axis based on the member on that axis, or on the measure.

Based on the member:                        Based on the measure:

To sort an axis based on the member, we can use the NAME property:

select non empty
order
( {[Customer].[City].[All].Children},
[Customer].[City].CurrentMember.Name, DESC
) on 1,
[Measures].[Internet Sales Amount] on 0
from [Adventure Works];

To sort an axis based on the measure, we use the measure name:

select non empty
order
( {[Customer].[City].[All].Children},
[Measures].[Internet Sales Amount], DESC
) on 1,
[Measures].[Internet Sales Amount] on 0
from [Adventure Works];

20 June 2010

Creating a Dimension with Multiple Column Key

Filed under: Analysis Services — Vincent Rainardi @ 7:07 pm
Tags:

I’ve come across this question three times, the latest one being today. So I thought I better write it down. The question is: in SSAS, how do we create a dimension with multiple columns as the key? Say you want to make 2 physical columns as the key of your AS dimension.

In the DSV, mark those 2 columns on the named query as logical primary key. Use control to block the columns, then right click on either and choose “set as logical primary key”.

On the Solution Explorer, right click Dimensions, New Dimension. BIDS will offer those 2 columns as the key of the dimension. On that first page (of the dimension wizard), BIDS also gives a warning that we need to have a name for this attribute, so choose a column as the name.

The column that we use for name isusually created on the DSV select statement as a concatenation of the 2 columns. For example, if you have col1 and col2 on dim1 table, your DSV named query is like this:
select col1, col2, col1 + ‘|’ + col2 as name
from dim1

The rest of the dimension creation process is like normal.

Then, after you created the dimension, on the dimension relationship tab of the cube, link the dim to the MG on that multiple column key, using the attribute name.

As they say, a picture speaks a thousand words, so I’ll add the DSV, the Properties of the Key Attribute, and the Define Relationship window screen shots below.

 

18 June 2010

Vertical Fact Table

Filed under: Analysis Services,Data Warehousing — Vincent Rainardi @ 5:17 pm
Tags: ,

A “vertical fact table*” is a fact table with only 1 measure. There is a dim key column in this fact table which determines what kind of measure that row is. On row 1, that measure could mean “sales”, but on row 2, that measure could mean “discount”. In most cases the measure is financial amount. But in some cases the measure can be quantity.

*I got this name from SSAS forum. Some people call it “Normalised Fact Table”. I believe it is called Vertical because it is very long. If you have 40 measures, after you ‘normalise’ the fact table it will be 40x longer.

A vertical fact table is normally used for accounting/finance cube. It’s unusual to ‘dimensionalize’ the measures into 1 column, especially if they are not additive to each other. 

Having said that, if you decided to do it, create a Dim_Measure table (in accounting/finance we have account dimension), containing these 5 columns:

  • Measure_Key (PK)
  • Measure_Name
  • Parent_Measure
  • Unary_Operator
  • Custom_Rollup

On the DSV, on this Dim_Measure named query, create a relationship from Parent_Measure to Measure_Key (Measure_Key as PK).

In BIDS, create a dimension with 2 attributes: Measure and Parent. On the Measure attribute, set the KeyColumn property to Measure_Key column, NameColumn property to Measure_Name column, Usage = Key. On the Parent attribute, set usage = parent, UnaryOperatorColumn = Unary_Operator column, CustomRollupColumn = Custom_Rollup column.

A vertical fact table contains a few dimensional key column as usual (one of them is Measure_Key), plus 1 column called Amount. It is the value of Measure_Key column that determines what measure it is.

Populate the Dim_Measure table:

  • Measure_Key: surrogate (1,2,3…)
  • Measure_Name: as per your project
  • Parent_Measure: set to 1 (all rows)
  • Unary_Operator: set to ~ (all rows)
  • Custom_Rollup: ([Dim_Measure].[Measure].&[8], [Measures].[Amount]) + ([Dim_Measure].[Measure].&[9], [Measures].[Amount])

[8] and [9] are examples only, replace them with the corresponding row.
+ is an example only, replace it with the formula in your project.
There is a property called CustomRollupPropertiesColumn that we can use to set the decimal format etc.

Creating Many Roles in SSAS Cube

Filed under: Analysis Services — Vincent Rainardi @ 4:54 pm
Tags:

If one day you had to create 100 roles in SSAS cube, all of which are with dimensional security*, then might be worth automating it. Here I would like to write down the possible options.

*as in, access to the dimension data, not access to the dimension itself. And I think the methods I’m about to write here are also applicable for cell security (as in, access to cell data). Remember that “access to cell data does not automatically give access to the dimension data” – see R2 doc: Analysis Services – Multidimensional Data – Security and Protection – Secure Operation – Granting User Access.

The methods that I’m thinking we can use to automate the process of creating the roles with dimensional security are:

a)  AMO
The idea is to add those role in the cube using AMO. Then define the dimensional security.

b)  XMLA
The idea is to compose XMLA scripts that create the roles, and the dimensional security, apply/execute this XMLA against the cube to create those roles, then rebuild the BIDS solution from the server.

Using AMO

In AMO we have a Role class. The constructor of this class is (C#) “public Role(string)” with string being the name of the role. We also have the option of using Role(string, string) constructor, with the second string being the ID of the role.

We also have Role Collection class. In this class we have an Add method, which is useful for creating a new role at the end of the collection. The creation is by the ID of the role, or by name and ID. If we want to create the role somewhere in the middle of the collection (rather than at the end), we can use the Insert method. Again we can use the ID of the role, or name and ID.

The RoleCollection class also has a Contains method, which indicates whether the collection contains a certain role or not, by ID (as string). This method is very useful for checking whether the role we are about the create already exists or not. As usual, like in the Cube class, the more useful method is Find and FindByName.

After we create the role, we need to give it access to the cube. For this we can use CubePermission class. The constructor has 3 parameters: the role ID, the cube name and the cube ID.

We also have CubePermissionCollection class, which has an Add method. This method can take 2 parameters (role ID and the cube permission name) or 3 parameters, with the 3rd param being the ID of the cube permission.

Then we need to create the dimensional security. We can use DimensionPermission class for this. The constructor has 3 parameters: the role ID, the dimension permission name, and the dimension permission ID.

We also have DimensionPermissionCollection class. This class has an Add method, which takes 2 parameters: the role ID and the dimension permission name. The Add method can also take a third parameter: dimension permission ID. As usual this class also has an Insert method.

Using XMLA

First we create the role, then give it cube access, then give it dimension permission. Something like this:
<Role>
  <ID>Role1</ID>
  <Name>Role1</Name>
  <Members>
    <Member>
      <Name>DOMAIN\ADGroup1</Name>
    </Member>
  </Members>
</Role>

<CubePermission>
  <ID>CubePermission 73</ID>
  <Name>CubePermission 73</Name>
  <RoleID>Role1</RoleID>
  <Read>Allowed</Read>
  <ReadSourceData>None</ReadSourceData>
</CubePermission>

<DimensionPermission>
  <ID>DimensionPermission 23</ID>
  <Name>DimensionPermission 23</Name>
  <RoleID>Role1</RoleID>
  <Read>Allowed</Read>
  <AttributePermissions>
    <AttributePermission>
      <AttributeID>Attr1</AttributeID>
    </AttributePermission>
    <AttributePermission>
      <AttributeID>Attr2</AttributeID>
      <AllowedSet>{[Dim1].[Attr2].&amp;[Member1]}</AllowedSet>
      <VisualTotals>1</VisualTotals>
    </AttributePermission>
  </AttributePermissions>
</DimensionPermission>

We then execute the XMLA against a cube, and then create the BIDS project from the cube: in BIDS, File menu, New Project, choose project type = Import SSAS database.

Creating Roles from other Cube

If the task is to mimic the roles on cube1 based on the roles on cube2, then we either a) use AMO to get the details of the roles and dimensional security on the existing roles on cube2, then create the roles in cube1 using AMO, or b) script cube2 on the server into XMLA, programmatically extract the role and dimensional security parts from the XMLA then execute that extract on cube1.

I prefer method a) (AMO) than method b) (XMLA) because we have more control and it’s more flexible.

17 June 2010

Normalised Data Warehouse (1)

Filed under: Data Warehousing — Vincent Rainardi @ 7:26 am
Tags:

What are the advantages of creating a data warehouse in a normalized format? Some people including myself have been exploring the answer to this question. I’m going to rewrite the answer that I posted in Kimball forum here. But before I do that, let’s clarify the terms: a) normalised, and b) normalised data warehouse.

Normalised: “Normalisation” is the process of removing data redundancy by implementing normalisation rules. I am referring to Codd’s  normalisation rules and forms. A “normalised database” is a database which has gone through the process of normalisation.

A “normalised data warehouse” is a database consisting of entities (tables) with parent-child relationship between them, usually normalised to 3rd normal form (3NF). But it could be in other normal forms, such as BCNF, 5NF, etc. Some people call it “relational data warehouse” instead of normalised data warehouse. Some people call it “3NF data warehouse”.

In my book I wrote about 3 data warehouse architectures: Single DDS, NDS+DDS and ODS+DDS. The second one (NDS + DDS) has a normalized database as its master data store. But the presentation layer (DDS) is in dimensional format. That is not what I mean by “normalised data warehouse” in this article.

Advantages for using normalized model are:
1. To minimize disk storage
2. Normalized model is more suitable for some reports
3. To increase flexibility
4. To reduce data integration (ETL)
5. To eliminate data redundancy, making it quicker to update the DW.

My response about the 2 reasons that were posted in the Kimball forum (Teradata and real time):

1. Teradata

Even though the terminology and the architecture are slightly different, the “Teradata” reason that you mentioned is generally also applicable to other MPP database system such as Netezza, Greenplum, Neoview or Madison.

In Teradata, the primary index is used to distribute the table rows. Primary index significantly affects the query performance that’s why it’s the main consideration when designing a table. So the main issue when implementing Kimball style fact table (with a composite primary key consisting of dimension surrogate keys) in Teradata is: what is the primary index? If we use a normalized model we don’t have fact tables and we don’t have this problem. In my opinion (Teradata folks, please correct me if I’m wrong) we can solve this problem by creating fact_key, which is a single column surrogate key on the fact table that acts as primary key. fact_key column is also used when implementing a fact table on SQL Server platform, i.e. as the clustered index.

That’s why I think the argument that “MPP database systems require normalized model” is incorrect. I believe that we can implement a DW with dimensional model on an MPP database. If we do that, we will also enjoy the performance benefit of an MPP (i.e. scalable, linear) compared to an SMP database.

Having said that, Tom Russell and Rob Amstrong from Teradata wrote the arguments for using normalized model on MPP, which I believe we can all learn from their expertise: http://www.teradata.com/t/page/115866/index.html

2. Real time DW

What I mean with real time DW here is: when the source system is updated, within 5 seconds that change is reflected in the DW. Generally speaking there are 3 ways to achieve this: 1) using a normalized data model, 2) using a dimensional data model and 3) keep the data in the source system.

A few years ago (we are talking 5-10 years ago), when we need a real time data in DW, we normally use a normalized model. A popular case at that time is class 1 ODS, which Bill Inmon, Claudia Imhoff and Greg Battas explained in chapter 3 of their ODS book (1996). The data structure of an ODS class 1 is almost exactly like the source system. It is normalized.

As Ralph Kimball and Joe Caserta explained in chapter 11 of their ETL toolkit book (originally in 2001, Kimball Design #13), that now we can do real time DW on a dimensional model. This is achieved by creating 2 fact tables, a “static fact table” and a “real time partition”. The “UDM” concept of SSAS 2005 (and 2008) is working on the same principle, the main partitions can be MOLAP where as the real time partition is set as ROLAP.

Storing History

Over and over this question came up: How do we store history in a normalized DW. History of the master tables that is, not historical transaction records. One popular method is to use audit tables. Now, with change tracking and CDC features on 2008 SQL Server, it is becoming easier to create and populate the audit tables. Using 2008’s CDC, everything is built-in and ready for use out of the box. But the detail explanation as per how etc it’s for a separate article/post.

Update 8/12/2010: I wrote the 2nd version of Normalised Data Warehouse here.

One or Two Dimensions

Filed under: Data Warehousing — Vincent Rainardi @ 5:41 am
Tags:

Each customer can have more than one account. Should we

a)  create one dimension: dim_account, with grain = account, PK = account_key, and customer details become attributes of this dimension. On the fact table we only have account_key FK.
b)  create two dimensions: dim_account (grain = account, PK = account_key) and dim_customer (grain: customer, PK: customer_key). On the fact table we have 2 FKs: account_key and customer_key.
c)  snowflake, i.e. create dim_account and dim_customer as per b), but on the fact table we only have 1 FK: account_key. On dim_account we have customer_key FK.

What are the advantages and disadvantages for each of the three approaches above? In what case/situation does each approach work best?

Before I start, there are 2 more approaches, derived from the above 3 main ones above:
d)  as per b) but on dim_account we have customer_key FK, which connects dim_account directly to dim_customer
e)  as per a) but on dim_account we have customer_key column, and on the fact table we have 2 FKs: account_key and customer_key

OK, now the advantages and disadvantages. For each of the 5 approaches above, I will elaborate the main advantage, in which situation is that approach best used, the main disadvantage and any issues when using that approach.

a) One Dimension

The main advantage is simpliciy, i.e. there is only 1 dimension and 1 fact table, and a 1-column link between them. This approach is best used when we don’t have other fact table(s) with grain = customer and we don’t need dim_customer to be created.

The main disadvantage is we don’t have dim_customer. So a fact table with the grain = customer can’t be built. An attempt to use dim_account for that fact table (using account_key to link) will result in difficulties in maintaining the account_key column on the fact table (to point to the right customer). The difficulties are especilly felt when the dim is SCD type 2.

b) Two Dimensions

The main advantage is: from the fact table we can the get customer details without knowing the account_key. In which Ideal to be used when we have a fact table with the grain = account and another fact table with grain = customer, but we don’t have the need to get the customer details for a particular account.

The main disadvantage is: we cannot ‘go’ from dim_account to dim_customer (and vice versa) without going through the fact table. All ‘traffic’ between dim_account and dim_customer must go via the fact table. How? Using select distinct, e.g.

select distinct account_key, customer_key from fact1

The fact table could be big. If it’s a 1 billion row fact table, this query would be slow. An unclustered index on both the account_key and customer_key helps the performance, but still a lot slower than approach c), d) or e). Unfortunately, we can’t materialize the above query using an indexed view. The select statement in an indexed view cannot contain DISTINCT keyword.

c) Snowflake

The main advantage is: it’s modular i.e. we have 2 separate dimension tables but can combine them easily to create a bigger dimension. If we have a fact table with the grain = account and another fact table with grain = customer, and we need to get the customer details for a particular account without going through the fact table, then this is the ideal approach to use.

The main disadvantage of snowflake approach is: in a fact table with the grain = account, to get the breakdown of a measure by customer details it’s a little bit more complicated than approach a).

In Figure a) : (One Dimension approach)

select a.customer_attribute1, sum(f.measure1)
from fact1 f
inner join dim_account a on f.account_key = a.account_key
group by a.customer_attribute1

In Figure c) : (Snowflake approach)

select c.customer_attribute1, sum(f.measure1)
from fact1 f
inner join dim_account a on f.account_key = a.account_key
inner join dim_customer c on a.customer_key = c.customer_key
group by c.customer_attribute1

But this disadvantage is relatively very small compared to the advantage of having a customer dimension which is linked to the account dimension.

d) Two Dimensions with Inter-Dimension Link

The main disadvantage of the Two Dimensions approach is that we can’t “go” from account dimension to customer dimension without going through the fact table. And the main disadvantage of the Snowflake approach is that we can’t access dim_customer directly from the fact table. This approach is trying to fix both problems. So there are 2 main advantages:

a)  we can “go” from account dimension to customer dimension, and
b)  we can access dim_customer directly from the fact table

When should we use this approach? When a) we have a fact table with the grain = customer so we need to create dim_customer, b) when we need to access customer details from the fact table directly, and c) we need to get the customer details for a particular account without going through the fact table.

What is the main disadvantage of this approach? The “duplicated effort” in maintaining the customer key. We maintain the customer key column in the fact table as well as in the dim_account.

e) One Dimension With Customer Key

This approach is trying to fix the disadvantage of the One Dimension approach, which is the inability to build a fact table with the grain = customer because we don’t have dim_customer with its own customer_key.

So in this approach we add a column to the dim_account: customer_key. This enable us to build a fact table with the grain = customer, but keeping the number of dimensions to one.

This approach is not as popular as c) and d) in solving dim_customer issue because it is “undecisive”. It is trying to create a dim_customer but it doesn’t want to create a dim_customer. In c) and d) we have dim_customer physically, but in e) the dim_customer is hidden inside dim_account. Making it a) more difficult to maintain (especially if it is a type 2), and b) less modular/flexible, which are the main disadvantages of this approach.

Please bear in mind that the above concept is applicable not only dim_customer & dim_account, but to all dimensions with hierarchical relationship. For example, dim_product and dim_product_group, dim_policy and dim_line.

Next Page »

Blog at WordPress.com.